QuickStart

This section will step you through how to upload a simple program and a small dataset, and run the program on the dataset.

First, signup for an account on mlcomp. This will enable you to create datasets and programs.

Create a dataset

Suppose we are interested in multiclass classification, where the goal is to predict the output class from an input feature vector.

To make this task concrete, let us start by creating a sample dataset:
1 2:1 3:1
2 1:1 3:1
3 1:1 2:1
1 2:1 3:1
2 1:1 3:1
2 1:1

Each line is an example, which starts with the output class (a positive integer) followed by a sequence of featureIndex:featureValue items. The first line corresponds to the example of class 1 and feature vector (0, 1, 1). Note that this is basically the same format as used in SVMlight.

First, download or copy the sample dataset into a file on your local disk. Then go to the add dataset page. Select the data file to upload. On the next page, fill out the form with the following metadata:

name: simple-dataset
description: a toy dataset for testing purposes.
format: MulticlassClassification
The name uniquely identifies this dataset on mlcomp (when you upload it, you might have to use a different name if simple-dataset is taken). The format specifies the format of the dataset and which programs can be run on it; there are other standard dataset formats. Now the dataset should be successfully saved into mlcomp.

Initially the dataset will be unprocessed; after a few moments, its status should change to processed, which means that it has been validated and that the data has been divided into a training set and a test set. Only after it has been processed can programs be run on it.

Note: You can also upload a zip file with the dataset file (call it raw) and a file called metadata the metadata from a local file (see details).

Create a program

The program we will create is a Naive Bayes classifier for supervised multiclass classification.

The program has three operations: learn, predict, and setHyperparameter. (To draw an analogy with object-oriented programming, think about a program as a class and its operations as its methods; the program implements the interface given by its task type.) Concretely, a program is an executable script called run. The operations are as follows:

  • The learn operation takes a file containing the training data of the Multiclass dataset format described above. Invoking this operation (done by mlcomp) produces the following result:
    % ./run learn path/to/trainingExamples
    Processing training examples...
    Smoothing and normalizing...
    Saving model...
    
    This operation must serialize the model parameters to some file in the current directory (in this case, a file called model).
  • The predict operation takes a file containing the test data (with the same format as the training data) and a file to which the predictions on that data should be output. Each line of the output file is a single predicted output, in the same order as the inputs. This operation is invoked as follows:
    % ./run predict path/to/testExamples path/to/outputPredictions
    Loading model...
    Predicting test examples...
    
  • The setHyperparameter is optional, but can be used if the learning algorithm needs to tune hyperparameters. The program will be passed a hyperparameter, a non-negative real number, which will range across values 0, 0.01, 0.1, 1, 10, 100. See reductions for more information on how hyperparameters are chosen. When setHyperparameter is invoked, the program saves the value to a file hyperparameter; the value will be read during learn. An example invocation:
    % ./run setHyperparameter 0.1
    Saving hyperparameter 0.1
    

Now, to upload the program, go to the add program page. Fill out the form with the following:

name: simple-naive-bayes
description: A Simple Naive Bayes implementation in Ruby.
task: MulticlassClassification
The name must uniquely identify the program in the mlcomp system (when you upload it, you might have to use a different name if simple-naive-bayes is taken). The task MulticlassClassification tells mlcomp which datasets can be run on the program and how to evaluate the program; there are other standard task types.

As for datasets, one can upload a single zip file containing a metadata file and the run script.

Create a run

Having uploaded a program called simple-naive-bayes and a dataset called simple-dataset, we are now ready to run the program on the dataset. To do that, go to the add runs page. Select simple-naive-bayes under the programs and simple-dataset under the datasets. It will prompt you if you want to tune the hyperparameter; check yes because our program supports different hyperparameters. Upon adding the run, the program will be added to the queue for running. It will be run on one of our worker servers (or one that you might provide). You can then inspect the training and test times and error rates of this run.

Each run will be run in a virtual machine (see below for information about the system specifications). By default, your program is allotteded 1GB of ram and one hour of computation time.