Domains

A domain defines a particular machine learning task. Each domain is equipped with the following:
  • a particular dataset format,
  • a particular program interface, and
  • standard evaluation metrics.
Please contact us if you want to add a domain for your problem.

There are two types of domains:

  • Supervised learning (e.g., classification, regression): For these domains, a run consists of two phases: learn and predict. Datasets must be split into two shards (for train and test).
  • Performing (e.g., clustering, optimization): for these domains, there is only one phase: perform. Datasets contain only one shard raw.
NEW: create your own custom domains.


BinaryClassification

Type: supervised-learning

Task description: The goal of this task is to learn how to classify data points represented as real vectors into one of two classes (positive or negative).

Dataset format:

One file where each line corresponds to an example:
output featureIndex:featureValue ... featureIndex:featureValue 
where featureIndex is a positive integer, featureValue is a real number, and output ∈ {-1, +1}. The feature indices must be sorted in increasing order. For the test file, output is 0. The predictions file contains a line for each test example:
predicted-output

Click here to see a sample dataset.

Configuration file


MulticlassClassification

Type: supervised-learning

Task description: The goal of this task is to learn how to classify data points represented as real vectors into one of K classes.

Dataset format:

One file where each line corresponds to an example:
output featureIndex:featureValue ... featureIndex:featureValue 
where featureIndex is a positive integer, featureValue is a real number, and output ∈ {1, 2, ..., K}. The feature indices must be sorted in increasing order. For the test file, output is 0. The predictions file contains a line for each test example:
predicted-output

Click here to see a sample dataset.

Configuration file


Regression

Type: supervised-learning

Task description: The goal of this task is to learn how to predict a real value Y given an input vector X.

Dataset format:

One file where each line corresponds to an example:
output featureIndex:featureValue ... featureIndex:featureValue 
where featureIndex is a positive integer, featureValue is a real number, and output is a real number. The feature indices must be sorted in increasing order. For the test file, output is 0. The predictions file contains a line for each test example:
predicted-output

Click here to see a sample dataset.

Configuration file


SequenceTagging

Type: supervised-learning

Task description: In this task, the input is a sequence (e.g., a sentence) and the output is a tag label for each position of the sequence (e.g., part-of-speech tags for each word in the sentence). The key part of this problem is that there is are dependencies between the various labels. This is the canonical structured prediction task.

Dataset format:

The file format is the same as the one from the CoNLL shared task. In the data file, each line corresponds to a position of a sequence and empty lines denote the end of one sequence and the start of another. Each line looks like this (a sequence of input feature columns followed by one output):
input ... input output
where the input and output are strings. For example, for named-entity recognition:
France NNP B-LOC
If labels include B-X, I-X, then this is treated as a segmentation task where a sequence of B-X I-X ... I-X denotes a segment labeled as X. For these tasks (e.g., named-entity recognition), F1 is an appropriate evaluation metric. For the test file, output is "-". The predictions file contains lines parallel to the input like the following:
predicted-output

Click here to see a sample dataset.

Configuration file


CollaborativeFiltering

Type: supervised-learning

Task description: Given some entries of a matrix (e.g., where rows are users and columns are movies and each entry is a numeric rating), predict other entries.

Dataset format:

One file where each line corresponds to an entry of the matrix:
row-index column-index value
where row index and column index are positive integers and value is a real number. For the test file, value is 0. The predictions file contains a line for each test example:
predicted-value

Click here to see a sample dataset.

Configuration file


DocumentClassification

Type: supervised-learning

Task description: The goal of this task is to learn how to classify text documents as one of K classes.

Dataset format:

A dataset consists of one or more datashards, where each datashard (e.g., train, test, or raw) is a directory consisting of one directory per class, which in turn contains one file for each document. All the file names must be distinct. An example of the directory layout inside the datashard:
label1/doc1
label1/doc2
label2/doc3
...
At test time, your program will be passed unlabeled examples, which are arranged in a datashard like this:
unlabeled/doc1
unlabeled/doc2
unlabeled/doc3
...
Your predictions should be written to a directory with the following structure, where each file can be empty (only the file name matters).
label1/doc1
label1/doc2
label2/doc3
...

Click here to see a sample dataset.

Configuration file


WordSegmentation

Type: performing

Task description: This is an unsupervised learning task where we are given an unsegmented sequence of characters (phonemes) as input and the goal is to determine the word boundaries and output the words.

Dataset format:

The input is just one file in UTF8 format containing sentences, one per line. Example input:
thisisatest
Example output:
this is a test

Click here to see a sample dataset.

Configuration file


ConstituencyParsingTest

Type: performing

Task description: The goal of this task is labeled constituency parsing with integrated part-of-speech tagging.

Dataset format:

The format follows the LDC constituency parse tree format.

Click here to see a sample dataset.

Configuration file


DependencyParsingTest

Type: performing

Task description: The goal of this task is labeled dependency parsing with integrated part-of-speech tagging.

Dataset format:

The format follows the popular CoNNL-X shared task format described here: http://ilk.uvt.nl/conll/index.html#dataformat In particular, each line corresponds to a word in a sentence and empty lines separate two sentences. Each line has 10 tab separated fields, some of which can be empty (indicated by _). See the link above for details.

Click here to see a sample dataset.

Configuration file


OnlineLearningMulticlass

Type: interactive-learning

Task description: The goal of this task is to learn how to classify data points represented as real vectors into one of K classes.

Dataset format:

The datashard is presented one at a time. The feature vector is presented first, through STDIN:
featureIndex:featureValue ... featureIndex:featureValue
where featureIndex is a positive integer and featureValue is a real number. The feature indices must be sorted in increasing order. The program should output to STDOUT its prediction:
predicted-output
where predicted-output element of {1, 2, ..., K}. The program does not know there are K classes, and it must find out through experience. Once prediction is received, the correct label is presented through STDIN:
correct-label
where correct-label element of {1, 2, ..., K}.

Click here to see a sample dataset.

Configuration file


BanditMulticlass

Type: interactive-learning

Task description: The goal of this task is to learn how to classify data points represented as real vectors into one of K classes. It is similar to OnlineLearningMulticlass, except that the oracle only tells you whether your prediction was correct or not.

Dataset format:

The datashard is presented one at a time. The feature vector is presented first, through STDIN:
featureIndex:featureValue ... featureIndex:featureValue
where featureIndex is a positive integer and featureValue is a real number. The feature indices must be sorted in increasing order. The program should output to STDOUT its prediction:
predicted-output
where predicted-output element of {1, 2, ..., K}. Unlike OnlineLearningMulticlass, the program knows that there are K classes (passed as first argument to ./run). Once prediction is received, either "yes" or "no" is presented through STDIN, depending on if the label was correct or not:
oracle-answer
where oracle-answer is in {yes, no}.

Click here to see a sample dataset.

Configuration file


SemiSupervisedMulticlass

Type: supervised-learning

Task description: Takes a dataset with labeled and unlabeled instances and classifies using both types.

Dataset format:

Data consists of:
label1 feature:value feature:value...
label2 feature:value feature:value...
label3 feature:value feature:value
...
Test data consists of:
label1 feature:value feature:value...
label2 feature:value feature:value...
label3 feature:value feature:value...
...
Predictions should be:
label1
label2
label3
...

Click here to see a sample dataset.

Configuration file


Creating a New Domain

If you have a task that does not fall into one of standard categories, you can create a new domain. Follow the instructions below:
  1. Decide if your domain kind is supervised-learning (involves a separate train/test phase) or performing (only one phase).
  2. Decide on the dataset format (input) and the format for the output of a program that operates in that domain.
  3. Create a sample dataset. For example, document classification. This dataset should be small but not degenerate, just complicated enough to demonstrate the characteristics of the format and domain.
  4. Create the helper program for this domain. The program should support the following operations:
    • inspect datashardPath: checks that the given datashard conforms to the format, and if so, extracts summary statistics and writes it out in a YAML format to a status file.
    • split rawDatashardPath trainDatashardPath testDatashardPath: reads in the raw datashard and splits up the examples into training and test, and outputs the examples to the corresponding file. This operation is only for supervised-learning domains.
    • stripLabels inDatashardPath outDatashardPath: reads in a datashard from inDatashardPath, removes the labels (e.g., replacing with -1, +1 with 0) and outputs the result to outDatashardPath.
    • evaluate datashardPath predictionPath: read in the true outputs from datashardPath and a program's predicted outputs from predictionPath and computes any error metrics suitable for that domain. Results should be written to the status file.
    For example, here is the program that does the above for document classification.
  5. Write a configuration file, which should be a YAML file. The configuration file for document classification.
Email the sample dataset, the helper program, and the configuration file to mlcomp.support@gmail.com and we will incorporate it into MLcomp.