All classifiers follow the same principle. First, they are trained using some kind of training data. The trained classifier can then be used to classify new data, i.e. to assign the same label to the new data as to the most similar training data.
Three different categories of classifiers exist in the LTI-Lib:
- unsupervised classifiers
- supervised sequence classifiers
- supervised instance classifiers
The first category includes all algorithms that are trained without knowing the proper labels or results for each training pattern. Clustering algorithms are included here, although most are not designed for later classifying new data. Supervised classifiers are trained knowing the expected result for each training pattern. They are divided in two groups: sequence classifiers work on time series whereas instance classifiers work on n-dimensional vectors.
All classifiers have four essential properties:
- The parameters are static properties of the classifier. All classifiers have default settings for the parameters object. However, many classifiers will not work properly without adjustment of some of the parameters.
- The status contains error messages. If any of the methods, particularly train and classify, return false. Information of the nature of the failure can be obtained by calling getStatusString().
- The progress of the training can be monitored with a progressObject. See Progress Objects.
- The outputTemplate contains information on how internal results are mapped to an ouputVector which is the result of a classification. This member should only be changed by experienced users. Usually, the outputTemplate is set by the classifier itself during the training.
The results of the classification are returned in an ouputVector. This object contains labels and corresponding recognition values. These can often be interpreted as probabilities. Section Output Vector gives details about this data structure.
The progressInfo object gives information on how many steps an algorithm will take until finished and how many have already been accomplished. Also the name of the classifier is usually given. The following progress infos exist:
The parameters of the classifier class define an enumeration eDistanceMeasure which is used to specify which distance is used in the classifier. Options are the L1 and L2 distances.
See also cvr::classifier::parameters.
The output vector is the result of a classification, i.e. calling the classify method of a classifier. It assigns values to the labels. In the case of supervised classifiers these labels where supplied by the user during training. In case of unsupervised classification the classifier usually assigns labels from 0 to C-1 with C the number of classes found.
Output vectors can be the final result of a process. In this case they are usually displayed by an application or used for statistical analysis of the classification process. For the later the classificationStatistics functor can be used. It is also possible to combine the results of several classifiers using the combination functor.
For further reading see the documentation of cvr::classifier::outputVector.
cvr::draw cvr::epsDraw cvr::draw3D cvr::draw2DDistribution
cvr::classifier2DVisualizer
cvr::classificationStatistics
Sammon's Mapping transforms points in n-dimensional space to points in m-dimensional space while trying to preserve all distances between the points. Usually, m will be 2 or 3 so that the points can be displayed using one of the
cvr::draw classes and an appropriate cvr::viewer. Sammon's mapping can be very useful to get an idea of the distribution of higher dimensional data without losing as much information as when using e.g. cvr::principalComponents to reduce the dimensionality.
However, the mapping is a very difficult task and might easily fail to converge at a minimum. Check the error to get an idea about the performance. In case it is bad there are several options:
- try to reduce the amount of data (point). This can be done e.g. by using a cvr::clustering algorithm with the number of clusters a fraction of the number of data points, say a third. Then use the cluster centers as data points. This is an easy quantization method.
- reduce the number of dimensions a little e.g. to 10 with cvr::principalComponents or a feature selection method. Use the lower dimensional points for Sammon's mapping. Thus, at least you'll get the information from 10 dimensions.
For more information see cvr::sammonsMapping.