A Survey on Semi-Supervised Learning Algorithms — Part 3
Continuing the Literature survey on Semi-supervised learning Algorithms, we now come to:
D. Generative Models
One of the oldest Semi-supervised learning methods, generative models is a full probabilistic model of all variables, whereas a discriminative model generates a hypothesis only for the target variables which are conditional on the observed variables.(Zhu, 2008) Thus a generative model is extremely flexible since it can, for example, generate values for all variables in the model, whereas a discriminative model only allows sampling of the target variables conditional on the observed quantities. Since discriminative models, cannot generally express more complex relationships between the observed and target variables, generative models are preferred in some problems. However discriminative models still claim to exhibit better performance than generative models. An additional concern pointed by Cozman & Cohen(2002) was that unlabeled data can degrade performance in generative models.
One example of generative models for semi-supervised sequence learning is the Hidden Markov Model (HMM), in particular the Baum-Welsh HMM training algorithm (Rabiner, 1989).
The Baum-Welsh HMM algorithm is most famously used in Speech recognition problems. It is as a semi-supervised learning algorithm. (Elworthy, 1994).
Nigam, McCallum, Thrun & Mitchell (2000) introduces an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classier. EM is a class of iterative algorithms for obtaining maximum likelihood or maximum a posteriori estimation for learning problems with missing data (Dempster, Laird, & Rubin, 1977). Initially, Naïve Bayes classifier is trained with only the available labeled documents, after which it assigns probabilistically-weighted class labels to each unlabeled document by calculating the expectation of the missing class labels. It then trains the classifier (from scratch) from the newly formed labeled data set and iterates. This iteration is performed until the estimation of parameters of the model do not change, which is measured by the complete log likelihood of the parameters. In its maximum likelihood formulation, EM performs hill-climbing in the data likelihood space, finding the local maxima or minima whichever the case is.
Assumptions about how the data are of generated must be satisfied: (Castelli & Cover, 1995) (Castelli & Cover, 1996) (Ratsaby & Venkatesh,1995).
1. The data is generated by a mixture model.
In the image shown, the data is represented by a mixture model, since the underlying data is represented by three Gaussian distributions. In this mixture model which is a probabilistic model for representing the presence of subpopulations within an overall population, the three Gaussians could potentially represent three classes.
2. There is correspondence between mixture components and classes.
Intuitively, we could say that using unsupervised learning we could cluster the data shown in the above figure into three clusters. Additionally, with the help of labeled data, we could associate the three clusters with their associated class labels. However, this requires the assumption that these three mixture components correspond to the classes.
However Nigam et al., (2000) proposes a way out if the data does not fit these assumptions. The extensions proposed are as follows:
1. Introduce a weighing factor that dynamically adjusts the strength of the unlabeled data’s contribution to the parameter estimation in EM.
2. Reduction of the bias of Naïve Bayes by modeling each class with multiple mixture components, instead of a single component.
The last extension is particularly interesting as it relaxes the second assumption that states that there should be a one-to-one correspondence between a mixture component and a class. Instead this extension allows the algorithm to have a many-to-one correspondence between mixture components and classes. This is equivalent to saying that certain features in a model can represent a certain sub-population in a class, but another sub-population in the class can only be represented by a completely different feature subset. In the example provided in the paper, the case of textual data is considered, where the possibility that a class, named Sport consists of documents that talk about hockey and baseball, then these documents would be represented in the population by the co-occurrence of two or more words such as ‘bat’ and ’ baseball’ or ‘ice’ and ’ puck’. These dependencies cannot be captured by a single multinomial distribution
over words in the sport class, but with multiple mixture components per class, one multinomial can cover the hockey sub-topic, and another the baseball sub-topic thus more accurately capturing the co-occurrence patterns of the above four words.
Baluja (1998) uses the same algorithm for probabilistic modeling on a face orientation discrimination task.
Fujino et al. (2005) propose a hybrid generative discriminative approach by extending generative mixture models by including a ‘bias correction’ term and discriminative training by implementing the maximum entropy principle.
Please find the detailed set of references in Part 1 of this series. Questions, comments — please feel free to drop them here.