Research Problem 1
Here is the motivation for the use of machine learning in digital libraries:
1. The digital library needs to be able to identify all documents relevant to a user’s query. This function is sometimes supported by an indexing system in which each document is tagged with the labels of all the topics it represents.
2. The indexing system is relatively easy to create in small document collectionss: a expert reads each single document, and then decides which topics it represents.
3. In large collections, however, this is prohibitively expensive—and clearly impossible if hundreds of thousands of documents are added to the library every week, or even on a daily basis.
4. In this latter case, one solution is to classify manually only a subset of the documents, and then employ the training set thus obtained for the induction of a classiﬁer.
5. The induced classiﬁer then labels those documents that have not been classiﬁed manually.
6. Note: the principle can be applied to other domains, not just digital libraries.
7. Note: each training example is described by a feature vector where each feature gives the frequency of one English word. Each training example is labeled with the set of topics it represents Here is the research question: How many documents should we classify manually if we want to induce a high-performance classiﬁer?
Put another way: How much can be gained from the use of machine learning?
Suppose we have 106 documents. If we manually classify only a few, the induced classiﬁer will overﬁt the training examples, and thus perform poorly on the remaining documents. The situation will improve if the training set consists of, say, 10% of the collection or more; but then, the price of manual classiﬁcation will become prohibitive.
This motivates an experimental study whose goal is to identify the right size of the training set. For the sake of simplicity, we begin with the simpliﬁed case where the classes are independent of each other. For the time being, we will ignore the case of hierarchical classiﬁers by considering only the class labels at the highest level—this was done in the following paper:
Sarinnapakorn, K. and Kubat, M. Combining Subclassiﬁers in Text Classiﬁcation: A DST-Based Solution and a Case Study. IEEE Transactions on Data and Knowledge Engineering, 19 (2007), No.12, pp. 1638–1651
Summary of the proposed experiment:
1. Take a testbed of your choice, perhaps the Eurovoc data.
2. Set aside 50% examples to be used for testing.
3. From the remianing 50%, create 10 training sets, the ﬁrst containing 5% of the examples, the next 10%, and so on, until the last one contains 50% of the examples. For instance, if the original domain contains 100,000 examples, 50,000 examples will be withheld for testing. For training, we will create 10 subsets: T1 contains 5,000 examples, T2 contains twice as many, . . . , and T10 contains 50,000 examples. Make sure that T1 ⊂ T2 . . . ⊂ T10.
4. Make sure that each of the classes is equally represented in each training set. This means that if 5,000 of the original 100,000 examples belong to class A, then 5% examples of each of the training sets, Ti, belong to A; also, 5% of the testing examples belong to A.
5. For each Ti, induce a classiﬁer, Ci, and test its performance on the testing set.
6. Plot the performance on the testing set. It is more than likely, that, as the size of the training size grows, the performance on the testing set improves, but only up to a certain point beyond which the performance does not improve no matter how much we extend the training set.
1. It may turn out that only a small percentage of all examples are enough for the induction of a relatively high-performance classiﬁer. In this case, the use of machine learning is justiﬁed.
2. Conversely, it may turn out that even using 50% of the examples for training is not enough. In this case, machine learning does not seem to help.
3. Most likely, the observed result will be somewhere between these two extremes.
4. We might want to whether the observation is the same in each of the studied experimental domains. This means that we want to repeat this experiment for several diﬀernt domains.
Possible extensions: Once we get used to this kind of study, we will generalize it for the hierachical case where the interrelation of the class labels can be speciﬁed by a generalization tree of a directed acyclic graph (DAG) as in Vateekul, Kubat, and Sarinnapakorn (2014)—the one I sent you a week or two ago.
Hierarchical Multi-Label Classiﬁcation with SVMs: a Case Study in Gene Function Prediction