Don Horrell wrote:Hi Paul Azunre.
I am trying to do multi-label classification on some text. The number of times each label has been assigned to the training text shows a large skew.
Is there anything that Transfer Learning can do to help?
Thanks
Don.
I don't think this is a transfer learning problem per se. I think this is more of a fundamental challenge with multilabel classification, but I will try to suggest a way TL can be used.
The first question I would ask would be - "is the skew representative of the target distribution?". Practitioners are obsessed with balanced datasets, but in my opinion we tend to forget that the distribution in the training data needs to reflect the target distribution in the wild, and not necessarily be balanced. If your training data distribution shows 3% "anomaly" class, and your classifier is likely to see 3% of this class when deployed, then 3% "anomalies" in your training data is probably the right thing to do. I hope this makes sense.
Beyond this, I would try "data augmentation" - duplicate some of the samples in the class whose count you are trying to increase, and substitute some of the words in the duplicates with their synonyms - you can use either pretrained
word embeddings or a thesaurus to do this, for instance (an example reference that talks about this ->
https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28). This will increase the count of your under-represented class and has been observed to lead to significant improvements. Technically, since you are using pretrained knowledge in the form of embeddings and thesaurus, this is an example of transfer learning, even if people may not acknowledge it as such.
Hope this is helpful!
- Paul