Apologies for asking lots of questions, but I need to transfer all your knowledge into my brain before the end of Friday.
With images as the input data it is easy to synthesise lots of data from a smaller dataset.
Are there techniques we can use with Natural Language Processing to synthesise data automatically? What if the subject matter is focussed on some technical area, so randomly changing words, even if they are synonyms, is not really an option?
As you say, for Images you can augment data simply through matrix/vector transformations.
For NLP, it gets tricky. The synonym approach is good, but it also might end up changing the meaning of your sentences in special circumstances. One think you might be able to try is randomly drop stop words. On another note, if you get global frequency statistics of the corpus, you can drop words based on probability calculated from the frequency of words (e.g. more frequent words dropped more). But again, you need to be careful, as this can end up changing the meaning of the sentences.
Thushan Ganegedara
PhD | Senior Data Scientist | AI/ML Educator