So if we're using the tokenisers in TensorFlow, what is the best way to deal with words that are not in the vocabulary of the word embeddings that we have chosen to work with? I'm thinking of niche areas where there is lots of technical jargon (medical, engineering, science journals etc.) that standard word embeddings may not contain. Another use-case is mis-spelled words or text-speak.
If the TensorFlow tokenisers just use some "unknown" token for all these important words, the ML model will be poor.
Is it possible to load a standard word embedding and then to add in the extra words somehow?
1. Instead of word embedding, try something like character embedding or Fast text
2. Use an existing word embedding, add vectors for the new words in your corpus and then finetune it. My argument for doing this is that, the model will quickly learn the word vectors for the new words as it would probably have good vectors for the words that are appearing in the context of the new word.
PhD | Senior Data Scientist | AI/ML Educator