Paul Azunre

+ Follow
since May 08, 2020
Cows and Likes
Total received
In last 30 days
Total given
Total received
Received in last 30 days
Total given
Given in last 30 days
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Paul Azunre

Don Horrell wrote:My other ML interest is topic modelling, using document vectors.
There do not seem to be pre-trained sets of document vectors available yet, but when there are, how could we use transfer learning to take a pre-trained set of document vectors and adapt it to domain-specific documents e.g. medical documents, documents about programming, patents etc?

I understand gensim has doc2vec included, although I have not used it and so cannot speak to its efficiency or performance. I would try this first if I were you.

Beyond that, you can use any word embedding to represent documents via a simple average of the vectors contained in it. You can also try sent2vec and average those sentences to represent the document. Or learn your own aggregation function over the word vectors as inputs, many options. A google search on this topic should bring many instructions on how to do this.

Regarding adapting to your domain, fine-tune say BERT on your domain, and then aggregate the word vectors (again, via say averaging) to get a feature for your document that is domain-specific.

- Paul

Don Horrell wrote:Thanks for your reply, Paul. I'm a little confused though.
The pre-trained FastText word embeddings I have downloaded map words to vectors, so in my case (using TensorFlow to do some NLP classification), I can only train my classifier on the words in the embedding list.
That is the crux of my original question - how can I add domain-specific vocabulary to pre-trained word embeddings. Will your book cover this?



These are your options:
1. If your fastText embedding is in the .vec format, it is not enough. You will also need the .bin format with subword information to handle out-of-vocabulary words wit FastText, see also -
2. Use ELMo or BERT, they will handle out-of-vocabulary words out of the box.
3. To get even better performance, fine-tune ELMo and BERT on your  data so they can learn your words better.

Hope this clarifies things,

- Paul

Lucian Maly wrote:Hi @Paul Azunre,

In your book, are you planning to cover this evolution from in-vocabulary context independent word embeddings to ones that take into account word order in their training? Or compare how Word2vec, Glove, ELMo, BERT generate different vectors for the same sentence? I have seen an attempt of something similar on the widely circulated sentence: but nothing that was easy to digest...

Thank you.


Yes, we cover the evolution from context-independent embeddings to the contextual ones, compare and contrast them, and provide examples (including code) on how to use them. Thanks for sharing the sentence, that is a very cool example! We have some examples which I think serve an equivalent purpose in the book and hope to add even more before it is done,

- Paul

Don Horrell wrote:Hi Paul Azunre.
I am trying to do multi-label classification on some text. The number of times each label has been assigned to the training text shows a large skew.
Is there anything that Transfer Learning can do to help?


I don't think this is a transfer learning problem per se. I think this is more of a fundamental challenge with multilabel classification, but I will try to suggest a way TL can be used.

The first question I would ask would be - "is the skew representative of the target distribution?". Practitioners are obsessed with balanced datasets, but in my opinion we tend to forget that the distribution in the training data needs to reflect the target distribution in the wild, and not necessarily be balanced. If your training data distribution shows 3% "anomaly" class, and your classifier is likely to see 3% of this class when deployed, then 3% "anomalies" in your training data is probably the right thing to do. I hope this makes sense.

Beyond this, I would try "data augmentation" - duplicate some of the samples in the class whose count you are trying to increase, and substitute some of the words in the duplicates with their synonyms - you can use either pretrained word embeddings or a thesaurus to do this, for instance (an example reference that talks about this -> This will increase the count of your under-represented class and has been observed to lead to significant improvements. Technically, since you are using pretrained knowledge in the form of embeddings and thesaurus, this is an example of transfer learning, even if people may not acknowledge it as such.

Hope this is helpful!

- Paul

Don Horrell wrote:What are the strengths and weaknesses of Gensim and TensorFlow for NLP?
Which is best for the different types of project?

There is no right or wrong answer to this question I think. Each tool is somewhat different, has a different purpose, and as an NLP practitioner you are likely to use both during your career depending on situation.

Gensim has some very easy to use implementations of several key techniques - topic modeling (LSA, LDA, etc), word2vec and other popular embeddings like doc2vec, etc. I have found it helpful with quick prototypes in the past, although I wouldn't use it in production - the word2vec model was kinda slow in my experience when I tried it.

Tensorflow is focused on neural networks and deep learning. If you are doing deep learning Gensim isn't really the competitor. Pytorch is the competitor in that domain. I prefer tensorflow, but a large fraction of the community swears by pytorch. I have had to use pytorch if something I am implementing depends on a library that has a dependency on it

Better yet, start with Keras. It is a higher level library so more approachable, and you can swap tensorflow backend for pytorch backend if you wish.

Hope this is a helpful answer, let me know if I can clarify anything further!

Serge Yurk wrote:Hello Paul.
Hope your book contains a lot of interesting info about cross-lingual models.
Сould you please give me some links on English-German and English-Russian successful stories in this field.
Thank you in advance,


Yes, we will try to cover the basics of cross-lingual transfer, which is a complex topic and a burgeoning field of study right now.

Seq2Seq models with attention work pretty well on parallel datasets for pure translation, with the current trend being to substitute it with transformer-based models.

Where Transfer Learning has been incorporated into cross-lingual learning specifically is through embedding models that are simultaneously trained on multiple tasks in multiple languages, e.g., mBERT ( and Universal Sentence Encoder Multilingual ( The idea is that by learning multiple tasks simultaneously across multiple languages (in the case of mBERT>100 languages) allows the model to learn features for cross-lingual tasks like translation.

Apparently this works pretty well for high resource languages like English, German and Russian, so you should be able to work with these methods. For lower resource languages, like many African Languages, more work remains to be done and I would say this is likely the most exciting line of research right now with many open problems.

- Paul

Don Horrell wrote:Hi Paul Azunre.
There are several pre-trained word embeddings available, but they generally cover the most common words.
Can I do something similar to Transfer Learning - start with a pre-trained set of word embeddings, then add my own domain-specific words somehow?


Don, the short answer is definitely absolutely YES!

Indeed the new breed of methods - pretrained models like ELMo, BERT, etc., etc., are all character-level. This means that you are actually no longer limited to in-vocabulary words, as used to be the case with word2vec and similar methods. Indeed even FastText, which preceded them, works at the sub-word level, and did ok with this.

Moreover, each method can be "fine-tuned" in your particular domain so that the general knowledge they contain can be adapted to the slang, differing meaning and structure inherent in the natural language you care about, beyond merely any particular words!!

Hope this is helpful!

peterr paul wrote:Hi

Can anyone help me in suggesting the best course preferably to learn as a fresher?

Would it be better to go with AI or Machine learning?
And certainly what will be the duration and methods of learning this course ?

Machine Learning is a type of AI. AI broadly includes symbolic rule-based approached we used many years ago. Today the paradigm is Machine Learning, i.e., we train systems by giving them examples of input and output signals, rather than program rules for every situation, which doesn't really work for complex problems.

TL;DR - Machine Learning is what you are looking to study, and that is the most prominent type of AI you should care about right now

Lucian Maly wrote:Two emerging pre-trained language models - BERT (uses a bidirectional transformer) & ELMo (uses the concatenation) open up new possibilities in language processing. I see in the book sample that for instance BERT and logistic regression are the best algorithms for the email classification, and also for the IMDB movie review classification, but what are the general rules for using one or the other or even something else like GPT? Obviously the answer is not as simple - it depends on the initial amount of data and hyperparameter tuning, but is there some kind of guidance / list of specific use case on where to use what algorithm?

Thank you so much for the response.

Great question Lucian!

As you correctly alluded to, there is no simple yes or no answer, so that caveat has to be first made. I usually try all the embedding in my toolbox on any given problem and see what works best. Sometimes issues arise that you could not predict in advance - something as silly as particular dependencies of the deployment environment may drive your decision more than marginal difference in performance between methods...

Generally, GPT has been preferred to text generation. I would say BERT and ELMo are preferred for classification and related tasks. I have seen ELMo do better than BERT when the amount of data for fine-tuning is very small. Other people also swear by this in certain applications like low-resource languages... I think BERT is the most mature technique in terms of community mind-share, and the consequence of that is a stronger software ecosystem. Many advances on making it more efficient too - Albert, DistilBERT, etc., etc., etc. Variants of it blow the other approaches out of the water on most GLUE tasks (the established benchmark set) - importantly Questing Answering which has been finding a tremendous number of applications.

I hope this is a helpful response. We try to cover all of the above in the book.

Sherin Mathew wrote:
As per my knowledge, you would require a good grasp in following subjects:
a. Linear algebra
b. Probability and Statistics
c. Artificial Intelligence and Neural Networks
d. Programming in any high level language, preferably python or Matlab (inbuilt libraries and functions available)

Some of the prerequisites for learning Natural Language Processing include: As NLP is part of soft skill training you must be able to understand concepts like sentence breaking, speech recognition, information extraction etc. Learning about python or tensor flow. Knowledge on algorithm

Excellent answer! As someone who came from many years of using MATLAB in academia, and now using Python, I would say focus on Python for anything AI/ML/NLP.

Campbell Ritchie wrote:

Sherin Mathew wrote:. . . Linear algebra . . .


One answer is "Whenever we deal with matrices and vectors a lot, this becomes important." Our data is usually organized into a matrix of sorts before feeding to the NLP algorithm. Of course it is also important to understand the theory behind many of the algorithms...

Abhisek Pattnaik wrote:Previously, NLP stood on it's own ground but now with AI and ML, it has taken a new curve. What things do I need to know to get a proper start with the subject?
Does the book cover these?

Thanks for your question! To get the most out of the book, we recommend some experience with Python, as well as some intermediate  machine  learning  skills –such  as  an  understanding  of  basic  classification  and regression concepts. It would also help to have some basic data manipulation and preprocessing skills with libraries such as Pandas and Numpy.

That said, the book was written in a way to allow you to pick up these skills with some extra work. The first couple chapters, as Lucian mentioned above, will attempt to bring you up to speed on everything you need to know. It is a rapidly evolving field so we will all need to keep learning to keep up from there!!

RajKumar Valluru wrote:Hi Paul, so this book will taught about how to leverage the prebuilt NLP models  ?

Yes, exactly! This is the trend that has emerged in NLP recently. We cover the various trending architectures like ELMo and BERT in the context of applied code examples.

Campbell Ritchie wrote:Sorry if I'm late, but welcome to the Ranch and I hope you have lots of difficult interesting questions.

Thank you Campbell! Jeanne! Looking forward to it