• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Rob Spoor
  • Henry Wong
  • Liutauras Vilda
Saloon Keepers:
  • Tim Moores
  • Carey Brown
  • Stephan van Hulst
  • Tim Holloway
  • Piet Souris
Bartenders:
  • Frits Walraven
  • Himai Minh
  • Jj Roberts

Pre-training model architectures - which one to chose

 
Author
Posts: 66
1
Redhat Notepad Fedora Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Two emerging pre-trained language models - BERT (uses a bidirectional transformer) & ELMo (uses the concatenation) open up new possibilities in language processing. I see in the book sample that for instance BERT and logistic regression are the best algorithms for the email classification, and also for the IMDB movie review classification, but what are the general rules for using one or the other or even something else like GPT? Obviously the answer is not as simple - it depends on the initial amount of data and hyperparameter tuning, but is there some kind of guidance / list of specific use case on where to use what algorithm?

Thank you so much for the response.
 
Author
Posts: 14
5
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Lucian Maly wrote:Two emerging pre-trained language models - BERT (uses a bidirectional transformer) & ELMo (uses the concatenation) open up new possibilities in language processing. I see in the book sample that for instance BERT and logistic regression are the best algorithms for the email classification, and also for the IMDB movie review classification, but what are the general rules for using one or the other or even something else like GPT? Obviously the answer is not as simple - it depends on the initial amount of data and hyperparameter tuning, but is there some kind of guidance / list of specific use case on where to use what algorithm?

Thank you so much for the response.



Great question Lucian!

As you correctly alluded to, there is no simple yes or no answer, so that caveat has to be first made. I usually try all the embedding in my toolbox on any given problem and see what works best. Sometimes issues arise that you could not predict in advance - something as silly as particular dependencies of the deployment environment may drive your decision more than marginal difference in performance between methods...

Generally, GPT has been preferred to text generation. I would say BERT and ELMo are preferred for classification and related tasks. I have seen ELMo do better than BERT when the amount of data for fine-tuning is very small. Other people also swear by this in certain applications like low-resource languages... I think BERT is the most mature technique in terms of community mind-share, and the consequence of that is a stronger software ecosystem. Many advances on making it more efficient too - Albert, DistilBERT, etc., etc., etc. Variants of it blow the other approaches out of the water on most GLUE tasks (the established benchmark set) - importantly Questing Answering which has been finding a tremendous number of applications.

I hope this is a helpful response. We try to cover all of the above in the book.

 
The moustache of a titan! The ad of a flea:
SKIP - a book about connecting industrious people with elderly land owners
https://coderanch.com/t/skip-book
reply
    Bookmark Topic Watch Topic
  • New Topic