Win a copy of The Little Book of Impediments (e-book only) this week in the Agile and Other Processes forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Collecting tweets, which better creating a Reader class or through the Map function?

 
Arwa Saad
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So, I'm working on twitter sentiment analysis. I have created a Reader.java that read twitter, clean them and store them in HDFS. Then the Map function will take the input and do the sentiment work.
Currently, I'm collecting a small number of tweets, but I thought, if I collect a huge number wouldn't be best to put the Reader work in the Map function? for the sake of handling Big data?
I tried doing it but the Map function show me a problem, that there is a need for input.

I'm confused, can anyone explain which solution is the best? Keep Reader or add the code to Map and do some workout for the input file?

Thanks,
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android C++ Java Linux PHP Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Receiving tweets directly in mappers is not a good idea.
Such a design would look like this, where each mapper is a separate process usually running in a different machine:

Note: The 1% is because twitter streaming API doesn't really give you all tweets that match your criteria, just 1% of them.

Since every mapper is receiving the same set of tweets, mapper design becomes complicated:
1) Mappers have to prevent duplicate writes
If number of mappers is m, tweet 1 should be written only by mapper 1, tweet 2 only by mapper 2,...tweet m only by mapper m,
tweet {m+1} only by mapper 1, and so on. You'll have to write extra logic that uses the mapper or task ID to do this kind of coordinated prevention.

2) Duplication prevention also wastes processing power
Since every mapper is responsible to write just 1 tweet in every m tweets, it's wastefully receiving and processing other m-1 tweets just to discard them.

3) Risk of data loss
If a mapper goes down, the tweets that are its responsibility will never be written, because twitter streaming API is a realtime API.
Once you miss a tweet, that tweet will never be redelivered via streaming API. It's left to you to retrieve them by tweet IDs via the REST API.

4) Risk of twitter throttling or banning your API access
Since all mappers will likely use the same API token, possibly via the same gateway IP address, there's a risk that twitter sees it as exceeding
limits and throttles or bans your token or IP address. I'm not sure if twitter actually does this for streaming API, but they do it for the REST APIs.
The risk is always there.

5) Overall network load of your cluster is higher
Since all the mappers open TCP channels to twitter's API endpoint and receive tweets pushed over those channels,
overall network load of your cluster is unnecessarily higher than with a single or limited readers. And, since most of the tweets
are discarded by every mapper, most of this network activity is in fact a waste.

The typical design for twitter analysis is to have one or two (for redundancy) readers that receive tweets, put them in a durable message queue like Kafka or RabbitMQ, and one or more MQ consumers pop items off that queue and write them to HDFS using efficient binary splittable compressible formats like Avro or Parquet. The mappers then just read those files from HDFS.
 
Arwa Saad
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you so much Karthik Shiraly! Now it make sense !
I appreciate it
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic