Win a copy of Head First Android this week in the Android forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • Rob Spoor
  • Bear Bibeault
Saloon Keepers:
  • Jesse Silverman
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Al Hobbs
  • salvin francis

Storm Applied: Better than Hadoop?

 
Marshal
Posts: 5138
321
IntelliJ IDE Python Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi folks,

I'm not entirely clued in with either Storm or Hadoop, but I was interested by this statement on the Amazon summary section for the book:

Amazon summary wrote:
Like Hadoop, Storm processes large amounts of data but it does it reliably and in real time, guaranteeing that every message will be processed.


Does that imply that Hadoop is unreliable, not real time, and doesn't guarantee every message will be processed?
 
Author
Posts: 14
5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hadoop is a batch oriented so discussing it in terms of messages isn't really applicable. Hadoop works on a pool of data a batch at a time.

Storm is a stream processing system that operates on a stream of data, one message at a time.

With Hadoop, you want be able to extract any answers from your data until the batch is processed. Imagine for example the case of twitter's trending topics:

With Hadoop, you could take all of the tweets from the last 30 minutes and process them looking for trending topics. That 30 minutes of tweets is your batch.

With Storm, you can act on those tweets as a stream of data and start immediately detecting trending topics as you are working with the stream in real time rather than
as a batch.

So Hadoop is not real time. It doesn't operate on messages so guaranteeing messages are processed doesn't apply.
We included reliably in the summary because, Hadoop v1 had numerous reliability issues. Hadoop v2 addresses most of those, however, at this time, most people
equate Hadoop reliability with v1.
 
Tim Cooke
Marshal
Posts: 5138
321
IntelliJ IDE Python Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for the reply. That's certainly interesting.

I work in financial trading, so we deal with nothing but endless streams of data and we're always looking for new and interesting ways to analyse it. You've piqued my interest in Storm, I shall have to research some more.
 
Ranch Hand
Posts: 544
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

With Storm, you can act on those tweets as a stream of data and start immediately detecting trending topics as you are working with the stream in real time rather than as a batch



If I am getting this right, we would need to store the trends ( which would be always intermediate) somewhere ? Do we store these in HDFS or some other type of storage ?
I am viewing this response with the reference to my earlier question asked but this answer provided me more insight hence putting down the question here.

Regards,
Amit
 
Tim Cooke
Marshal
Posts: 5138
321
IntelliJ IDE Python Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Amit raises a good point. If Storm is analysing a stream of data, then is its output also a stream of data?
 
Sean Allen
Author
Posts: 14
5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That would be implementation specific.

* You could output changes in trending topics like

"started trending" : "one direction"

"no longer trending" : "katy perry"

to a message queue and other parts of your system could act on that. (for example, put the output in a Kafka topic which could serve as the beginning of other streams).

* You could be updating an external database such as Redis or Riak that provide Set as a datatype.

* Storm provides a DRPC client that you can use to query the in memory state of a topology (this is covered in Chapter 9 in relation to Trident but you could if you really wanted to, use it without Trident).

 
Greenhorn
Posts: 10
Eclipse IDE Python Spring
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Storm is a fault tolerant framework, where all the messages are ack once they complete the flow in topology, which isn't the case in Hadoop.

Moreover the nimbus and supervisor are not tightly coupled unlike hadoop task tracker and job tracker, i.e the supervisor will still keep on processing the data if the nimbus go down.. and all the workers will complete the task allocated, and once nimbus is back it will work as if nothing happened.

 
Sean Allen
Author
Posts: 14
5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
A lot has changed with the introduction of Hadoop 2 and Yarn. See: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Re storm and tuple/message tracking. Its true you can track all messages to assure they are processed that is however, an optional feature. You can choose to not use guaranteed message processing and in turn, get better throughput.
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic