Hadoop is a batch oriented so discussing it in terms of messages isn't really applicable. Hadoop works on a pool of data a batch at a time.
Storm is a stream processing system that operates on a stream of data, one message at a time.
With Hadoop, you want be able to extract any answers from your data until the batch is processed. Imagine for example the case of twitter's trending topics:
With Hadoop, you could take all of the tweets from the last 30 minutes and process them looking for trending topics. That 30 minutes of tweets is your batch.
With Storm, you can act on those tweets as a stream of data and start immediately detecting trending topics as you are working with the stream in real time rather than
as a batch.
So Hadoop is not real time. It doesn't operate on messages so guaranteeing messages are processed doesn't apply.
We included reliably in the summary because, Hadoop v1 had numerous reliability issues. Hadoop v2 addresses most of those, however, at this time, most people
equate Hadoop reliability with v1.
Thanks for the reply. That's certainly interesting.
I work in financial trading, so we deal with nothing but endless streams of data and we're always looking for new and interesting ways to analyse it. You've piqued my interest in Storm, I shall have to research some more.
With Storm, you can act on those tweets as a stream of data and start immediately detecting trending topics as you are working with the stream in real time rather than as a batch
If I am getting this right, we would need to store the trends ( which would be always intermediate) somewhere ? Do we store these in HDFS or some other type of storage ?
I am viewing this response with the reference to my earlier question asked but this answer provided me more insight hence putting down the question here.
* You could output changes in trending topics like
"started trending" : "one direction"
"no longer trending" : "katy perry"
to a message queue and other parts of your system could act on that. (for example, put the output in a Kafka topic which could serve as the beginning of other streams).
* You could be updating an external database such as Redis or Riak that provide Set as a datatype.
* Storm provides a DRPC client that you can use to query the in memory state of a topology (this is covered in Chapter 9 in relation to Trident but you could if you really wanted to, use it without Trident).
Storm is a fault tolerant framework, where all the messages are ack once they complete the flow in topology, which isn't the case in Hadoop.
Moreover the nimbus and supervisor are not tightly coupled unlike hadoop task tracker and job tracker, i.e the supervisor will still keep on processing the data if the nimbus go down.. and all the workers will complete the task allocated, and once nimbus is back it will work as if nothing happened.
Re storm and tuple/message tracking. Its true you can track all messages to assure they are processed that is however, an optional feature. You can choose to not use guaranteed message processing and in turn, get better throughput.