Win a copy of Securing DevOps this week in the Security forum!

Alex Holmes

Author
Greenhorn
+ Follow
since Oct 19, 2012
Alex Holmes is a senior software engineer with over 15 years of experience developing large scale distributed Java systems. For the last four years he has gained expertise in Hadoop solving Big Data problems across a number of projects. He is the author of "Hadoop in Practice", a book published by Manning Publications. He has presented at JavaOne and Jazoon and is currently a technical lead at VeriSign.
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
1
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Alex Holmes

Mohamed,

In distributed computing it is much preferred to read data from local disk, rather than over the network. This is also known as data locality, and is one of the key aspects of Hadoop. When MapReduce pushes work to the slave nodes, it does do in a way to favor reads from disk rather than reads from the network.

Thanks,
Alex
5 years ago
Predictive analytics is an umbrella term for things like data mining and machine learning. This chapter mostly covers Mahout, and a list of the algorithms that is supports can be seen here: https://cwiki.apache.org/MAHOUT/algorithms.html
5 years ago
Given that Cassandra is a real-time system, and Hadoop is batch-based, so they end up complimenting each other.
5 years ago
Hi,

There is definitely a learning curve with Hadoop, which is probably higher than most NoSQL systems, which are more aligned with other real-time systems that we are all accustomed to working with (such as relational databases). The additional learning time is really related to installation, management and understanding MapReduce as a framework and programming model.

Having said that, I would argue that it's worthwhile understanding the Hadoop fundamentals; even if you don't end up using the technology, it will help you understand the MapReduce concepts, which are also being leveraged in-system by NoSQL solutions. Hadoop's emphasis on data locality is also a valuable lesson that we all should be aware about as general good-practice distributed system design, which helps reenforce our own architectural and design decisions.

Thanks,
Alex
5 years ago
Hi Mohamed,

The "Hadoop in Action" book covers streaming, which is how you'd integrate Python with MapReduce. This is actually a pretty common usage of Hadoop, and is a good way for Python programmers to work with Hadoop. Other tools worth considering would be Pig and Hive, which would be higher-level ways to work with Hadoop.

Thanks,
Alex
5 years ago
Hi Ashwin,

I wouldn't put Hadoop in the same camp as NoSQL technologies - for the most part NoSQL technologies tend to be real-time, versus Hadoop, which is batch-based, and excels at ETL, DW type use cases. In terms of which NoSQL solution to pick that's a touch choice as there doesn't seem to be a clear winner in the marketplace at the moment. Having said that Cassandra, MongoDB and HBase have distinctive traits which will likely push you to one of them based on how you intend to access your data. I'm not an expert on these systems so I won't attempt to push one over the other, but after you do some research I think it'll become apparent which one will work best for you.

Relational systems are quite different from Hadoop, and not only from an the real-time/batch perspective. Hadoop isn't a transactional system, but it was architected from the ground-up to scale, so you can typically work with much larger data sets than you can with monolithic database systems. Hadoop is also great at joining structured and unstructured data together, and for data aggregations and summarizations. You can also use tools like Mahout for predictive analytics.

Hope this helps some of your questions.

Thanks,
Alex
5 years ago
Hi Raj,

Microsoft is currently working on supporting Hadoop on its platform, you can read some more details on that here: http://www.informationweek.com/software/information-management/microsoft-releases-hadoop-on-windows/240009632

Thanks,
Alex
5 years ago
Chapter 1, which can be downloaded for free from http://www.manning.com/holmes/, shows some of the was Hadoop can be used, as well as its limitations, which are probably worth looking at.
5 years ago
Hi,

The book does cover some elements of code design in chapter 12, and chapters 4 and 7 cover some MapReduce applications. But there's no section that's wholly dedicated to analysis and design.

I'm not sure when the book will start shipping from B&N, but I do know that Amazon is shipping: http://www.amazon.com/Hadoop-Practice-Alex-Holmes/dp/1617290238

Thanks,
Alex
5 years ago
If you're using JRuby then you can invoke the Hadoop API's directly. If you're using a non-JVM language then you'll need to resort to making system calls to invoke the Hadoop scripts.
5 years ago
Hi David,

All very good questions! Yes it's tricky these days to pick the data storage system. A few years ago everything would automatically get stuck into a relational database as that was all that was widely available. Traditional databases still have their place, and you'll still find a lot of technology companies using sharded relational databases (with some memcached-like fronting system) to service web requests. Ultimately it comes down to your particular application - you need to map-out how you expect your data to be accessed, and whether you need things like transactions. Hadoop in its current form is a batch-based system, so you wouldn't want to use it (MapReduce) for serving any real-time data access use cases. HBase and friends on the other hand would be suited for real-time access, and scale very well too, but it's important to understand their limitations (such as how well they work at searching).

I don't go into NoSQL in length in my book, apart from covering some HBase and Hadoop integration use cases. My book should however give you a good sense of the various use cases that work well for Hadoop, and hopefully that'll give you a sense of how it can be leveraged, and whether that would be sufficient for your needs.

Thanks,
Alex
5 years ago
Hi Kiran, and thanks for your question.

The book currently covers integration with both HBase and relational databases in chapter 2. With regards to other NoSQL systems this is still an emerging topic, and chapter 2 will be updated over time as solutions start coming up.

Thanks for your interest,
Alex
5 years ago
I would classify Hadoop as a "extreme record processing tool". Hadoop doesn't have transactional semantics which are offered by relational databases. It's really a batch-based system that allows you to work with large data volumes at scale.

HBase would be the canonical example of Hadoop integration with a NoSQL system. There's an article on Oracle's site about Hadoop integration with their NoSQL data store: http://www.oracle.com/technetwork/articles/bigdata/nosql-hadoop-1654158.html
5 years ago
Hi Mohamed,

The short answer is that Hadoop is useful for working with data volumes larger than what you can store on a single machine. Hadoop is written almost entirely in Java, and MapReduce, which is Hadoop's computational tier, is a programming framework that lets you express your work in Java.

If you read chapter 1 of my book (which you can download for free from http://www.manning.com/holmes/) you should hopefully get a better idea of what Hadoop is, and how it can be used.

Thanks,
Alex
5 years ago
Hi Ashley,

You need intermediary Java knowledge, and some Hadoop fundamentals (some of which are presented in Chapter 1). Books like "Hadoop in Action" and "Hadoop: The Definitive Guide" are great for fundamentals.

Thanks,
Alex
5 years ago