Win a copy of The Little Book of Impediments (e-book only) this week in the Agile and Other Processes forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Roadmap for Bigdata study

 
Marwan Adel
Greenhorn
Posts: 14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dear All,
I am new to bigdata are and I want some help for the folks to recommend good resource for bigdata for beginners
 
chris webster
Bartender
Posts: 2407
33
Linux Oracle Postgres Database Python Scala
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Big Data is a Big Topic, so you probably need to decide on a more specific area to look at initially.

  • IBM Big Data University has a wide range of resources, although I haven't tried these myself.
  • MongoDB has some excellent free online courses on the MongoDB NoSQL database.
  • Datastax Academy has free online courses on the Cassandra NoSQL database.
  • Hortonworks Sandbox is an excellent way to get started with Hadoop, including several useful tutorials.
  • Data-wrangling with MongoDB is a free online course from Udacity on applying MongoDB for data science. You don't have to pay for the course - choose the "Access course materials" option.
  • Intro to Hadoop and MapReduce is another Udacity course on Hadoop (Cloudera).
  • Intro to Data Science is a Udacity course on data science using Python.
  • Data Science Certificate is a set of courses from Johns Hopkins University (via Coursera) looking at data science using the R language. You have to pay for the certificate track, but you can study the individual courses for free. R is widely used in data science and statistics, but these courses are not specifically about Big Data technologies.

  • I'm working on a small team doing R&D around Big Data technologies. We've found the following tools interesting so far:

  • MongoDB - NoSQL database stores data as JSON documents. Great for scalability, flexible data models, arbitrary queries. Not so good for number-crunching, easy admin.
  • Cassandra - NoSQL database stores data in column-family format. Just starting to look at this, great for scalability, robustness, speed. Not so good for flexible data model, arbitrary query (can only query by key columns).
  • Apache Spark - excellent distributed processing engine that can run on a Hadoop or Cassandra cluster or in stand-alone mode and on a local machine. APIs for Scala, Python and Java, plus R is coming soon. This is definitely going to be a core Big Data technology.
  • Cloudera or Hortonworks - pre-packaged bundles of Hadoop-based technologies. Free "sandbox" downloads available.
  • Python (especially with the IPython Notebook) - great for interactive work, ad hoc data analysis, prototyping etc. Not so good for scaling up/out but powerful when combined with Spark
  • Scala - primarily for developing scalable applications e.g. using Spark, Akka, Kafka, etc.
  • R - I don't use this but some of my statistical colleagues like it, but it's hard to scale up/out easily.

  • Hope this will give you some ideas.
     
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic