• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

need advise on using Hadoop

 
Vikrama Sanjeeva
Ranch Hand
Posts: 760
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I'm here to take a expert advise on using Hadoop or not for my situation/case.

Brief intro: I'm a masters by research student doing research in BigData area which I wish to explore more by continuing into PhD.

We've a side project in its inception phase (not related to my research) for developing mobile app using Ionic framework. One of the non-functional requirement of the project is to collect various analytics, for example, how many times apps downloaded, visited, which feature is used more, which user uses what etc etc. As a part of proposed solution, we've proposed to use Google Analytics to capture the required analytics and MangoDB for storing user's data.

I believe these statistics can be easily captured by GA. However, recently, I was thinking to use Hadoop and its related technologies (Hive, Impala, Sqoop etc) for doing analytics work. Why ? Because this way, I will get a chance to work on Hadoop echo-system which will be a good compliment with my Master's research which is in BigData.

What I know that Hadoop is mainly used where we have really BigData (in TBs or more) in a variety of format (unstructured, semi-structured) and where value out of data is needed by performing analytics.

My question is, we will not have much data in mobile app, but we do have data analytics work in app. So does it make sense (reasonable?) to export data from MangoDB into HDFS and use Hive or Impala for doing analytics ?

Please give feedback. Your expert advise is highly appreciated.

Many thanks.

Viki.
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android C++ Java Linux PHP Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I feel you may be underestimating the costs of developing your own analytics pipeline.

GA provides end to end pipeline for free: 1) data collection 2) data transfer 3) data storage on Google's infrastructure
4) data processing on Google's infrastructure 5) data visualization and presentation
It's even possible to integrate with their API to record and visualize custom events you may be interested in.

But with a hadoop based DIY pipeline, you or your team have to do all of this. Your app's founder or investors are incurring the non-trivial costs of skill building;
costs of effort involved in developing and maintaining the data transfer, processing and visualization code;
and financial costs of buying the hardware/cloud infrastructure required for running all that.

You can always familiarize yourself with the ecosystem on the side as a non-critical prototype, but I don't think replacing GA with it is a good idea when you're doing it for the very first time.

I feel it's better to start with GA, gather data, get an estimate of how much data you are collecting over time (an indicator how popular or not your app is), extract some information out of that data, setup a business culture where that information leads to decisions (sometimes unpleasant ones, such as dropping a feature that everybody loves except your users), use something simple like R studio or python/jupyter to write your own data analysis algorithms and visualize them, and then after some months, analyse if the performance of your pipeline is so bad that it actually requires Hadoop.


Also, Hadoop is not some magic wand that can extract useful information by itself. Hadoop's association with analytics is mostly marketing speak by the companies who specialize in it. It's about as accurate as saying something like "buy MS Excel and become a millionaire investor" or "buy MS Word and become the next JK Rowling".
In reality, it's nothing more than a computing system to distribute processing load to multiple machines. You still have to understand statistical theory and develop your own mining algorithms to get value out of your data. Often, the time and cost of setting up and maintaining a cluster is more than the time saved or profit brought in by it. It makes sense only at really large data scales.

Lastly, some small technical clarifications. Sqoop won't help you with exporting data from MongoDB to Hadoop; Sqoop is only for JDBC compliant databases.
 
Vikrama Sanjeeva
Ranch Hand
Posts: 760
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Wonderful !!

I completely agree with the rational behind your advise.

Just few things:

  • You said, GA provides end to end pipeline for free. I tried to look into free storage but only free storage I found in 15GB google drive. May be I'm missing something here? Could you please provide some starting links to this end to end service for free which includes all 5 services you mentioned (data collection storage, transfer, processing, visualization) ??


  • Secondly, just for info, we do have Hadoop cluster setup as part of uni project, which the management encouraged me to use instead of going for paid online services (actually initially I proposed Firebase, but they (the management) is not willing to pay continuously for Firebase service in future, that's why asked me to the already made Hadoop cluster).


  • Rest, I really appreciate yours ideas of learning the echo-system side by side on non-critical prototype, gathering data with GA and setting up a business culture for taking necessary decisions based on the collected information. This is indeed very piratical advise, as well as using Hadoop when really required.

    Also thank you very much for raising the JDBC compliance issue of MongoDB !!

    Many thanks.

    Viki.



     
    Karthik Shiraly
    Bartender
    Posts: 1210
    25
    Android C++ Java Linux PHP Python
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Vikrama Sanjeeva wrote: GA provides end to end pipeline for free. I tried to look into free storage but only free storage I found in 15GB google drive. May be I'm missing something here? Could you please provide some starting links to this end to end service for free which includes all 5 services you mentioned (data collection storage, transfer, processing, visualization) ??


    I meant that Google Analytics itself already does all that. You don't have to add any other free storage. https://developers.google.com/analytics/
    GA stores even historical data for upto some years - 3 or 5, I'm not sure...it'll be in their ToS. That includes any custom events you add to GA.
    For GA to show all those visualizations, it has to store the raw data somewhere, and it does that in Google's servers for free.

    Secondly, just for info, we do have Hadoop cluster setup as part of uni project, which the management encouraged me to use instead of going for paid online services (actually initially I proposed Firebase, but they (the management) is not willing to pay continuously for Firebase service in future, that's why asked me to the already made Hadoop cluster).

    That's good, that should reduce the infrastructure effort and cost drastically.

    Edit: Oh, and I forgot to add that you can export your data out of GA too. That should provide you the raw data for prototyping your own data analysis algorithms. Those prototypes can show you the path forward, whether to go with a big hammer like hadoop or spark eventually.
     
    Vikrama Sanjeeva
    Ranch Hand
    Posts: 760
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    I understand that GA performs analytics and stores analytics data in google analytics storage. I'm not sure of it's length, but as I remember, it can store for longer time may be 2+ years. I've integrated GA with one corporate portal back in 2010. It worked wonders, without a doubt.

    What I'm worried is user data and app data. Which I will store in MongoDB. So if some analytics has to perform on this data, then I don't think GA will come in handy here. After your advise (of not using Hadoop unnecessarily until required), I think I've to use R Studio to write our own custom data analysis algorithms. I'm not familiar with R, but it's is one of the language in my research road-map to master upon. Please suggest, how you see things in this case? That is, performing analytics on user data and app data to make value out of the collected data which in turn will help business to make future decisions. I'm not experienced with R, but of you can suggest something simple here using R then it would be great for me and my app.

    Many thanks.

    Viki
     
    Karthik Shiraly
    Bartender
    Posts: 1210
    25
    Android C++ Java Linux PHP Python
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    What I'm worried is user data and app data. Which I will store in MongoDB.

    I'm not sure exactly what kind of data you have in mind, but before rolling your own collection and storage, check if it's something that can be collected in the app and can be implemented using GA's custom events API.

    I don't see what suggestion I can make, since analytics is deeply tied to the domain of your app, and I have no idea what that is.
    If you're not sure what value to get out of it, then those are the first questions for you to answer - what is my business? who are my users? what are my goals? Is it user growth? Is it product revenue growth? Is it ad revenue growth?. Only after answering questions like that can you decide what data to measure which will help reach those goals.
    For example, in shopping apps, a commonly used metric is how many visitors reached upto the checkout page, how long they stayed on that page, and how many actually converted into paying customers. A large difference may point to some problem or confusion in the UI of the checkout page and can be focussed on. The goal is maximizing revenue.
    Another example is product recommendations - how many people actually click on the recommended products, and is there any correlation between the current product and the product clicked on. This is a custom mouse click event that can be captured via GA.
     
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic