• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Paul Clapham
  • Jeanne Boyarsky
  • Knute Snortum
Sheriffs:
  • Liutauras Vilda
  • Tim Cooke
  • Junilu Lacar
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Joe Ess
  • salvin francis
  • fred rosenberger

How much data can be called big data when there is no fixed definition for this?

 
Ranch Hand
Posts: 1300
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
big data involves processing of huge amount of data fast. There is no clear definition that how much data can be called big data? In such case how much data can be called as big data?
thanks
 
Saloon Keeper
Posts: 11184
244
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Like you said, there is no fixed definition.
 
Sheriff
Posts: 7421
504
Mac OS X VI Editor BSD Linux
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To me, big data isn't something about lots of records or "big" file sizes, even though in most cases that's what actually is.

Big data to me is more about its surrounding domain, not what the data is, but how the data gets handled: data transformation (which might be significant), system's scalability characteristics/requirements where data gets processed, fairly tight service level agreements (SLAs), similar... - all that to me is what defines whether you are dealing with big data.

A bit more concrete, you may not know and barely could predict what amount of data you are going to receive during the day, but you need to have system in place which could handle that (and preferably in a cost efficient way). Whether you received only one 10 MB file during the whole day, or you received several terabytes of continuously streamed data, the data has to be processed in an agreed timeline, let's say 24 h from the received time. In nowadays, it is common to accommodate such requirements using infrastructure, services offered by the cloud providers.

As an example what kind of infrastructure and services we are talking about may be: Kubernetes cluster with running nodes serving N amount of pods where each of those may run data processing pipeline and scale up or down based on demand. Or could be used some managed service, i.e. Google Dataflow to run data processing pipeline (implemented with Apache Beam), which also could have up to certain amount of workers running to handle data volume by parallelizing some of the steps.

Not to spend enormous amounts of money I've mentioned before for services running 24/7 and waiting for the presumable data to come to be processed, usually such systems involve some orchestration around which initiate services to power up when there is some work to be done and so could go down to idle again when they are done.

So all these personally to me is what kind of describes boundaries of data, whether you are dealing with Big Data or not, not the data size alone.
 
Monica Shiralkar
Ranch Hand
Posts: 1300
8
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks. That clearly answers that it is not just about the size of data. Also volume is just one out of the 4 V's of big data. That is volume, variety, velocity, verasity
 
Saloon Keeper
Posts: 21593
146
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In trade discussions, as Liutauras has said, "big" isn't really about the amount of data, but what you're doing with it. Keeping Information Warehouses. Doing complex ad hoc correlations to see what falls out. It's about data analytics instead of data processing.

In most cases, you need at least a couple of gigabytes worth of samples to work from, but I suppose even very tiny sets could be considered "big data" if that's how they're used.
 
Monica Shiralkar
Ranch Hand
Posts: 1300
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

It's about data analytics instead of data processing.



I am trying to understand this that it is not about data processing because most of the definitions of big data involve the word data processing as Processing of Large Volume of Data.
 
Marshal
Posts: 24943
61
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"It's analyzing the data, not processing it."

Of course analyzing data involves processing; in fact anything you do with data involves processing. So to say that it's not about processing is to misunderstand the statement.

If I were to say "I'm talking about chocolate, not food" you wouldn't be confused because chocolate is actually a food, would you? I hope not.
 
Monica Shiralkar
Ranch Hand
Posts: 1300
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks. If a big data use case is to read terrabytes of data from a Kafka topic using Spark Streaming, filter it using Spark and post the results to a REST API, then what is the data analytics happening in this big data use case?
 
Stephan van Hulst
Saloon Keeper
Posts: 11184
244
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's in the way you're filtering the data.
 
Monica Shiralkar
Ranch Hand
Posts: 1300
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks.  In filter condition (spark transformation filter), it checks whether the value of a parameter is greater than a threshold value. That's all. I feel that is the business logic (not analytics) and business logic would be there in any application be it big data or not.
 
Stephan van Hulst
Saloon Keeper
Posts: 11184
244
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To what end? If the filtering is done in order to understand the data better, that's Big Data. If the filtering is done because you already understand the data and know you're only interested in a specific part of it, that's not Big Data.
 
Monica Shiralkar
Ranch Hand
Posts: 1300
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What is meant by to what end? Filtering is done as per the requirement. That was the big data project in my previous organization . The project involved spark, Kafka. Based on what you said the project which was called big data project there was not a big data project.
 
Tim Holloway
Saloon Keeper
Posts: 21593
146
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think I need to clarify. The modern-day term for what we do with computer systems is Information Technology (IT). However in mainframe days, the common term was Data Processing (DP). The reason I used the term "Data Processing" is that DP was more limited. Little or no networking, and almost never external networking. Mostly just running data in batch (think punched cards or magnetic tapes) and dumb terminals running dump applications. And, in fact, a lot of shops didn't even have online applications at all.

So what I was referring to was the raw processing of the data itself as opposed to finding out the shapes in the data.
 
Monica Shiralkar
Ranch Hand
Posts: 1300
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think the use case which I mentioned may not be called involving data analytics but may still qualify to be called a big data project.
 
Liutauras Vilda
Sheriff
Posts: 7421
504
Mac OS X VI Editor BSD Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am very interested, who is asking you to define whether big data or not? After all, it is just a words, you do whatever you have to do regardless how that is called.

What's behind all that?
 
Monica Shiralkar
Ranch Hand
Posts: 1300
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Another thing which I want to know is whether all 4 V's of big data( Volume, Velocity, Variety and Veracity) are  required to be present for it to be big data or if any one of the V's is present, it can be big data.
 
Stephan van Hulst
Saloon Keeper
Posts: 11184
244
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If I tell you 'no', are you going to ask how many of the for Vs need to be present?

There are no hard rules Monica. Big Data refers to a dynamic area of software engineering. It's like 'the Cloud'. When somebody uses these words, it's to give you a bit of context what the discussion is about, but it's not meant to be a list of hard requirements for systems to follow.
 
Monica Shiralkar
Ranch Hand
Posts: 1300
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks

Liutauras Vilda wrote:To me, big data isn't something about lots of records or "big" file sizes, even though in most cases that's what actually is.



As you have cleared big data is not just about Volume.

big data may also be about other attributes variety, velocity, and veracity.  

The tools for big data computing like Spark and Map Reduce are for dealing with high volume on cluster as it would require more than 1 computer. What about the cases where it is not about Volume and about other attributes like  variety, velocity, and veracity. Are Spark and Map Reduce still used in such cases too.If not then what are some examples of big data tools that can be used?


 
Arch enemy? I mean, I don't like you, but I don't think you qualify as "arch enemy". Here, try this tiny ad:
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!