• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Bear Bibeault
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • Junilu Lacar
  • Henry Wong
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Jj Roberts
  • Tim Holloway
  • Piet Souris
Bartenders:
  • Himai Minh
  • Carey Brown
  • salvin francis

Unlike Spark RDD, are Spark dataframes used in cases with lesser data too?

 
Ranch Foreman
Posts: 2036
12
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Spark RDD is used for processing of extremely large datasets using cluster computing system. Are dataframes too used for large datasets only are also used for lesser data too? e.g some configuration to be read from a table in SQL Server database?
thanks.
 
Ranch Foreman
Posts: 28
3
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The difference between RDDs and DataFrames has nothing to do with the volume of data, but with what kind of data it is, and what you want to do with it.  You can use RDDs or Spark DataFrames to process a single record from a single file, or massive data-sources containing gigabytes of data.  You can run Spark on your laptop, in a cloud environment (Azure, AWS, Google Cloud Platform etc), or on a huge on-premises Hadoop cluster, and so on. It depends on what you want to do.  

If you are working with structured data that can be represented as a table, then you would probably use DataFrames and the Spark SQL APIs.

You can read data into DataFrames from pretty much any source that Spark can read, and you can write data via DataFrames to any sink that Spark can write to.

Spark data sources

This includes JDBC sources i.e. SQL databases.
 
Monica Shiralkar
Ranch Foreman
Posts: 2036
12
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks. The reason I got this doubt, is because Dataframes are used for structured data and structured data is more related to relational databases (which process structured data only) than it is to NoSql databases (which deal with both structured and unstructured data).  From what I understand, between RDD and Dataframes, the latter is used with relational databases too which is associated with limited data whereas for RDD using it with relational DB with limited data from relational DB must be pretty rare.
 
I'm doing laundry! Look how clean this tiny ad is:
Thread Boost feature
https://coderanch.com/t/674455/Thread-Boost-feature
reply
    Bookmark Topic Watch Topic
  • New Topic