Spark RDD is used for processing of extremely large datasets using cluster computing system. Are dataframes too used for large datasets only are also used for lesser data too? e.g some configuration to be read from a table in SQL Server database?
The difference between RDDs and DataFrames has nothing to do with the volume of data, but with what kind of data it is, and what you want to do with it. You can use RDDs or Spark DataFrames to process a single record from a single file, or massive data-sources containing gigabytes of data. You can run Spark on your laptop, in a cloud environment (Azure, AWS, Google Cloud Platform etc), or on a huge on-premises Hadoop cluster, and so on. It depends on what you want to do.
If you are working with structured data that can be represented as a table, then you would probably use DataFrames and the Spark SQL APIs.
You can read data into DataFrames from pretty much any source that Spark can read, and you can write data via DataFrames to any sink that Spark can write to.
Thanks. The reason I got this doubt, is because Dataframes are used for structured data and structured data is more related to relational databases (which process structured data only) than it is to NoSql databases (which deal with both structured and unstructured data). From what I understand, between RDD and Dataframes, the latter is used with relational databases too which is associated with limited data whereas for RDD using it with relational DB with limited data from relational DB must be pretty rare.
What I don't understand is how they changed the earth's orbit to fit the metric calendar. Tiny ad: