Win a copy of High Performance Python for Data Analytics this week in the Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Bear Bibeault
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • Tim Cooke
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Tim Holloway
  • Stephan van Hulst
  • Jj Roberts
  • Carey Brown
Bartenders:
  • salvin francis
  • Frits Walraven
  • Piet Souris

How does lazy evaluation happen in dataframes in Spark which do not have actions unlike RDDs

 
Ranch Foreman
Posts: 2348
12
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
RDD has transformations and actions. Spark has concept of lazy evaluation where execution is in memory and triggered only when action is called.  How does it happen in case of Dataframes as they do not have actions unlike RDDs?
thanks.
 
Ranch Hand
Posts: 32
3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
RDDs are now provided as DataSets, which have the same action/transformation distinction:  https://spark.apache.org/docs/latest/rdd-programming-guide.html

DataFrames are now based on DataSets instead of old-style RDDs  https://spark.apache.org/docs/latest/sql-programming-guide.html

RDDs/DataSets are a lower-level construct.  DataFrames are and always have been based on these. So nothing has really changed here.

SparkSQL allows you to use a more declarative approach to define your operations on DataFrames (structured data) i.e. you tell Spark what you want, not how to do it.

 
Monica Shiralkar
Ranch Foreman
Posts: 2348
12
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks

Christopher Webster wrote:RDDs are now provided as DataSets, which have the same action/transformation distinction:  https://spark.apache.org/docs/latest/rdd-programming-guide.html



Yes, the operations on Datasets look similar to RDDs. I think some actions like reduceByKey and groupByKey  as in RDD are not availble in Dataframes.

I wonder when RDDs are used such less as compared to Datasets and Dataframes, the Spark programming guide has a good RDD programming guide but no good guide for Dataframes and Datasets something similar to how good RDD programming guide is. The way it lists the transformations and actions. I have a hard time knowing what actions does datasets support for in case of RDD I can easily see in the section for actions in the rdd programming guide.

Christopher Webster wrote:DataFrames are now based on DataSets instead of old-style RDDs



What exactly does that mean? From what I understood Dataframes look way different that Datasets and a program using dataframes looks like the below:



I am trying to understand that where exactly is Lazy Evaluation happening in the above code?  ( I know that for RDD/DataSet code)
This is very different from Dataset program.(which looks somewhat like RDD code involving transformations and actions).


Christopher Webster wrote:
RDDs/DataSets are a lower-level construct.  DataFrames are and always have been based on these. So nothing has really changed here.



Isnt only RDD low level construct(with low level apis like reduceByKey,groupByKey) and both Datasets and Dataframes high level constructs?
 
The moustache of a titan! The ad of a flea:
the value of filler advertising in 2020
https://coderanch.com/t/730886/filler-advertising
reply
    Bookmark Topic Watch Topic
  • New Topic