• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Bear Bibeault
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • Junilu Lacar
  • Henry Wong
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Jj Roberts
  • Tim Holloway
  • Piet Souris
Bartenders:
  • Himai Minh
  • Carey Brown
  • salvin francis

Migrating from Spark 1.6 to newer version

 
Ranch Foreman
Posts: 2033
12
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Around 3-4 years back , I had worked in project which used Spark 1.6. The way to develop spark applications was to read the data in RDD (e.g from Kafka or hdfs )and run operations on RDD: transformations and actions. The transformations are held in memory and result of each transformation is another transformation. E.g Map, Filter, FlatMap etc. When action is called, it triggers the spark pipeline.  This is how the processing happens in case of  Spark 1.6. How does it happen in newer version of Spark. Is it somewhat like that or entirely different. I have not worked on it since then and now the latest version is 3.0 and lot of new features are added like structured streaming.

Thanks.
 
Ranch Foreman
Posts: 28
3
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Probably the best place to start is with Data Frames and Spark SQL as there have been a lot of really useful changes here since Spark 1.6.

If you are working with structured data e.g. tables  or CSV files etc, then you rarely need to use raw RDDs.  Instead you can read your data via DataFrames or Datasets, which provide a lot more functionality and much better support for structure and data-types.

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine



Datasets seem to replace RDDs, effectively, but you will probably want to use DataFrames for this kind of data anyway:

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.



DFs allow you to use SQL or a SQL-like programming API to interact with your data, wherever it comes from.  So you could have DFs that are using data from different sources - HDFS, Hive, Parquet files, CSV files, SQL DBs, NoSQL DBs etc - and process all of it using SQL within Spark e.g. perform SQL joins between MongoDB and Hive and write the results to Cassandra, or whatever.

Spark SQL gives you a common distributed programming platform for a huge range of data sources, which is really powerful and flexible.

And there is now an optional free package for Spark called Delta Lake which uses Parquet as the basic storage format but adds a commit log and other features to support ACID transactions, in-place updates etc.  I'm just starting to look at Delta Lake now, but I think it looks really promising as a way to combine many of the useful features of conventional RDBMS platforms with the power and flexibility of distributed storage and processing.

Going beyond DataFrames, Spark has also added a lot more functionality around Machine Learning and Streaming (Structured Streaming is like DataFrames for streams), so there are plenty of interesting topics for you to look at.  

Finally, the basic execution model is still the same i.e. actions trigger data to be processed via transformations.  But the Spark SQL APIs mean you are less concerned with how the engine works and can focus instead on what you want it to do i.e. a declarative approach.  Of course, it's still useful to understand what's going on underneath the hood, so you can figure out when to perform particular operations that might trigger aggregation of data etc.
 
Monica Shiralkar
Ranch Foreman
Posts: 2033
12
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks. This will be very helpful.

So, say if one is getting streaming data which is structured , then one may not use RDD at all in the application and just do this using Data Frames? Or, typically one uses combination of both. If it is common to use combination of RDD and data frames in application, then what part will RDD do and what part will Dataframes do in the same flow?
 
Monica Shiralkar
Ranch Foreman
Posts: 2033
12
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Surprisingly on the Spark website https://spark.apache.org/examples.html, they choose to list Word Count example using RDD (original way of doing suitable for unstructured/structured data ) but not using Dataset (new API suitable for structured data).
reply
    Bookmark Topic Watch Topic
  • New Topic