As the author hasn't replied, here's my personal take as an occasional user of Spark since around 2014.
* AFAIK Spark is still written in Scala, which means new features appear in the Scala APIs first.
* This means the Scala API is usually a bit ahead of the others, and it will never be behind them.
* Personally, I find Scala is a very natural language for this kind of processing (which is why Spark is based on it), so I am most comfortable with the Scala API. YMMV of course.
* Python is very widely used with Spark, as it is a much more popular language than Scala generally, and it is often used by data scientists.
* Python is also a popular choice for people who use interactive notebook interfaces, like Jupyter, with Spark (although you can also use Scala with notebooks these days).
* But the Python API is usually a little behind the Scala API, and some features are slower/harder to implement in Python than in Scala.
* So Python is a good choice for data scientists or if you are not concerned about having the very latest API features.
* There is no good reason to use Java with Spark.
* Although Java now offers Lambdas etc, it is still really clunky to write good functional code with Java compared to Scala.
* And Python is a much nicer language for data science and notebooks etc.
* If you're using Spark, pick a language API that works well with Spark and does the things that Spark does well.
Christopher Webster wrote:
There is no good reason to use Java with Spark.
While Python is the preferred choice while going for Spark ML , for other cases I think, suppose the team has developers who are good in Java(instead of Scala), if we go for Java, we still can still have the option of moving to Scala later when we have that skill set in team. However if today we go for Python, then it is like another route altogether as then relatively it would be less likely to be able to move to Scala. The reason for this is that Scala and Java, the JVM languages have more in common than Scala and Python.