|
![]() |
ML stands for Machine Learning
Book: Machine Learning Systems
Site: jeffsmith.tech
Jeff Smith wrote:I talk about the choice to use Scala a bit in the book, and I get asked about it so much, I'm probably going to write a blog post on the topic. Let me try giving you a fairly broad answer.
First, I think it's important to acknowledge that learning generally applicable skills is usually the goal of a reader of a technical book like mine. If you're learning, then the choice of language for that learning is only of secondary importance.
But let's get into the question of why someone would choose to use Scala for a machine learning book. Here are some of my reasons:
1. Large portions of a production machine learning system need to be able to support high concurrency. This is usually a requirement of the model server, but it can come up in data collection as well. This means that it's useful to have a multi-threaded runtime like the JVM, the BEAM, the CLR, etc.
2. Beyond single node concurrency, it's often necessary to build distributed data processing pipelines for things like feature generation and increasingly for model learning as well. Since my book isn't primarily about distributed systems infrastructure implementation, I wanted to have some straightforward answer for how to distribute computation such as a framework like Spark or language native capabilities as in Distributed Erlang.
3. A lot of the techniques used in distributed systems rely upon techniques common to functional programming, such as immutable data and pure functions as first class citizens. FP languages like Scala, Clojure, Haskell, F#, and others make using those techniques easy, but so do libraries and language features in multi-paradigm languages like JavaScript and Python.
4. The book is all about machine learning systems, so I really need access to good library implementations of common bits of machine learning functionality. This really wasn't optional; I wanted every chapter to only use code from that specific chapter. Languages with good enough machine learning libraries include Python, Scala, R, and not too many others.
5. A lot of the book is about data modeling and data engineering. Specifically, I tried to introduce a fair amount of material around uncertain data engineering. To teach all of that material, I needed some way of describing data structures. The simplest way to do this is with static types as in Scala or Haskell, but it could also be done using optional type annotations as in Erlang or some dialects of JavaScript.
6. The concepts of supervision and message passing are closely intertwined with the actor model. Ideally, I needed a robust actor model implementation that I could use for several different aspects of the machine learning system. This requirement is fulfilled by Erlang, Akka, and a few other less commonly used implementations.
Let's score a few languages against these criteria.
R
1. Not natively.
2. Via libraries written in other languages.
3. Not by default.
4. Some good ML libraries.
5. Not easy to express.
6. No implementation I'm aware of.
Python
1. Not natively. C/++ systems and libraries are often used to mitigate this.
2. Via libraries, usually written in other languages.
3. Only optionally and via libraries.
4. The best ML libraries of any language.
5. Recently added and only rarely used.
6. Only via libraries, none in common use.
Erlang
1. Arguably the canonical concurrency-oriented programming language.
2. Support built directly into the language as well as several commonly used libraries.
3. Very FP-oriented language allowing for only immutable data.
4. No widely used libraries.
5. Optional type annotations via Dialyzer.
6. The canonical actor model implementation.
Scala
1. Several approaches to concurrency, thanks to the JVM.
2. Spark is the biggest project in all of big data.
3. Very FP-oriented language that allows for some limited use of non-FP techniques (e.g. vars).
4. MLlib in Spark is very complete and scales to arbitrary workloads.
5. Incredibly rich and powerful type system.
6. Akka is the second most widely used actor model implementation.
Looking at all of this, Scala really was the only language that fulfilled all of my needs. I could have written a similar book about a problem other than machine learning using Erlang. Or I could have dropped a lot of the material and written something far narrower around machine learning using Python relying upon things like Spark or TensorFlow Serving to fill the gaps that Python would have left. Or I could have done some mixing and matching, hoping that readers would be able to follow along across toolchains.
I chose to use Scala, because I wanted someone to use pretty much the same tools to explore all of these area. I think it's a fun trip, going through every phase of the machine learning process and layering in new techniques on top of the same tools.
Final caveat: that's an answer specifically about my book. My answer would be totally different if you were trying to get a job in ML or were building your first ML application. That said, I think the book will do a good job of preparing you for future ML development, regardless of what toolchain you choose to use.
bacon. tiny ad:
The Low Tech Laboratory Movie Kickstarter is LIVE NOW!
https://www.kickstarter.com/projects/paulwheaton/low-tech
|