The Scala compiler can be brutally slow. The community has a love-hate relationship with it. Love means “Yes, scalac is slow”. Hate means, “Scala — 1★ Would Not Program Again”. It’s hard to go a week without reading another rant about the Scala compiler.
Moreover, one of the Typesafe co-founders left the company shouting, “The Scala compiler will never be fast” (17:53). Even Scala inventor Martin Odersky provides a list of fundamental reasons why compiling is slow.
At Sumo Logic, we happily build over 600K lines of Scala code[1] with Maven and find this setup productive. Based on the public perception of the Scala build process, this seems about as plausible as a UFO landing on the roof of our building. Here’s how we do it:
Many modules
At Sumo Logic, we have more than 120 modules. Each has its own source directory, unit tests, and dependencies. As a result, each of them is reasonably small and well defined. Usually, you just need to modify one or a few of them, which means that you can just build them and fetch binaries of dependencies[2].
Using this method is a huge win in build time and also makes the IDE and test suites run more quickly. Fewer elements are always easier to handle.
We keep all modules in single GitHub repository. Though we have experimented with a separate repository for each project, keeping track of version dependencies was too complicated.
Parallelism on module level
Although Moore’s law is still at work, single cores have not become much faster since 2004. The Scala compiler has some parallelism, but it’s nowhere close to saturating eight cores[3] in our use case.
Enabling parallel builds in Maven 3 helped a lot. At first, it caused a lot of non-deterministic failures, but it turns out that always forking the Java compiler fixed most of the problems[4]. That allows us to fully saturate all of the CPU cores during most of the build time. Even better, it allows us to overcome other bottlenecks (e.g., fetching dependencies).
Incremental builds with Zinc
Zinc brings features from sbt to other build systems, providing two major gains:
- It keeps warmed compilers running, which avoids the startup JVM “warm-up tax”.
- It allows incremental compilation. Usually we don’t compile from a clean state, we just make a simple change to get recompiled. This is a huge gain when doing Test Driven Development.
For a long time we were unable to use Zinc with parallel modules builds. As it turns out, we needed to tell Zinc to fork Java compilers. Luckily, an awesome Typesafe developer, Peter Vlugter, implemented that option and fixed our issue.
Time statistics
The following example shows the typical development workflow of building one module. For this benchmark, we picked the largest one by lines of code (53K LOC).
This next example shows building all modules (674K LOC), the most time consuming task.
Usually we can skip test compilation, bringing build time down to 12 minutes.[5]
Wrapper utility
Still, some engineers were not happy, because:
- Often they build and test more often than needed.
- Computers get slow if you saturate the CPU (e.g., video conference becomes sluggish).
- Passing the correct arguments to Maven is hard.
Educating developers might have helped, but we picked the easier route. We created a simple bash wrapper that:
Runs every Maven process with lower CPU priority (nice -n 15); so the build process doesn’t slow the browser, IDE, or a video conference.
- Makes sure that Zinc is running. If not, it starts it.
- Allows you to compile all the dependencies (downstream) easily for any module.
- Allows you to compile all the things that depend on a module (upstream).
- Makes it easy to select the kind of tests to run.
Though it is a simple wrapper, it improves usability a lot. For example, if you fixed a library bug for a module called “stream-pipeline” and would like to build and run unit tests for all modules that depend on it, just use this command:
bin/quick-assemble.sh -tu stream-pipeline
Tricks we learned along the way
- Print the longest chain of module dependency by build time.
That helps identify the “unnecessary or poorly designed dependencies,” which can be removed. This makes the dependency graph much more shallow, which means more parallelism. - Run a build in a loop until it fails.
As simple as in bash: while bin/quick-assemble.sh; do :; done.
Then leave it overnight. This is very helpful for debugging non-deterministic bugs, which are common in a multithreading environment. - Analyze the bottlenecks of build time.
CPU? IO? Are all cores used? Network speed? The limiting factor can vary during different phases. iStat Menus proved to be really helpful. - Read the Maven documentation.
Many things in Maven are not intuitive. The “trial and error” approach can be very tedious for this build system. Reading the documentation carefully is a huge time saver.
Summary
Building at scale is usually hard. Scala makes it harder, because relatively slow compiler. You will hit the issues much earlier than in other languages. However, the problems are solvable through general development best practices, especially:
- Modular code
- Parallel execution by default
- Invest time in tooling
Then it just rocks!
[1] ( find ./ -name ‘*.scala’ -print0 | xargs -0 cat ) | wc -l
[2] All modules are built and tested by Jenkins and the binaries are stored in Nexus.
[3] The author’s 15-inch Macbook Pro from late 2013 has eight cores.
[4] We have little Java code. Theoretically, Java 1.6 compiler is thread-safe, but it has some concurrency bugs. We decided not to dig into that as forking seems to be an easier solution.
[5] Benchmark methodology:
- Hardware: MacBook Pro, 15-inch, Late 2013, 2.3 GHz Intel i7, 16 GB RAM.
- All tests were run three times and median time was selected.
- Non-incremental Maven goal: clean test-compile.
- Incremental Maven goal: test-compile. A random change was introduced to trigger some recompilation.