The characteristics of concurrent multi-module builds
The following article tries to discuss the effects of various concurrency scheduling options, seen in context of an
actual build. For the sake of disussion in this article, the available concurrency always matches or exceeds the
amount of schedulable work, if you're saturating your CPU then there's litte more to be gained, performance wise.
All measurements done in this article is also based on ramdisk (tmpfs) based, models, meaning that IO in general is
taken out of the equation.
The following abbreviations are used:
Figure 1: Dependency graph of project
My average project
I timed the actual phases of some other builds (my 2 test projects). These are fairly standard maven projects
with lots of code and decent test coverage.
That means there's less than the rounding error (<1% left for all the other stuff).
The run-time view
To make things more interesting, I've transposed the (real) numbers from "my average project" onto the "imaginary" dependency graph seen in figure 1, to
better understand what is happening. The figure has "time" along the X-axis, and shows the different modules along the Y axis.
It's possible to draw a large number of graphs that have significant changes in invidvidual modules but no
change on end-outcome.
The runtime-leaf-module issue
There's a given set of modules that are reactor-leaf-modules in the reactor dependency
tree (Y and Z in this case). There is an additional set of runtime-leaf-modules that constitute the
"last modules to reach package/install" in a concurrent build. If we assume that jar/install is mostly at very small
phase at the end, we see that the race is all about reaching the packaging phase (between S and J in the figures )
So it'd be possible to consider cross-module prioritization of threads/scheduling of tasks
Number of schedulable tasks
Figure 5: Number of schedulable tasks
Figure 6: first module in reactor dependency is critical path of execution
In this scenenario, the unit tests in the first module take a long time to complete.
What does it mean ?
- Overall concurrency is governed by reactor dependency graph
- There's little/no point in trying to schedule "things" our of order. We can just let everything from package
onwards respect reactor dependency graph totally
- There's a number of crazy strategies I tried out, some of which I communicated to the dev list:
Most crazy optimizations only have any real use case if they're along the critical path, and then the effect is quite limited,
unless the optimization can affect /all/ of the potential critical paths that may arise, and even then it will be limited by the
number of runtime-leaf-modules.
- Profile information from previous runs could be used to influence priorities, but it looks mostly like it'd be
maven telling different surefire-modules how much resources they can consume to reach the overall goal