The characteristics of concurrent multi-module builds

The following article tries to discuss the effects of various concurrency scheduling options, seen in context of an
actual build. For the sake of disussion in this article, the available concurrency always matches or exceeds the
amount of schedulable work, if you're saturating your CPU then there's litte more to be gained, performance wise.
All measurements done in this article is also based on ramdisk (tmpfs) based, models, meaning that IO in general is
taken out of the equation.

Legend

The following abbreviations are used:

Letter

Phase

C

compile

TC

Test compile

S

surefire

J

jar/install

X

blocked/unrunnable

This analysis tries to explain concurrency options seen in the context of one particular build. While this is definitely
non-exhaustive, it suffices to illustrate the challenges/restriction encountered in making this /one/ build run optimally.

The module dependency graph is as follows


Figure 1: Dependency graph of project

My average project

I timed the actual phases of some other builds (my 2 test projects). These are fairly standard maven projects
with lots of code and decent test coverage.

The interesting thing (not shown) is that the average time for the different lifecycle phases in a multi-module build
did not vary much. Without loosing too much accuracy I could define an "average" module in my multi-module build. For mvn -o clean install my "average" module in my project spends

That means there's less than the rounding error (<1% left for all the other stuff).

The run-time view

To make things more interesting, I've transposed the (real) numbers from "my average project" onto the "imaginary" dependency graph seen in figure 1, to
better understand what is happening. The figure has "time" along the X-axis, and shows the different modules along the Y axis.


Figure 2: Weave-mode run-time scheduling of modules in the average build, time along X axis

The interesting bit about this is that minor variations in the individual modules have little impact on the end-result:
the figures are to-scale so if you can keep them visible at the same time you'll see the (lack of) difference.


Figure 3: Module E changes characteristics (becomes shorter than average), module Z follows scheduling

It's possible to draw a large number of graphs that have significant changes in invidvidual modules but no
change on end-outcome.

The runtime-leaf-module issue

There's a given set of modules that are reactor-leaf-modules in the reactor dependency
tree (Y and Z in this case). There is an additional set of runtime-leaf-modules that constitute the
"last modules to reach package/install" in a concurrent build. If we assume that jar/install is mostly at very small
phase at the end, we see that the race is all about reaching the packaging phase (between S and J in the figures (wink))

Notable special cases:


Figure 4: Same graph as figure 3, but with critical path runtime-leaf-module shown with red line

The graph shows the "critical path" in this build. Although it cannot be known up-front it will in effect always limit the total-time
spent building this project.

Given this understanding, one could be tempted to look at a few other scenarios:

So it'd be possible to consider cross-module prioritization of threads/scheduling of tasks

Number of schedulable tasks


Figure 5: Number of schedulable tasks

Variations


Figure 6: first module in reactor dependency is critical path of execution

In this scenenario, the unit tests in the first module take a long time to complete.

What does it mean ?