The term boot storm origins from virtual computing infrastructure. When many systems boot in a narrow time interval the infrastructure is overwhelmed and becomes unresponsive. There is a similar effect in CI environments which we call a dispatch storm. Cider-CI has build in means to avoid dispatch storms as of version 3.14.
Cider CI is capable of dispatching many tasks within a very little timespan. This one of the qualities which make Cider-CI a very efficient testing platform.
When many tasks (more precisely trials in Cider-CI parlance) are dispatched to the same executor negative effects can occur. The underlying hardware of the executor might be overwhelmed with switching contexts. Concurrent projects checkouts can be much slower compared to sequential checkouts; in particular if traditional magnetic hard disks are involved. The result is a much slower execution and a higher likeliness for timeouts within the tests.
One solution is to limit the number of available slots on the executor much below the number of real cores available on the executor. This has the disadvantage that the available resources are not used optimally.
By our experience most of the times only the starting phase of
a test involves a high demand on the system resources. To mitigate this
demands during the first few seconds of a test Cider-CI version 3.14 introduces
the dispatch_storm_delay_duration
directive. After a trial has been
dispatched no further trial will be dispatched to the same executor until
this duration has passed. The directive is placed on the task level, see
the Dispatch-Storm Delay Demo for example. The value is given in human
form, 30 Seconds
for example.
An additional setting can be found in the server configuration which is used as a default if a task does not provide one. A low value between one to three seconds has shown positive impact on even very light tests since it overcomes the project checkout phase. For this reason a low default of one second is per default in place.
More complicated tests benefit from a higher value. Finding a fitting value for a particular set of tests needs observation over a few executions. We found no general setting that served all kinds of tests we are executing. The following patterns are to watch out for:
When any one of this circumstances is present it pays out to invest some time
and adjust the dispatch_storm_delay_duration
directive for the particular
kind of test. We don't set the value dedicated to single test. We group
"feature tests" and "model tests" for the Madek project for
example. The tests for Cider-CI itself, which satisfy three of the four given
criteria, also showcase an interesting example.