Preventing Dispatch Storms

The term boot storm origins from virtual computing infrastructure. When many systems boot in a narrow time interval the infrastructure is overwhelmed and becomes unresponsive. There is a similar effect in CI environments which we call a dispatch storm. Cider-CI has build in means to avoid dispatch storms as of version 3.14.

updated: 2016-06-03
created: 2016-04-27
level: intermediate
keywords: dispatching, dispatch storms

Preventing Dispatch Storms

Cider CI is capable of dispatching many tasks within a very little timespan. This one of the qualities which make Cider-CI a very efficient testing platform.

When many tasks (more precisely trials in Cider-CI parlance) are dispatched to the same executor negative effects can occur. The underlying hardware of the executor might be overwhelmed with switching contexts. Concurrent projects checkouts can be much slower compared to sequential checkouts; in particular if traditional magnetic hard disks are involved. The result is a much slower execution and a higher likeliness for timeouts within the tests.

One solution is to limit the number of available slots on the executor much below the number of real cores available on the executor. This has the disadvantage that the available resources are not used optimally.

By our experience most of the times only the starting phase of a test involves a high demand on the system resources. To mitigate this demands during the first few seconds of a test Cider-CI version 3.14 introduces the dispatch_storm_delay_duration directive. After a trial has been dispatched no further trial will be dispatched to the same executor until this duration has passed. The directive is placed on the task level, see the Dispatch-Storm Delay Demo for example. The value is given in human form, 30 Seconds for example.

An additional setting can be found in the server configuration which is used as a default if a task does not provide one. A low value between one to three seconds has shown positive impact on even very light tests since it overcomes the project checkout phase. For this reason a low default of one second is per default in place.

More complicated tests benefit from a higher value. Finding a fitting value for a particular set of tests needs observation over a few executions. We found no general setting that served all kinds of tests we are executing. The following patterns are to watch out for:

When "big" executors with many cores are used.
When the project to be checked out is rather large, or the disks of the executor are fairly slow.
When a tests involves many services which are started and are running concurrently.
When at least one heavy compilation step is present in the test.

When any one of this circumstances is present it pays out to invest some time and adjust the dispatch_storm_delay_duration directive for the particular kind of test. We don't set the value dedicated to single test. We group "feature tests" and "model tests" for the Madek project for example. The tests for Cider-CI itself, which satisfy three of the four given criteria, also showcase an interesting example.