Abstract: I'm going to describe the current CPU usage of GPU-Grid, make a few assumptions about how things work behind the scenes and will then suggest a way to reduce the CPU usage, while simultaneously keeping performance of current GPUs constant and optimizing performance of older GPUs.
Current situation: the new Kepler GPUs use a full CPU core for most WUs (only half a core for some WUs). I remember someone (probably GDF) saying this was done to assure they're performing as fast as they could.
On older GPUs a mechanism called Swan Sync is in use by default and keeps CPU usage rather low, e.g. ~8% on a GTX570 if I remember correctly. On these cards performance started to suffer the faster they became, so that it became standard practice in older apps to disable Swan Sync via setting the environment variable "Swan_Sync=0". This way an entire core was used and optimal GPU performance was achieved.
- Using a full core does not involve any magic or additional calculations, it's just polling the GPU as fast as possible, in order to avoid the GPU running dry.
- Swan Sync itself seems to work pretty well, as evidenced by the slower cards.
- Swan Sync predicts how long a GPU will need for a given time step (or whatever the chunk of work is that the GPU processes without CPU intervention) and sets the CPU thread to sleep for about just that long. After waking up the CPU thread starts continously polling the GPU for the results.
- When Swan Sync starts to "fall apart" is when the time needed per time step approaches the time granularity of the OS scheduler, i.e. for fast cards.
- Each time step requires approximately the same time.
- Swan Sync is not yet adaptive.
Intermediate conclusion: actually we'd want to switch between Swan Sync and constant polling based on GPU performance rather than GPU generation. One could approximate GPU speed from the clock speed, number of shaders and compute capability.. but let's think this further.
Suggestion: I'm about to describe an algorithm which is based on Swan Sync, but introduces a correction to the time prediction calculated by Swan Sync. This correction is empirically determined "on the fly", from data readily available, is continuously updated and hence automatically includes all factors influencing the performance of a machine (static and temporary ones).
Here's how to do it: wiggle the sleep time of the CPU up and down. We start with what ever sleep time Swan Sync calculates. For the next time step we reduce this time by a small amount, maybe 5%. After completing this time step we check if the GPU was done with the work earlier than our initial prediction forcasted. In this case we apply a small correction to the time predicted by Swan Sync and repeat this two-step cycle. This mechanism should ensure that the sleep time is always short enough so that the GPU is not being limited by polling it too late. If this results in practically constant polling for very fast cards - fine, we're not loosing anything compared to the current situation.
And there's another case to be considered: both, the predicted sleep time and the wiggled down one are too short, i.e. in both cases some continous polling of the GPU happened before the time step was completed. In this case we can increase the sleep time (i.e. the correction to the Swan Sync estimate). However, I'd do this carefully, as it can cost GPU performance. Maybe trying 5% longer for every 50 time steps or so, and then check if GPU performance suffered (it was ready before polling started and we keep the current sleep time) or if it's still fine (still some polling before completion, so we can increase sleep time).
Obviously some fine tuning of the wiggle steps and frequency would be needed and I'd also keep a history of the last 100 or 1000 time steps, which should help in making an even better prediction. Make this window small to make the algorithm agile - it will react quickly to changes. And make it larger to smooth things out, so that the occasional odd value doesn't throw the timing completely off balance.
Summary: by wiggling the sleep time constantly down and carefully up it should be possible to achieve optimum GPU performance at little additional CPU load, independent of the GPU generation.
Let me know what you think!
Scanning for our furry friends since Jan 2002