Lunar Bug

On the morning of 1st July 2015, I was notified of high CPU loads for most of our servers (Open Suse Linux). Needless to say, this caused many of our customer-facing applications to be unavailable. The Engineers who were troubleshooting were at their wit's end. No code had been deployed. No new script had been run. There were no hacker attacks. We had not run any huge campaigns.


Most Java processes exhibited the high loads. We ran strace and found that the system call being run was futex. Futex is a lock primitive, used as building blocks for higher synchronization constructs.Turns out that the reason most of our Java processes were spinning on futex was due to "leap second". Because of the Moon slowing down Earth's rotation, the international time keepers add a second occasionally, to accommodate the difference between precise time and imprecise observed solar time. As a result "NTP" (a protocol to keep all servers clock time in sync) too counted 23:59:59 twice on June 30, 2015. But the internal server clocks didn't. Futex calls were always timing out.


To gve an analogy, it is like, you decide to clean your inbox by finding very old emails and deleting them to make space. Because 1 second for CPUs is like eons, all locks acquired were treated as old emails from the jurassic era and were cleaned up before they could even be opened. This continuous denial of lock requests causes spinning. Spinning locks that never resolve, leaves the CPU in a vegetative state.


The fix was simple in the end. Run this command as root and restart the processes.

date -s "`date`"


We blamed lunar gravity in our RCA.


The Moon slowing down Earth's rotation. Frozen servers as a result. Wow!! I felt the heavens were pointing us to the glitch in the matrix !!