OpenStack Gone Wild, a Stable Bug’s Life

A while ago, one of our customers asked for more security in their OpenStack installation. So we went ahead and added some web proxy in front of their public¹ API endpoints, at the same time deploying SSL certificates for authentication.

For a couple of days, all went well. Then suddenly over night one of the controller nodes stopped working completely, it turned that its root disk had filled up completely within a couple of hours. Looking a bit closer we found that the log files for the nova-api-os-compute daemon were the culprit, they were filled with very strange output, it seemed like multiple threads had been writing their messages there in parallel. No way of finding a root cause for the explosion here, so we cleaned up the logs, restarted the service, and everything was fine again.

Of course things are not that simple, a couple of days later the scenario repeated. This time we were able to find some log lines just before the kamikaze logging started, indicating that there had been issues with DNS resolution when the daemon tried to connect to some other API.

Oh well, so with changing the endpoints to using SSL, of course now they did not specify the IP address of the HA frontend anymore, but an FQDN resolving to that IP address instead. Maybe there were some intermittent problems with the outside internet connection, causing timeouts in the DNS resolution. So we added static entries for the needed hostnames to the internal DNS resolvers, allowing them to be served without relying on outside connectivity.

But still the issue was not solved, though due to closer monitoring we were now able to cleanup the affected nodes before they were filled up completely. Then we found the next piece in the puzzle: Sometimes the logs were showing errors about the process running out of available file descriptors. Looking further we found that even when the daemon was still working properly, it was having a huge amount of connections in status CLOSE_WAIT towards the endpoint for the glance API.

This lead to our next candidate culprit, the web proxy, being the component that had been introduced into the setup latest. Still we were unable to get a permanent fix for our problem, but due to lack of time to perform ever more debugging, we settled for a workaround: Set up a cron job that will restart the nova daemon once per hour, making sure that it will not clobber up with pending connections.

While this was not a perfect solution, it allowed us to spend a couple of weeks working on other tasks, with the cluster running mostly stable now. When we came back to look at the bug again, we found that someone else had already been seeing the same issue and tracked it down to a bug within the python-glanceclient library. The bug was leading to sockets not being cleaned up properly after usage, so that matched up pretty well with the scenario that we had experienced. Not only were we glad that someone else had been seeing the same issue, there even was a bug fix proposed already.

We manually tested the patch and confirmed that it indeed resolved our issue. The patch was merged soon and a new release made. Happily we went ahead, rebuilt our OpenStack packages in order to be able to deploy the bug fix without manual intervention, only to find out that once again we had been rejoicing too soon. The reason being that during the Kilo cycle there had been version caps introduced into the stable branches for most of the dependencies, with the intention of being able to make changes to the libraries that are not backward compatible, allowing for easier development of new features. But this in turn now meant that the release of the library containing the bug fix was not picked up in our package build because it exceeded the version cap. It also was not possible to manually override this version cap, because there had been other changes in the library that in turn required updated versions of other dependent libraries which in the end were not compatible with the stable branch code.

So I asked on the openstack-dev mailing list for a stable release of the library, naively assuming that this would be an easy thing to do, since both Icehouse and Juno were still being supported and for me the criticality of the bug, leading to complete controller nodes getting broken, was obvious. Sadly the response was disappointing, it turned out that nothing was set up in order to do this stable release. In fact (and that this is still the current situation at the time of this writing), developers are still discussing the policy under which the stable branches may some time in the future be created.

At the same time, following standard procedures the bug in question had been marked with status “Fix Released”, which I feared would imply no one will ever work on it again. Since there was no stable branch for the library, it was impossible to mark this branch as being affected, as would have been to usual way of tackling this situation for other projects. Since from my operator’s point of view, it was still the nova daemon being effectively broken, I tried adding nova as affected project to the bug, but sadly this doesn’t seem to be an accepted solution either.

In the end we had to come up with creating our own stable branch, backporting the fix itself luckily was as simple as doing a git cherry-pick. It took quite a bit of fiddling with the build process however, because all the python dependencies are usually pulled in from the official PyPI repository. Even though the syntax of the requirements.txt allows to specify alternative sources, this did lead to some blow up in the testing process being performed during the package build. In the end we had to resort to installing our custom library package in an explicit extra step after building the package normally with the still broken official library package.

Just to stop people from looking for them: Public in this case is the OpenStack term for these as opposed to internal endpoints. This does not imply that these are accessible from the public internet.↩

Fricklers Blog

About OpenStack, CEPH and automating all of this

OpenStack Gone Wild, a Stable Bug’s Life