A while ago, one of our customers asked for more security in their OpenStack installation. So we went ahead and added some web proxy in front of their public1 API endpoints, at the same time deploying SSL certificates for authentication.
For a couple of days, all went well. Then suddenly over night one of the controller nodes
stopped working completely, it turned that its root disk had filled up completely within
a couple of hours. Looking a bit closer we found that the log files for the
nova-api-os-compute daemon were the culprit, they were filled with very strange output,
it seemed like multiple threads had been writing their messages there in parallel. No
way of finding a root cause for the explosion here, so we cleaned up the logs, restarted
the service, and everything was fine again.
Of course things are not that simple, a couple of days later the scenario repeated. This time we were able to find some log lines just before the kamikaze logging started, indicating that there had been issues with DNS resolution when the daemon tried to connect to some other API.
Oh well, so with changing the endpoints to using SSL, of course now they did not specify the IP address of the HA frontend anymore, but an FQDN resolving to that IP address instead. Maybe there were some intermittent problems with the outside internet connection, causing timeouts in the DNS resolution. So we added static entries for the needed hostnames to the internal DNS resolvers, allowing them to be served without relying on outside connectivity.
But still the issue was not solved, though due to closer monitoring we were now able to
cleanup the affected nodes before they were filled up completely. Then we found the
next piece in the puzzle: Sometimes the logs were showing errors about the process
running out of available file descriptors. Looking further we found that even
when the daemon was still working properly, it was having a huge amount of connections
CLOSE_WAIT towards the endpoint for the glance API.
This lead to our next candidate culprit, the web proxy, being the component that had been introduced into the setup latest. Still we were unable to get a permanent fix for our problem, but due to lack of time to perform ever more debugging, we settled for a workaround: Set up a cron job that will restart the nova daemon once per hour, making sure that it will not clobber up with pending connections.
While this was not a perfect solution, it allowed us to spend a couple of weeks
working on other tasks, with the cluster running mostly stable now. When we
came back to look at the bug again, we found that someone else had already
been seeing the same issue and tracked it down to a bug within the
python-glanceclient library. The bug was leading to sockets not being
cleaned up properly after usage, so that matched up pretty well with the
scenario that we had experienced. Not only were we glad that someone else
had been seeing the same issue, there even was a bug fix proposed already.
We manually tested the patch and confirmed that it indeed resolved our issue. The patch was merged soon and a new release made. Happily we went ahead, rebuilt our OpenStack packages in order to be able to deploy the bug fix without manual intervention, only to find out that once again we had been rejoicing too soon. The reason being that during the Kilo cycle there had been version caps introduced into the stable branches for most of the dependencies, with the intention of being able to make changes to the libraries that are not backward compatible, allowing for easier development of new features. But this in turn now meant that the release of the library containing the bug fix was not picked up in our package build because it exceeded the version cap. It also was not possible to manually override this version cap, because there had been other changes in the library that in turn required updated versions of other dependent libraries which in the end were not compatible with the stable branch code.
So I asked on the
openstack-dev mailing list for a stable release of
the library, naively assuming that this would be an easy thing to do,
since both Icehouse and Juno were still being supported and for me the
criticality of the bug, leading to complete controller nodes getting
broken, was obvious. Sadly the response was disappointing, it turned
out that nothing was set up in order to do this stable release. In
fact (and that this is still the current situation at the time of
this writing), developers are still discussing the policy under
which the stable branches may some time in the future be created.
At the same time, following standard procedures the bug in question had been marked with status “Fix Released”, which I feared would imply no one will ever work on it again. Since there was no stable branch for the library, it was impossible to mark this branch as being affected, as would have been to usual way of tackling this situation for other projects. Since from my operator’s point of view, it was still the nova daemon being effectively broken, I tried adding nova as affected project to the bug, but sadly this doesn’t seem to be an accepted solution either.
In the end we had to come up with creating our own stable branch,
backporting the fix itself luckily was as simple as doing a
git cherry-pick. It took quite a bit of fiddling with the
build process however, because all the python dependencies are
usually pulled in from the official PyPI repository. Even
though the syntax of the
requirements.txt allows to specify
alternative sources, this did lead to some blow up in the
testing process being performed during the package build. In
the end we had to resort to installing our custom library
package in an explicit extra step after building the package
normally with the still broken official library package.
Just to stop people from looking for them: Public in this case is the OpenStack term for these as opposed to internal endpoints. This does not imply that these are accessible from the public internet.↩