The Long Story: Crashes-XEN-Kernel, Optimization, NGINX

September 15th, 2010

So what does this all above has in common? Nothing really, apart from a summary what I’ve done in the background on the server to stabilize the system and optimize the performance.

1) Crashes & XEN & Kernel

Those who have followed my rages in the past on Twitter may noticed it. Since the upgrade to debian lenny (on dom0, that’s the master system for controlling the virtualized guests) last November, I had a lot of random crashes of the xen guests (evemaps is one of those). The easiest choice was to set the number of virtual cpus (vcpu) down to 1 for every guest and it would run stable. But who’s happy to have only half the performance on a dual core system. Right.

After doing some research and talks with other people the only choice would be the kernel which was being used in debian ‘stable’ lenny. The 2.6.26 wasn’t the original kernel optimized for XEN. It was rather a kernel spiked with some strange unofficial forward patches provided by opensuse (*brrrr*).

First choice was to reinstall the dom0, but this time using CentOS 5 (clone of Redhat Enterprise Linux) who’re providing a stable OS for the master system. Reinstall went smooth and I set vcpu=2 for the guests again. Only a few hours later the first guest crashed already again *doh*, so back to vcpu=1.

Next step was to upgrade the kernel for the paravirtualized guests: Cause I was lazy I took the 2.6.32 without xen-patches, but paravirt_ops() interface from the backports repository. Everything was running stable for about 2 weeks, so decided two days ago to give vcpu=2 another try and monitor for any crashes (which were not reproduceable) since everything (xen version, all kernels) have changed so far.

But: after about 28 hours the first guest crashed again with kernel errors and while restarting the guest the whole server crashed and rebootet again.

Back to the roots (only 1 vcpu). Perhabs &%§$%§ XEN is the root cause, perhabs cpu … but nothing explain why can run stable for months with only 1 assigned cpu. Maybe I should switch to KVM one day or upgrade to better and newer hardware … we’ll see.

2) PHP & SQL Optimization

Short and quick: The usage of Zend Optimizer and eAccelerator for caching byte code compiled php scripts at runtime is working great. The average runtime per php script is usually less then 0.05 seconds.

On the SQL side I could tune some MySQL Server settings and of course find bottlenecks in sql scripts, update procedures, etc which have been optimized.

But as usual: You’ll always find something to tune and tweak.

3) NGINX: Frontend Webserver / Reverse Proxy

The newest addition was the installation of a frontend webserver which is responsible for delivering static data (like images, css, javascript), handling connection and keep alive with the client and for everything else forward (reverse proxy) the request to the internal backend webserver running the old and threaded apache. For this setup I’m using nginx. Nginx has been designed to handle multiple thousend clients with only a minimum of memory and cpu usage. Big websites like wordpress.com or sourceforge are relaying on nginx as loadbalancer, reverse proxy or simple webserver.

But nothing comes easy: After some configuration tests and dry runs on my test installation everything was looking good and stable so I deployed the setup. After a couple hours people told me that the new radar tracking feature wasn’t working anymore: Some quick checks later: The IGB Headers aren’t forwared to the backend apache anymore even the site has been trusted in the IGB Browser. After talking with some nginx people on IRC and checking the debug log I could find the root cause:

client sent invalid header line: “EVE_CHARNAME: Wollari”.

Apparently underscores “_” are invalid characters in the HTTP Header definition (bad CCP!) but there was an undocumented (until then) config option (underscores_in_headers on;) for nginx which helped in this case. If you ever wanna use nginx for eve related pages, don’t forget this option 🙂

Done

Well that’s all for the moment. I hope I could give you some insights and background knowledge. The server is running stable (even only on vcpu=1) and I could still improve a lot of smaller things here and there. And yes: I’ll still keep a close eye on OpenNMS (+ text message on my mobile) and Munin as usual to be aware of upcoming trouble.

Fly safe!

One Response to “The Long Story: Crashes-XEN-Kernel, Optimization, NGINX”

  1. Ian Mantell says:

    Hei. I gave this some investigation time and followed the rfc texts regarding the http header syntax (oh my head).
    To be frank:
    CCP is actually “innocent” of using a bad character.
    NGINX is guilty of knowing every defined http/1.1 header.

    Explanation.
    The syntax for the header responses is defined in something etched in stone by the age of it, according to
    our trusted heroes of definition: http://www.w3.org/Protocols/rfc2616/rfc2616.html it is based on this dinosaur: http://www.ietf.org/rfc/rfc0822.txt
    that is actually describing the syntax for something that existed before the definition of email was finished.
    My grave robbery of this particular dino cemetery brought up that there is no “_” forbidden. Only some delimiters are reserved.
    So the error message rather seems to refer to the whole of the variable. Maybe preceeding it with an X- could help, like in add-on headers for smtp.
    Not the tombr.. archeologists duty, though 🙂

PHP MySQL NGINX Webserver Firefox EVE Onlline Twitter @wollari Facebook
API J:18 Jul 12:55 K:18 Jul 12:49 C:18 Jul 12:00 A:18 Jul 13:37 O:04 Jun 11:15 F:18 Jul 13:27 S:18 Jul 12:56 W:18 Jul 13:15