I started keeping the list of hardest bugs I encountered and fixed and this latest one deserves a place on this list.
We have centralized logging based on syslog-ng open source product. Messages from the event sources flow into syslog-ng which inserts them into database. The event sources applications send events through log4Net RemoteSysLogAppender.
I was wiring up the new event source when I noticed that syslog-ng stopped processing any new messages. After a few hours of debugging I discovered what the problem was. The event source application sent message when an exception was caught. Part of the exception class is TargetSite property which is a name of the method name that threw an exception. Maybe because the exception was bubbling up up long on the chain or something else, the value of TargetSite property was a string longer than 255, something like this:
'Void .ctor(System.Data.ProviderBase.DbConnectionPoolIdentity, System.Data.SqlClient.SqlConnectionString, ta-da-da....
Syslog-ng receives this message and tries to insert it into database, but the database column for TargetSite is defined as
VARCHAR(255) and thus insert fails.
This is all would be fairly benign, but syslog-ng doesn't batches up messages instead of inserting them one by one. This also makes sense, but here's the kicker - if one message in the batch fails to be inserted, the whole batch is rolled back. Pretty bad. So essentially one malformed message plugs up the whole server and no new messages can come through.
This is well known poison message problem, but it appears to me that syslog-ng cannot handle poison message. It does retries, but has no things like dead letter queue. What made matters even worse, was that syslog-ng doesn't even have a log where it reports failures like this, or may be we didn't set up it that way and the only way to try to understand it was to through SQL Profiler which took some time. The fix was to ensure field length in the message matches database column length and purging syslog-ng queue from all messages.
Another nasty bug which also made it to my list was when I working with legacy code and couldn't figure out why all of the sudden database state was unexpectedly changing. I was stepping through the code in debugger and in completely innocuous section of code - boom the value changes. After narrowing it down to just a few lines of code, I found that state was changing after getting value from one of the objects properties. It was something like
var rate = obj.Rate;
and after this line the value is changed. It turned out that somebody put a piece of code into Rate getter to change things. I have no idea why, but this one also made it to the list.