Adding log4Net into an application and configure it to write to a file is simple, but making all applications in the organization to send some sort of telemetry (errors, warnings, messages) and most importantly making sense of this data is completely different story. By making sense I mean
- Seeing problems in the systems before customers do and reacting on those problems. Proactively detect anomalies. Have Ops guys in the loop monitoring the systems.
- Know overall system health (are there more errors after deployment we just did?)
- Filter out noise. There are way to many events, how do I focus on what's essential? This goes way beyond that simply setting up Pager Duty alert on a specific error message text.
- Query events by data range, an application, event text or other attributes (this is probably the simplest from all)
- Correlate events between systems
These are only some of the questions that today's event logging system should answer. Assume that we are talking about just application events and not OS, network appliances, IoT or other sources.
In the year of 2016 it is a well known problem with a ton of possible solutions and multitude of vendor products both on-prem and SAAS (syslog-ng, Raygun, Alert Logic, Riemann and of course the biggest one of all, Splunk). Assume that for whatever reason, either cost or not enough volume or too much volume, you opt to hybrid approach where say, the event flowing to some centralized event hub (a service or product like syslog-ng) and then end up being stored in database table. So given that, we want to be able to start small, with a baby step to just make sense from all these messages coming from all over the place. And this is where we are finally coming to a simple but quite useful feature of event grouping.
Rarely there is a single instance of event. There can be a unique event that has significance, but much more often the same or similar event keeps on coming many times. Buggy piece of code gets called all the time and every time it throws dreadful Object Reference Not Set exception. If there are too many messages like that, you can't see forest through the trees. If it's just a matter of one verbose event source, this event source can be filtered out, but things get more complex if there are multiple event sources like that. Therefore, it nice to group similar messages and show message and its occurrences. Object Reference Not Set - 2,300 times, Input sting has incorrect format - 1,200 times and so on. That sounds useful, but how do we actually do this?
Group events by message text
This is what first comes to mind - simply group those events by full event text, essentially SQL GROUP BY. That will work but won't be effective. Consider this exception
Transaction (Process ID xxx) was deadlocked on (xxx) resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Because there are many types of messages like that, simply running GROUP BY will result in too many message groups where in fact all of those message groups will be the same message with different values.
Let user pick messages to be grouped
So, why not let user pick the messages to be grouped telling us that they are in fact the same message? From what I heard from people using this feature in a vendor product, that's too much work. Too many clicks, too much hassle. And this finally brings us to an effective feature which is not too difficult to implement.
Group events by regex
I was surprised that at this time, at least per my understanding, this feature is missing in big vendor's products. I saw people asking for it, but haven't yet seen it working. The idea is very simple. Define message patterns as regular expression in what can be called Message Editor. Write matching code. The match occurs in two places, when the events are coming in, on the fly and match a group of unmatched events on the page. How this is done is another topic, but we seen encouraging results from this method - after defining message patterns out of roughly two million events there are usually less than 50 unique event groups and each of those event groups can be expanded into individual events with full details. This is so much easier to understand!