Event Correlation

opEvents does not just automatically suppress duplicate events (stateful or custom-matched); it can also create new events based on correlating recent event occurrences.
In opEvents versions 2.0.4 (and newer) you can use more fine-grained controls to deal with the triggering events.

This event correlation and synthesis feature is configured in the same way as the duplicate suppression, namely by putting event creation rules into conf/EventRules.nmis.

An event synthesis rule consists of:

an event name, which specifies the name of the newly created event,
a list of events (more precisely, their names), which are the events to consider for correlation,
a (minimum) count of events that have to be detected to trigger the rule,
an optional list of groupby clauses, which define whether the count is interpreted globally for all named events, or separately within smaller groups,
optional delayedaction and autoacknowledge clauses, which define how the triggering events should be handled,
an optional enrich clause, which adjusts the handling of the newly created event,
and finally a window parameter, which defines the time window to examine.

(If you compare suppression and synthesis rules closely, you'll see that the main difference is the lack of a suppress clause for synthesis rules.)

Here is an example rule:

'3' => {
	name => 'Customer Outage',
 	events => ["Node Down","SNMP Down"],
 	window => '60',
 	count=> 5,
 	groupby=>['node.customer'], # count separately for every observed value of customer
 	enrich=>{priority => 3, answer => 42}, # any such items gets inserted in the new event
},

This rule causes opEvents to look for Node Down and SNMP Down events in the last 60 seconds, separate them into per-customer groups (see grouping below); if it counts 5 or more such events in a group, then a new event called Customer Outage is created.

Event content and Enrichment

When a synthesis rule creates a new event, first the contents of the most recent triggering event are copied over. Then the event name is set to the name given in the rule, an audit trail of triggering events is added (by adding the properties nodes and eventids), the event is marked as synthetic, and finally any enrichment entries from the rule are added in.

Event Processing

At this point the new event is inserted into the database, and is ready for further action processing. This action processing (e.g. escalation, mail notification, custom logging) is performed immediately.

Handling of the Triggering Events

Please note: this feature is only available in opEvents 2.0.4 and newer.

If your rule contains a delayedaction clause (with a numeric value), then all potentially triggering events will have their action processing delayed by the given number of seconds.
This affects all events whose name is in the event list of your rule, no matter whether the limits for triggering a synthetic event have been met or not. The delayaction value should therefore be set to a relatively small value.
If your rule has the property autoacknowledge set to true or 1, then all triggering events will be automatically acknowledged and all action processing for them will be aborted.

The combination of these two controlling properties provides fine-grained storm control and the ability to create "combination events" that subsume and close any number of triggering events:

"Plain" Synthetic Events

If your rule sets neither delayedaction nor autoacknowledge, then the incoming potential trigger events will be processed as per normal and immediately, and any policy actions for them will be taken as soon as possible (but possible after being delayed by the state_flap_window - see Deduplication and storm control in opEvents for details). The trigger events are thus completed, and visible as current/unacknowledged, completely independent of any synthetic events that might get triggered by them later.

"Combination" Events

If your rule sets delayedaction (and optionally autoacknowledge), then the incoming events are delayed and held for the given time before any policy actions are taken for them. (The delayedaction setting should be the same as or larger than the rule's window setting.)

If the requirements for a synthetic event are met during that time, then the new synthetic event can "combine" and supersede the triggering events. If autoacknowledge is set, then all the triggering events will be acknowledged, closed and no actions will be taken for them at all. (Without autoacknowledge the triggering events would not have actions performed but would remain unacknowledged.)

The net effect is that the current events view would show only the new synthetic event as 'current' and all the underlying triggering events would be categorized as closed (and optionally acknowledged), and thus be mostly hidden.

Grouping

If no groupby clause is present, then the set of matching events is counted directly, which may be too generic for many common scenarios. For example creating new events for a particular customer or service group only wouldn't be possible. Grouping solves this problem: the set is split into groups with matching property values and the count threshold is applied to those groups.

The groupby clause has the form of a list of node.X or event.Y property specifications (e.g. node.customer or node.group), which are used to group events into buckets for counting: only events that share the same values for all the listed grouping properties will be counted together. For example, the groupby clause [ 'node.customer', 'event.priority' ] would cause this suppression rule to be applied independently for all combinations of customer and event priority. The clause given in the example block above will create a Customer Outage event for any individual customer with 5 outages in 60 seconds; without the groupby any 5 outages would cause a synthetic event to be created.

The common node properties are listed here, and the standard event properties are documented on this page.

Synthetic events and Storm Control

All synthesis rules are applied independently, thus a single event could be a trigger for multiple synthetic events. This is desirable for example for detecting both per-customer problems and global issues at the same time: a few problem events can trigger a customer-specific action, while the same events could be counted together with others for detecting and reacting to a major outage.

However, great care has been taken to avoid event storms caused by synthetic events: When a synthesis rule fires because there were more than count matching events in the time window, then all the matching events are marked as consumed and will not be considered for any future synthesis for this rule. In other words, there is no overlap between successful synthesis time windows.

Here is a practical example for the consequences of this design: Let's assume a rule that specifies 5 event matches in 120 seconds as trigger. At some time T1 we count 25 such recent events, therefore the rule fires, a new event is created, and the 25 matches are consumed (not just the 5 that the trigger requires!). The count of triggering events thus starts from scratch at time T1. Let's assume that at T2, four seconds later, event correlation is performed next, and now only the events since T1 are considered as potential triggers. Assuming there were 3 bad events in these four seconds, no synthetic event will be created. Another 4 in the next few seconds, the count is now at 7 and the rule fires. On the other hand, if there had been 200 events between T1 and T2, then only one synthetic event would be created at time T2.

If a rule does not trigger because there are fewer than count trigger events, then naturally these events remain potential triggers until the time window moves past the events in question.

Space shortcuts

Page tree