How to use Occurrence counter Whitepaper

Introduction:

Alerting and the rule set within it, is an important building block for monitoring a complex environment using thresholds.
These rules contribute to how precise a check and the monitoring of a specific system takes place.
It is also of great importance, in which timeframe a rule violation takes place and how often.
An information about each rule violation can be useful, it also can lead to an unwanted flood of notifications, which are rather disturbing.

With BVQ 2022.H1 the Occurrence counter is available as a new alerting functions.
This introduced a very useful tool, with which you are able to determine and create thresholds regarding your needs.

In the following, we will explain the use and benefits of this innovation.

Alerting - Occurrence counter:

The Occurrence counter is part of the Alerting system in BVQ.
Until BVQ release 2022H1.4, any change in the status of a rule has been directly reported with the respecting alert level, for that rule.


However, a single occurrence of a threshold break might not be a sufficient reason for an alert.
With the introduction of the Occurrence counter, the user has the ability to decide within a user-defined alert rule, when he wants to be informed about the status change of a rule.

For this purpose BVQ has introduced three new options with the new Occurrence counter:

  1. Violations per timeframe (sliding window)
  2. Violations per SLA interval (fixed window)
  3. Consecutive violations (useable in both, sliding or fixed window)

The user has the option to keep the default direct trigger mode active, which means the Occurence counter stays disabled,
The user has also the option to activate the all new Occurrence counter and use these new options to specify in more detail when an alarm is triggered (delayed trigger mode).

Where can I find the Occurrence counter and how can I enable it:

By default, the Occurrence counter is disabled.
For enabling, you have to activate the switch and add one or more counter definitions to your BVQ alert condition.

 Figure 1

Figure 1: Use Occurrence counter


After enabling the Occurrence counter, you can decide on a method to choose from  → Consecutive violations or Violations per time

Both methods can also be used in the Service Level Agreement (SLA) mode.
The capabilities and the usage will also be explained in this document.

INFO: To use this on Predefined alert rules, you have to clone the rule and edit the clone.
The Occurrence counter can only be used within custom alert rules.

How to enable and use the different options:

As illustrated in Figure 1, after enabling this option you can decide which method to choose from.
Both methods differ in the way, a status change of the respective alert condition is reported.
In the alert condition is defined, what value the threshold has and in which time frame the break of a threshold is resulting in an alert.  

Consecutive violation determines the quantity of errors that occur directly one after the other.
Violations per timeframe determines the quantity of errors that occur per time interval. 


Please be aware that you can add multiple Occurrence counters to each condition.
It can be quite useful to define multiple alert levels with separate counts and to have the ability to combine "per time" with "consecutive" violations.
You can reorder their sequence by dragging them at the desired position.
The conditions are checked from top to bottom, if a matching condition is recognized, all the following conditions below will not be checked.

INFO: Please note that a mixture of SLA and non SLA mode within a single Alert Condition is not possible.

 Figure 2
Consecutive violations



TypeType of this
Occurrence counter                                                     
  • Consecutive violations - define how many consecutive violations in a row must occur before the condition is fulfilled
AmountNumber of violationsMaximum number of violation occurrences that are counted until the Alert level is raised.
Alert levelBVQ Alert level

Desired Alert level, one of: OK, INFO, WARN, ERROR, UNKNOWN

Violations per timeframe



TypeType of this
Occurrence counter                                                     
  • Violations per timeframe - define how often in a certain timeframe a condition must match until the condition is fulfilled
TimeMinutesSize of the sliding timeframe, in minutes. Only available for type "Violations per time" if SLA mode is turned off.
AmountNumber of violationsMaximum number of violation occurrences that may be counted until the Alert level is raised.
Alert levelBVQ Alert level

Desired Alert level, one of: OK, INFO, WARN, ERROR, UNKNOWN

Violations per SLA interval (fixed window)


TypeType of this
Occurrence counter
  • Violations per SLA interval - define how often in a fixed timeframe a condition must match until the condition is fulfilled
Time
If the SLA Mode is enabled the time setting is not available. The number of Alert conditions within the SLA interval will be counted.
AmountNumber of violationsMaximum number of violation occurrences that may be counted until the Alert level is raised.
Alert levelBVQ Alert level

Desired Alert level, one of: OK, INFO, WARN, ERROR, UNKNOWN

Figure 2: Differences between the options

Illustration of the different methods:

The following chart illustrates the different methods, using performance data like the latency and IO/s of an MDisk. 

RED:         Every violation of the given Alert rule will be informed. (Occurrence counter OFF) - (direct trigger mode)

BLUE:       Only a consecutive sequence of exceedances triggers an alarm. (Occurrence counter ON - Consecutive Violations) - (delayed trigger mode)

YELLOW:  Only a certain amount of exceedances within a specified timeframe triggers an alarm. (Occurrence counter ON - Violations per Time) - (delayed trigger mode)

 Figure 3

Figure 3: Real Life example and illustration of the different methods

Example & description

This table describes the different options using a real-life example.
More detailed information about each option is given, as well as a description of how to activate one of the options.

 Figure 4

Methods of choice

Example

How does it work?

What to fill in

Default (direct trigger mode)


Each exceeding of a predefined threshold triggers an event within the designated warning level. (Info, Warning, Error)

If a latency exceeds 3ms, the threshold is reached and an error message is triggered. 

The example with the red border shows only one of these events for clarification, but this behavior would apply to all further exceedances.


The default will raise an Error event as soon as the rule is violated the first time.



Consecutive Violations


Only when a set of 5 consecutive errors occur, an event is triggered within the designated warning level. (Info, Warning, Error)

This condition is shown inside the blue frame and actually occurs only once in this performance chart.


The latency for the MDisk is 5 times above the 3ms threshold and therfore this rule will raise an Error.

Violations per timeframe



The yellow box shows the possibility of setting "Violations per time". Here, a number of violations per time is specified.

As soon as this number is reached, the designated warning level is displayed.

However, if this value is no longer exceeded, the status of the alert rotates back to the next better value.


In this example we have specified a number of 3 Violations per 60 min. timeframe. With the exceedance of this an alert message will be raised.



Figure 4: The different options using a real-life example

When using the non-SLA - sliding window mode, the status of a rule varies, depending on the current value.
If in the next measurement the value falls below a critical value again, this will be reflected in the status of the alert rule and it will change from e.g. Error Level back to an OK state.
Relevant for this is the PI timing and the time specified in the Sliding window. 
Here, an additional possibility has been created to maintain this state in the long term, over a certain SLA time interval. (fixed window)

SLA timing mode for BVQ Occurrence counter

As stated above, both options of the occurrence counter can also be used in an SLA mode.
The SLA mode allows the setting of a so-called fixed time window. This fixed time window specifies a period of time, in which the warning rules are applicable.
So, if it has to be proved that a certain rule is not exceeded within a certain period of time, this option is the right setting. 

The main difference from the Sliding Window (used without the SLA option) is the fact that the state of the Alert rule remains.
By default, SLA mode is disabled. To enable it, change the SLA INTERVAL from "SLA mode not enabled" to the desired timeframe.
The following illustration shows the differences by means of a comparative analysis:


Standart timing mode (sliding window)SLA timing mode (fixed window)
Intended to be used for normal monitoring purposesIntended to prove the health of service level objectives.
Allows to define a separate Timing for each Occurrence counter definitionForces all Occurrence counter definitions to use the configured SLA timing
Uses the timing for a sliding window - Time setting within the Occurrence counter definitionsUses the timing for fixed window - Time setting for the SLA interval
Reset to default Alert level (typ. OK) when the measurements in the sliding window are below the counts of all Occurrence counter definitions.Reset to default Alert level (typ. OK) at each start of the SLA interval fixed window.

Figure 5: Standart timing mode versus SLA timing mode


In order to compare the differences between the two options, we have chosen a schematic diagram which compares the standard timing versus the SLA timing. 

In this example we use the term "Sliding Window" for the standard timing and the term "Fixed Window" for the SLA timing.
To keep the example simple and understandable the following assumptions have been made:

  • PI Timing: 5 min
  • SLA Interval: 30 min
  • Violations depending on frequency divided into INFO, WARN and ERROR

Exact values are not given, here it is only about the representation how often a notification appears depending on the module.

 Step 1: Start of the interval

A rule is monitored every 5 min and leads to the output of the error status if it is exceeded once or several times.

In this picture it is to be recognized that with the start the state still runs in the "OK" status and after in each case 5 min another error occurs.

Accordingly the status changes independently of the sliding or fixed window first to info, then to warning and finally to error status.

Thus, both options are in the Error status in interval 25.

 Step 2: Behaviour of the fixed window mode

The SLA interval is set to 30 minutes in this example, so the alert level status will be reset to OK after exactly this period. 

With the default time mode, the status is re-evaluated every time the warning rule is exceeded.
As can be seen in the next step 3, the status changes over a period of time and thus re-evaluates each time how many violations of the rule have occurred within the time window.


 Step 3: Behaviour of the sliding window mode

This example shows the display of the different alarm levels. In SLA mode, the status is reset exactly after 30 minutes, as shown in step 2. And the counting of rule violations starts again. 

Whereas in sliding window mode, the current state and the number of state changes in the time window are taken into account. After the 4th PI interval, i.e. after minute 20, the rule is no longer violated. The sliding window over a period of 30 minutes now counts the number of rule violations within the 30 minutes and decides how often the rule was broken here. 

Here you can clearly see the "expiration" of the alarm level status. 

Figure 6: A more schematic way to explain the differences


Explanation based on the real-life example from above

Advanced example: For this we take the same chart as above. In this example we concentrate on the timeframe 07:00 untill 11:00.  Also, we claim that nothing of particular interest happens before and after this time window.

Let's further assume you need to certify that a group of Storage Pool Disks (SVC MDisks) will meet a set of thresholds:

  • ERROR at 3 ms
  • Violations per time 5, 10, 20 occurrences per time
  • consecutive violation by more than 5 occurrences
 Figure 6

Figure 6: Real-life example



BVQ Alert rule - SLA Mode not enabled (sliding window)Each box represents a 1 hours sliding window and the color shows the current statusExplanation
NameSVC MDisk violated


INFO: Each colored box represents a sliding window of 1 hour and indicates the current status by its color.

From 07:15 untill 08:55 the 3ms latency rule was exceeded 5 times. → Status Raised to ERROR

The next upcoming 5 min PI measurements did also contain more than 5 violations of the 3ms latency rule. → Status still at ERROR
From 09:15 the staus lowerded the severity because the 3ms latency rule is no longer exceeded 5 times in that 60 min sliding window. → Status  WARN


Perfomance indicator timing5 minutes
SLA intervalNONE (OFF)

AR ConditionLatency > 3ms  

1. AR Condition > 3 msERROR

INFO: Each colored box represents a sliding window of 1 hour and indicates the current status by its color.

Figure 7: Disabled SLA mode


BVQ Alert rule - SLA Mode enabled (fixed window)SLA interval of one dayExplanation
NameDaily SVC MDisk SLA violated

INFO: Each box shows by its color the status since the beginning of the day up to that moment. 


SLA mode sets the status to default OK at the start of the day.

At 08:55 5 violations are counted this will raise the status to  INFO

At 09:10 10 violations of 3ms latency are counted so that the status will raise to WARN

At 09:15 5 consecutive violations are happening. This will trigger our 4. alert rule and therefore the status will direclty raise to ERROR

This status will be based on the SLA mode persistant untill the next start of the day.

It will then be reset to OK




Perfomance indicator timing5 minutes
SLA interval1 day
AR ConditionLatency > 3ms
1. AR Condition Occurrence counter

ERROR

20 times per SLA interval
2. AR Condition Occurrence counter

WARN

10 times per SLA interval

3. AR Condition Occurrence counter

INFO

5 times per SLA interval
4. AR Condition Occurrence counterERROR5 times in a row per SLA interval

Figure 8: Enabled SLA mode

While the normal Occurrence counter timing is based on a shorter sliding window (some minutes) the SLA timing mode is based on a fixed window to assess the state of a custom alert rule for a defined larger timeframe (hours, days weeks). The SLA timing fixed window is aligned to Monday 00:00. At the start of each SLA interval, the Alert level is switched back to the default level (OK) and as soon as the occurrence count is met, the Alert level is raised to the level configured in the Occurrence counter.

Additional information

  • A mixture of consecutive violations and violations per timeframe withing on rule is possible. But you cannot mix SLA mode and non SLA mode within one rule
  • If SLA timing mode is enabled, an Occurrence counter must be defined in each Alert condition and all occurrence counters are restricted to use the SLA timing window instead of the individual ones.

Attention: Some settings do not make sense and may not lead to any result of the alert rule. When entering the PI timing, number of violations per time and the parameters in the SLA intervals, the violations per time must make sense. 

Here is an example:

What do I have to care about specifying the values for SLA or non SLA mode in respective of the timing?

  • Violations per timeframe (sliding window) → → Time must be greater than  →  Amount of violations  X  PI timing

> X

  • Violations per timeframe (fixed window) → →   SLA interval must be greate than → Amount of violations  X  PI timing

>X

Summary

The new BVQ Occurrence counter is an important possibility to specify the amount and the way of monitoring your environment with the Alerting options. With this option it is possible to control very simple up to highly complex scenarios. 

The BVQ team can help you with any further questions!