How to use Occurrence counter Whitepaper
Introduction:
Alerting and the rule set within it, is an important building block for monitoring a complex environment using thresholds.
These rules contribute to how precise a check and the monitoring of a specific system takes place.
It is also of great importance, in which timeframe a rule violation takes place and how often.
An information about each rule violation can be useful, it also can lead to an unwanted flood of notifications, which are rather disturbing.
With BVQ 2022.H1 the Occurrence counter is available as a new alerting functions.
This introduced a very useful tool, with which you are able to determine and create thresholds regarding your needs.
In the following, we will explain the use and benefits of this innovation.
Alerting - Occurrence counter:
The Occurrence counter is part of the Alerting system in BVQ.
Until BVQ release 2022H1.4, any change in the status of a rule has been directly reported with the respecting alert level, for that rule.
However, a single occurrence of a threshold break might not be a sufficient reason for an alert.
With the introduction of the Occurrence counter, the user has the ability to decide within a user-defined alert rule, when he wants to be informed about the status change of a rule.
For this purpose BVQ has introduced three new options with the new Occurrence counter:
- Violations per timeframe (sliding window)
- Violations per SLA interval (fixed window)
- Consecutive violations (useable in both, sliding or fixed window)
The user has the option to keep the default direct trigger mode active, which means the Occurence counter stays disabled,
The user has also the option to activate the all new Occurrence counter and use these new options to specify in more detail when an alarm is triggered (delayed trigger mode).
Where can I find the Occurrence counter and how can I enable it:
By default, the Occurrence counter is disabled.
For enabling, you have to activate the switch and add one or more counter definitions to your BVQ alert condition.
After enabling the Occurrence counter, you can decide on a method to choose from → Consecutive violations or Violations per time
Both methods can also be used in the Service Level Agreement (SLA) mode.
The capabilities and the usage will also be explained in this document.
INFO: To use this on Predefined alert rules, you have to clone the rule and edit the clone.
The Occurrence counter can only be used within custom alert rules.
How to enable and use the different options:
As illustrated in Figure 1, after enabling this option you can decide which method to choose from.
Both methods differ in the way, a status change of the respective alert condition is reported.
In the alert condition is defined, what value the threshold has and in which time frame the break of a threshold is resulting in an alert.
Consecutive violation determines the quantity of errors that occur directly one after the other.
Violations per timeframe determines the quantity of errors that occur per time interval.
Please be aware that you can add multiple Occurrence counters to each condition.
It can be quite useful to define multiple alert levels with separate counts and to have the ability to combine "per time" with "consecutive" violations.
You can reorder their sequence by dragging them at the desired position.
The conditions are checked from top to bottom, if a matching condition is recognized, all the following conditions below will not be checked.
INFO: Please note that a mixture of SLA and non SLA mode within a single Alert Condition is not possible.
Illustration of the different methods:
The following chart illustrates the different methods, using performance data like the latency and IO/s of an MDisk.
RED: Every violation of the given Alert rule will be informed. (Occurrence counter OFF) - (direct trigger mode)
BLUE: Only a consecutive sequence of exceedances triggers an alarm. (Occurrence counter ON - Consecutive Violations) - (delayed trigger mode)
YELLOW: Only a certain amount of exceedances within a specified timeframe triggers an alarm. (Occurrence counter ON - Violations per Time) - (delayed trigger mode)
Example & description
This table describes the different options using a real-life example.
More detailed information about each option is given, as well as a description of how to activate one of the options.
When using the non-SLA - sliding window mode, the status of a rule varies, depending on the current value.
If in the next measurement the value falls below a critical value again, this will be reflected in the status of the alert rule and it will change from e.g. Error Level back to an OK state.
Relevant for this is the PI timing and the time specified in the Sliding window.
Here, an additional possibility has been created to maintain this state in the long term, over a certain SLA time interval. (fixed window)
SLA timing mode for BVQ Occurrence counter
As stated above, both options of the occurrence counter can also be used in an SLA mode.
The SLA mode allows the setting of a so-called fixed time window. This fixed time window specifies a period of time, in which the warning rules are applicable.
So, if it has to be proved that a certain rule is not exceeded within a certain period of time, this option is the right setting.
The main difference from the Sliding Window (used without the SLA option) is the fact that the state of the Alert rule remains.
By default, SLA mode is disabled. To enable it, change the SLA INTERVAL from "SLA mode not enabled" to the desired timeframe.
The following illustration shows the differences by means of a comparative analysis:
Standart timing mode (sliding window) | SLA timing mode (fixed window) |
---|---|
Intended to be used for normal monitoring purposes | Intended to prove the health of service level objectives. |
Allows to define a separate Timing for each Occurrence counter definition | Forces all Occurrence counter definitions to use the configured SLA timing |
Uses the timing for a sliding window - Time setting within the Occurrence counter definitions | Uses the timing for fixed window - Time setting for the SLA interval |
Reset to default Alert level (typ. OK) when the measurements in the sliding window are below the counts of all Occurrence counter definitions. | Reset to default Alert level (typ. OK) at each start of the SLA interval fixed window. |
Figure 5: Standart timing mode versus SLA timing mode
In order to compare the differences between the two options, we have chosen a schematic diagram which compares the standard timing versus the SLA timing.
In this example we use the term "Sliding Window" for the standard timing and the term "Fixed Window" for the SLA timing.
To keep the example simple and understandable the following assumptions have been made:
- PI Timing: 5 min
- SLA Interval: 30 min
- Violations depending on frequency divided into INFO, WARN and ERROR
Exact values are not given, here it is only about the representation how often a notification appears depending on the module.
Figure 6: A more schematic way to explain the differences
Explanation based on the real-life example from above
Advanced example: For this we take the same chart as above. In this example we concentrate on the timeframe 07:00 untill 11:00. Also, we claim that nothing of particular interest happens before and after this time window.
Let's further assume you need to certify that a group of Storage Pool Disks (SVC MDisks) will meet a set of thresholds:
- ERROR at 3 ms
- Violations per time 5, 10, 20 occurrences per time
- consecutive violation by more than 5 occurrences
BVQ Alert rule - SLA Mode not enabled (sliding window) | Each box represents a 1 hours sliding window and the color shows the current status | Explanation | ||
---|---|---|---|---|
Name | SVC MDisk violated | INFO: Each colored box represents a sliding window of 1 hour and indicates the current status by its color. | From 07:15 untill 08:55 the 3ms latency rule was exceeded 5 times. → Status Raised to ERROR The next upcoming 5 min PI measurements did also contain more than 5 violations of the 3ms latency rule. → Status still at ERROR From 09:15 the staus lowerded the severity because the 3ms latency rule is no longer exceeded 5 times in that 60 min sliding window. → Status WARN | |
Perfomance indicator timing | 5 minutes | |||
SLA interval | NONE (OFF) | |||
AR Condition | Latency > 3ms | |||
1. AR Condition > 3 ms | ERROR | |||
INFO: Each colored box represents a sliding window of 1 hour and indicates the current status by its color. |
Figure 7: Disabled SLA mode
BVQ Alert rule - SLA Mode enabled (fixed window) | SLA interval of one day | Explanation | |||
---|---|---|---|---|---|
Name | Daily SVC MDisk SLA violated | INFO: Each box shows by its color the status since the beginning of the day up to that moment. |
At 08:55 5 violations are counted this will raise the status to INFO At 09:10 10 violations of 3ms latency are counted so that the status will raise to WARN At 09:15 5 consecutive violations are happening. This will trigger our 4. alert rule and therefore the status will direclty raise to ERROR This status will be based on the SLA mode persistant untill the next start of the day. It will then be reset to OK | ||
Perfomance indicator timing | 5 minutes | ||||
SLA interval | 1 day | ||||
AR Condition | Latency > 3ms | ||||
1. AR Condition Occurrence counter | ERROR | 20 times per SLA interval | |||
2. AR Condition Occurrence counter | WARN | 10 times per SLA interval | |||
3. AR Condition Occurrence counter | INFO | 5 times per SLA interval | |||
4. AR Condition Occurrence counter | ERROR | 5 times in a row per SLA interval |
Figure 8: Enabled SLA mode
While the normal Occurrence counter timing is based on a shorter sliding window (some minutes) the SLA timing mode is based on a fixed window to assess the state of a custom alert rule for a defined larger timeframe (hours, days weeks). The SLA timing fixed window is aligned to Monday 00:00. At the start of each SLA interval, the Alert level is switched back to the default level (OK) and as soon as the occurrence count is met, the Alert level is raised to the level configured in the Occurrence counter.
Additional information
Attention: Some settings do not make sense and may not lead to any result of the alert rule. When entering the PI timing, number of violations per time and the parameters in the SLA intervals, the violations per time must make sense.
Here is an example:
X
The new BVQ Occurrence counter is an important possibility to specify the amount and the way of monitoring your environment with the Alerting options. With this option it is possible to control very simple up to highly complex scenarios.
The BVQ team can help you with any further questions!