AIOps Configuration
Enables users to define static or dynamic limits for various performance metrics. It helps proactively detect anomalies by triggering alerts when values breach conditions on adaptive thresholds. This ensures timely action to maintain system stability and performance.
Adaptive Thresholds
These dynamic, model-generated values define acceptable upper and lower bounds for system metrics hourly. These thresholds are not fixed like traditional static thresholds (e.g., CPU > 80%) but are derived from 30 days of historical data to predict the expected range of values for the next 15 days. The model calculates and maintains three key values for each hour: lower bound, upper bound, and predicted value.
Example:
The model learns the typical traffic behavior of a branch network bandwidth based on historical patterns. For example, during weekday business hours (e.g., 10:00 AM to 6:00 PM), traffic is usually high due to regular operations. Conversely, during non-business hours or weekends, the traffic is minimal.
If the model observes a sudden and consistent spike in traffic during non-business hours that deviates from the learned pattern, it flags it as an anomaly. This detection is based on the trained adaptive threshold, not a hardcoded static rule.
Use Case
An admin at a remote branch notices network traffic surging at 2:00 AM on multiple days. Having learned that this time window typically has near-zero usage, the model classifies the event as an anomaly, potentially indicating unauthorized data transfers or a misconfigured backup job.
Key Terminologies
Predicted Value
This is the central expected value predicted by the ML model for a metric at a specific time interval. It acts as the baseline or reference point for calculating the upper and lower bounds.
Upper Bound
The maximum threshold (before any factor is applied) the ML model expects for the metric at a given time. Indicates the highest value the metric should usually reach under expected conditions.
Lower Bound
The minimum threshold (before factor) the ML model expects for the metric at a specific time. It represents the lowest value expected under normal behavior.
Band
The range (gap) between the Predicted Value and the Upper Bound or Lower Bound. It calculates the Upper Limit or Lower Limit by applying a factor.
Formula:
Upper Band = Upper Bound − Predicted Value
Lower Band = Predicted Value − Lower Bound
Difference between Bound & Band
Term
Meaning
Derived From
Bound
The absolute value is predicted by ML (Upper or lower bound).
Direct ML output
Band
The difference or gap between predicted and bound values.
Calculated for thresholding logic
Add or Configure Range Thresholds
Go to Infraon Configuration -> IT Operations-> Threshold
Click on 'Add' and select 'Protocol' as desired.
Model-Based Thresholding
The system requires 30 days of historical data to analyze usage trends.
Once enough data is available, the model generates hourly range values for the next 15 days.
For each hour, the model calculates:
Lower bound
Upper bound
Predicted value
These values are collectively known as the Range Threshold and are used for anomaly detection.
Add Range | Basic Details
These are user-defined thresholds that are manually configured with fixed values.
1. Severity
Specifies how critical the issue is when the threshold is breached. This helps prioritize alerts in the system and trigger different workflows or escalations. The system uses a classification tag to decide the alert priority.
Levels:
Minor (low risk)
Major (moderate risk)
Critical (immediate attention needed).
2. Condition
The rule that evaluates a metric against the threshold that defines the breach logic.
≥ Greater Than or Equal To
≤ Less Than or Equal To
3. Value
The static number is compared with the monitored metric and acts as the threshold limit for triggering alerts.
4. Poll Points
The system should evaluate the number of historical data points (collected during polling intervals). The application collects metric values at defined intervals (e.g., every 5 minutes). Each collected value is called a poll point.
It prevents the system from reacting to a single, momentary spike. It waits to see if the issue is sustained across multiple data points. It helps avoid false alarms from short-lived spikes or dips.
Using too many poll points (e.g., 10 or more) can delay anomaly detection and is not recommended for quick response metrics.
5. Breached %
The percentage of poll points must violate the condition to trigger an alert. It adds tolerance for occasional data fluctuation.
6. Effective Poll Points
The actual number of poll points (out of the total) that must breach the condition is derived from the “Breached %”. It converts the user-defined percentage into an exact value that the system will use.
Formula: Poll Points × Breached%
So, for an anomaly to be triggered, at least two of the last three readings must cross the threshold.
If the resulting number is ≥ 1.5, it is rounded up.
If it is < 1.5, it is rounded down.
This rounding logic ensures practical thresholds and avoids micro-fluctuation triggers.
Enabling Adaptive Thresholds
In the configuration UI for Advanced Resource Configuration, toggle Prediction and set the necessary statistical inputs.
The system will apply the model-generated thresholds automatically.
These thresholds adjust for:
Time of day (e.g., weekday vs weekend behavior)
Device-specific patterns
Location-specific variations
Adaptive Threshold | Basic Details
These are machine learning (ML) driven thresholds that dynamically adjust based on historical data patterns.
1. Severity
Like static thresholds, this sets how critical an anomaly is based on how much the real-time value deviates from the model's predictions. It categorizes deviations into severity buckets to avoid overwhelming users with low-priority issues.
2. Upper Limit
The final computed threshold for raising alerts above normal behavior. It defines the boundary where high values are flagged as anomalies.
Formula:
Upper Limit = Upper Bound + (Factor × Upper Band)
If the actual value exceeds 90%, an alert is raised.
3. Lower Limit
The final threshold for flagging values that are too low compared to learned trends. Helps in detecting underperformance or unusual drops.
Formula:
Lower Limit = Lower Bound − (Factor × Lower Band)
If the value drops below 30%, an alert is raised.
Use Case: Monitoring UPS Battery Percentage in a Data Center
Context
A financial services company manages multiple data centres, each powered by UPS systems to ensure availability during power outages. Monitoring UPS battery percentage is critical to prevent unexpected shutdowns and equipment damage.
The operations team wants to detect two types of anomalies:
Overcharging or sensor issues when the battery percentage goes unusually high.
Discharging or battery failure when the battery percentage drops sharply.
They use Adaptive Thresholding to account for differences in battery behavior across systems and time-of-day patterns (e.g., heavy usage during backups or power switches).
ML Prediction Snapshot (at 3:00 AM)
Predicted Value
Upper Bound
Lower Bound
60%
75%
45%
Upper Band = 75 − 60 = 15%
Lower Band = 60 − 45 = 15%
Threshold Configuration
Parameter
Value
Factor
1
Poll Points
3
Breached %
100%
Upper Limit Calculation
Upper Limit = 75 + (1 × 15) = 90%
The system will raise a Critical anomaly if the battery percentage exceeds 90%.
This can indicate overcharging, stuck sensors, or faulty voltage regulators.
Lower Limit Calculation
Lower Limit = 45 − (1 × 15) = 30%
The system will raise a Critical anomaly if the battery percentage drops below 30%.
This can indicate a dying battery, failing charge cycles, or losing backup capability.
Polled Data (Every 5 Minutes)
Time
Battery %
3:00 AM
28%
3:05 AM
29%
3:10 AM
27%
All values < Lower Limit (30%), Anomaly Triggered (Critical)
The system flags a Lower Limit anomaly, sending an automated alert to the NOC team. The investigation revealed a failing battery module in Rack 2, which was replaced proactively, preventing potential data center downtime.
4. Poll Points
There are several recent values to evaluate (like static thresholds). These are sampled metric values over time used to confirm whether a breach is consistent. This allows time-based validation of anomalies instead of reacting immediately.
Using a smaller number of poll points (e.g., 3 to 5) helps the system react quickly to sustained issues while still filtering out random spikes.
It ensures the ML model doesn’t raise false positives for one-off data irregularities.
5. Breached %
It is required % of those poll points must breach the threshold to trigger an anomaly. It filters out irregular or spiky data patterns and ensures only consistent breaches raise alerts.
6. Factor
A multiplier applied to the band (i.e., the difference between the predicted value and upper/lower bound). It controls how aggressive or lenient the alerting is.
Formula:
Upper Threshold (Limit) = Upper Bound + (Factor × Upper Band)
Lower Threshold (Limit) = Lower Bound + (Factor × Lower Band)
Example:
How It Helps
To reduce noise (i.e., avoid frequent or insignificant alerts), use a higher factor.
To make the system more sensitive, use a lower factor like 0 or 0.5.
The Factor value is also often mapped to alert Severity levels:
Factor 1 → Minor
Factor 2 → Major
Factor 3 → Critical
Example 1: Using Factor for Upper Limit (CPU Usage)
Predicted Value: 60%
Upper Bound: 75%
→ Upper Band = 75 − 60 = 15%
Factor: 1
→ Upper Limit = 75 + (1 × 15) = 90%
If the CPU usage exceeds 90%, an anomaly is raised. Increasing the Factor to 2 would push the limit to 105%, making it more tolerant.
Example 2: Using Factor for Lower Limit (UPS Battery%)
Predicted Value: 60%
Lower Bound: 45%
→ Lower Band = 60 − 45 = 15%
Factor: 1
→ Lower Limit = 45 − (1 × 15) = 30%
If the UPS battery % drops below 30%, an anomaly is triggered. Reducing the Factor to 0.5 changes the threshold to 37.5%, making detection more sensitive.
7. Alert Above / Alert Below
These fields are static thresholds configured in addition to the adaptive ML-based thresholds. They act as benchmarking overrides that help reduce noise in situations where model predictions may tolerate natural variation, but strict enforcement is still required based on business expectations.
How It Helps
Helps reduce alert noise by enforcing absolute boundaries.
Ideal for metrics that have hard operational limits (e.g., CPU > 95%, Battery < 10%).
Not recommended for highly sensitive metrics, where adaptive thresholds alone should govern.
Example:
Let’s say we are monitoring CPU utilization at 10:00 AM.
Predicted Value
Upper Bound
Lower Bound
Factor
60%
75%
45%
1
From this:
Upper Band = 75 − 60 = 15
Upper Limit = 75 + (1 × 15) = 90%
Lower Band = 60 − 45 = 15
Lower Limit = 45 − (1 × 15) = 30%
Let’s say Alert Above = 85% and Alert Below = 35%.
If the real-time CPU usage is 88%, it does not cross the adaptive upper limit (90%) → but it does cross the Alert Above = 85%, so an anomaly is still triggered.
Similarly, if the value drops to 32%, and that’s still above the ML-calculated lower limit (30%) → but below Alert Below = 35% → An anomaly is again triggered.
This demonstrates how Alert Above and Alert Below enforce stricter thresholds when needed.
Last updated
Was this helpful?