AIOps Configuration

Enables users to define static or dynamic limits for various performance metrics. It helps proactively detect anomalies by triggering alerts when values breach conditions on adaptive thresholds. This ensures timely action to maintain system stability and performance.

Adaptive Thresholds

These dynamic, model-generated values define acceptable upper and lower bounds for system metrics hourly. These thresholds are not fixed like traditional static thresholds (e.g., CPU > 80%) but are derived from 30 days of historical data to predict the expected range of values for the next 15 days. The model calculates and maintains three key values for each hour: lower bound, upper bound, and predicted value.

Example:

The model learns the typical traffic behavior of a branch network bandwidth based on historical patterns. For example, during weekday business hours (e.g., 10:00 AM to 6:00 PM), traffic is usually high due to regular operations. Conversely, during non-business hours or weekends, the traffic is minimal.

If the model observes a sudden and consistent spike in traffic during non-business hours that deviates from the learned pattern, it flags it as an anomaly. This detection is based on the trained adaptive threshold, not a hardcoded static rule.

Use Case

An admin at a remote branch notices network traffic surging at 2:00 AM on multiple days. Having learned that this time window typically has near-zero usage, the model classifies the event as an anomaly, potentially indicating unauthorized data transfers or a misconfigured backup job.

Key Terminologies

Predicted Value

This is the central expected value predicted by the ML model for a metric at a specific time interval. It acts as the baseline or reference point for calculating the upper and lower bounds.

Example:

If CPU usage on a database server consistently hovers around 60% at 10:00 AM, the model may predict 60% as the expected (Predicted) value for that hour. Any significant deviation from this becomes a candidate for anomaly evaluation.

Upper Bound

The maximum threshold (before any factor is applied) the ML model expects for the metric at a given time. Indicates the highest value the metric should usually reach under expected conditions.

Example:

If during business hours (10:00 AM), the Predicted CPU is 60%, and historically it never exceeds 75%, the model defines 75% as the Upper Bound for that hour. This means CPU reaching up to 75% is still considered acceptable.

Lower Bound

The minimum threshold (before factor) the ML model expects for the metric at a specific time. It represents the lowest value expected under normal behavior.

Example:

If the predicted network bandwidth at 11:00 AM is 50 Mbps, and the lowest expected usage based on history is 30 Mbps, then 30 Mbps becomes the Lower Bound. Any significant drop below this may be flagged as an anomaly.

Band

The range (gap) between the Predicted Value and the Upper Bound or Lower Bound. It calculates the Upper Limit or Lower Limit by applying a factor.

Example:

If the predicted CPU is 60%, and the Upper Bound is 75%, the Upper Band is:

75 − 60 = 15% This 15% defines the maximum deviation tolerated before factoring in additional limits.

Difference between Bound & Band

Term

Meaning

Derived From

Bound

The absolute value is predicted by ML (Upper or lower bound).

Direct ML output

Band

The difference or gap between predicted and bound values.

Calculated for thresholding logic

Think of Bound as the edge, and Band as the width from the predicted center to that edge.

Add or Configure Range Thresholds

  • Go to Infraon Configuration -> IT Operations-> Threshold

  • Click on 'Add' and select 'Protocol' as desired.

Please refer to the product documentation for guidance on adding Basic threshold types.

Model-Based Thresholding

  • The system requires 30 days of historical data to analyze usage trends.

  • Once enough data is available, the model generates hourly range values for the next 15 days.

  • For each hour, the model calculates:

    • Lower bound

    • Upper bound

    • Predicted value

  • These values are collectively known as the Range Threshold and are used for anomaly detection.

Add Range | Basic Details

These are user-defined thresholds that are manually configured with fixed values.

1. Severity

Specifies how critical the issue is when the threshold is breached. This helps prioritize alerts in the system and trigger different workflows or escalations. The system uses a classification tag to decide the alert priority.

Levels: 

Minor (low risk)
Major (moderate risk)
Critical (immediate attention needed).

Example: CPU ≥ 80% → Major, CPU ≥ 90% → Critical.

2. Condition

The rule that evaluates a metric against the threshold that defines the breach logic.

≥ Greater Than or Equal To
≤ Less Than or Equal To

Example: If you set the condition to >= 80, the alert is triggered when the value is 80 or higher.

3. Value

The static number is compared with the monitored metric and acts as the threshold limit for triggering alerts.

Example: If the “Value” is set to 80, and the “Condition” is , then any metric reaching 80 or above will be flagged.

4. Poll Points

The system should evaluate the number of historical data points (collected during polling intervals). The application collects metric values at defined intervals (e.g., every 5 minutes). Each collected value is called a poll point.

It prevents the system from reacting to a single, momentary spike. It waits to see if the issue is sustained across multiple data points. It helps avoid false alarms from short-lived spikes or dips.

Example: If three poll points are configured with a 5-minute polling interval, the system evaluates the last 15 minutes of data. An anomaly is triggered only if all three values breach the defined threshold (based on the breached percentage setting).

5. Breached %

The percentage of poll points must violate the condition to trigger an alert. It adds tolerance for occasional data fluctuation.

Example: If 10 poll points are monitored and the breach% is set to 60%, then at least 6 values must exceed the threshold to raise an alert.

6. Effective Poll Points

The actual number of poll points (out of the total) that must breach the condition is derived from the “Breached %”. It converts the user-defined percentage into an exact value that the system will use.

Example: If 3 poll points are configured and the breached % is 60%, → 3 × 0.6 = 1.8, → This value is rounded up to 2.

So, for an anomaly to be triggered, at least two of the last three readings must cross the threshold.

This rounding logic ensures practical thresholds and avoids micro-fluctuation triggers.

Enabling Adaptive Thresholds

  • In the configuration UI for Advanced Resource Configuration, toggle Prediction and set the necessary statistical inputs.

  • The system will apply the model-generated thresholds automatically.

  • These thresholds adjust for:

    • Time of day (e.g., weekday vs weekend behavior)

    • Device-specific patterns

    • Location-specific variations

Adaptive Threshold | Basic Details

These are machine learning (ML) driven thresholds that dynamically adjust based on historical data patterns.

1. Severity

Like static thresholds, this sets how critical an anomaly is based on how much the real-time value deviates from the model's predictions. It categorizes deviations into severity buckets to avoid overwhelming users with low-priority issues.

Example: Assign different severity levels based on Factor.

2. Upper Limit

The final computed threshold for raising alerts above normal behavior. It defines the boundary where high values are flagged as anomalies.

Example:

  • Predicted = 60%

  • Upper Bound = 75% → Band = 15%

  • Factor = 1

  • Upper Limit = 75 + (1 × 15) = 90%

3. Lower Limit

The final threshold for flagging values that are too low compared to learned trends. Helps in detecting underperformance or unusual drops.

Example:

  • Predicted = 60%

  • Lower Bound = 45% → Band = 15%

  • Factor = 1

  • Lower Limit = 45 − (1 × 15) = 30%

Use Case: Monitoring UPS Battery Percentage in a Data Center

Context

A financial services company manages multiple data centres, each powered by UPS systems to ensure availability during power outages. Monitoring UPS battery percentage is critical to prevent unexpected shutdowns and equipment damage.

The operations team wants to detect two types of anomalies:

  • Overcharging or sensor issues when the battery percentage goes unusually high.

  • Discharging or battery failure when the battery percentage drops sharply.

They use Adaptive Thresholding to account for differences in battery behavior across systems and time-of-day patterns (e.g., heavy usage during backups or power switches).

ML Prediction Snapshot (at 3:00 AM)

Predicted Value

Upper Bound

Lower Bound

60%

75%

45%

  • Upper Band = 75 − 60 = 15%

  • Lower Band = 60 − 45 = 15%

Threshold Configuration

Parameter

Value

Factor

1

Poll Points

3

Breached %

100%

Upper Limit Calculation

  • Upper Limit = 75 + (1 × 15) = 90%

The system will raise a Critical anomaly if the battery percentage exceeds 90%.

This can indicate overcharging, stuck sensors, or faulty voltage regulators.

Lower Limit Calculation

  • Lower Limit = 45 − (1 × 15) = 30%

The system will raise a Critical anomaly if the battery percentage drops below 30%.

This can indicate a dying battery, failing charge cycles, or losing backup capability.

Polled Data (Every 5 Minutes)

Time

Battery %

3:00 AM

28%

3:05 AM

29%

3:10 AM

27%

The system flags a Lower Limit anomaly, sending an automated alert to the NOC team. The investigation revealed a failing battery module in Rack 2, which was replaced proactively, preventing potential data center downtime.

4. Poll Points

There are several recent values to evaluate (like static thresholds). These are sampled metric values over time used to confirm whether a breach is consistent. This allows time-based validation of anomalies instead of reacting immediately.

If polling happens every 5 minutes, and you set 3 poll points, the system evaluates the last 15 minutes of data. An anomaly will only be triggered if the defined number of those 3 points (based on the breached %) meet the threshold condition.

5. Breached %

It is required % of those poll points must breach the threshold to trigger an anomaly. It filters out irregular or spiky data patterns and ensures only consistent breaches raise alerts.

Example: If 3 poll points are configured and the Breached% is set to 66%, the system calculates:

3 × 66% = 1.98 → rounded up to 2 poll points

So, for an anomaly to be raised, at least 2 out of 3 poll points must breach the threshold.

6. Factor

A multiplier applied to the band (i.e., the difference between the predicted value and upper/lower bound). It controls how aggressive or lenient the alerting is.

Example:

How It Helps

  • To reduce noise (i.e., avoid frequent or insignificant alerts), use a higher factor.

  • To make the system more sensitive, use a lower factor like 0 or 0.5.

  • The Factor value is also often mapped to alert Severity levels:

    • Factor 1 → Minor

    • Factor 2 → Major

    • Factor 3 → Critical

Example 1: Using Factor for Upper Limit (CPU Usage)

  • Predicted Value: 60%

  • Upper Bound: 75%

  • Upper Band = 75 − 60 = 15%

  • Factor: 1

  • Upper Limit = 75 + (1 × 15) = 90%

If the CPU usage exceeds 90%, an anomaly is raised. Increasing the Factor to 2 would push the limit to 105%, making it more tolerant.

Example 2: Using Factor for Lower Limit (UPS Battery%)

  • Predicted Value: 60%

  • Lower Bound: 45%

  • Lower Band = 60 − 45 = 15%

  • Factor: 1

  • Lower Limit = 45 − (1 × 15) = 30%

If the UPS battery % drops below 30%, an anomaly is triggered. Reducing the Factor to 0.5 changes the threshold to 37.5%, making detection more sensitive.

7. Alert Above / Alert Below

These fields are static thresholds configured in addition to the adaptive ML-based thresholds. They act as benchmarking overrides that help reduce noise in situations where model predictions may tolerate natural variation, but strict enforcement is still required based on business expectations.

How It Helps

  • Helps reduce alert noise by enforcing absolute boundaries.

  • Ideal for metrics that have hard operational limits (e.g., CPU > 95%, Battery < 10%).

  • Not recommended for highly sensitive metrics, where adaptive thresholds alone should govern.

Example:

Let’s say we are monitoring CPU utilization at 10:00 AM.

Predicted Value

Upper Bound

Lower Bound

Factor

60%

75%

45%

1

From this:

  • Upper Band = 75 − 60 = 15

  • Upper Limit = 75 + (1 × 15) = 90%

  • Lower Band = 60 − 45 = 15

  • Lower Limit = 45 − (1 × 15) = 30%

Let’s say Alert Above = 85% and Alert Below = 35%.

  • If the real-time CPU usage is 88%, it does not cross the adaptive upper limit (90%) → but it does cross the Alert Above = 85%, so an anomaly is still triggered.

  • Similarly, if the value drops to 32%, and that’s still above the ML-calculated lower limit (30%) → but below Alert Below = 35% → An anomaly is again triggered.

This demonstrates how Alert Above and Alert Below enforce stricter thresholds when needed.

  • During live monitoring, the system compares each polled value against the model's predicted upper and lower thresholds.

  • If a value falls outside the defined range, an anomaly event is generated.

  • If a corresponding trigger is configured, the anomaly event can automatically create a ticket.

  • Threshold models are retrained weekly to adapt to the latest usage patterns.

  • Avoid using Alert Above/Below on sensitive metrics unless you want to override adaptive behavior for definite business rules.

Last updated

Was this helpful?