# AIOps Configuration

Enables users to define static or dynamic limits for various performance metrics. It helps proactively detect anomalies by triggering alerts when values breach conditions on adaptive thresholds. This ensures timely action to maintain system stability and performance.

## **Adaptive Thresholds**

These dynamic, model-generated values define acceptable upper and lower bounds for system metrics hourly. These thresholds are not fixed like traditional static thresholds (e.g., CPU > 80%) but are derived from **30 days of historical data** to predict the expected range of values for the next **15 days**. The model calculates and maintains three key values for each hour: **lower bound, upper bound, and predicted value**.

**Example:**

The model learns the typical traffic behavior of a **branch network bandwidth** based on historical patterns. For example, during **weekday business hours** (e.g., 10:00 AM to 6:00 PM), traffic is usually high due to regular operations. Conversely, during **non-business hours** or weekends, the traffic is minimal.

If the model observes a **sudden and consistent spike in traffic during non-business hours** that deviates from the learned pattern, it flags it as an **anomaly**. This detection is based on the **trained adaptive threshold**, not a hardcoded static rule.

**Use Case**

An admin at a remote branch notices network traffic surging at 2:00 AM on multiple days. Having learned that this time window typically has near-zero usage, the model classifies the event as an anomaly, potentially indicating unauthorized data transfers or a misconfigured backup job.

<figure><img src="https://8249392-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FE4mkwSP8a1BSD9BFNFav%2Fuploads%2FQbWbWdF4SRvdR7IegnSj%2FFrame%20634070.svg?alt=media&#x26;token=2d2f77ec-11f1-4915-aa8d-c92caefe456e" alt=""><figcaption></figcaption></figure>

### **Key Terminologies**

#### **Predicted Value**

This is the central expected value predicted by the ML model for a metric at a specific time interval. It acts as the baseline or reference point for calculating the upper and lower bounds.

{% hint style="info" %}
**Example:**

If CPU usage on a database server consistently hovers around **60% at 10:00 AM**, the model may predict **60%** as the expected (Predicted) value for that hour. Any significant deviation from this becomes a candidate for anomaly evaluation.
{% endhint %}

#### **Upper Bound**

The maximum threshold (before any factor is applied) the ML model expects for the metric at a given time. Indicates the highest value the metric should usually reach under expected conditions.

{% hint style="info" %}
**Example:**

If during business hours (10:00 AM), the Predicted CPU is **60%,** and historically it never exceeds **75%,** the model defines 75% as the **Upper Bound** for that hour. This means CPU reaching up to 75% is still considered acceptable.
{% endhint %}

#### **Lower Bound**

The minimum threshold (before factor) the ML model expects for the metric at a specific time. It represents the lowest value expected under normal behavior.

{% hint style="info" %}
**Example:**

If the predicted network bandwidth at 11:00 AM is **50 Mbps**, and the lowest expected usage based on history is **30 Mbps**, then 30 Mbps becomes the **Lower Bound**. Any significant drop below this may be flagged as an anomaly.
{% endhint %}

#### **Band**

The range **(gap)** between the Predicted Value and the Upper Bound or Lower Bound. It calculates the Upper Limit or Lower Limit by applying a factor.

{% hint style="success" %}
**Formula:**

* **Upper Band** = Upper Bound − Predicted Value
* **Lower Band** = Predicted Value − Lower Bound
  {% endhint %}

{% hint style="info" %}
**Example:**

If the predicted CPU is **60%**, and the Upper Bound is **75%**, the **Upper Band** is:

**75 − 60 = 15%**\
\
This **15%** defines the maximum deviation tolerated before factoring in additional limits.
{% endhint %}

#### **Difference between Bound & Band**

| **Term**  | **Meaning**                                                   | **Derived From**                  |
| --------- | ------------------------------------------------------------- | --------------------------------- |
| **Bound** | The absolute value is predicted by ML (Upper or lower bound). | Direct ML output                  |
| **Band**  | The difference or gap between predicted and bound values.     | Calculated for thresholding logic |

{% hint style="info" %}
Think of **Bound** as the edge, and **Band** as the width from the predicted center to that edge.
{% endhint %}

## **Add or Configure Range Thresholds**

* Go to Infraon Configuration -> IT Operations-> Threshold
* Click on 'Add' and select 'Protocol' as desired.

{% hint style="info" %}
Please refer to the [product documentation](https://docs.infraon.io/infraon-help/infinity-user-guide/infraon-configuration/it-operations/thresholds#instructions-to-add-a-threshold) for guidance on adding Basic threshold types.
{% endhint %}

### **Model-Based Thresholding**

* The system requires **30 days of historical data** to analyze usage trends.
* Once enough data is available, the model generates **hourly range values** for the next **15 days**.
* For each hour, the model calculates:
  * **Lower bound**
  * **Upper bound**
  * **Predicted value**
* These values are collectively known as the **Range Threshold** and are used for anomaly detection.

#### **Add Range | Basic Details**

These are user-defined thresholds that are manually configured with fixed values.

**1. Severity**

Specifies how critical the issue is when the threshold is breached. This helps prioritize alerts in the system and trigger different workflows or escalations. The system uses a classification tag to decide the alert priority.

```
Levels: 

Minor (low risk)
Major (moderate risk)
Critical (immediate attention needed).
```

{% hint style="info" %}
**Example**: CPU ≥ 80% → *Major*, CPU ≥ 90% → *Critical*.
{% endhint %}

**2. Condition**

The rule that evaluates a metric against the threshold that defines the breach logic.

```
≥ Greater Than or Equal To
≤ Less Than or Equal To
```

{% hint style="info" %}
**Example**: If you set the condition to **>= 80**, the alert is triggered when the value is 80 or higher.
{% endhint %}

**3. Value**

The static number is compared with the monitored metric and acts as the threshold limit for triggering alerts.

{% hint style="info" %}
**Example**: If the “Value” is set to **80**, and the “Condition” is **≥**, then any metric reaching 80 or above will be flagged.
{% endhint %}

**4. Poll Points**

The system should evaluate the number of historical data points (collected during polling intervals). The application collects metric values at defined intervals (e.g., every 5 minutes). Each collected value is called a **poll point**.

It prevents the system from reacting to a single, momentary spike. It waits to see if the issue is sustained across multiple data points. It helps avoid false alarms from short-lived spikes or dips.

{% hint style="info" %}
**Example**: If **three poll points** are configured with a 5-minute polling interval, the system evaluates the last **15 minutes** of data. An anomaly is triggered **only if all three values** breach the defined threshold (based on the breached percentage setting).
{% endhint %}

{% hint style="warning" %}
Using too many poll points (e.g., 10 or more) can delay anomaly detection and is not recommended for quick response metrics.
{% endhint %}

**5. Breached %**

The percentage of poll points must violate the condition to trigger an alert. It adds tolerance for occasional data fluctuation.

{% hint style="info" %}
**Example**: If 10 poll points are monitored and the breach% is set to 60%, then at least 6 values must exceed the threshold to raise an alert.
{% endhint %}

**6. Effective Poll Points**

The actual number of poll points (out of the total) that must breach the condition is derived from the “Breached %”. It converts the user-defined percentage into an exact value that the system will use.

{% hint style="success" %}
**Formula:** Poll Points × Breached%
{% endhint %}

{% hint style="info" %}
**Example**: If **3 poll points** are configured and the **breached %** is **60%**,\
→ 3 × 0.6 = **1.8**,\
→ This value is **rounded up** to 2.
{% endhint %}

So, for an anomaly to be triggered, at least two of the last three readings must cross the threshold.

{% hint style="warning" %}

* If the resulting number is **≥ 1.5**, it is **rounded up**.
* If it is **< 1.5**, it is **rounded down**.
  {% endhint %}

This rounding logic ensures practical thresholds and avoids micro-fluctuation triggers.

### **Enabling Adaptive Thresholds**

* In the configuration UI for Advanced Resource Configuration, toggle **Prediction** and set the necessary statistical inputs.
* The system will apply the model-generated thresholds automatically.
* These thresholds adjust for:
  * Time of day (e.g., weekday vs weekend behavior)
  * Device-specific patterns
  * Location-specific variations

#### **Adaptive Threshold |** Basic Details

These are machine learning (ML) driven thresholds that dynamically adjust based on historical data patterns.

**1. Severity**

Like static thresholds, this sets how critical an anomaly is based on how much the real-time value deviates from the model's predictions. It categorizes deviations into severity buckets to avoid overwhelming users with low-priority issues.

{% hint style="info" %}
**Example:** Assign different severity levels based on **Factor**.
{% endhint %}

**2. Upper Limit**

The final computed threshold for raising alerts above normal behavior. It defines the boundary where high values are flagged as anomalies.

{% hint style="success" %}
**Formula:**

Upper Limit = Upper Bound + (Factor × Upper Band)
{% endhint %}

{% hint style="info" %}
**Example:**

* Predicted = 60%
* Upper Bound = 75% → Band = 15%
* Factor = 1
* **Upper Limit** = 75 + (1 × 15) = **90%**
  {% endhint %}

{% hint style="danger" %}
If the actual value exceeds 90%, an alert is raised.
{% endhint %}

**3. Lower Limit**

The final threshold for flagging values that are too low compared to learned trends. Helps in detecting underperformance or unusual drops.

{% hint style="success" %}
**Formula:**

Lower Limit = Lower Bound − (Factor × Lower Band)
{% endhint %}

{% hint style="info" %}
**Example**:

* Predicted = 60%
* Lower Bound = 45% → Band = 15%
* Factor = 1
* **Lower Limit** = 45 − (1 × 15) = **30%**
  {% endhint %}

{% hint style="danger" %}
If the value drops below 30%, an alert is raised.
{% endhint %}

**Use Case: Monitoring UPS Battery Percentage in a Data Center**

**Context**

A financial services company manages multiple data centres, each powered by UPS systems to ensure availability during power outages. Monitoring **UPS battery percentage** is critical to prevent unexpected shutdowns and equipment damage.

The operations team wants to detect two types of anomalies:

* **Overcharging or sensor issues** when the battery percentage goes unusually high.
* **Discharging or battery failure** when the battery percentage drops sharply.

They use **Adaptive Thresholding** to account for differences in battery behavior across systems and time-of-day patterns (e.g., heavy usage during backups or power switches).

**ML Prediction Snapshot (at 3:00 AM)**

| **Predicted Value** | **Upper Bound** | **Lower Bound** |
| ------------------- | --------------- | --------------- |
| 60%                 | 75%             | 45%             |

* **Upper Band** = 75 − 60 = 15%
* **Lower Band** = 60 − 45 = 15%

**Threshold Configuration**

| **Parameter**   | **Value** |
| --------------- | --------- |
| **Factor**      | 1         |
| **Poll Points** | 3         |
| **Breached %**  | 100%      |

\
**Upper Limit Calculation**

* Upper Limit = 75 + (1 × 15) = 90%

The system will raise a **Critical anomaly** if the battery percentage exceeds 90%.

This can indicate overcharging, stuck sensors, or faulty voltage regulators.

**Lower Limit Calculation**

* Lower Limit = 45 − (1 × 15) = 30%

The system will raise a **Critical anomaly** if the battery percentage drops below 30%.

This can indicate a dying battery, failing charge cycles, or losing backup capability.

**Polled Data (Every 5 Minutes)**

| **Time**    | **Battery %** |
| ----------- | ------------- |
| **3:00 AM** | 28%           |
| **3:05 AM** | 29%           |
| **3:10 AM** | 27%           |

{% hint style="danger" %}
All values < Lower Limit (30%), **Anomaly Triggered (Critical)**
{% endhint %}

The system flags a **Lower Limit anomaly**, sending an automated alert to the NOC team. The investigation revealed a failing battery module in Rack 2, which was replaced proactively, preventing potential data center downtime.

**4. Poll Points**

There are several recent values to evaluate (like static thresholds). These are sampled metric values over time used to confirm whether a breach is consistent. This allows time-based validation of anomalies instead of reacting immediately.

{% hint style="info" %}
If polling happens every **5 minutes**, and you set **3 poll points**, the system evaluates the last **15 minutes** of data.\
\
An anomaly will only be triggered if the defined number of those 3 points (based on the breached %) meet the threshold condition.
{% endhint %}

{% hint style="warning" %}

* Using a smaller number of poll points (e.g., 3 to 5) helps the system react quickly to sustained issues while still filtering out random spikes.
* It ensures the ML model doesn’t raise false positives for one-off data irregularities.
  {% endhint %}

**5. Breached %**

It is required % of those poll points must breach the threshold to trigger an anomaly. It filters out irregular or spiky data patterns and ensures only consistent breaches raise alerts.

{% hint style="info" %}
**Example**: If **3 poll points** are configured and the **Breached%** is set to **66%**, the system calculates:

3 × 66% = 1.98 → rounded up to 2 poll points

So, for an anomaly to be raised, at least 2 out of 3 poll points must breach the threshold.
{% endhint %}

**6. Factor**

A multiplier applied to the **band** (i.e., the difference between the predicted value and upper/lower bound). It controls how aggressive or lenient the alerting is.

{% hint style="success" %}
**Formula:**

Upper Threshold (Limit) = Upper Bound + (Factor × Upper Band)

Lower Threshold (Limit) = Lower Bound + (Factor × Lower Band)
{% endhint %}

**Example**:

**How It Helps**

* To **reduce noise** (i.e., avoid frequent or insignificant alerts), use a higher factor.
* To make the system more **sensitive**, use a **lower factor** like 0 or 0.5.
* The Factor value is also often mapped to alert **Severity levels**:
  * Factor 1 → Minor
  * Factor 2 → Major
  * Factor 3 → Critical

**Example 1: Using Factor for Upper Limit (CPU Usage)**

* **Predicted Value**: 60%
* **Upper Bound**: 75%
* → **Upper Band** = 75 − 60 = 15%
* **Factor**: 1
* → **Upper Limit** = 75 + (1 × 15) = **90%**

If the CPU usage exceeds 90%, an anomaly is raised.\
Increasing the Factor to 2 would push the limit to 105%, making it more tolerant.

**Example 2: Using Factor for Lower Limit (UPS Battery%)**

* **Predicted Value**: 60%
* **Lower Bound**: 45%
* → **Lower Band** = 60 − 45 = 15%
* **Factor**: 1
* → **Lower Limit** = 45 − (1 × 15) = **30%**

If the UPS battery % drops below 30%, an anomaly is triggered.\
Reducing the Factor to 0.5 changes the threshold to 37.5%, making detection more sensitive.

**7. Alert Above / Alert Below**

These fields are static thresholds configured in addition to the adaptive ML-based thresholds. They act as **benchmarking overrides** that help reduce noise in situations where model predictions may tolerate natural variation, but strict enforcement is still required based on business expectations.

**How It Helps**

* Helps reduce alert noise by enforcing **absolute boundaries**.
* Ideal for metrics that have hard operational limits (e.g., CPU > 95%, Battery < 10%).
* Not recommended for **highly sensitive metrics**, where adaptive thresholds alone should govern.

**Example:**

Let’s say we are monitoring **CPU utilization** at 10:00 AM.

| **Predicted Value** | **Upper Bound** | **Lower Bound** | **Factor** |
| ------------------- | --------------- | --------------- | ---------- |
| 60%                 | 75%             | 45%             | 1          |

From this:

* **Upper Band** = 75 − 60 = 15
* **Upper Limit** = 75 + (1 × 15) = **90%**
* **Lower Band** = 60 − 45 = 15
* **Lower Limit** = 45 − (1 × 15) = **30%**

Let’s say **Alert Above = 85%** and **Alert Below = 35%**.

* If the real-time CPU usage is **88%**, it does **not** cross the adaptive upper limit (90%)\
  → but it **does** cross the **Alert Above = 85%**, so **an anomaly is still triggered**.
* Similarly, if the value drops to **32%**, and that’s still **above** the ML-calculated lower limit (30%)\
  → but **below** Alert Below = 35%\
  → An anomaly is again triggered.

This demonstrates how Alert Above and Alert Below **enforce stricter thresholds** when needed.

{% hint style="info" %}

* During live monitoring, the system compares each polled value against the model's predicted upper and lower thresholds.
* If a value falls outside the defined range, an anomaly event is generated.
* If a corresponding trigger is configured, the anomaly event can automatically create a ticket.
* Threshold models are retrained weekly to adapt to the latest usage patterns.
* Avoid using Alert Above/Below on sensitive metrics unless you want to override adaptive behavior for definite business rules.
  {% endhint %}
