Learn how to create and manage alerts.

All users in VMware Aria Operations for Applications (formerly known as Tanzu Observability by Wavefront) can examine alerts and drill down to find the problem.

If you are using the Terraform Provider, update to version 3.0.1. Earlier versions are not compatible with the 2022 alert experience.

Create Alert Video

Users with the Alerts permission follow a step-by-step process to create an alert. Watch this 90 second video. Note that this video was created in 2021 and some of the information in it might have changed. It also uses the 2021 version of the UI.

You can also watch the video here video camera icon.

Create Alert Tutorial

This tutorial creates an alert that allows you to specify the severity for each threshold. For example, you can:

  • Send an alert email of type Info to a group of engineers when a certain value is close to the SLO (e.g. 90% of budgeted CPU)
  • Send an alert Slack message of type Severe to engineers and engineering managers when the value has crossed that threshold (e.g. 95% of budgeted CPU).

Before you begin, ensure that you have the information for the required fields:

  • Alert data. For example, CPU of all production clusters. Be as specific as possible to speed up query execution.
  • Alert condition and associated severity. For example, it could be INFO severity if CPU of at least 1 cluster is at 90% for 5 minutes, but SEVERE if CPU of 75% of all clusters is at 90%.
  • Recipients. For each severity, you can specify an email, Slack notification, or one or more alert targets to notify when the alert changes state. When the alert changes state, each target that meets the condition is notified with the specified severity.

Step 0: Start Alert Creation

To start alert creation, do one of the following:

  • Alerts Browser - Click Alerting > Create Alert.
create_alert
  • Chart - Click the ellipsis icon on the right of the query and select Create Alert from the menu.
create_alert

Step 1a: Specify the Data to Watch and Alert On

In the Data section, specify the data that you want to monitor, optionally customize the chart, and click Next. You have many options:
  • Keep it simple, e.g. just specify ts() and a metric: ts(~sample.mem.used.percentage)
  • Use multiple queries, optionally with variables, to take advantage of the full power of Wavefront Query Language (WQL).
    Note: You must select one query as the alert query using the radio button. You can use results of other queries as chart variables in the selected query.
  • Use either WQL or PromQL.
Specify data the alert is monitoring

Step 1b: Customize the Chart (Optional)

By default, each alert includes a line chart with a two hour time window. You can modify the chart type, format, axis, and some other aspects of the chart. See the Chart Reference for background.

Important: The customizations for alert charts are more limited than the customizations for charts in dashboards.
Selection of chart type

Step 2: Specify Thresholds and Severities

1. In the Conditions section, specify thresholds for the alert. The threshold becomes visible in the chart.

You can alert when the query result is greater than or less than the specified threshold. Specify at least 1 threshold.

Note: If your Data query is a Boolean expression that includes a comparison operator, you can specify only one severity.
Specify data the alert is monitoring
2. Click Test Condition to check if the alert would have fired in the current time window. Examine the test result, shown above the chart. Test results. Alert would have fired once in the last 2 hours Tip: Test Condition looks backwards, and does not always match the actual alert firing in the future. See the FAQ below.  
3. Optionally, fine-tune and test the condition.
  • Trigger Window: Length of time (in minutes) during which the Condition expression must be true before the alert fires. Minimum is 1. For example, if you enter 5, the alerting engine reviews the value of the condition during the last 5-minute window to determine whether the alert should fire.
  • Resolve Window(Optional): By default the Resolve Window is set to the same number of minutes as the Trigger Window. Set the Resolve Window to greater than or equal to the Trigger Window to avoid resolve-fire cycles.

    The Resolve Window is the length of time (in minutes) during which the Condition expression must be NOT true before the alert switches to resolved. Minimum is 1.

Condition options discussed in left column
4. For special cases, expand Additional Settings to also specify the following settings. The default is often best.
  • Checking Frequency: Number of minutes between checks whether the condition is true. The default value is 5 minutes. When an alert is in the INVALID state, the alert is checked approximately every 15 minutes, and not with the specified checking frequency.
  • Evaluation Strategy: Allows you to select Real-time Alerting. By default, Operations for Applications ignores values for the last 1 minute to account for delays. This default evaluation strategy prevents spurious firings because many data sources are updated only at certain points in time. If you select this check box, the alerting engine considers values in the last 1 minute (the alert is evaluated strictly on the ingested data). See Limiting the Effects of Data Delays for some background.
Condition options discussed in left column

Step 3: Specify Recipients

Alert recipients receive notifications when the alert changes state. For each severity, you can:
  • Specify any email address
  • Specify a PagerDuty key
  • Select from predefined alert targets. Alert targets allow fine-grained notification settings for a variety of messaging platforms (email, pager services) and communication channels.
create_alert

Step 4 (Optional): Help Alert Recipients Resolve the Alert

If you already have information that helps recipients find the causes for the alert, specify them in the Contents section:

  • Runbook: A URL to a wiki page, or another document that helps alert recipients resolve the alert.
  • Triage Dashboard(s): Start typing to select from dashboards on your Operations for Applications instance that have useful information and pass in information. See How Can I Pass a Value to a Triage Dashboard.
  • Additional Information: Any other information that is useful to the alert recipient. This field supports Markdown. Click Preview to preview the Markdown output.
  • Related Logs: You see this field only if you have Logs permission. Click the plus button to add log tags. When an alert fires, you get a link to access the logs associated with these tags on the Logs Browser. Click Go To Logs to test if the tags work for your log data and refine the log search by adding or removing tags.
create_alert
Click Additional Settings to further customize the notifications for special cases.
  • Resend Notifications: If selected, we resend the notification of a firing alert. You can specify interval at which the alert is resent. By default, notifications are sent only when the alert changes state.
  • Unique PagerDuty Incidents: Select this option to receive separate PagerDuty notifications for each series that meets the alert conditions.
    For example, you get separate PagerDuty notifications for both the series on the right because the env tag is different.
    #first series
    app.errors source=machine env=prod
    
    #second series
    app.errors source=machine env=stage
    
  • Secure Metrics Details: If you are protecting metrics in your environment with metrics security policies, select this check box to send a simplified alert notification without metric details and alert images.
screenshot of options in Step 4

Click Preview Notification for a preview of the notification that users will see.

Step 5: Name and Activate the Alert

As a final step, you name the alert, optionally add alert tags, and activate the alert.

  1. (Required) Specify a Name that uniquely identifies the alert.
  2. (Optional) Specify one or more Alert tags. Tags make it easy to find alerts, for example, in the Alerts Browser.
screenshot of options in step 5

Alert FAQs

Here are some frequently asked questions about alerts.

Why Can I Specify Only 1 Severity?

If your data query follows the format <expression> <comparisonOperator> <constant>, for example myCPU < 45000, the query itself already includes the condition.

In the example screenshot on the right, the threshold is 6000. Notice how the hover text shows either 0 or 1 for the different time series.
screenshot of options in step 5
Because the threshold is predefined, you can select only 1 severity. All notifications will go to the same set of recipients with that severity. screenshot of options in step 5

If your query does NOT follow the <expression> <comparisonOperator> <constant> pattern, you can specify different thresholds and different severities.

Who Gets Notified When the Alert Changes State?

We send alert notifications when the alert changes state.

  • An alert with a query that follows the pattern <expression> <comparisonOperator> <constant> sends a notification with the specified severity to all specified targets. This page calls this type of query Boolean query.
  • A multi-threshold alert supports multiple severities and a different target for each severity. When the alert changes state, targets for conditions that meet the severity threshold are notified. Lower severity targets always receive notifications for all higher severities.

For example, an alert fires when a metric stays at a value that indicates a problem for the specified amount of time. But you might also want to be notified when the alert is resolved or when the alert is snoozed. The alert target gives fine-grained control over which state changes trigger a notification.

What’s an Alert Target?

Each alert is associated with one or more recipients: an email address, PagerDuty key, or alert target.

When the alert changes state, the recipients are notified. Customize which state changes trigger a notification:

screenshot of Create Alert target shows several options e.g Alert Status Updated and Alert Resolved

The maximum number of email alert targets is:

  • 10 for alerts with Boolean queries that follow the pattern <expression> <comparisonOperator> <constant>.
  • 10 per severity for multi-threshold alerts.

If you exceed the number, you receive a message like the following:

{"status":{"result":"ERROR","message":"Invalid notification specified: null","code":400}}

My Alert Fires with Test Condition, But Not In Production

Test Condition is useful in fine-tuning an alert, but doesn’t always match what happens in production.

For example:

  • If data comes in late, Test Condition won't match the actual alert firing. Data are visible looking back, but might not be there in real time.
  • If data are meeting the alert condition for the "condition is true for x mins" amount of time, the actual alert might not fire because the alert check, determined by the alert check interval, happens too soon or too late.

For both cases, test condition shows that the condition was met, but the actual alert might not fire.

How Do I Pass Values to Triage Dashboards?

The Content section allows you to specify one or more triage dashboards. For each dashboard, you can preset one or more dashboard variables so that the user sees what they’re interested in when they go to the triage dashboard. Here’s an example that uses the predefined Cluster Metrics Exploration dashboard that’s part of the Tour Pro integration as the target dashboard.

  1. In the target dashboard, show the predefined dashboard variables. screenshot Variables icon
  2. Ensure that you know the variable names and possible values. In this example, the variable name is env and the value we want to set is dev.
  3. In the alert dialog, specify the name and value to set.
screenshot

Edit Alerts

Users with the Alerts permission can change an alert at any time. The options are similar to what you see when you create an alert, but you can quickly focus on the things that you want to change.

Start the Alert Edit

  1. Click Alerting > All Alerts from the toolbar to display the Alerts Browser.
  2. Click the alert name, or click the ellipsis icon next to the alert and select Edit.

    You can search for the alert by name, status, alert tag, etc. See Performing Searches.
screenshot ellipsis menu to the left of alert in alerts browser
  1. Make changes (see next section).
  2. Click Show Firings at any time to see when the alert fired and fine-tune the behavior based on that information.
screenshot ellipsis menu to the left of alert in alerts browser

Change the Alert Properties

You can change the alert properties when you edit the alert.

Alert Name and Tags

In this section:
  • Change the alert name.
  • Click the X on any tag to remove it from the alert.
  • Click +Tag to add a tag to the alert.
screenshot ellipsis menu to the left of alert in alerts browser
Data

In this section:
  • Change the data to alert on.
  • Edit the existing alert query. For example, add filters to fine-tune the query. See Query Language Quickstart for background and a video or Query Reference if you're an advanced user.
  • Fine-tune the alert image. See the Chart Reference for details.
screenshot of data section showing a single query
Conditions

In this section, you can fine-tune the alert condition and test the condition.
  • Change the alert threshold or thresholds and severity.
  • Change the Trigger Window and Resolve Window values.
  • Change the Checking Frequency and Evaluation Strategy values.
See Specify Thresholds and Severities for details on each option.
Screenshot of a few Conditions options. Details are under Create Alert.
Recipients

In this section, you can view, change, or add recipients of alert notifications.
  • Specify one or more recipient for each severity.
  • You can specify an email address, PagerDuty key, or alert target that has already been created.
  • Notifications for each severity are sent to the recipients of that severity and higher.
  • As a result, you cannot specify a recipient for multiple severities. Most likely, the recipient already receives the notification because, for example, when an alert notification is sent at the SEVERE level, it also goes to all recipients at lower levels.
Screenshot of Recipients section with 1 email and 1 alert target included
Content

In this section, you can add runbook URLs and specify other information that can help with alert resolution.
  • The Runbook URL can point to internal information.
  • Start typing to select from dashboards in your environment. You can set environment variables for the dashboard with the Pass option. See How Do I Pass Values to Triage Dashboards.
  • Specify other information you want included in the notification in the Additional Information section.
Screenshot of Recipients section with 1 email and 1 alert target included

Save Your Changes

Click Save in the top right to save your changes.

Delete an Alert

You delete an alert from the Alerts Browser page. Only users with the Alerts permission can delete an alert.

  1. Click Alerting > All Alerts from the toolbar to display the Alerts Browser.
  2. Click the ellipsis icon next to the alert.
  3. Select Delete and confirm the deletion.
screenshot ellipsis menu to the left of alert in alerts browser

Restore a Deleted Alert

You can restore an alert from the trash if it was deleted less than 30 days ago or if it wasn’t permanently deleted. You restore deleted alerts from the Alerts Browser page. Only users with the Alerts permission can restore a deleted alert.

  1. Click Alerting > All Alerts from the toolbar to display the Alerts Browser.
  2. Click All from the top right and select Deleted.
  3. Click the ellipsis icon next to the alert.
  4. Select Restore.
screenshot of changing the view to deleted alerts

Restore an Alert Version

Each time you save an alert, you create an alert version. Up to 100 versions are supported.

  1. Find the alert in the Alerts Browser.
  2. Click the ellipsis icon and select Versions.
  3. Select a version.
screenshot ellipsis menu to the left of alert in alerts browser

Do More!