Google ML Engine Integration | VMware Aria Operations for Applications Documentation

Learn about the Google ML Engine Integration.

This page provides an overview of what you can do with the Google ML Engine integration. The documentation pages only for a limited number of integrations contain the setup steps and instructions. If you do not see the setup steps here, navigate to the Operations for Applications GUI. The detailed instructions for setting up and configuring all integrations, including the Google ML Engine integration are on the Setup tab of the integration.

Log in to your Operations for Applications instance.
Click Integrations on the toolbar, search for and click the Google ML Engine tile.
Click the Setup tab and you will see the most recent and up-to-date instructions.

Google Cloud Platform Integration

The Google Cloud Platform integration is full-featured native integration offering agentless data ingestion of GCP metric data, as well as pre-defined dashboards and alert conditions for certain GCP services.

Metrics Configuration

Operations for Applications ingests Google Cloud Platform metrics using the v3 Stackdriver Monitoring APIs. For details on the metrics, see the metrics documentation.

Metrics originating from Google Cloud Platform are prefixed with gcp. within Operations for Applications. Once the integration has been set up, you can browse the available GCP metrics in the metrics browser.

Dashboards

Operations for Applications provides Google Cloud Platform dashboards for the following services:

Google App Engine
Google BigQuery
Google Cloud Bigtable
Google Cloud Billing
Google Cloud Datastore
Google Cloud Dataproc
Google Cloud Functions
Google Cloud Logging
Google Cloud Pub/Sub
Google Cloud Router
Google Cloud Spanner
Google Cloud Storage
Google Cloud VPN
Google Compute Engine
Google Container Engine
Google Firebase
Google Kubernetes Engine
Google ML Engine

Alerts

The Google Cloud Platform integration dashboard contains predefined alert conditions. These conditions are embedded as queries in the dashboard’s charts. For example:

images/alert_condition.png

To create the alert, click the Create Alert link under the query and configure the alert properties (notification targets, condition checking frequency, etc.).

Add a GCP Integration

Adding a Google Cloud Platform (GCP) integration requires establishing a trust relationship between GCP and VMware Aria Operations for Applications (formerly known as Tanzu Observability by Wavefront). Minimum required permissions you need depend on the services that you are using. See Google Cloud Platform Overview and Permissions for details.

The overall process involves the following:

Creating a service account in Google Cloud
Giving that account viewer privileges
Downloading a JSON private key

To register a Google Cloud Platform integration:

In the Name text box, enter a meaningful name.
In the JSON key text box, enter your JSON key to give read-only access to a GCP project. Note: The JSON key is securely stored and never exposed except for read-only access to the GCP APIs.
(Optional) Select the categories to fetch.
(Optional) In the Metric Allow List text box, you can add metrics to an allow list by entering a regular expression.

For example, to monitor all the CPU metrics coming from the Compute Engine, enter ^gcp.compute.instance.cpu.*$.

Note: Metric names consist of the actual metric name and a suffix (starting with an underscore (“_”) or a dot (“.”)). The suffix represents an aggregation type. In the regular expression, you must use the actual metric names without the aggregation types, such as: count, rate, min, max, sumOfSquaredDeviation, mean, and so on.

For example, for the Google Cloud Pub/Sub Engine, we collect a number of metrics, and some of them contain a suffix:

Push request latencies metrics:
- gcp.pubsub.subscription.push_request_latencies.bucket
- gcp.pubsub.subscription.push_request_latencies.count
- gcp.pubsub.subscription.push_request_latencies.mean
- gcp.pubsub.subscription.push_request_latencies.sumOfSquaredDeviation
Here, the actual metric name is gcp.pubsub.subscription.push_request_latencies, while bucket, count, mean, and sumOfSquaredDeviation are the aggregation types. When you create the regular expression, you must use only gcp.pubsub.subscription.push_request_latencies. For example, ^gcp.pubsub.subscription.push_request_latencies$.

Cumulative count of messages acknowledged by Acknowledge requests, grouped by delivery type:
- gcp.pubsub.subscription.ack_message_count_count
- gcp.pubsub.subscription.ack_message_count_rate
Here, the actual metric name is gcp.pubsub.subscription.ack_message_count, while _count and _rate are the aggregation types. When you create the regular expression, you must use only gcp.pubsub.subscription.ack_message_count. For example, ^gcp.pubsub.subscription.ack_message_count$.
(Optional) In the Additional Metric Prefixes text box, enter a comma separated list of additional metrics prefixes. The metrics names that start with these prefixes will be imported in addition to what you have selected as categories.
(Optional) Change the Service Refresh Rate. The default is 5 minutes.
(Optional) Select whether you want to enable Histogram metrics ingestion.
1. (Optional) Select which histogram metrics to ingest.
  - All - The default option which means that all metrics will be ingested.
  - Custom - Allows you to ingest particular histogram metrics based on their Google Cloud Platform grouping functions, such as Count, Mean, and Standard Deviation. When you select a grouping function, only the histogram metrics with the respective grouping function will be ingested. If you deselect all check boxes, all histogram metrics will be ingested.
2. (Optional) Select to enable Detailed Histogram Metrics, Delta Counts, and Pricing & Billing information.
  
  Note: Enabling Detailed Histogram Metrics and Delta Counts will increase your ingestion rate and costs.
  
  If you select to enable the Pricing & Billing information, you must also provide an API key.
Click Register.

Metric Name	Description
gcp.ml.prediction.error_count	The cumulative count of prediction errors.
gcp.ml.prediction.latencies	The latency of overhead, model, or user type.
gcp.ml.prediction.online.accelerator.duty_cycle	The average fraction of time over the past sample period during which the accelerators were actively processing.
gcp.ml.prediction.online.accelerator.memory.bytes_used	The amount of accelerator memory allocated by the model replica.
gcp.ml.prediction.online.cpu.utilization	The fraction of CPU allocated by the model replica and currently in use. May exceed 100% if the machine type has multiple CPUs.
gcp.ml.prediction.online.memory.bytes_used	The amount of memory allocated by the model replica and currently in use.
gcp.ml.prediction.online.network.bytes_received	The number of bytes received over the network by the model replica.
gcp.ml.prediction.online.network.bytes_sent	The number of bytes sent over the network by the model replica.
gcp.ml.prediction.online.replicas	The number of active model replicas.
gcp.ml.prediction.online.target_replicas	The aspired number of active model replicas.
gcp.ml.prediction.prediction_count	The cumulative count of predictions.
gcp.ml.prediction.response_count	The cumulative count of different response codes.
gcp.ml.training.accelerator.memory.utilization	The fraction of allocated accelerator memory that is currently in use.
gcp.ml.training.accelerator.utilization	The fraction of allocated accelerator that is currently in use.
gcp.ml.training.cpu.utilization	The fraction of allocated CPU that is currently in use.
gcp.ml.training.memory.utilization	The fraction of allocated memory that is currently in use.
gcp.ml.training.network.received_bytes_count	The number of bytes received by the training job over the network.
gcp.ml.training.network.sent_bytes_count	The number of bytes sent by the training job over the network.