Get help when you have problems with your Kubernetes setup.

Depending on your setup, you typically deploy the following components into your Kubernetes cluster:

Once deployed, the collector instances gather data at regular intervals from various sources and send the data to Wavefront via the Wavefront proxy.

Troubleshoot Using the Wavefront Collector Dashboard

The Wavefront Collector emits internal metrics that you can use to troubleshoot issues.

The Wavefront Collector metrics dashboard in the Kubernetes integration shows these metrics.

screenshot of Kubernetes metrics

Troubleshooting Using the Data Collection Flow

In Kubernetes, a Node can be considered a virtual machine, and can have several applications and services running on it. These applications and services are referred to as Pods. The Wavefront collector deploys itself on each Node to collect metrics from the Pods.

All the Pods the Wavefront collector collects metrics from are considered a Source.

Next, the Source sends metrics to the Wavefront Sink and then to the Wavefront Service through the Wavefront proxy.

Since the Wavefront collector runs on each Node, metrics common to the Kubernetes environment or cluster can be repeated, such as the cluster metrics, which are reported multiple times. To avoid the same metric being reported several times, one Wavefront collector is elected as the leader to perform tasks that only need to be done once.

The following diagram shows how the data flows from your Kubernetes environment to Wavefront.

Kubernetes Collector Data Flow Diagram

You run into issues when data doesn’t flow from one component to another or when there are configuration issues.

For example, identifying the metrics that come into Wavefront and the metrics that don’t go into Wavefront help you know where to look.

To troubleshooting data collection, follow the data flow from the source to Wavefront to find where the flow is broken.

  • Individual processes in the flow can cause problems.
  • Connections between processes can cause problems.

Identifying what metrics are and aren’t coming through can help identify where to look.

 click for top of page

Symptom: No Data Flowing into Wavefront

Step 1: Verify that the Collector is Running.

Highlights Kubernetes collector on the Kubernetes Collector data flow diagram

  • Run kubectl get pods -l k8s-app=wavefront-collector -n <NAMESPACE> to verify all collector instances are ready and available.
  • Pods are marked as not ready:
     click for top of page

Step 2: Verify that the Proxy is Running

Highlights Wavefront proxy on the Kubernetes Collector data flow diagram

  • Run kubectl get deployment wavefront-proxy -n NAMESPACE to verify the proxy instances are ready and available.
  • Run kubectl get pods -l app=wavefront-proxy -n <NAMESPACE> to verify there are no pod restarts.
  • Run kubectl logs pod_name to check the proxy logs for errors connecting to the Wavefront service.
 click for top of page

Step 3: Verify that the Collector Can Connect to the Proxy

Highlights arrow from the sinker to the wavefront proxy on the Kubernetes Collector data flow diagram

  • List collector pods with kubectl get pods -l k8s-app=wavefront-collector -n <NAMESPACE>.
  • Check the collector logs for errors sending points to the proxy with kubectl logs pod_name.
  • To make sure that the collector can communicate with the proxy:
     click for top of page

Step 4: Verify that the Proxy Can Connect to Wavefront

Highlights arrow from the wavefront proxy to the wavefront service on the Kubernetes Collector data flow diagram

See Monitor Wavefront Proxies for monitoring and troubleshooting the proxy.

 click for top of page

Symptom: Incomplete Data in Wavefront

Step 1: Verify Collection Source Configurations

Highlights the source box on the Kubernetes Collector data flow diagram

See the Wavefront Collector Configurations to verify that the collector is configured correctly.

 click for top of page

Step 2: Verify Filter Configuration

You can filter out data flowing into Wavefront at multiple points:

  • In some cases, the application (App Pod) can filter metrics and decide on the metrics are available to collect. A common example of this is kube-state-metrics. See kube-state-metrics documentation for configuration options. Highlights the app pod on the Kubernetes Collector data flow diagram

  • The Wavefront Kubernetes collector allows two levels of filtering internally, shown in the picture below.
    1. Filter metrics at the source level.
    2. Filter all metrics sent from the collector to Wavefront.

    Run kubectl get configmap collector-config -n <YOUR_NAMESPACE> -o yaml and check both your source configuration and sink configuration for filters. See Prefix, tags, and filter configurations for the Wavefront Collector.

    Highlights the source and sink the Kubernetes Collector data flow diagram

  • Filter or rename metrics at the Proxy before sending them to Wavefront. See Wavefront proxy preprocessor rules. More information can be found in Monitoring Wavefront Proxies.

    Highlights arrow from the sinker to the wavefront proxy on the Kubernetes Collector data flow diagram

 click for top of page

Step 3: Verify Metric Naming Configuration

 click for top of page

Step 4: Check Collector Health

Check for Leader Health Problems

The leader Wavefront collector pod collects the metrics from its node and the metrics that are not specific to the local node. Example: cluster metrics, service monitoring metrics, and metrics from applications that use explicit rule definitions.

If a leader pod crashes due to insufficient resources, another pod picks up the leader role automatically. If the new leader pod also faces the same problem and crashes as well, just like the previous leader pod, the cycle continues. This problem leads to inconsistencies in data collection.

To check for leader health problems:

  1. Open the Wavefront Collector Metrics dashboard in the Kubernetes integration.
  2. In the Troubleshooting section, check if the Leader Election is red or if the chart has spikes with the number of times the leadership changed,

If you see these symptoms, then there are leader health problems.

The following screenshot shows that there are no leader health problems.

Leadership Election in a good state on the Collector Troubleshooting Dashboard

 click for top of page

Check for Insufficient CPU Symptoms

To check for Insufficient CPU, follow these steps:

  1. Open the Wavefront Collector Metrics dashboard in the Kubernetes integration.

  2. In the Troubleshooting section, see the Collector Restarts chart. If the chart is in red, or if there are spikes on the chart, there are memory or CPU issues. See Fine-Tune the Time Window to customize the time window on the chart.

    The following screenshot shows that there are collector restarts. Collector restarts happening on the Collector Troubleshooting Dashboard

  3. In the Wavefront Collector Metrics dashboard’s Data Collection section, check the Collection Latency chart. If you customized the time window for the above steps, use the same time window is selected for this chart.

    Collector Latency Graph on the collector Troubleshooting Dashboard

 click for top of page

Check for Insufficient Memory Symptoms

To check for insufficient memory, follow these steps:

  • Run kubectl get pods -l app.kubernetes.io/component=collector -n <NAMESPACE> to find collector pods that have been restarting.
    • If the collector is showing frequent restarts, check the termination reason by running kubectl describe pod podname
    • OOM errors show that the collector has insufficient memory resources to run.
  • If your collector does not show OOM as its termination reason, check the logs for other errors by running kubectl logs podname

To solve this, See the remedies section.

 click for top of page

Remedies for CPU or Memory Problems

  • Increase CPU or memory limits: The easiest way to resolve memory or CPU issues is to increase the memory and CPU limits. To determine the CPU or memory limit, see the charts in the Wavefront Collector Metrics dashboard’s Troubleshooting section.

  • Determine CPU limit: If you see high collector latencies, you need to adjust the CPU limit. When the collector is throttled due to lack of CPU availability, it leads to memory issues too. The process of increasing CPU is trial and error. Adjust the limit and monitor the collector latency graph after the update. If the latencies level out, you have found the correct CPU limit.

  • Determine memory limit by following these steps:
    1. Open the Wavefront Collector Metrics dashboard and see Collector Memory Usage (Top 20) chart on the Troubleshooting section. This chart gives you an idea of the limits you need.
    2. Customize the time frame on the chart to 2-4 days using the (-) icon on the chart and look for spikes. The spikes are most likely created by the elected leader.
    3. Start to set the limit 10% over the max value you find and monitor the changes. For example, based on the screenshot below, the max value is 43.5 Mi. Therefore, you can start to set your limit at 47.85 Mi (43.5 x 1.10) and monitor the progress.

    Collector Memory Usage Graph on the Collector Troubleshooting Dashboard

  • Change the CPU or memory limit based on how the Wavefront Collector is installed:
    • Helm Deployments
      • Option 1: For details on updating the CPU and memory limit on helm charts, see Parameters.
      • Option 2: Update the cpu and memory limits in the values.yaml file.
    • Manual Deployments
      Update the container settings in the daemonset definition. The default limits are:
      resources:
          limits:
            cpu: 1000m
            memory: 1024Mi
      
  • Reduce the collection load: Reduce the number of metrics that are collected and reduce the collector CPU and memory load limit from the start. For example, reduce the load at the application pod or source level. It grants the largest reduction in overall load on the system. Fewer resources are required to remove a source than to filter downstream at the collector.

    • Filter metrics:

      • Configure the Wavefront collector to remove sources: If you have statically defined sources, comment out or remove sources that emit a large number of metrics from the sources list in the collector configuration.md file. This method removes metrics that are minimally processed, reducing the CPU and memory load on the collector.
      • Filter metrics at the source: Sources scraped by the collector have a way of filtering out metrics. You can filter the metrics on the source or from the Wavefront collector:
        • Some applications let you configure the metrics they produce. If your application can do that, you can reduce the metrics collected before the metrics are sent to the collector.
        • Change the Wavefront collector source configuration to filter out metrics you don’t need. Note:
    • Disable Auto-Discovery: If the load is still high, you might be scraping pods based on annotations that the collector finds, which is standard for helm charts or widely used containers. Disable autodiscovery and see if the load reduces. If this works and you don’t want the pods to be scrapped in the future, remove the annotations.

 click for top of page

Step 5: Check for Data Collection Errors

Use these metrics to help troubleshoot issues with data collection:

MetricDescription
kubernetes.collector.target.collect.errors Counter showing the number of errors collecting data from a target pod or service etc.
You can see this data on the Collection Errors Per Endpoint chart under the Troubleshooting section of the Kubernetes Collector Metrics Dashboard.
kubernetes.collector.source.collect.errors Counter showing the number of errors per plugin type (prometheus, telegraf etc.)
You can see this data on the Collection Errors Per Type chart under the Troubleshooting section of the Kubernetes Collector Metrics Dashboard.
kubernetes.collector.target.points.collected Counter showing the number of points collected from a single target (pod, service etc.) as a per-second rate.
You can see this data on the Points Collected Per Target (Top 20) chart under the Data Collection section of the Kubernetes Collector Metrics Dashboard.

Check the source of these metrics to identify the specific Kubernetes node on which the collector is running. Then check the logs for that collector instance for further troubleshooting.

 click for top of page