Etcd Monitoring

Learning how to monitor etcd is of vital importance when running Arrikto EKF in production. Monitoring etcd lets you validate that EKF performs as expected, while it also helps you detect and troubleshoot issues in a timely manner.

Inspecting the performance and status of Rok etcd is key to keep the Rok cluster of your EKF installation healthy and functional. The Rok Monitoring Stack increases your observability into the way your Rok cluster interacts with etcd by collecting and visualizing Prometheus metrics that are directly exposed by etcd. This helps you maintain high levels of performance and availability.

This guide also contains commands that you can run to access the Rok etcd Grafana dashboard. Here is what you will need to follow them:

Important

Before proceeding, ensure that you have been granted proper rights to access the Rok Monitoring Stack UI. Currently, access to the Rok Monitoring Stack is allowed only to admin users.

Introduction

etcd uses Prometheus for metrics reporting at the /metrics HTTP endpoint. The metrics that etcd exposes can be used for real-time monitoring and debugging. However, etcd does not persist its metrics on its own, that is, metrics are reset upon restarts.

To persist etcd metrics on Kubernetes, the Rok Monitoring Stack creates a ServiceMonitor custom resource in the namespace where Rok is deployed to configure Rok Prometheus to periodically pull metrics from Rok etcd and save them in its time-series database.

Note

By default, Rok Prometheus retains metrics for 3 days.

Metrics

Below you can view the categories of etcd v3.3 metrics:

  1. Stable metrics, under the etcd_ prefix.
  2. Debugging metrics, under the etcd_debugging_ prefix.
  3. System and Go application metrics, under the process_ and go_ and prefixes.
  4. gRPC server metrics, under the grpc_server_ prefix.

Rok Prometheus collects and stores all metrics exposed by etcd, while Rok Grafana queries for and visualizes a subset of the collected metrics. The goal is to use collected metrics to sufficiently monitor the following areas:

  • leader election
  • disk operations
  • CPU usage
  • memory usage
  • network traffic

The table below lists the etcd metrics that are included in the Rok / etcd Grafana dashboard:

Name Description Type
etcd_disk_backend_commit_duration_seconds The latency distributions of commit called by backend. Histogram
etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by wal. Histogram
etcd_network_client_grpc_received_bytes_total The total number of bytes received from grpc clients. Counter
etcd_network_client_grpc_sent_bytes_total The total number of bytes sent to grpc clients. Counter
etcd_network_peer_received_bytes_total The total number of bytes received from peers. Counter
etcd_network_peer_sent_bytes_total The total number of bytes sent to peers. Counter
etcd_server_has_leader Whether or not a leader exists. 1 is existence, 0 is not. Gauge
etcd_server_leader_changes_seen_total The number of leader changes seen. Counter
etcd_server_proposals_applied_total The total number of consensus proposals applied. Gauge
etcd_server_proposals_committed_total The total number of consensus proposals committed. Gauge
etcd_server_proposals_failed_total The total number of failed proposals seen. Counter
etcd_server_proposals_pending The current number of pending proposals to commit. Gauge
process_resident_memory_bytes Resident memory size in bytes. Gauge
grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure. Counter
grpc_server_started_total Total number of RPCs started on the server. Counter
etcd_debugging_mvcc_db_total_size_in_bytes Total size of the underlying database physically allocated in bytes. Gauge
etcd_debugging_snap_save_total_duration_seconds The total latency distributions of save called by snapshot. Histogram

View Grafana Dashboard

Note

The Rok Monitoring Stack places Grafana dashboards for individual EKF components under the EKF folder.

  1. Visit the Kubeflow central dashboard with your browser at

    https://<FQDN>

    Replace <FQDN> with your the value of your domain. For example:

    https://arrikto-cluster.apps.example.com
  2. If prompted, log in using your credentials:

    ../../_images/kubeflow-login.png
  3. Select Metrics from the left side bar to navigate to Grafana:

    ../../_images/kubeflow-dashboard-metrics.png
  4. In the left side bar, hover your cursor over the Dashboards entry and then click Manage to navigate to the Grafana Dashboards page:

    ../../_images/grafana-dashboard-manage.png

    Note

    In the Grafana Dashboards page you can search, view, and select dashboards.

  5. Go to the EKF folder and select the Rok / etcd dashboard:

    ../../_images/rok-etcd-grafana-dashboard-select.png
  6. View visualizations of collected Rok etcd metrics:

    ../../_images/rok-etcd-grafana-dashboard.png

Summary

In this guide you gained insight on how the Rok Monitoring Stack integrates with etcd and which metrics it collects and visualizes.

What’s Next

The next step is to learn how to monitor Rok and view the Rok Grafana dashboard.