Learning how to monitor etcd is of vital importance when running Arrikto EKF in production. Monitoring etcd lets you validate that EKF performs as expected, while it also helps you detect and troubleshoot issues in a timely manner.
Inspecting the performance and status of Rok etcd is key to keep the Rok cluster of your EKF installation healthy and functional. The Rok Monitoring Stack increases your observability into the way your Rok cluster interacts with etcd by collecting and visualizing Prometheus metrics that are directly exposed by etcd. This helps you maintain high levels of performance and availability.
This guide also contains commands that you can run to access the Rok etcd Grafana dashboard. Here is what you will need to follow them:
Before proceeding, ensure that you have been granted proper rights to access the Rok Monitoring Stack UI. Currently, access to the Rok Monitoring Stack is allowed only to admin users.
etcd uses Prometheus for metrics reporting at the
/metrics HTTP endpoint.
The metrics that etcd exposes can be used for real-time monitoring and
debugging. However, etcd does not persist its metrics on its own, that is,
metrics are reset upon restarts.
To persist etcd metrics on Kubernetes, the Rok Monitoring Stack creates a
ServiceMonitor custom resource in the namespace where Rok is deployed to
configure Rok Prometheus to periodically pull metrics from Rok etcd and save
them in its time-series database.
By default, Rok Prometheus retains metrics for 3 days.
Below you can view the categories of etcd v3.3 metrics:
- Stable metrics, under the
- Debugging metrics, under the
- System and Go application metrics, under the
- gRPC server metrics, under the
Rok Prometheus collects and stores all metrics exposed by etcd, while Rok Grafana queries for and visualizes a subset of the collected metrics. The goal is to use collected metrics to sufficiently monitor the following areas:
- leader election
- disk operations
- CPU usage
- memory usage
- network traffic
The table below lists the etcd metrics that are included in the
Rok / etcd
|The latency distributions of commit called by backend.
|The latency distributions of fsync called by wal.
|The total number of bytes received from grpc clients.
|The total number of bytes sent to grpc clients.
|The total number of bytes received from peers.
|The total number of bytes sent to peers.
|Whether or not a leader exists. 1 is existence, 0 is not.
|The number of leader changes seen.
|The total number of consensus proposals applied.
|The total number of consensus proposals committed.
|The total number of failed proposals seen.
|The current number of pending proposals to commit.
|Resident memory size in bytes.
|Total number of RPCs completed on the server, regardless of success or failure.
|Total number of RPCs started on the server.
|Total size of the underlying database physically allocated in bytes.
|The total latency distributions of save called by snapshot.
The Rok Monitoring Stack places Grafana dashboards for individual EKF
components under the
Visit the Kubeflow central dashboard with your browser athttps://<FQDN>
<FQDN>with your the value of your domain. For example:https://arrikto-cluster.apps.example.com
If prompted, log in using your credentials:
Metricsfrom the left side bar to navigate to Grafana:
In the left side bar, hover your cursor over the
Dashboardsentry and then click
Manageto navigate to the Grafana Dashboards page:
In the Grafana Dashboards page you can search, view, and select dashboards.
Go to the
EKFfolder and select the
Rok / etcddashboard:
View visualizations of collected Rok etcd metrics:
In this guide you gained insight on how the Rok Monitoring Stack integrates with etcd and which metrics it collects and visualizes.