Rok Monitoring Stack Architecture¶

This guide contains information about the architecture of the Rok Monitoring Stack.

Overview

Introduction
Components
Monitoring Targets
Summary
What’s Next

Introduction ¶

The Rok Monitoring Stack (RMS) is a carefully curated collection of Kubernetes manifests, Grafana dashobards and Prometheus rules to operate end-to-end EKF cluster monitoring. RMS is a Prometheus-based monitoring stack that is built on top of the widely adopted, open-source kube-prometheus repository.

To configure and deploy RMS in a declarative manner based on GitOps, Arrikto organizes the kube-prometheus manifests into Kustomize packages. By applying Kustomize patches, we selectively configure components and tailor RMS to monitor physical nodes, Kubernetes, Rok external services and Rok.

Assuming that you already have your clone of the Arrikto GitOps repository, you can view the kustomization tree with the Rok monitoring manifests under rok/monitoring/.

Note

Currently, RMS follows release-0.7 of kube-prometheus that is compatible with Kubernetes 1.19 and 1.20.

See also

Kube Prometheus Compatibility Matrix

Components ¶

The Rok Monitoring Stack consists of multiple components, each responsible for specific operations. In the table below you can view the components that RMS configures and deploys by default:

Component	Description
Prometheus Operator	Provides Kubernetes-native deployment and management of Prometheus and related monitoring components.
Prometheus	An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database, and modern alerting approach.
Node Exporter	Prometheus exporter for hardware and OS metrics exposed by *NIX kernels, written in Go with pluggable metric collectors.
Kube State Metrics	A simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
Grafana	An open-source visualization and analytics software that allows you to query, visualize, alert on, and explore metrics stored in various databases.

The core of the Rok Monitoring Stack is Prometheus: a full-fledged, widely adopted monitoring system and time series database built using an HTTP pull model. It includes a dimensional data model based on labels, a custom query language (PromQL), and an alerting system (Alert Manager). Prometheus is a graduated project of the Cloud Native Computing Foundation.

See also

The Prometheus Operator for Kubernetes introduces the monitoring.coreos.com/v1 API and manages the Prometheus, ServiceMonitor and PodMonitor custom resources. More specifically, it synchronizes the configuration of the Prometheus server based on the spec of the Prometheus CR and ensures that metrics from all targets referred to by existing ServiceMonitors and PodMonitors are collected.

The frontend of the Rok Monitoring Stack is based on Grafana, an observability and analytics platform that allows you to query, visualize, alert on and understand collected metrics. It is data-driven, can connect to multiple backends and provides a huge variety of dashboards to explore your data.

Note

Both Prometheus and Grafana provide a graphical user interface to query Prometheus’s time series database and visualize collected metrics. For more details on accessing these UIs on EKF see the Rok Monitoring Stack UIs user guide.

See also

The Rok Monitoring Stack also supports but does not currently deploy the following components:

Alert Manager	A system that handles alerts sent by client applications, such as the Prometheus server, and takes care of deduplicating, grouping, and routing them to the correct receiver integrations.
Prometheus Adapter	A component that exposes custom, application-specific metrics via the Kubernetes Custom Metrics API, so that the HPA controller or some other entity can use them.

Note

By default, RMS does not set up Prometheus Alert Manager and Prometheus Adapter instances. These components are optional and their configuration depends on the characteristics and needs of each installation.

See also

Monitoring Targets ¶

If you have already deployed Arrikto EKF, then you have also deployed the Rok Monitoring Stack with all its components. In this section we describe how we configure Rok Prometheus to monitor:

Physical Nodes
Kubernetes
EKF Istio
Rok Etcd
Rok Redis
Rok

Note

The Rok Monitoring Stack creates RBAC resources that grant Prometheus sufficient permissions to perform get, list, and watch operations on Pods, Services, and Endpoints in the namespace where Rok is deployed.

Physical Nodes ¶

To monitor the physical nodes (or cloud VMs) that host Kubernetes and, in turn, applications running on it, we need a way to gather and export critical system metrics in order to expose the overall state regarding CPU usage, memory consumption, disk I/O, network traffic, and other resources. The standard way to achieve this in a Prometheus-based monitoring stack is to run a Node Exporter instance on each node which, in turn, runs a set of collectors for both hardware and OS metrics exposed by the kernel.

The Rok Monitoring Stack deploys Node Exporter as a DaemonSet to retrieve system metrics from all nodes.

To configure Prometheus to periodically collect metrics from the Node Exporter, the Rok Monitoring Stack creates a ServiceMonitor custom resource that selects the Node Exporter Service (node-exporter.monitoring) and looks like:

node-exporter-serviceMonitor.yaml

1apiVersion: monitoring.coreos.com/v1
2kind: ServiceMonitor
3metadata:
4-24
labels:
  app.kubernetes.io/name: node-exporter
  app.kubernetes.io/version: v1.0.1
name: node-exporter
namespace: monitoring
9spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  interval: 15s
  port: https
  relabelings:
  - action: replace
    regex: (.*)
    replacement: $1
    sourceLabels:
    - __meta_kubernetes_pod_node_name
    targetLabel: instance
  scheme: https
  tlsConfig:
    insecureSkipVerify: true
jobLabel: app.kubernetes.io/name
selector:
  matchLabels:
    app.kubernetes.io/name: node-exporter

Kubernetes ¶

To monitor Kubernetes we need a way to gather and export metrics from core Kubernetes components, such as the Kubernetes API Server, Kubelet, the Kubernetes Scheduler, the Kubernetes Controller Manager, CoreDNS, etc. These components already collect and expose metrics in the Prometheus data format via Kubernetes Services.

Note

cAdvisor is an open-source agent that is integrated into the kubelet binary that monitors resource usage and analyzes the performance of containers. It collects statistics about the CPU, memory, file, and network usage for all containers running on a given node (it does not operate at the pod level).

In addition, we need to gather and export metrics from Kubernetes API resources, such as StatefulSets, DaemonSets, PersistentVolumeClaims, etc. The standard way to achieve this in a Prometheus-based monitoring stack is to deploy Kube State Metrics in the Kubernetes cluster to be monitored. Kube State Metrics focuses on generating completely new metrics based on the state of Kubernetes API objects. It holds an entire snapshot of Kubernetes state in memory and continuously generates new metrics based on it.

Warning

In managed cloud environments, such as EKS, GKE and AKS, metrics from Kubernetes components that are exclusively running on the master node (such as the scheduler and the controller manager) might not be available by default.

To configure Prometheus to periodically collect metrics from Kubernetes, the Rok Monitoring Stack creates multiple ServiceMonitor custom resources that select individual Kubernetes Services:

kube-state-metrics-serviceMonitor.yaml

1apiVersion: monitoring.coreos.com/v1
2kind: ServiceMonitor
3metadata:
4-28
labels:
  app.kubernetes.io/name: kube-state-metrics
  app.kubernetes.io/version: 1.9.7
name: kube-state-metrics
namespace: monitoring
9spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  honorLabels: true
  interval: 30s
  port: https-main
  relabelings:
  - action: labeldrop
    regex: (pod|service|endpoint|namespace)
  scheme: https
  scrapeTimeout: 30s
  tlsConfig:
    insecureSkipVerify: true
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  interval: 30s
  port: https-self
  scheme: https
  tlsConfig:
    insecureSkipVerify: true
jobLabel: app.kubernetes.io/name
selector:
  matchLabels:
    app.kubernetes.io/name: kube-state-metrics

prometheus-serviceMonitorKubelet.yaml

1apiVersion: monitoring.coreos.com/v1
2kind: ServiceMonitor
3metadata:
4-93
labels:
  k8s-app: kubelet
name: kubelet
namespace: monitoring
8spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  honorLabels: true
  interval: 30s
  metricRelabelings:
  - action: drop
    regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds)
    sourceLabels:
    - __name__
  - action: drop
    regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)
    sourceLabels:
    - __name__
  - action: drop
    regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)
    sourceLabels:
    - __name__
  - action: drop
    regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)
    sourceLabels:
    - __name__
  - action: drop
    regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)
    sourceLabels:
    - __name__
  - action: drop
    regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)
    sourceLabels:
    - __name__
  - action: drop
    regex: transformation_(transformation_latencies_microseconds|failures_total)
    sourceLabels:
    - __name__
  - action: drop
    regex: (admission_quota_controller_adds|crd_autoregistration_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|AvailableConditionController_retries|crd_openapi_controller_unfinished_work_seconds|APIServiceRegistrationController_retries|admission_quota_controller_longest_running_processor_microseconds|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_unfinished_work_seconds|crd_openapi_controller_adds|crd_autoregistration_controller_retries|crd_finalizer_queue_latency|AvailableConditionController_work_duration|non_structural_schema_condition_controller_depth|crd_autoregistration_controller_unfinished_work_seconds|AvailableConditionController_adds|DiscoveryController_longest_running_processor_microseconds|autoregister_queue_latency|crd_autoregistration_controller_adds|non_structural_schema_condition_controller_work_duration|APIServiceRegistrationController_adds|crd_finalizer_work_duration|crd_naming_condition_controller_unfinished_work_seconds|crd_openapi_controller_longest_running_processor_microseconds|DiscoveryController_adds|crd_autoregistration_controller_longest_running_processor_microseconds|autoregister_unfinished_work_seconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|non_structural_schema_condition_controller_queue_latency|crd_naming_condition_controller_depth|AvailableConditionController_longest_running_processor_microseconds|crdEstablishing_depth|crd_finalizer_longest_running_processor_microseconds|crd_naming_condition_controller_adds|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_unfinished_work_seconds|crd_openapi_controller_depth|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|DiscoveryController_work_duration|autoregister_adds|crd_autoregistration_controller_queue_latency|crd_finalizer_retries|AvailableConditionController_unfinished_work_seconds|autoregister_longest_running_processor_microseconds|non_structural_schema_condition_controller_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_depth|AvailableConditionController_depth|DiscoveryController_retries|admission_quota_controller_depth|crdEstablishing_adds|APIServiceOpenAPIAggregationControllerQueue1_retries|crdEstablishing_queue_latency|non_structural_schema_condition_controller_longest_running_processor_microseconds|autoregister_work_duration|crd_openapi_controller_retries|APIServiceRegistrationController_work_duration|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_openapi_controller_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_queue_latency|crd_autoregistration_controller_depth|AvailableConditionController_queue_latency|admission_quota_controller_queue_latency|crd_naming_condition_controller_work_duration|crd_openapi_controller_work_duration|DiscoveryController_depth|crd_naming_condition_controller_longest_running_processor_microseconds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|crd_finalizer_unfinished_work_seconds|crdEstablishing_retries|admission_quota_controller_unfinished_work_seconds|non_structural_schema_condition_controller_adds|APIServiceRegistrationController_unfinished_work_seconds|admission_quota_controller_work_duration|autoregister_depth|autoregister_retries|kubeproxy_sync_proxy_rules_latency_microseconds|rest_client_request_latency_seconds|non_structural_schema_condition_controller_retries)
    sourceLabels:
    - __name__
  port: https-metrics
  relabelings:
  - sourceLabels:
    - __metrics_path__
    targetLabel: metrics_path
  scheme: https
  tlsConfig:
    insecureSkipVerify: true
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  honorLabels: true
  honorTimestamps: false
  interval: 30s
  metricRelabelings:
  - action: drop
    regex: container_(network_tcp_usage_total|network_udp_usage_total|tasks_state|cpu_load_average_10s)
    sourceLabels:
    - __name__
  - action: drop
    regex: (container_fs_.*|container_spec_.*|container_blkio_device_usage_total|container_file_descriptors|container_sockets|container_threads_max|container_threads|container_start_time_seconds|container_last_seen);;
    sourceLabels:
    - __name__
    - pod
    - namespace
  path: /metrics/cadvisor
  port: https-metrics
  relabelings:
  - sourceLabels:
    - __metrics_path__
    targetLabel: metrics_path
  scheme: https
  tlsConfig:
    insecureSkipVerify: true
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  honorLabels: true
  interval: 30s
  path: /metrics/probes
  port: https-metrics
  relabelings:
  - sourceLabels:
    - __metrics_path__
    targetLabel: metrics_path
  scheme: https
  tlsConfig:
    insecureSkipVerify: true
jobLabel: k8s-app
namespaceSelector:
  matchNames:
  - kube-system
selector:
  matchLabels:
    k8s-app: kubelet

prometheus-serviceMonitorApiserver.yaml

1apiVersion: monitoring.coreos.com/v1
2kind: ServiceMonitor
3metadata:
4-71
labels:
  k8s-app: apiserver
name: kube-apiserver
namespace: monitoring
8spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  interval: 30s
  metricRelabelings:
  - action: drop
    regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds)
    sourceLabels:
    - __name__
  - action: drop
    regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)
    sourceLabels:
    - __name__
  - action: drop
    regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)
    sourceLabels:
    - __name__
  - action: drop
    regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)
    sourceLabels:
    - __name__
  - action: drop
    regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)
    sourceLabels:
    - __name__
  - action: drop
    regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)
    sourceLabels:
    - __name__
  - action: drop
    regex: transformation_(transformation_latencies_microseconds|failures_total)
    sourceLabels:
    - __name__
  - action: drop
    regex: (admission_quota_controller_adds|crd_autoregistration_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|AvailableConditionController_retries|crd_openapi_controller_unfinished_work_seconds|APIServiceRegistrationController_retries|admission_quota_controller_longest_running_processor_microseconds|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_unfinished_work_seconds|crd_openapi_controller_adds|crd_autoregistration_controller_retries|crd_finalizer_queue_latency|AvailableConditionController_work_duration|non_structural_schema_condition_controller_depth|crd_autoregistration_controller_unfinished_work_seconds|AvailableConditionController_adds|DiscoveryController_longest_running_processor_microseconds|autoregister_queue_latency|crd_autoregistration_controller_adds|non_structural_schema_condition_controller_work_duration|APIServiceRegistrationController_adds|crd_finalizer_work_duration|crd_naming_condition_controller_unfinished_work_seconds|crd_openapi_controller_longest_running_processor_microseconds|DiscoveryController_adds|crd_autoregistration_controller_longest_running_processor_microseconds|autoregister_unfinished_work_seconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|non_structural_schema_condition_controller_queue_latency|crd_naming_condition_controller_depth|AvailableConditionController_longest_running_processor_microseconds|crdEstablishing_depth|crd_finalizer_longest_running_processor_microseconds|crd_naming_condition_controller_adds|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_unfinished_work_seconds|crd_openapi_controller_depth|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|DiscoveryController_work_duration|autoregister_adds|crd_autoregistration_controller_queue_latency|crd_finalizer_retries|AvailableConditionController_unfinished_work_seconds|autoregister_longest_running_processor_microseconds|non_structural_schema_condition_controller_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_depth|AvailableConditionController_depth|DiscoveryController_retries|admission_quota_controller_depth|crdEstablishing_adds|APIServiceOpenAPIAggregationControllerQueue1_retries|crdEstablishing_queue_latency|non_structural_schema_condition_controller_longest_running_processor_microseconds|autoregister_work_duration|crd_openapi_controller_retries|APIServiceRegistrationController_work_duration|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_openapi_controller_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_queue_latency|crd_autoregistration_controller_depth|AvailableConditionController_queue_latency|admission_quota_controller_queue_latency|crd_naming_condition_controller_work_duration|crd_openapi_controller_work_duration|DiscoveryController_depth|crd_naming_condition_controller_longest_running_processor_microseconds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|crd_finalizer_unfinished_work_seconds|crdEstablishing_retries|admission_quota_controller_unfinished_work_seconds|non_structural_schema_condition_controller_adds|APIServiceRegistrationController_unfinished_work_seconds|admission_quota_controller_work_duration|autoregister_depth|autoregister_retries|kubeproxy_sync_proxy_rules_latency_microseconds|rest_client_request_latency_seconds|non_structural_schema_condition_controller_retries)
    sourceLabels:
    - __name__
  - action: drop
    regex: etcd_(debugging|disk|server).*
    sourceLabels:
    - __name__
  - action: drop
    regex: apiserver_admission_controller_admission_latencies_seconds_.*
    sourceLabels:
    - __name__
  - action: drop
    regex: apiserver_admission_step_admission_latencies_seconds_.*
    sourceLabels:
    - __name__
  - action: drop
    regex: apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)
    sourceLabels:
    - __name__
    - le
  port: https
  scheme: https
  tlsConfig:
    caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    serverName: kubernetes
jobLabel: component
namespaceSelector:
  matchNames:
  - default
selector:
  matchLabels:
    component: apiserver
    provider: kubernetes

prometheus-serviceMonitorKubeControllerManager.yaml

1apiVersion: monitoring.coreos.com/v1
2kind: ServiceMonitor
3metadata:
4-56
labels:
  k8s-app: kube-controller-manager
name: kube-controller-manager
namespace: monitoring
8spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  interval: 30s
  metricRelabelings:
  - action: drop
    regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds)
    sourceLabels:
    - __name__
  - action: drop
    regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)
    sourceLabels:
    - __name__
  - action: drop
    regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)
    sourceLabels:
    - __name__
  - action: drop
    regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)
    sourceLabels:
    - __name__
  - action: drop
    regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)
    sourceLabels:
    - __name__
  - action: drop
    regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)
    sourceLabels:
    - __name__
  - action: drop
    regex: transformation_(transformation_latencies_microseconds|failures_total)
    sourceLabels:
    - __name__
  - action: drop
    regex: (admission_quota_controller_adds|crd_autoregistration_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|AvailableConditionController_retries|crd_openapi_controller_unfinished_work_seconds|APIServiceRegistrationController_retries|admission_quota_controller_longest_running_processor_microseconds|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_unfinished_work_seconds|crd_openapi_controller_adds|crd_autoregistration_controller_retries|crd_finalizer_queue_latency|AvailableConditionController_work_duration|non_structural_schema_condition_controller_depth|crd_autoregistration_controller_unfinished_work_seconds|AvailableConditionController_adds|DiscoveryController_longest_running_processor_microseconds|autoregister_queue_latency|crd_autoregistration_controller_adds|non_structural_schema_condition_controller_work_duration|APIServiceRegistrationController_adds|crd_finalizer_work_duration|crd_naming_condition_controller_unfinished_work_seconds|crd_openapi_controller_longest_running_processor_microseconds|DiscoveryController_adds|crd_autoregistration_controller_longest_running_processor_microseconds|autoregister_unfinished_work_seconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|non_structural_schema_condition_controller_queue_latency|crd_naming_condition_controller_depth|AvailableConditionController_longest_running_processor_microseconds|crdEstablishing_depth|crd_finalizer_longest_running_processor_microseconds|crd_naming_condition_controller_adds|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_unfinished_work_seconds|crd_openapi_controller_depth|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|DiscoveryController_work_duration|autoregister_adds|crd_autoregistration_controller_queue_latency|crd_finalizer_retries|AvailableConditionController_unfinished_work_seconds|autoregister_longest_running_processor_microseconds|non_structural_schema_condition_controller_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_depth|AvailableConditionController_depth|DiscoveryController_retries|admission_quota_controller_depth|crdEstablishing_adds|APIServiceOpenAPIAggregationControllerQueue1_retries|crdEstablishing_queue_latency|non_structural_schema_condition_controller_longest_running_processor_microseconds|autoregister_work_duration|crd_openapi_controller_retries|APIServiceRegistrationController_work_duration|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_openapi_controller_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_queue_latency|crd_autoregistration_controller_depth|AvailableConditionController_queue_latency|admission_quota_controller_queue_latency|crd_naming_condition_controller_work_duration|crd_openapi_controller_work_duration|DiscoveryController_depth|crd_naming_condition_controller_longest_running_processor_microseconds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|crd_finalizer_unfinished_work_seconds|crdEstablishing_retries|admission_quota_controller_unfinished_work_seconds|non_structural_schema_condition_controller_adds|APIServiceRegistrationController_unfinished_work_seconds|admission_quota_controller_work_duration|autoregister_depth|autoregister_retries|kubeproxy_sync_proxy_rules_latency_microseconds|rest_client_request_latency_seconds|non_structural_schema_condition_controller_retries)
    sourceLabels:
    - __name__
  - action: drop
    regex: etcd_(debugging|disk|request|server).*
    sourceLabels:
    - __name__
  port: https-metrics
  scheme: https
  tlsConfig:
    insecureSkipVerify: true
jobLabel: k8s-app
namespaceSelector:
  matchNames:
  - kube-system
selector:
  matchLabels:
    k8s-app: kube-controller-manager

prometheus-serviceMonitorKubeScheduler.yaml

1apiVersion: monitoring.coreos.com/v1
2kind: ServiceMonitor
3metadata:
4-19
labels:
  k8s-app: kube-scheduler
name: kube-scheduler
namespace: monitoring
8spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  interval: 30s
  port: https-metrics
  scheme: https
  tlsConfig:
    insecureSkipVerify: true
jobLabel: k8s-app
namespaceSelector:
  matchNames:
  - kube-system
selector:
  matchLabels:
    k8s-app: kube-scheduler

EKF Istio ¶

Arrikto EKF uses Istio as the service mesh for microservices. Monitoring Istio with the Rok Monitoring Stack is a work in progress.

Rok Etcd ¶

Rok uses etcd as a key-value store to save Rok data and metadata. To make Prometheus aware of Rok etcd and configure it to periodically scrape metrics from it, the Rok Monitoring Stack creates a ServiceMonitor custom resource that selects the Rok etcd Service (rok-etcd.rok):

prometheus-serviceMonitorRokEtcd.yaml

1# This file is part of Rok.
2#
3# Copyright © 2020, 2022 Arrikto Inc.  All Rights Reserved.
4-17
4
5apiVersion: monitoring.coreos.com/v1
6kind: ServiceMonitor
7metadata:
8  name: rok-etcd-metrics
9  namespace: rok
10spec:
11  endpoints:
12  - interval: 15s
13    port: tcp-client
14  namespaceSelector:
15    matchNames:
16      - rok
17  selector:
18    matchLabels:
19      app: etcd
20      app.kubernetes.io/part-of: rok

See also

For detailed information on how to monitor Rok etcd, see the Etcd Monitoring user guide.

Rok Redis ¶

Rok uses Redis as an in-memory data structure store to cache metadata. To make Prometheus aware of Rok Redis and configure it to periodically scrape metrics from it, the Rok Monitoring Stack creates a ServiceMonitor custom resource that selects the Rok Redis metrics Service (rok-redis-metrics.rok):

prometheus-serviceMonitorRokRedis.yaml

1# This file is part of Rok.
2#
3# Copyright © 2020 Arrikto Inc.  All Rights Reserved.
4-17
4
5apiVersion: monitoring.coreos.com/v1
6kind: ServiceMonitor
7metadata:
8  name: rok-redis-metrics
9  namespace: rok
10spec:
11  endpoints:
12  - interval: 15s
13    port: http-metrics
14  namespaceSelector:
15    matchNames:
16      - rok
17  selector:
18    matchLabels:
19      app: redis
20      app.kubernetes.io/part-of: rok

Rok ¶

Rok is natively integrated with Prometheus, as it serves the /metrics HTTP endpoint and exposes metrics to the outer world using Prometheus’s data model and text-based format.

See also

To make Prometheus aware of Rok and configure it to periodically scrape metrics from it, the Rok Monitoring Stack creates a ServiceMonitor custom resource that selects the Rok Service (rok.rok):

prometheus-serviceMonitorRok.yaml

1# This file is part of Rok.
2#
3# Copyright © 2020 Arrikto Inc.  All Rights Reserved.
4-17
4
5apiVersion: monitoring.coreos.com/v1
6kind: ServiceMonitor
7metadata:
8  name: rok-metrics
9  namespace: rok
10spec:
11  endpoints:
12  - interval: 15s
13    port: http-rok
14  namespaceSelector:
15    matchNames:
16    - rok
17  selector:
18    matchLabels:
19      rok-cluster: rok
20      app.kubernetes.io/part-of: rok

See also

For detailed information on how to monitor Rok, see the Rok Monitoring user guide.

Summary ¶

In this guide you gained insight on the architecture, components and default configuration of the Rok Monitoring Stack.

What’s Next ¶

The next step is to learn how to access the UIs of the Rok Monitoring Stack.

Rok Monitoring Stack UIs