Gather Logs for Troubleshooting¶

This section describes how to gather any EKF-related logs in order to troubleshoot your deployment. At the end of this guide you will end up with a tarball that you can send to the Arrikto Support Team.

Choose one of the following options to gather logs for troubleshooting:

Option 1: Gather Logs Automatically (preferred).
Option 2: Gather Logs Manually.

Overview

What You’ll Need
Option 1: Gather Logs Automatically (preferred)
Option 2: Gather Logs Manually
- Procedure
Summary
What’s Next

What You’ll Need ¶

A configured management environment.
An existing Kubernetes cluster.
An existing Rok deployment.
An existing Kubeflow deployment.

Option 1: Gather Logs Automatically (preferred)¶

Gather logs by following the on-screen instructions on the rok-gather-logs user interface.

Start the rok-gather-logs CLI tool:

root@rok-tools:~# rok-gather-logs
Wait until the script has finished successfully.
Troubleshooting
The script takes too long to complete and/or runs out of space
If the script takes too long to complete and/or runs out of space, you can do the following:

Press Ctrl+C to stop the script.

Start the rok-gather-logs tool again:

root@rok-tools:~# rok-gather-logs

(Optional) When prompted, select a smaller limit for the size of the logs of the Rok and Rok CLI Node Pods that the tool will gather from each node of your cluster. The recommended value is 100 MiB.

(Optional) When prompted, select a subset of the cluster nodes to gather logs from, if you have pinpointed the problematic nodes of your cluster.
The script reports timeout errors

Inspect the last lines of ~/.rok/log/gather-logs.log. If they report a warning message similar to this:

socket.timeout: timed out ... urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.VerifiedHTTPSConnection object at 0x7fbd39c60d30>, 'Connection to 0b15546e6290295416ca651a36d6b692.gr7.eu-central-1.eks.amazonaws.com timed out. (connect timeout=1.0)') ... urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='0b15546e6290295416ca651a36d6b692.gr7.eu-central-1.eks.amazonaws.com', port=443): Max retries exceeded with url: /api/v1/nodes?labelSelector= (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fbd39c60d30>, 'Connection to 0b15546e6290295416ca651a36d6b692.gr7.eu-central-1.eks.amazonaws.com timed out. (connect timeout=1.0)'))

it means that the Kubernetes API server takes a while to respond. Change the timeouts accordingly and rerun the script. If you want to give the server more time to respond, try increasing the timeouts. For example:

root@rok-tools:~# rok-gather-logs \ > --connection-timeout "2 minutes" \ > --read-timeout "10 minutes"

The script reported warnings

If the above command reports a warning message similar to this:

it means that some of the steps in the log gathering process couldn’t complete successfully, probably due to a degraded cluster.

Proceed with the tarball that rok-gather-logs produced, as it reports the steps that have failed and the reason for their failure. To see which steps have failed for yourself, inspect the file(s) that rok-gather-logs reports.

The script crashed

If the above command crashes with an error, please report it to the Arrikto Support Team.

Contact Arrikto

Open the logs under ~/.rok/log/gather-logs.log, go to the end of the file, and send us the lines that show an error.
Find the tarball under $HOME. Copy the output to your clipboard, as you are going to use it later:

root@rok-tools:~# export TARBALL=$(ls -t ~/*tar.gz | head -1) && echo ${TARBALL?} /root/ekf-logs-20211221-155217.tar.gz

Proceed to the Summary section.

Option 2: Gather Logs Manually ¶

If you want to gather logs manually, follow the instructions below.

Note

The manual instructions do not support limiting the size of the logs or selecting specific nodes to gather logs from. If you need these options, follow Option 1: Gather Logs Automatically (preferred).

Procedure ¶

Switch to your management environment.
Get the current timestamp:

root@rok-tools:~# export TIMESTAMP=$(date +%Y%m%d-%H%M%S)
Use the timestamp to specify a name for the logs directory:

root@rok-tools:~# export NAME=ekf-logs-${TIMESTAMP?}
Set the working directory:

root@rok-tools:~# export WORKDIR=~/ops

Note

You may also store any files under $HOME so that everything is persistent and easily accessible.
Set the directory to use for saving the logs:

root@rok-tools:~# export LOGDIR=$(realpath ${WORKDIR?}/${NAME?})
Create the directory and enter it:

root@rok-tools:~# mkdir -p ${LOGDIR?} && cd ${LOGDIR?}
Get Kubernetes nodes:

root@rok-tools:~/ops/ekf-logs# kubectl get nodes -o wide > nodes.txt

root@rok-tools:~/ops/ekf-logs# kubectl get nodes -o yaml > nodes.yaml
Get Pods in all namespaces:

root@rok-tools:~/ops/ekf-logs# kubectl get pods -A -o wide > pods.txt

root@rok-tools:~/ops/ekf-logs# kubectl get pods -A -o yaml > pods.yaml
Get events in all namespaces:

root@rok-tools:~/ops/ekf-logs# kubectl get events --sort-by=.lastTimestamp -A -o wide > events.txt
Get RokCluster status:

root@rok-tools:~/ops/ekf-logs# kubectl get rokcluster -n rok rok > rokcluster.txt

root@rok-tools:~/ops/ekf-logs# kubectl get rokcluster -n rok rok -o yaml > rokcluster.yaml
Get Pods in the rok-system namespace:

root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok-system -o wide > rok-system-pods.txt

root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok-system -o yaml > rok-system-pods.yaml
Get the logs of all Pods in the rok-system namespace:

root@rok-tools:~/ops/ekf-logs# kubectl get pods \ > -n rok-system \ > -o custom-columns=NAME:.metadata.name --no-headers \ > | while read pod; do > kubectl logs -n rok-system ${pod} --all-containers > ${pod}.log > done
Get Pods in the rok namespace:

root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok -o wide > rok-pods.txt

root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok -o yaml > rok-pods.yaml
Get the logs of all Pods in the rok namespace:

root@rok-tools:~/ops/ekf-logs# kubectl get pods \ > -n rok \ > -o custom-columns=NAME:.metadata.name --no-headers \ > | while read pod; do > kubectl logs -n rok ${pod} --all-containers > ${pod}.log > done
Find the master Rok Pod:

root@rok-tools:~/ops/ekf-logs# kubectl get pods -l role=master,app=rok -n rok -o wide > rok-master.txt
Get the logs of the master Rok Pod:

root@rok-tools:~/ops/ekf-logs# kubectl logs -n rok svc/rok --all-containers > rok-master.log

Troubleshooting

kubectl hangs

This means that the service has no endpoints because the RokCluster has problems with master election. Press Ctrl+C to interrupt the command and proceed with the next steps.
Get the logs of all Rok Pods:

root@rok-tools:~/ops/ekf-logs# kubectl get pods \ > -n rok \ > -l app=rok \ > -o custom-columns=NAME:.metadata.name --no-headers \ > | while read pod; do > kubectl cp --retries -1 rok/${pod}:/var/log/rok ${pod}-var-log-rok > kubectl cp --retries -1 rok/${pod}:/root/.rok/log ${pod}-root-rok-log > done
Get RokCluster members:
1. Get the configured members of the cluster:
  
  root@rok-tools:~/ops/ekf-logs# kubectl exec \ > -ti -n rok ds/rok \ > -- bash \ > -i -c "rok-cluster member-list" > rok-cluster-member-list.txt
2. Get the runtime members of the cluster:
  
  root@rok-tools:~/ops/ekf-logs# kubectl exec \ > -ti -n rok ds/rok \ > -- bash \ > -i -c "rok-election-ctl member-list" > rok-election-ctl-member-list.txt
Get Rok locks:
1. Get the composition-related locks:
  
  root@rok-tools:~/ops/ekf-logs# kubectl exec \ > -ti -n rok ds/rok \ > -- bash \ > -i -c "rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace composer lock-list" > rok-dlm-locks-composer.txt
2. Get the election-related locks:
  
  root@rok-tools:~/ops/ekf-logs# kubectl exec \ > -ti -n rok ds/rok \ > -- bash \ > -i -c "rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace election lock-list" > rok-dlm-locks-election.txt
Get PVCs in all namespaces:

root@rok-tools:~/ops/ekf-logs# kubectl get pvc -A > pvc.txt

root@rok-tools:~/ops/ekf-logs# kubectl get pvc -A -o yaml > pvc.yaml
Get PVs:

root@rok-tools:~/ops/ekf-logs# kubectl get pv > pv.txt

root@rok-tools:~/ops/ekf-logs# kubectl get pv -o yaml > pv.yaml
Inspect Rok local storage on all nodes:

root@rok-tools:~/ops/ekf-logs# kubectl get pods \ > -n rok-system \ > -l name=rok-disk-manager \ > -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName --no-headers \ > | while read rdm node; do > kubectl exec -n rok-system ${rdm} -- vgs > ${node}-vgs.txt > kubectl exec -n rok-system ${rdm} -- lvs > ${node}-lvs.txt > kubectl exec -n rok-system ${rdm} -- df -h /mnt/data/rok > ${node}-df.txt > done
Get the logs of the Dex Pod:

root@rok-tools:~/ops/ekf-logs# kubectl get svc -n auth dex &> /dev/null \ > && kubectl logs -n auth svc/dex --all-containers > dex.log
Get the logs of the AuthService Pod:

root@rok-tools:~/ops/ekf-logs# kubectl logs -n istio-system svc/authservice --all-containers > authservice.log
Get the logs of the Reception Server Pod:

root@rok-tools:~/ops/ekf-logs# kubectl logs -n kubeflow svc/kubeflow-reception --all-containers > kubeflow-reception.log
Get the logs of the Profile Controller Pod:

root@rok-tools:~/ops/ekf-logs# kubectl logs -n kubeflow svc/profiles-kfam --all-containers > profiles-kfam.log
Get the logs of the Cluster Autoscaler Pod:

root@rok-tools:~/ops/ekf-logs# kubectl get deploy -n kube-system cluster-autoscaler &> /dev/null \ > && kubectl logs -n kube-system deploy/cluster-autoscaler --all-containers > cluster-autoscaler.log
Get Services in all namespaces:

root@rok-tools:~/ops/ekf-logs# kubectl get svc -A -o wide > services.txt

root@rok-tools:~/ops/ekf-logs# kubectl get svc -A -o yaml > services.yaml
Set the tarball name:

root@rok-tools:~/ops/ekf-logs# export TARBALL=$(realpath ~/${NAME}.tar.gz)
Create tarball:

root@rok-tools:~/ops/ekf-logs# env GZIP=-1 tar -C ${WORKDIR?} -czvf ${TARBALL?} ${NAME?}
Find the tarball under $HOME. Copy the output to your clipboard, as you are going to use it later:

root@rok-tools:~/ops/ekf-logs# cd && echo ${TARBALL?} /root/ekf-logs-20211221-155217.tar.gz

Summary ¶

You have successfully gathered any EKF-related logs of your deployment.

What’s Next ¶

The next step is to send the tarball to the Arrikto Support Team.

Send Logs to Arrikto for Troubleshooting

Previous Next

Gather Logs for Troubleshooting¶

What You’ll Need¶

Option 1: Gather Logs Automatically (preferred)¶

Option 2: Gather Logs Manually¶

Procedure¶

Summary¶