Gather Logs for Troubleshooting

This section describes how to gather any EKF-related logs in order to troubleshoot your deployment. At the end of this guide you will end up with a tarball that you can send to the Arrikto Support Team.

Choose one of the following options to gather logs for troubleshooting:

Option 1: Gather Logs Automatically (preferred)

Gather logs by following the on-screen instructions on the rok-gather-logs user interface.

  1. Start the rok-gather-logs CLI tool:

    root@rok-tools:~# rok-gather-logs
    ../../_images/gather-welcome.png
  2. Wait until the script has finished successfully.

    ../../_images/gather-success.png

    Troubleshooting

    The script takes too long to complete and/or runs out of space

    If the script takes too long to complete and/or runs out of space, you can do the following:

    1. Press Ctrl+C to stop the script.

    2. Start the rok-gather-logs tool again:

      root@rok-tools:~# rok-gather-logs
    3. (Optional) When prompted, select a smaller limit for the size of the logs of the Rok and Rok CLI Node Pods that the tool will gather from each node of your cluster. The recommended value is 100 MiB.

    4. (Optional) When prompted, select a subset of the cluster nodes to gather logs from, if you have pinpointed the problematic nodes of your cluster.

    The script reports timeout errors

    Inspect the last lines of ~/.rok/log/gather-logs.log. If they report a warning message similar to this:

    socket.timeout: timed out ... urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.VerifiedHTTPSConnection object at 0x7fbd39c60d30>, 'Connection to 0b15546e6290295416ca651a36d6b692.gr7.eu-central-1.eks.amazonaws.com timed out. (connect timeout=1.0)') ... urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='0b15546e6290295416ca651a36d6b692.gr7.eu-central-1.eks.amazonaws.com', port=443): Max retries exceeded with url: /api/v1/nodes?labelSelector= (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fbd39c60d30>, 'Connection to 0b15546e6290295416ca651a36d6b692.gr7.eu-central-1.eks.amazonaws.com timed out. (connect timeout=1.0)'))

    it means that the Kubernetes API server takes a while to respond. Change the timeouts accordingly and rerun the script. If you want to give the server more time to respond, try increasing the timeouts. For example:

    root@rok-tools:~# rok-gather-logs \ > --connection-timeout "2 minutes" \ > --read-timeout "10 minutes"

    The script reported warnings

    If the above command reports a warning message similar to this:

    ../../_images/gather-errors.png

    it means that some of the steps in the log gathering process couldn’t complete successfully, probably due to a degraded cluster.

    Proceed with the tarball that rok-gather-logs produced, as it reports the steps that have failed and the reason for their failure. To see which steps have failed for yourself, inspect the file(s) that rok-gather-logs reports.

    The script crashed

    If the above command crashes with an error, please report it to the Arrikto Support Team.

    Contact Arrikto

    Open the logs under ~/.rok/log/gather-logs.log, go to the end of the file, and send us the lines that show an error.

  3. Find the tarball under $HOME. Copy the output to your clipboard, as you are going to use it later:

    root@rok-tools:~# export TARBALL=$(ls -t ~/*tar.gz | head -1) && echo ${TARBALL?} /root/ekf-logs-20211221-155217.tar.gz

Proceed to the Summary section.

Option 2: Gather Logs Manually

If you want to gather logs manually, follow the instructions below.

Note

The manual instructions do not support limiting the size of the logs or selecting specific nodes to gather logs from. If you need these options, follow Option 1: Gather Logs Automatically (preferred).

Procedure

  1. Switch to your management environment.

  2. Get the current timestamp:

    root@rok-tools:~# export TIMESTAMP=$(date +%Y%m%d-%H%M%S)
  3. Use the timestamp to specify a name for the logs directory:

    root@rok-tools:~# export NAME=ekf-logs-${TIMESTAMP?}
  4. Set the working directory:

    root@rok-tools:~# export WORKDIR=~/ops

    Note

    You may also store any files under $HOME so that everything is persistent and easily accessible.

  5. Set the directory to use for saving the logs:

    root@rok-tools:~# export LOGDIR=$(realpath ${WORKDIR?}/${NAME?})
  6. Create the directory and enter it:

    root@rok-tools:~# mkdir -p ${LOGDIR?} && cd ${LOGDIR?}
  7. Get Kubernetes nodes:

    root@rok-tools:~/ops/ekf-logs# kubectl get nodes -o wide > nodes.txt
    root@rok-tools:~/ops/ekf-logs# kubectl get nodes -o yaml > nodes.yaml
  8. Get Pods in all namespaces:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods -A -o wide > pods.txt
    root@rok-tools:~/ops/ekf-logs# kubectl get pods -A -o yaml > pods.yaml
  9. Get events in all namespaces:

    root@rok-tools:~/ops/ekf-logs# kubectl get events --sort-by=.lastTimestamp -A -o wide > events.txt
  10. Get RokCluster status:

    root@rok-tools:~/ops/ekf-logs# kubectl get rokcluster -n rok rok > rokcluster.txt
    root@rok-tools:~/ops/ekf-logs# kubectl get rokcluster -n rok rok -o yaml > rokcluster.yaml
  11. Get Pods in the rok-system namespace:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok-system -o wide > rok-system-pods.txt
    root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok-system -o yaml > rok-system-pods.yaml
  12. Get the logs of all Pods in the rok-system namespace:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods \ > -n rok-system \ > -o custom-columns=NAME:.metadata.name --no-headers \ > | while read pod; do > kubectl logs -n rok-system ${pod} --all-containers > ${pod}.log > done
  13. Get Pods in the rok namespace:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok -o wide > rok-pods.txt
    root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok -o yaml > rok-pods.yaml
  14. Get the logs of all Pods in the rok namespace:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods \ > -n rok \ > -o custom-columns=NAME:.metadata.name --no-headers \ > | while read pod; do > kubectl logs -n rok ${pod} --all-containers > ${pod}.log > done
  15. Find the master Rok Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods -l role=master,app=rok -n rok -o wide > rok-master.txt
  16. Get the logs of the master Rok Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl logs -n rok svc/rok --all-containers > rok-master.log

    Troubleshooting

    kubectl hangs

    This means that the service has no endpoints because the RokCluster has problems with master election. Press Ctrl+C to interrupt the command and proceed with the next steps.

  17. Get the logs of all Rok Pods:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods \ > -n rok \ > -l app=rok \ > -o custom-columns=NAME:.metadata.name --no-headers \ > | while read pod; do > kubectl cp --retries -1 rok/${pod}:/var/log/rok ${pod}-var-log-rok > kubectl cp --retries -1 rok/${pod}:/root/.rok/log ${pod}-root-rok-log > done
  18. Get RokCluster members:

    1. Get the configured members of the cluster:

      root@rok-tools:~/ops/ekf-logs# kubectl exec \ > -ti -n rok ds/rok \ > -- bash \ > -i -c "rok-cluster member-list" > rok-cluster-member-list.txt
    2. Get the runtime members of the cluster:

      root@rok-tools:~/ops/ekf-logs# kubectl exec \ > -ti -n rok ds/rok \ > -- bash \ > -i -c "rok-election-ctl member-list" > rok-election-ctl-member-list.txt
  19. Get Rok locks:

    1. Get the composition-related locks:

      root@rok-tools:~/ops/ekf-logs# kubectl exec \ > -ti -n rok ds/rok \ > -- bash \ > -i -c "rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace composer lock-list" > rok-dlm-locks-composer.txt
    2. Get the election-related locks:

      root@rok-tools:~/ops/ekf-logs# kubectl exec \ > -ti -n rok ds/rok \ > -- bash \ > -i -c "rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace election lock-list" > rok-dlm-locks-election.txt
  20. Get PVCs in all namespaces:

    root@rok-tools:~/ops/ekf-logs# kubectl get pvc -A > pvc.txt
    root@rok-tools:~/ops/ekf-logs# kubectl get pvc -A -o yaml > pvc.yaml
  21. Get PVs:

    root@rok-tools:~/ops/ekf-logs# kubectl get pv > pv.txt
    root@rok-tools:~/ops/ekf-logs# kubectl get pv -o yaml > pv.yaml
  22. Inspect Rok local storage on all nodes:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods \ > -n rok-system \ > -l name=rok-disk-manager \ > -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName --no-headers \ > | while read rdm node; do > kubectl exec -n rok-system ${rdm} -- vgs > ${node}-vgs.txt > kubectl exec -n rok-system ${rdm} -- lvs > ${node}-lvs.txt > kubectl exec -n rok-system ${rdm} -- df -h /mnt/data/rok > ${node}-df.txt > done
  23. Get the logs of the Dex Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl get svc -n auth dex &> /dev/null \ > && kubectl logs -n auth svc/dex --all-containers > dex.log
  24. Get the logs of the AuthService Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl logs -n istio-system svc/authservice --all-containers > authservice.log
  25. Get the logs of the Reception Server Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl logs -n kubeflow svc/kubeflow-reception --all-containers > kubeflow-reception.log
  26. Get the logs of the Profile Controller Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl logs -n kubeflow svc/profiles-kfam --all-containers > profiles-kfam.log
  27. Get the logs of the Cluster Autoscaler Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl get deploy -n kube-system cluster-autoscaler &> /dev/null \ > && kubectl logs -n kube-system deploy/cluster-autoscaler --all-containers > cluster-autoscaler.log
  28. Get Services in all namespaces:

    root@rok-tools:~/ops/ekf-logs# kubectl get svc -A -o wide > services.txt
    root@rok-tools:~/ops/ekf-logs# kubectl get svc -A -o yaml > services.yaml
  29. Set the tarball name:

    root@rok-tools:~/ops/ekf-logs# export TARBALL=$(realpath ~/${NAME}.tar.gz)
  30. Create tarball:

    root@rok-tools:~/ops/ekf-logs# env GZIP=-1 tar -C ${WORKDIR?} -czvf ${TARBALL?} ${NAME?}
  31. Find the tarball under $HOME. Copy the output to your clipboard, as you are going to use it later:

    root@rok-tools:~/ops/ekf-logs# cd && echo ${TARBALL?} /root/ekf-logs-20211221-155217.tar.gz

Summary

You have successfully gathered any EKF-related logs of your deployment.

What’s Next

The next step is to send the tarball to the Arrikto Support Team.