Gather Logs for Troubleshooting

This section describes how to gather any EKF-related logs in order to troubleshoot your deployment. At the end of this guide you will end up with a tarball that you can send to the Arrikto Support Team.

Choose one of the following options to gather logs for troubleshooting:

Option 1: Gather Logs Automatically (preferred)

Gather logs by following the on-screen instructions on the rok-gather-logs user interface.

  1. Start the rok-gather-logs CLI tool:

    root@rok-tools:~# rok-gather-logs
    
    ../../_images/gather-welcome.png
  2. Wait until the script has finished successfully.

    ../../_images/gather-success.png

    Troubleshooting

    The script reports timeout errors

    Inspect the last lines of ~/.rok/log/gather-logs.log. If they report a warning message similar to this:

    socket.timeout: timed out
    ...
    urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.VerifiedHTTPSConnection object at 0x7fbd39c60d30>, 'Connection to 0b15546e6290295416ca651a36d6b692.gr7.eu-central-1.eks.amazonaws.com timed out. (connect timeout=1.0)')
    ...
    urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='0b15546e6290295416ca651a36d6b692.gr7.eu-central-1.eks.amazonaws.com', port=443): Max retries exceeded with url: /api/v1/nodes?labelSelector= (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fbd39c60d30>, 'Connection to 0b15546e6290295416ca651a36d6b692.gr7.eu-central-1.eks.amazonaws.com timed out. (connect timeout=1.0)'))
    

    it means that the Kubernetes API server takes a while to respond. Change the timeouts accordingly and rerun the script. If you want to give the server more time to respond, try increasing the timeouts. For example:

    root@rok-tools:~# rok-gather-logs \
    > --connection-timeout "2 minutes" \
    > --read-timeout "10 minutes"
    

    The script reported warnings

    If the above command reports a warning message similar to this:

    ../../_images/gather-errors.png

    it means that some of the steps in the log gathering process couldn't complete successfully, probably due to a degraded cluster.

    Proceed with the tarball that rok-gather-logs produced, as it reports the steps that have failed and the reason for their failure. To see which steps have failed for yourself, inspect the file(s) that rok-gather-logs reports.

    The script crashed

    If the above command crashes with an error, please report it to the Arrikto Support Team.

    Contact Arrikto

    Open the logs under ~/.rok/log/gather-logs.log, go to the end of the file, and send us the lines that show an error.

  3. Find the tarball under $HOME. Copy the output to your clipboard, as you are going to use it later:

    root@rok-tools:~# export TARBALL=$(ls -t ~/*tar.gz | head -1) && echo ${TARBALL?}
    /root/ekf-logs-20211221-155217.tar.gz
    

Proceed to the Summary section.

Option 2: Gather Logs Manually

If you want to gather logs manually, follow the instructions below.

Procedure

  1. Switch to your management environment.

  2. Get the current timestamp:

    root@rok-tools:~# export TIMESTAMP=$(date +%Y%m%d-%H%M%S)
    
  3. Use the timestamp to specify a name for the logs directory:

    root@rok-tools:~# export NAME=ekf-logs-${TIMESTAMP?}
    
  4. Set the working directory:

    root@rok-tools:~# export WORKDIR=~/ops
    

    Note

    You may also store any files under $HOME so that everything is persistent and easily accessible.

  5. Set the directory to use for saving the logs:

    root@rok-tools:~# export LOGDIR=$(realpath ${WORKDIR?}/${NAME?})
    
  6. Create the directory and enter it:

    root@rok-tools:~# mkdir -p ${LOGDIR?} && cd ${LOGDIR?}
    
  7. Get Kubernetes nodes:

    root@rok-tools:~/ops/ekf-logs# kubectl get nodes -o wide > nodes.txt
    
    root@rok-tools:~/ops/ekf-logs# kubectl get nodes -o yaml > nodes.yaml
    
  8. Get Pods in all namespaces:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods -A -o wide > pods.txt
    
    root@rok-tools:~/ops/ekf-logs# kubectl get pods -A -o yaml > pods.yaml
    
  9. Get events in all namespaces:

    root@rok-tools:~/ops/ekf-logs# kubectl get events --sort-by=.lastTimestamp -A -o wide > events.txt
    
  10. Get RokCluster status:

    root@rok-tools:~/ops/ekf-logs# kubectl get rokcluster -n rok rok > rokcluster.txt
    
    root@rok-tools:~/ops/ekf-logs# kubectl get rokcluster -n rok rok -o yaml > rokcluster.yaml
    
  11. Get Pods in the rok-system namespace:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok-system -o wide > rok-system-pods.txt
    
    root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok-system -o yaml > rok-system-pods.yaml
    
  12. Get the logs of all Pods in the rok-system namespace:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods \
    > -n rok-system \
    > -o custom-columns=NAME:.metadata.name --no-headers \
    > | while read pod; do
    >     kubectl logs -n rok-system ${pod} --all-containers > ${pod}.log
    >   done
    
  13. Get Pods in the rok namespace:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok -o wide > rok-pods.txt
    
    root@rok-tools:~/ops/ekf-logs# kubectl get pods -n rok -o yaml > rok-pods.yaml
    
  14. Get the logs of all Pods in the rok namespace:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods \
    > -n rok \
    > -o custom-columns=NAME:.metadata.name --no-headers \
    > | while read pod; do
    >     kubectl logs -n rok ${pod} --all-containers > ${pod}.log
    >   done
    
  15. Find the master Rok Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods -l role=master,app=rok -n rok -o wide > rok-master.txt
    
  16. Get the logs of the master Rok Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl logs -n rok svc/rok --all-containers > rok-master.log
    

    Troubleshooting

    kubectl hangs

    This means that the service has no endpoints because the RokCluster has problems with master election. Press Ctrl-C to interrupt the command and proceed with the next steps.

  17. Get the logs of all Rok Pods:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods \
    > -n rok \
    > -l app=rok \
    > -o custom-columns=NAME:.metadata.name --no-headers \
    > | while read pod; do
    >     kubectl cp rok/${pod}:/var/log/rok ${pod}-var-log-rok
    >   done
    
  18. Get RokCluster members:

    1. Get the configured members of the cluster:

      root@rok-tools:~/ops/ekf-logs# kubectl exec \
      > -ti -n rok ds/rok \
      > -- bash \
      > -i -c "rok-cluster member-list" > rok-cluster-member-list.txt
      
    2. Get the runtime members of the cluster:

      root@rok-tools:~/ops/ekf-logs# kubectl exec \
      > -ti -n rok ds/rok \
      > -- bash \
      > -i -c "rok-election-ctl member-list" > rok-election-ctl-member-list.txt
      
  19. Get Rok locks:

    1. Get the composition-related locks:

      root@rok-tools:~/ops/ekf-logs# kubectl exec \
      > -ti -n rok ds/rok \
      > -- bash \
      > -i -c "rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace composer lock-list" > rok-dlm-locks-composer.txt
      
    2. Get the election-related locks:

      root@rok-tools:~/ops/ekf-logs# kubectl exec \
      > -ti -n rok ds/rok \
      > -- bash \
      > -i -c "rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace election lock-list" > rok-dlm-locks-election.txt
      
  20. Get PVCs of all namespaces:

    root@rok-tools:~/ops/ekf-logs# kubectl get pvc -A > pvc.txt
    
    root@rok-tools:~/ops/ekf-logs# kubectl get pvc -A -o yaml > pvc.yaml
    
  21. Get PVs:

    root@rok-tools:~/ops/ekf-logs# kubectl get pv > pv.txt
    
    root@rok-tools:~/ops/ekf-logs# kubectl get pv -o yaml > pv.yaml
    
  22. Inspect Rok local storage on all nodes:

    root@rok-tools:~/ops/ekf-logs# kubectl get pods \
    > -n rok-system \
    > -l name=rok-disk-manager \
    > -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName --no-headers \
    > | while read rdm node; do
    >     kubectl exec -n rok-system ${rdm} -- vgs > ${node}-vgs.txt
    >     kubectl exec -n rok-system ${rdm} -- df -h /mnt/data/rok > ${node}-df.txt
    >   done
    
  23. Get the logs of the Dex Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl get svc -n auth dex &> /dev/null && \
    > kubectl logs -n auth svc/dex --all-containers > dex.log
    
  24. Get the logs of the AuthService Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl logs -n istio-system svc/authservice --all-containers > authservice.log
    
  25. Get the logs of the Reception Server Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl logs -n kubeflow svc/kubeflow-reception --all-containers > kubeflow-reception.log
    
  26. Get the logs of the Profile Controller Pod:

    root@rok-tools:~/ops/ekf-logs# kubectl logs -n kubeflow svc/profiles-kfam --all-containers > profiles-kfam.log
    
  27. Set the tarball name:

    root@rok-tools:~/ops/ekf-logs# export TARBALL=$(realpath ~/${NAME}.tar.gz)
    
  28. Create tarball:

    root@rok-tools:~/ops/ekf-logs# env GZIP=-1 tar -C ${WORKDIR?} -czvf ${TARBALL?} ${NAME?}
    
  29. Find the tarball under $HOME. Copy the output to your clipboard, as you are going to use it later:

    root@rok-tools:~/ops/ekf-logs# cd && echo ${TARBALL?}
    /root/ekf-logs-20211221-155217.tar.gz
    

Summary

You have successfully gathered any EKF-related logs of your deployment.

What's Next

The next step is to send the tarball to the Arrikto Support Team.