Kubernetes deployments provide users with a better approach to managing deployments within a K8s cluster. Instead of manually creating and updating containerized applications within a cluster, deployments allow users to carry out an automated and repeatable deployment process. This process involves creating pods, removing old versions and updating to new versions, scaling containers, and even rolling back to previous versions in case of a failure. In this post, we aim to discuss how to approach troubleshooting an errored deployment.
Image Source: DepositPhotos
How to Detect a Failed Deployment
While a deployment can be completed without indicating any errors, underlying issues in a Pod, service, or configuration can break the application. Thus, the first thing to look at is if the application is accessible from outside the cluster. Then you should look at the pods to determine if they are up and running without their status stuck on Pending, Waiting, CrashLoopBackOff, or any other errors.
This approach allows users to isolate the problem area. If the user cannot access the application while the Pods seem to be up and running without any obvious issues, the error is more likely to be in the network end of the cluster. However, if the user can see an error in the Pod status, it’s more likely an error related to the container. However, these issues are not the only two causes of deployment failures. There may also be other scenarios such as failures in both the pods and networking or other misconfigurations leading to deployment failures.
Using the command-line tool kubectl is the easiest way to dig deeper into any object within a Kubernetes cluster. Simply run the describe command if you want to see the events on a Pod and use the get SVC command if you want to see the services. It is essential to have a good understanding of kubectl commands if you are dealing with Kubernetes in any capacity. Check out a kubectl cheat sheet to get familiar with CLI commands quickly.
Troubleshooting a Kubernetes Deployment
When a deployment issue is identified, if possible, the deployment should be immediately rolled back to the previous configuration to minimize the impact to the end-users and ensure continuous availability of the application. However, you can try and troubleshoot within the production environment if the deployment is done within a predetermined maintenance window and you have enough time left in the maintenance window. In the worst case, simply roll back and deploy the new configuration in a test environment and try to reproduce the issue.
When troubleshooting the network aspect of deployment, the first step is to verify if the connectivity between the container, service, and the ingress is correctly configured. The container should expose the relevant ports, while the service should be pointed to the necessary containers and exposed ports of the container.
Additionally, the selector of the service should match the relevant labels of the Pods for the services to correctly set up the container connectivity. Here, you can use the kubectl port-forward command to test the connectivity. Then a user must ensure that the connectivity between the service and ingress is correctly configured. The “service. port” and “service. name” fields of the ingress should match the relevant service name and port. This configuration can also be tested using the port-forward command instead of targeting the service run command against the ingress controller. If all seems fine, and you still cannot access the application, quickly provision a simple Pod and test the connectivity against that pod. It will allow users to eliminate any reliance on the deployment pods, which may be the cause for the connectivity issue. Users can confidently focus their troubleshooting on the Pods themselves after eliminating all variables that can affect connectivity.
The pod event logs should be the first choice to inspect to identify what is causing the pod to fail. It allows users to pinpoint the issue. Furthermore, if an exit code is available, it can be used to drill down the issue further. In instances where Pod is running yet unresponsive, the best action is to try to log into the container. The most probable cause for this issue will be a failed process within the container due to an invalid configuration or insufficient resources. It can be verified using the kubectl exec command, which allows users to gain shell access to the container. One common reason for this process failure is improper privilege configurations when role-based access controls (RBAC) are enabled. It can create situations where Pods cannot access other resources such as volumes and secrets, leading to issues in containers.
Isolating individual components and eliminating them one by one is the ideal way to approach troubleshooting a Kubernetes deployment.
As mentioned previously, the first step in troubleshooting Kubernetes deployments is to isolate the components. It allows users to easily identify the component or resource that is the root of the problem. Then users are free to drill down and pinpoint the issue. The ability to rollback deployments offers users an additional layer of flexibility to revert to previous configurations and troubleshoot outside the production environments.