However, the migration process is not always straightforward, and sometimes the journey doesn’t go as smoothly as expected. Our Python Django application, which had been running flawlessly on a VM, suddenly turned sluggish and unresponsive after the migration.
Timeouts became a frustratingly common occurrence, and the overall performance of the application deteriorated significantly.
This unexpected slowdown was a major concern, as it impacted the user experience and could potentially lead to lost revenue and customer dissatisfaction.
In this blog post, we take you through the steps we followed to track down the performance issues and identify the root cause of our application’s slowdown in the Kubernetes environment.
[Good Read: Deploying a Production-Ready Kafka Cluster on Kubernetes with Strimzi ]
Step to Resolve Timeout Issues in Python Django on Kubernetes
Even after adjusting configurations and scaling our application, the problem persisted, leading us to delve deeper into the underlying infrastructure. Here are the steps that we followed to identify and fix the issues:
Fine-Tuning Kubernetes Resource Allocation: We looked at our resource allocation for the application and checked it against the minimum requirement for the application to run.
Readiness & Liveness Probe: Initially, we optimized resource usage. Then, we extended the liveness and readiness timeout to ensure that the probe responded back before the timeout exceeded.
Research on Stack Overflow highlighted that under heavy request loads, the probes might struggle to respond promptly.
Therefore, we increased the probe timeout. This adjustment significantly reduced the frequency of timeout issues in our application. Moreover, by doubling the timeout setting, or aws cloud intelligence dashboards,
we observed a 25% decrease in application timeouts.
Gunicorn Configuration: Even after doubling the time for the liveness and readiness checks, the problem’s still there. So, we added Gunicorn to our Django app. It uses workers better to manage more requests, helping avoid server issues beyond what the checks fix. This makes things smoother and prevents timeouts.
Number of Worker = (2 * #cores) + 1 WorkerClass: gthread
Changing Gunicorn worker class and number of threads: Even though we set up Gunicorn with the usual settings and made the liveness and readiness checks take longer, the problem stayed.
So, we discussed with our Python developer and decided to switch Gunicorn’s worker class to “gevents”. This change helped it handle lots of requests all at once without causing problems.
Upgrading the Postgres Master Server Configuration: After making all the changes to the application, we checked how much the PostgreSQL master was using the node’s resources.
We saw that the CPU was getting really busy, which could be causing the timeouts. So, we decided to increase the node size for the PostgreSQL master. But even after doing that, the problem still persisted.
Setting up monitoring for Postgres and Ingress Controller: Even after making many changes, we still had the same problem with our app. So, we decided to monitor the Nginx ingress controller and our Postgres Database using Postgres exporter. So when we started monitoring the ingress controller & Postgres database, we noticed that when there were too many requests at the same time Postgres tables were also getting locked.
So after implementing monitoring, we noticed that when the application times out, the database tables also get locked.
You can check more info about: Python Django on Kubernetes.