My First P0 Incident

Posted on 2024-07-03 Views: Waline: Word count in article: 1.5k Reading time ≈ 6 mins.

The author of this article records the experience of developing the Docker Image Sign (DIS) system during the company's PCI DSS certification, including technology selection, development and deployment, two incidents, and post - incident reviews. The first incident was due to high concurrency causing the service to crash, and the second was due to a failure in a dependent service. Ultimately, it was concluded that one should not rashly promise to provide a service within a short time.

Some time ago, our company was undergoing PCI DSS certification. One of the requirements was to sign and verify all images to be deployed to the production environment before deployment. Since I was in charge of the company’s private Harbor, it was “logical” that I would take on this task.

PCI DSS, short for Payment Card Industry Data Security Standard, is a set of security standards developed by the Payment Card Industry Security Standards Council (PCI SSC). PCI DSS is a crucial security standard in the payment card industry, aiming to protect cardholders’ data security through a series of strict security control measures.

When the colleague leading the PCI DSS certification in the company came to talk to me and my leader about this, there were only about 10 days left. That is to say, we had to design, develop, test, and deploy a system for image signing and verification within 10 days. Moreover, I was not working on this full - time as I had other tasks. This set the stage for what followed and was the source of the pressure I felt in the past two months.

Some of my colleagues in the team had done some relevant research in the early stage, and finally, we chose the Sigstore solution, that is, using Cosign for container image signing and verification.

Sigstore is an open - source project aiming to provide a key - less signature solution for the software supply chain. Its main goal is to simplify the software signing and verification process, thereby enhancing the security of the software supply chain. Sigstore is hosted by the OpenSSF (Open Software Security Foundation) under the Linux Foundation and is supported by several industry leaders.

Cosign is a tool in the Sigstore project, specifically designed for signing and verifying container images. It simplifies the container image signing process and ensures the security of signing and verification. Cosign supports multiple signing methods, including traditional private - key signing and key - less signing (using Sigstore).

Due to user requirements and some security reasons, we couldn’t directly let users use the Cosign command to sign and verify by themselves. Instead, we needed to develop a backend service and provide an API for users to sign and verify by calling this API.

Finally, I used Django + Django Rest Framework + K8S as the infrastructure to develop and run the service (hereinafter referred to as DIS, Docker Image Sign).

As mentioned at the beginning, the requirement of PCI DSS is: “Sign and verify all images to be deployed to the production environment before deployment.” If you are familiar with CICD, you may have already realized the problem.
This means that image signing and verification will block the CD process. That is, if signing and verification are not completed, the image cannot be deployed to the production environment. This requires DIS to be available 365 * 24 and be able to handle concurrency. And DIS was developed by me alone in about three or four days, and these three days were not full - time.

About 5 days after going live, the first incident occurred.

I configured the k8s pod running the DIS backend with 1 - core CPU and 1GB of memory. This was a public template provided by our infrastructure colleagues and could not be customized. My idea was that since I couldn’t modify the computing resource configuration, I would increase the number of pod replicas to improve the backend’s computing and processing capabilities. So, I set up 10 pods.
Due to the tight schedule, no load testing was done before going live. After going live, I was busy guiding users on how to use DIS and had no time for load testing. About 5 days after going live, we encountered an unusually high concurrency.

The so - called unusually high concurrency was actually only around 600. Usually, there should be only around 200 concurrent requests because at this time, DIS had not been widely promoted, and only some of the most urgent projects were using it. The additional 400 concurrent requests were due to a problem with the CICD configuration of a project team. Every time they merged a branch in the code repository, it would trigger a CICD, and then sign and verify all images. They had nearly 400 images, but in fact, only two or three images had changes and needed to be signed and verified. Their branch - merging frequency was also very high, with a peak of merging a branch every few minutes. That is, just this project would send 400 requests to DIS every few minutes.

Later, when I checked the monitoring and logs, I found that this situation lasted for about half an hour, and then the DIS backend couldn’t hold on and crashed.

The reason why the DIS backend with 10 pods of 1C1G could only handle 600 concurrent requests is that the DIS service essentially executes a Cosign shell command, and it takes about 7 seconds to execute this command, which cannot be optimized. So, each request takes more than 7 seconds to be processed, making it difficult to increase concurrency.

After discovering the problem, I restarted the service immediately and asked the infrastructure colleagues to adjust the computing resources. I contacted the project leader mentioned above and asked them to suspend the CICD. After a few minutes, the service was restored.

After the incident, we added monitoring items, such as the number of concurrent requests, computing resource utilization, and specific error messages in the logs.
Yes, before this problem occurred, DIS had no monitoring at all.

About 40 days after DIS went live, another incident occurred, and it was finally classified as a P0 - level incident.

The DIS backend depends on a private service called Rekor, which was provided by other colleagues in our department. Since they already had this service, after learning about this, I talked to them about sharing it, and they agreed.

Rekor is an important part of the Sigstore project, used to immutably record the signatures and other metadata of software artifacts.

This incident was caused by the fact that after a change in Rekor, the FQDN did not correctly point to its backend service, and the DIS backend could not call its service.
More importantly, the owner of Rekor did not notify me before the change, and they did not do any testing afterwards. I had no idea about this at all.
Even more importantly, the monitoring items of DIS were not comprehensive enough, and the service failure was not detected in a timely manner.

This failure lasted for 17.5 hours before it was discovered. Unfortunately, during this period, our company’s most important SaaS product needed to fix an online bug. Due to the downtime of DIS, the development team of this product could not deploy the update program until we restored DIS.

Later, I heard through the grapevine that the company compensated about several hundred thousand dollars due to this bug.

After the incident, our team conducted a post - incident review. The final conclusion was: “We should not have rashly promised the colleague in charge of PCI to provide the DIS service within such a short time at the beginning.” This was the root cause.

I’m very glad that the company and the leader did not hold me accountable for this incident. Because after considering many factors, the conclusion was that no one in our team was confident that they could have done a better job with DIS in such a short time. I would like to thank the leader here.

I believe that our team as a whole is a relatively responsible and proactive team. Therefore, when we first learned that PCI required an image signing and verification service, both my leader and I promised to try our best to provide this service. But at that time, we did not comprehensively assess the risks. We did not realize that DIS would block the CD process. If we had realized this at that time, we definitely would not have promised so rashly. Even if we had promised, it would not have been done by me part - time. Instead, we would have temporarily mobilized a considerable amount of resources to work on this.
\

Share a little story. About 15 years ago, due to a wrong configuration change by a colleague in the Philippines of our company, the Japanese railway was shut down for several hours because the IT of the Japanese railway used our company’s products. The company’s senior management held a press conference to apologize to the Japanese people, but finally chose not to disclose the employee’s information to protect him and did not take any punitive measures against him.

What I want to say is that perhaps it was because of that incident many years ago and the company’s handling method that people can actively do things without being too hesitant.

In fact, even without the requirements of PCI DSS, our department originally planned to develop DIS as a project in another two months.

I’m recording my first P0 incident in 7 years of work in a rather detailed way, so that I can recall it in the future.
I hope I will never encounter a P0 incident again.