A major service disruption hit Proton users worldwide as the company undertook an ambitious infrastructure upgrade to enhance system reliability. The outage stemmed from complications during Proton's transition to a Kubernetes-based infrastructure, combined with an ill-timed software update.
What Happened
The incident began when Proton initiated its planned migration to Kubernetes while simultaneously maintaining existing systems. During this transition, a software change triggered an unexpected surge in database connections, overwhelming the system's capacity to handle the load. The situation worsened around 4:00 PM Zurich time when increased user activity led to nearly 50% of requests failing.
Technical Details
The root cause emerged from multiple compounding factors:
- The parallel operation of legacy and new Kubernetes systems created load balancing challenges
- A software update caused an initial spike in system load
- Database servers became overwhelmed by a sudden surge in connection requests
- The new infrastructure struggled to scale appropriately under pressure
- Peak user activity amplified existing system strain
Recovery Actions
To restore services, Proton's technical team had to roll back the problematic software change to normalize database loads. This highlighted the risks of deploying updates during major infrastructure transitions.
Impact on Users
While Proton maintained sufficient server capacity throughout the incident, the complicated interplay between old and new systems resulted in intermittent service availability for users. Many experienced failed connection attempts and service disruptions during peak usage periods.
Looking Forward
The outage exposed several areas requiring attention in Proton's infrastructure strategy:
- Better coordination between system migrations and software updates
- Enhanced load balancing mechanisms
- Improved scaling capabilities for database connections
- More robust redundancy systems
This incident serves as a reminder of the complexities involved in modernizing technical infrastructure while maintaining service stability. As Proton continues refining its systems, the experience gained from this outage will help prevent similar disruptions in future upgrades.