Liwati menyang isi utama
OpenAI

22 Januari 2026

Rekayasa

Menskalakan PostgreSQL untuk 800 yuta pangguna ChatGPT

Dening Bohan Zhang, Anggota Staf Teknis

Lagi dimuat…

Wis pirang-pirang taun, PostgreSQL dadi salah siji sistem data paling kritis ing balik layar sing ndhukung produk inti kaya ChatGPT lan API OpenAI. Bareng basis pangguna saya cepet tuwuh, tuntutan marang basis data kita uga mundhak kanthi eksponensial. Sajrone setaun kepungkur, beban PostgreSQL kita wis mundhak luwih saka 10x, lan isih terus munggah kanthi cepet.

Upaya kita kanggo ngembangake infrastruktur produksi supaya bisa nahan pertumbuhan iki nggawa wawasan anyar: PostgreSQL bisa diskalakake kanggo ndhukung workload sing dominan diwaca kanthi andal ing ukuran sing luwih gedhe tinimbang sing biyen dianggep bisa. Sistem iki (sing wiwitane digawe tim ilmuwan saka University of California, Berkeley) wis ndadekake kita bisa ndhukung lalu lintas global gedhe nganggo siji instance server fleksibel Azure PostgreSQL(mbukak ing jendhela anyar) primer lan meh 50 replika baca sing sumebar ing macem-macem region ing donya. Iki crita carane kita menskalakan PostgreSQL ing OpenAI kanggo ndhukung mayuta-yuta query per detik kanggo 800 yuta pangguna liwat optimisasi sing disiplin lan rekayasa sing kuwat; kita uga bakal nuduhake pelajaran penting sing kita entuk sajrone proses iki.

Celah ing desain awal kita

Sawise peluncuran ChatGPT, lalu lintas tuwuh kanthi laju sing durung tau ana sadurunge. Kanggo ndhukung iki, kita cepet ngetrapake optimisasi ekstensif ing lapisan aplikasi lan basis data PostgreSQL, scale up kanthi nambah ukuran instance, lan scale out kanthi nambah luwih akeh replika baca. Arsitektur iki wis migunani banget kanggo kita sajrone wektu suwe. Kanthi perbaikan sing terus lumaku, arsitektur iki isih menehi ruang cukup kanggo pertumbuhan ing mangsa ngarep.

Bisa uga katon nggumunake yen arsitektur primer tunggal bisa nyukupi tuntutan skala OpenAI; nanging, nggawe iki bisa mlaku ing praktik ora gampang. Kita wis ndeleng sawetara SEV amarga Postgres kakehan beban, lan polahe kerep padha: masalah hulu nyebabake lonjakan beban basis data kanthi dadakan, kayata cache miss massal amarga kegagalan lapisan cache, ledhakan multi-way join sing larang nganti nggawe CPU kebak, utawa badai tulis saka peluncuran fitur anyar. Nalika panggunaan sumber daya munggah, latensi query uga mundhak lan panjaluk wiwit timeout. Retry banjur malah nambah beban, memicu siklus ala sing bisa ngganggu kabeh layanan ChatGPT lan API.

diagram penskalaan beban

Sanadyan PostgreSQL bisa diskalakake kanthi apik kanggo workload kita sing dominan diwaca, kita isih nemoni tantangan nalika periode lalu lintas tulis sing dhuwur. Iki umume amarga implementasi multiversion concurrency control (MVCC) ing PostgreSQL, sing ndadekake sistem iki kurang efisien kanggo workload sing dominan nulis. Contone, nalika sawijining query nganyari tuple utawa malah mung siji field, kabeh row disalin kanggo nggawe versi anyar. Ing beban tulis sing abot, iki nyebabake amplifikasi tulis sing signifikan. Iki uga nambah amplifikasi baca, amarga query kudu mindhai pirang-pirang versi tuple (dead tuple) kanggo njupuk versi paling anyar. MVCC uga nambah tantangan liya kayata bloat tabel lan indeks, overhead pangopènan indeks sing luwih gedhe, lan tuning autovacuum sing kompleks. (Sampeyan bisa maca pembahasan jero babagan masalah iki ing blog sing tak tulis bareng Prof. Andy Pavlo saka Carnegie Mellon University kanthi judhul Bagian PostgreSQL yang Paling Kami Benci(mbukak ing jendhela anyar), sing dikutip(mbukak ing jendhela anyar) ing kaca Wikipedia PostgreSQL.)

Menskalakan PostgreSQL nganti mayuta-yuta QPS

Kanggo nyuda watesan iki lan ngurangi tekanan tulis, kita wis migrasi lan terus migrasi workload sing bisa di-shard (yaiku workload sing bisa dipartisi sacara horisontal), sing dominan nulis, menyang sistem ter-shard kaya Azure Cosmos DB, bareng ngoptimalake logika aplikasi supaya nyilikake tulis sing ora perlu. Kita uga wis ora ngidini nambah tabel anyar menyang deployment PostgreSQL saiki. Workload anyar kanthi gawan mlebu menyang sistem ter-shard.

Sanadyan infrastruktur kita wis berkembang, PostgreSQL tetep ora di-shard, kanthi siji instance primer sing nangani kabeh tulis. Alasan utamane yaiku amarga nge-shard workload aplikasi sing wis ana bakal rumit banget lan mbutuhake wektu suwe, amarga kudu ngowahi atusan titik pungkasan aplikasi lan bisa mbutuhake wulan utawa malah taun. Amarga workload kita utamane dominan diwaca, lan kita wis ngetrapake optimisasi sing ekstensif, arsitektur saiki isih menehi headroom sing cukup kanggo ndhukung pertumbuhan lalu lintas sing terus lumaku. Sanadyan kita ora nutup kemungkinan nge-shard PostgreSQL ing mangsa ngarep, iki dudu prioritas jangka cedhak amarga ruang tumbuh kanggo pertumbuhan saiki lan mengko isih cukup.

Ing bagean sabanjure, kita bakal mbahas tantangan sing kita temoni lan optimisasi ekstensif sing kita terapkan kanggo ngatasi lan nyegah outage ing mangsa ngarep, nyurung PostgreSQL tekan watese lan menskalakake nganti mayuta-yuta query per detik (QPS).

Ngurangi beban ing primer

Tantangan: Amarga mung ana siji writer, setup primer tunggal ora bisa menskalakan tulis. Lonjakan tulis abot bisa cepet mbebani primer lan mengaruhi layanan kaya ChatGPT lan API kita.

Solusi: Kita nyilikake beban ing primer sak maksimal mungkin—baca lan tulis—supaya kapasitasé cukup kanggo nangani lonjakan tulis. Lalu lintas baca dialihake menyang replika yen bisa. Nanging, sawetara query baca kudu tetep ana ing primer amarga dadi bagean saka transaksi tulis. Kanggo kuwi, kita fokus supaya query kasebut efisien lan ora dadi query alon. Kanggo lalu lintas tulis, kita wis migrasi workload sing bisa di-shard lan dominan nulis menyang sistem ter-shard kaya Azure CosmosDB. Workload sing luwih angel di-shard nanging isih ngasilake volume tulis dhuwur butuh wektu luwih suwe kanggo dimigrasi, lan proses kasebut isih terus lumaku. Kita uga kanthi agresif ngoptimalake aplikasi kanggo nyuda beban tulis; contone, kita ndandani bug aplikasi sing nyebabake tulis redundan lan ngenalake lazy write yen cocog, kanggo nglembutake lonjakan lalu lintas. Kajaba iku, nalika nindakake backfill field tabel, kita ngetrapake rate limit sing ketat kanggo nyegah tekanan tulis sing berlebihan.

Optimisasi query

Tantangan: Kita ngenali sawetara query larang ing PostgreSQL. Ing jaman kepungkur, lonjakan volume dadakan ing query iki ngonsumsi CPU gedhe, nganti alonake panjaluk ChatGPT lan API.

Solusi: Sawetara query larang, kayata sing nggabungake akeh tabel sekaligus, bisa banget ngrusak utawa malah njatuhake kabeh layanan. Kita kudu terus-terusan ngoptimalake query PostgreSQL supaya efisien lan ngindhari anti-pola Online Transaction Processing (OLTP) sing umum. Contone, kita tau ngenali query sing larange nemen lan nggabungake 12 tabel, lan lonjakan query iki dadi sebab SEV parah ing jaman kepungkur. Yen bisa, kita kudu ngindhari join multi-tabel sing kompleks. Yen join pancen perlu, kita sinau kanggo nimbang mecah query kasebut lan mindhah logika join kompleks menyang lapisan aplikasi. Akeh query bermasalah iki digawe dening framework Object-Relational Mapping (ORM), mula penting kanggo nliti kanthi teliti SQL sing diasilake lan mesthekake prilakune sesuai pangarepan. Uga umum ditemokake query idle sing mlaku suwe ing PostgreSQL. Nata timeout kaya idle_in_transaction_session_timeout iku penting kanggo nyegah query kasebut ngalangi autovacuum.

Mitigasi single point of failure

Tantangan: Yen replika baca mati, lalu lintas isih bisa dialihake menyang replika liyane. Nanging, gumantung marang siji writer ateges duwe single point of failure—yen mati, kabeh layanan kena dampake.

Solusi: Umume panjaluk kritis mung nglibatake query baca. Kanggo nyuda single point of failure ing primer, kita mindhah beban baca kasebut saka writer menyang replika, supaya panjaluk kuwi isih bisa dilayani sanajan primer mati. Sanadyan operasi tulis tetep bakal gagal, dampake dadi luwih cilik; iki ora maneh dadi SEV0 amarga baca isih kasedhiya.

Kanggo nyuda kegagalan primer, kita mbukak primer ing mode High-Availability (HA) kanthi hot standby, yaiku replika sing terus disinkronake lan tansah siap njupuk alih layanan lalu lintas. Yen primer mati utawa kudu dipateni sementara kanggo maintenance, kita bisa cepet ningkatake standby kanggo nyilikake downtime. Tim Azure PostgreSQL wis nindakake akèh karya penting kanggo mesthekake failover iki tetep aman lan andal sanajan ing beban sing dhuwur banget. Kanggo nangani kegagalan replika baca, kita nyebarake pirang-pirang replika ing saben region kanthi headroom kapasitas sing cukup, supaya kegagalan siji replika ora nyebabake outage regional.

Isolasi workload

Tantangan: Kita kerep nemoni kahanan nalika panjaluk tartamtu ngonsumsi sumber daya kanthi ora proporsional ing instance PostgreSQL. Iki bisa nyebabake degradasi kinerja kanggo workload liyane sing mlaku ing instance sing padha. Contone, peluncuran fitur anyar bisa ngenalake query sing ora efisien lan ngentekake CPU PostgreSQL kanthi gedhe, nganti ngalonake panjaluk kanggo fitur kritis liyane.

Solution: To mitigate the “noisy neighbor” problem, we isolate workloads onto dedicated instances to ensure that sudden spikes in resource-intensive requests don’t impact other traffic. Specifically, we split requests into low-priority and high-priority tiers and route them to separate instances. This way, even if a low-priority workload becomes resource-intensive, it won’t degrade the performance of high-priority requests. We apply the same strategy across different products and services as well, so that activity from one product does not affect the performance or reliability of another.

Connection pooling

Challenge: Each instance has a maximum connection limit (5,000 in Azure PostgreSQL). It’s easy to run out of connections or accumulate too many idle ones. We’ve previously had incidents caused by connection storms that exhausted all available connections.

Solution: We deployed PgBouncer as a proxy layer to pool database connections. Running it in statement or transaction pooling mode allows us to efficiently reuse connections, greatly reducing the number of active client connections. This also cuts connection setup latency: in our benchmarks, the average connection time dropped from 50 milliseconds (ms) to 5 ms. Inter-region connections and requests can be expensive, so we co-locate the proxy, clients, and replicas in the same region to minimize network overhead and connection use time. Moreover, PgBouncer must be configured carefully. Settings like idle timeouts are critical to prevent connection exhaustion.

diagram proksi PostgreSQL

Saben replika baca nduweni deployment Kubernetes dhewe sing mbukak pirang-pirang pod PgBouncer. Kita mbukak pirang-pirang deployment Kubernetes ing mburi Kubernetes Service sing padha, sing mbagi lalu lintas kanthi seimbang ing antarane pod.

Caching

Challenge: A sudden spike in cache misses can trigger a surge of reads on the PostgreSQL database, saturating CPU and slowing user requests.

Solution: To reduce read pressure on PostgreSQL, we use a caching layer to serve most of the read traffic. However, when cache hit rates drop unexpectedly, the burst of cache misses can push a large volume of requests directly to PostgreSQL. This sudden increase in database reads consumes significant resources, slowing down the service. To prevent overload during cache-miss storms, we implement a cache locking (and leasing) mechanism so that only a single reader that misses on a particular key fetches the data from PostgreSQL. When multiple requests miss on the same cache key, only one request acquires the lock and proceeds to retrieve the data and repopulate the cache. All other requests wait for the cache to be updated rather than all hitting PostgreSQL at once. This significantly reduces redundant database reads and protects the system from cascading load spikes.

Scaling read replicas

Challenge: The primary streams Write Ahead Log (WAL) data to every read replica. As the number of replicas increases, the primary must ship WAL to more instances, increasing pressure on both network bandwidth and CPU. This causes higher and more unstable replica lag, which makes the system harder to scale reliably.

Solution: We operate nearly 50 read replicas across multiple geographic regions to minimize latency. However, with the current architecture, the primary must stream WAL to every replica. Although it currently scales well with very large instance types and high-network bandwidth, we can’t keep adding replicas indefinitely without eventually overloading the primary. To address this, we’re collaborating with the Azure PostgreSQL team on cascading replication(mbukak ing jendhela anyar), where intermediate replicas relay WAL to downstream replicas. This approach allows us to scale to potentially over a hundred replicas without overwhelming the primary. However, it also introduces additional operational complexity, particularly around failover management. The feature is still in testing; we’ll ensure it’s robust and can fail over safely before rolling it out to production.

diagram replikasi berantai PostgreSQL

Rate limit

Challenge: A sudden traffic spike on specific endpoints, a surge of expensive queries, or a retry storm can quickly exhaust critical resources such as CPU, I/O, and connections, which causes widespread service degradation.

Solution: We implemented rate-limiting across multiple layers—application, connection pooler, proxy, and query—to prevent sudden traffic spikes from overwhelming database instances and triggering cascading failures. It’s also crucial to avoid overly short retry intervals, which can trigger retry storms. We also enhanced the ORM layer to support rate limiting and when necessary, fully block specific query digests. This targeted form of load shedding enables rapid recovery from sudden surges of expensive queries.

Schema Management

Challenge: Even a small schema change, such as altering a column type, can trigger a full table rewrite(mbukak ing jendhela anyar). We therefore apply schema changes cautiously—limiting them to lightweight operations and avoiding any that rewrite entire tables.

Solution: Only lightweight schema changes are permitted, such as adding or removing certain columns that do not trigger a full table rewrite. We enforce a strict 5-second timeout on schema changes. Creating and dropping indexes concurrently is allowed. Schema changes are restricted to existing tables. If a new feature requires additional tables, they must be in alternative sharded systems such as Azure CosmosDB rather than PostgreSQL. When backfilling a table field, we apply strict rate limits to prevent write spikes. Although this process can sometimes take over a week, it ensures stability and avoids any production impact.

Results and the road ahead

This effort demonstrates that with the right design and optimizations, Azure PostgreSQL can be scaled to handle the largest production workloads. PostgreSQL handles millions of QPS for read-heavy workloads, powering OpenAI’s most critical products like ChatGPT and the API platform. We added nearly 50 read replicas, while keeping replication lag near zero, maintained low-latency reads across geo-distributed regions, and built sufficient capacity headroom to support future growth.

This scaling works while still minimizing latency and improving reliability. We consistently deliver low double-digit millisecond p99 client-side latency and five-nines availability in production. And over the past 12 months, we’ve had only one SEV-0 PostgreSQL incident (it occurred during the viral launch(mbukak ing jendhela anyar) of ChatGPT ImageGen, when write traffic suddenly surged by more than 10x as over 100 million new users signed up within a week.)

While we’re happy with how far PostgreSQL has taken us, we continue to push its limits to ensure we have sufficient runway for future growth. We’ve already migrated the shardable write-heavy workloads to our sharded systems like CosmosDB. The remaining write-heavy workloads are more challenging to shard—we’re actively migrating those as well to further offload writes from the PostgreSQL primary. We’re also working with Azure to enable cascading replication so we can safely scale to significantly more read replicas.

Looking ahead, we’ll continue to explore additional approaches to further scale, including sharded PostgreSQL or alternative distributed systems, as our infrastructure demands continue to grow.

Pangarang

Bohan Zhang

Ucapan matur nuwun

Matur nuwun khusus kanggo Jon Lee, Sicheng Liu, Chaomin Yu, lan Chenglong Hao, sing wis menehi kontribusi kanggo tulisan iki, uga kanggo kabeh tim sing mbantu menskalakan PostgreSQL. Kita uga arep matur nuwun marang tim Azure PostgreSQL kanggo kemitraan sing kuwat.