Niskalaw PostgreSQL biex iħaddem 800 miljun utent ta’ ChatGPT
Minn Bohan Zhang, Membru tal-Persunal Tekniku
Għal snin sħaħ, PostgreSQL kien wieħed mis-sistemi tad-data l-aktar kritiċi li jaħdmu fil-qalba ta’ prodotti ewlenin bħal ChatGPT u l-API ta’ OpenAI. Hekk kif il-bażi tal-utenti tagħna qed tikber malajr, it-talbiet fuq id-databases tagħna żdiedu b’mod esponenzjali wkoll. Matul l-aħħar sena, it-tagħbija tagħna fuq PostgreSQL kibret b’aktar minn 10 darbiet, u qed tkompli tiżdied malajr.
L-isforzi tagħna biex navvanzaw l-infrastruttura tal-produzzjoni tagħna biex inżommu dan it-tkabbir żvelaw għarfien ġdid: PostgreSQL jista’ jiġi skalat biex b’mod affidabbli jappoġġa ammonti ta’ xogħol ħafna akbar b’enfasi fuq qari milli ħafna kienu jaħsbu qabel li hu possibbli. Is-sistema (maħluqa inizjalment minn tim ta’ xjenzati fl-Università ta’ California, Berkeley) ippermettietilna nappoġġaw traffiku globali enormi b’istanza waħda primarja ta’ Azure PostgreSQL flexible server(jinfetaħ f’tieqa ġdida) u kważi 50 replika tal-qari mifruxa fuq diversi reġjuni madwar id-dinja. Din hija l-istorja ta’ kif skalejna PostgreSQL f’OpenAI biex nappoġġaw miljuni ta’ queries fis-sekonda għal 800 miljun utent permezz ta’ ottimizzazzjonijiet rigorużi u inġinerija soda; se nkopru wkoll it-tagħlimiet ewlenin li ksibna tul it-triq.
Wara t-tnedija ta’ ChatGPT, it-traffiku kiber b’rata bla preċedent. Biex nappoġġawh, implimentajna malajr ottimizzazzjonijiet estensivi kemm fis-saff tal-applikazzjoni kif ukoll fis-saff tad-database PostgreSQL, skalejna ’l fuq billi żidna d-daqs tal-istanza, u skalejna ’l barra billi żidna aktar repliki tal-qari. Din l-arkitettura qdietna tajjeb għal żmien twil. B’titjib kontinwu, tkompli tipprovdi spazju biżżejjed għal tkabbir futur.
Jista’ jidher sorprendenti li arkitettura b’primary waħda tista’ tlaħħaq mat-talbiet tal-iskala ta’ OpenAI; madankollu, li dan jaħdem fil-prattika mhuwiex sempliċi. Rajna diversi SEVs ikkawżati minn tagħbija żejda fuq Postgres, u ħafna drabi jsegwu l-istess mudell: problema upstream tikkawża żieda f’daqqa fit-tagħbija tad-database, bħal cache misses mifruxa minħabba falliment fis-saff tal-caching, żieda qawwija ta’ multi-way joins għaljin li jissaturaw is-CPU, jew write storm minn tnedija ta’ karatteristika ġdida. Hekk kif jogħla l-użu tar-riżorsi, il-latenza tal-queries tiżdied u t-talbiet jibdew jispiċċaw fi timeout. Ir-retries imbagħad ikomplu jkabbru t-tagħbija, u jqajmu ċiklu vizzjuż bil-potenzjal li jiddegrada s-servizzi kollha ta’ ChatGPT u tal-API.
Għalkemm PostgreSQL jiskala tajjeb għall-ammonti ta’ xogħol tagħna b’enfasi fuq qari, xorta niltaqgħu ma’ sfidi waqt perjodi ta’ traffiku għoli ta’ kitba. Dan huwa fil-biċċa l-kbira minħabba l-implimentazzjoni ta’ multiversion concurrency control (MVCC) ta’ PostgreSQL, li tagħmlu inqas effiċjenti għal ammonti ta’ xogħol b’enfasi fuq kitba. Pereżempju, meta query taġġorna tuple jew saħansitra kamp wieħed, ir-ringiela kollha tiġi kkupjata biex tinħoloq verżjoni ġdida. Taħt tagħbijiet kbar ta’ kitba, dan jirriżulta f’amplifikazzjoni sinifikanti tal-kitba. Iżid ukoll l-amplifikazzjoni tal-qari, billi l-queries iridu jiskennjaw diversi verżjonijiet ta’ tuples (dead tuples) biex jirkupraw l-aktar waħda reċenti. MVCC jintroduċi sfidi addizzjonali bħal nefħa ta’ tabelli u indiċijiet, żieda fl-overhead tal-manutenzjoni tal-indiċijiet, u tuning kumpless ta’ autovacuum. (Tista’ ssib analiżi fil-fond ta’ dawn il-kwistjonijiet f’blog li ktibt ma’ Prof. Andy Pavlo fl-Università Carnegie Mellon bl-isem Il-Parti ta’ PostgreSQL li Nistmerru l-Aktar(jinfetaħ f’tieqa ġdida), iċċitat(jinfetaħ f’tieqa ġdida) fil-paġna ta’ PostgreSQL fuq il-Wikipedija.)
Biex innaqqsu dawn il-limitazzjonijiet u nnaqqsu l-pressjoni tal-kitba, immigrajna, u qed inkomplu nimigraw, ammonti ta’ xogħol shardable (jiġifieri, ammonti ta’ xogħol li jistgħu jinqasmu orizzontalment), b’enfasi fuq kitba, lejn sistemi mqassma f’shards bħal Azure Cosmos DB, filwaqt li nottimizzaw il-loġika tal-applikazzjoni biex innaqqsu kitbiet mhux meħtieġa. Ma nħallux aktar ukoll li jiżdiedu tabelli ġodda mad-deployment attwali ta’ PostgreSQL. Ammonti ta’ xogħol ġodda b’mod awtomatiku jmorru fuq is-sistemi mqassma f’shards.
Anke hekk kif l-infrastruttura tagħna evolviet, PostgreSQL baqa’ mhux maqsum f’shards, b’istanza primarja waħda li sservi l-kitbiet kollha. Ir-raġuni ewlenija hija li t-tqassim f’shards tal-ammonti ta’ xogħol eżistenti tal-applikazzjoni jkun kumpless ħafna u jieħu ħafna ħin, u jkun jeħtieġ bidliet f’mijiet ta’ endpoints tal-applikazzjoni u jista’ jieħu xhur jew anke snin. Peress li l-ammonti ta’ xogħol tagħna huma prinċipalment b’enfasi fuq qari, u implimentajna ottimizzazzjonijiet estensivi, l-arkitettura attwali għadha tipprovdi marġni biżżejjed biex tappoġġa tkabbir kontinwu tat-traffiku. Filwaqt li mhux qed neskludu tqassim f’shards ta’ PostgreSQL fil-futur, mhijiex prijorità fil-qrib minħabba l-ispazju biżżejjed li għandna għat-tkabbir attwali u futur.
Fit-taqsimiet li ġejjin, se nidħlu fil-fond fl-isfidi li affrontajna u l-ottimizzazzjonijiet estensivi li implimentajna biex nindirizzawhom u nevitaw outages futuri, billi nimbuttaw PostgreSQL sal-limiti tiegħu u niskalawh għal miljuni ta’ queries fis-sekonda (QPS).
Sfida: B’writer wieħed biss, setup b’primary waħda ma jistax jiskala l-kitbiet. Żidiet qawwija fil-kitba jistgħu malajr jgħabbu żżejjed il-primary u jħallu impatt fuq servizzi bħal ChatGPT u l-API tagħna.
Soluzzjoni: Innaqqsu t-tagħbija fuq il-primary kemm jista’ jkun—kemm qari kif ukoll kitba—biex niżguraw li jkollha kapaċità biżżejjed biex tlaħħaq ma’ żidiet fil-kitba. It-traffiku tal-qari jiġi mċaqlaq lejn repliki kull fejn hu possibbli. Madankollu, xi read queries iridu jibqgħu fuq il-primary għax huma parti minn transazzjonijiet ta’ kitba. Għal dawn, niffokaw biex niżguraw li jkunu effiċjenti u nevitaw slow queries. Għat-traffiku tal-kitba, immigrajna ammonti ta’ xogħol shardable, b’enfasi fuq kitba, lejn sistemi mqassma f’shards bħal Azure CosmosDB. Ammonti ta’ xogħol li huma aktar diffiċli biex jinqasmu f’shards iżda xorta jiġġeneraw volum għoli ta’ kitba jieħdu aktar żmien biex jiġu migrati, u dak il-proċess għadu għaddej. Ottimizzajna wkoll l-applikazzjonijiet tagħna b’mod aggressiv biex innaqqsu t-tagħbija tal-kitba; pereżempju, irranġajna bugs fl-applikazzjoni li kienu jikkawżaw kitbiet ridondanti u introduċejna lazy writes, fejn xieraq, biex intaffu żidiet fit-traffiku. Barra minn hekk, meta nagħmlu backfilling ta’ fields fit-tabelli, ninfurzaw limiti stretti tar-rata biex nevitaw pressjoni eċċessiva tal-kitba.
Sfida: Identifikajna diversi queries għaljin f’PostgreSQL. Fil-passat, żidiet f’daqqa fil-volum ta’ dawn il-queries kienu jikkunsmaw ammonti kbar ta’ CPU, u jnaqqsu kemm it-talbiet ta’ ChatGPT kif ukoll dawk tal-API.
Soluzzjoni: Ftit queries għaljin, bħal dawk li jgħaqqdu ħafna tabelli flimkien, jistgħu jnaqqsu b’mod sinifikanti jew saħansitra jwaqqgħu s-servizz kollu. Irridu nottimizzaw kontinwament il-queries ta’ PostgreSQL biex niżguraw li huma effiċjenti u nevitaw anti-patterns komuni ta’ Online Transaction Processing (OLTP). Pereżempju, darba identifikajna query għalja immens li kienet tgħaqqad 12-il tabella, fejn żidiet f’din il-query kienu responsabbli għal SEVs ta’ severità għolja fil-passat. Għandna nevitaw multi-table joins kumplessi kull fejn hu possibbli. Jekk il-joins huma meħtieġa, tgħallimna nikkunsidraw li nkissru l-query u minflok nimxu l-loġika kumplessa tal-join lejn is-saff tal-applikazzjoni. Ħafna minn dawn il-queries problematiċi jiġu ġġenerati minn frameworks ta’ Object-Relational Mapping (ORMs), għalhekk huwa importanti li nirrevedu bir-reqqa l-SQL li jipproduċu u niżguraw li jaġixxi kif mistenni. Huwa komuni wkoll li ssib idle queries li jdumu ħafna f’PostgreSQL. Il-konfigurazzjoni ta’ timeouts bħal idle_in_transaction_session_timeout hija essenzjali biex ma jħalluhomx jimblokkaw l-autovacuum.
Sfida: Jekk replika tal-qari tieqaf taħdem, it-traffiku xorta jista’ jintbagħat lejn repliki oħra. Madankollu, dipendenza fuq writer wieħed tfisser punt uniku ta’ falliment—jekk dan jieqaf, is-servizz kollu jintlaqat.
Soluzzjoni: Il-biċċa l-kbira tat-talbiet kritiċi jinvolvu biss read queries. Biex innaqqsu l-punt uniku ta’ falliment fil-primary, mexxejna dawk il-qari mill-writer lejn repliki, u niżguraw li dawk it-talbiet ikunu jistgħu jkomplu jservu anke jekk il-primary jieqaf. Filwaqt li l-operazzjonijiet ta’ kitba xorta jonqsu, l-impatt jonqos; ma jibqax SEV0 peress li l-qari jibqa’ disponibbli.
Biex innaqqsu l-fallimenti tal-primary, inħaddmu l-primary fil-modalità High-Availability (HA) bi hot standby, replika sinkronizzata kontinwament li dejjem tkun lesta tieħu f’idejha s-servizz tat-traffiku. Jekk il-primary tieqaf jew ikollha tittieħed offline għall-manutenzjoni, nistgħu malajr nippromwovu l-standby biex innaqqsu d-downtime. It-tim ta’ Azure PostgreSQL għamel ħafna xogħol biex jiżgura li dawn il-failovers jibqgħu siguri u affidabbli anke taħt tagħbija għolja ħafna. Biex nittrattaw fallimenti ta’ repliki tal-qari, nedeployjaw diversi repliki f’kull reġjun b’marġni ta’ kapaċità biżżejjed, u b’hekk niżguraw li falliment ta’ replika waħda ma jwassalx għal outage reġjonali.
Sfida: Spiss niltaqgħu ma’ sitwazzjonijiet fejn ċerti talbiet jikkunsmaw ammont sproporzjonat ta’ riżorsi fuq l-istanzi ta’ PostgreSQL. Dan jista’ jwassal għal prestazzjoni degradata għal ammonti ta’ xogħol oħra li jkunu qed jaħdmu fuq l-istess istanzi. Pereżempju, tnedija ta’ karatteristika ġdida tista’ tintroduċi queries ineffiċjenti li jikkunsmaw ħafna CPU ta’ PostgreSQL, u jnaqqsu t-talbiet għal karatteristiċi kritiċi oħra.
Solution: To mitigate the “noisy neighbor” problem, we isolate workloads onto dedicated instances to ensure that sudden spikes in resource-intensive requests don’t impact other traffic. Specifically, we split requests into low-priority and high-priority tiers and route them to separate instances. This way, even if a low-priority workload becomes resource-intensive, it won’t degrade the performance of high-priority requests. We apply the same strategy across different products and services as well, so that activity from one product does not affect the performance or reliability of another.
Challenge: Each instance has a maximum connection limit (5,000 in Azure PostgreSQL). It’s easy to run out of connections or accumulate too many idle ones. We’ve previously had incidents caused by connection storms that exhausted all available connections.
Solution: We deployed PgBouncer as a proxy layer to pool database connections. Running it in statement or transaction pooling mode allows us to efficiently reuse connections, greatly reducing the number of active client connections. This also cuts connection setup latency: in our benchmarks, the average connection time dropped from 50 milliseconds (ms) to 5 ms. Inter-region connections and requests can be expensive, so we co-locate the proxy, clients, and replicas in the same region to minimize network overhead and connection use time. Moreover, PgBouncer must be configured carefully. Settings like idle timeouts are critical to prevent connection exhaustion.
Kull replika tal-qari għandha d-deployment Kubernetes tagħha stess li jħaddem diversi pods ta’ PgBouncer. Aħna nħaddmu diversi deployments ta’ Kubernetes wara l-istess Kubernetes Service, li jqassam it-traffiku bejn il-pods.
Challenge: A sudden spike in cache misses can trigger a surge of reads on the PostgreSQL database, saturating CPU and slowing user requests.
Solution: To reduce read pressure on PostgreSQL, we use a caching layer to serve most of the read traffic. However, when cache hit rates drop unexpectedly, the burst of cache misses can push a large volume of requests directly to PostgreSQL. This sudden increase in database reads consumes significant resources, slowing down the service. To prevent overload during cache-miss storms, we implement a cache locking (and leasing) mechanism so that only a single reader that misses on a particular key fetches the data from PostgreSQL. When multiple requests miss on the same cache key, only one request acquires the lock and proceeds to retrieve the data and repopulate the cache. All other requests wait for the cache to be updated rather than all hitting PostgreSQL at once. This significantly reduces redundant database reads and protects the system from cascading load spikes.
Challenge: The primary streams Write Ahead Log (WAL) data to every read replica. As the number of replicas increases, the primary must ship WAL to more instances, increasing pressure on both network bandwidth and CPU. This causes higher and more unstable replica lag, which makes the system harder to scale reliably.
Solution: We operate nearly 50 read replicas across multiple geographic regions to minimize latency. However, with the current architecture, the primary must stream WAL to every replica. Although it currently scales well with very large instance types and high-network bandwidth, we can’t keep adding replicas indefinitely without eventually overloading the primary. To address this, we’re collaborating with the Azure PostgreSQL team on cascading replication(jinfetaħ f’tieqa ġdida), where intermediate replicas relay WAL to downstream replicas. This approach allows us to scale to potentially over a hundred replicas without overwhelming the primary. However, it also introduces additional operational complexity, particularly around failover management. The feature is still in testing; we’ll ensure it’s robust and can fail over safely before rolling it out to production.
Challenge: A sudden traffic spike on specific endpoints, a surge of expensive queries, or a retry storm can quickly exhaust critical resources such as CPU, I/O, and connections, which causes widespread service degradation.
Solution: We implemented rate-limiting across multiple layers—application, connection pooler, proxy, and query—to prevent sudden traffic spikes from overwhelming database instances and triggering cascading failures. It’s also crucial to avoid overly short retry intervals, which can trigger retry storms. We also enhanced the ORM layer to support rate limiting and when necessary, fully block specific query digests. This targeted form of load shedding enables rapid recovery from sudden surges of expensive queries.
Challenge: Even a small schema change, such as altering a column type, can trigger a full table rewrite(jinfetaħ f’tieqa ġdida). We therefore apply schema changes cautiously—limiting them to lightweight operations and avoiding any that rewrite entire tables.
Solution: Only lightweight schema changes are permitted, such as adding or removing certain columns that do not trigger a full table rewrite. We enforce a strict 5-second timeout on schema changes. Creating and dropping indexes concurrently is allowed. Schema changes are restricted to existing tables. If a new feature requires additional tables, they must be in alternative sharded systems such as Azure CosmosDB rather than PostgreSQL. When backfilling a table field, we apply strict rate limits to prevent write spikes. Although this process can sometimes take over a week, it ensures stability and avoids any production impact.
This effort demonstrates that with the right design and optimizations, Azure PostgreSQL can be scaled to handle the largest production workloads. PostgreSQL handles millions of QPS for read-heavy workloads, powering OpenAI’s most critical products like ChatGPT and the API platform. We added nearly 50 read replicas, while keeping replication lag near zero, maintained low-latency reads across geo-distributed regions, and built sufficient capacity headroom to support future growth.
This scaling works while still minimizing latency and improving reliability. We consistently deliver low double-digit millisecond p99 client-side latency and five-nines availability in production. And over the past 12 months, we’ve had only one SEV-0 PostgreSQL incident (it occurred during the viral launch(jinfetaħ f’tieqa ġdida) of ChatGPT ImageGen, when write traffic suddenly surged by more than 10x as over 100 million new users signed up within a week.)
While we’re happy with how far PostgreSQL has taken us, we continue to push its limits to ensure we have sufficient runway for future growth. We’ve already migrated the shardable write-heavy workloads to our sharded systems like CosmosDB. The remaining write-heavy workloads are more challenging to shard—we’re actively migrating those as well to further offload writes from the PostgreSQL primary. We’re also working with Azure to enable cascading replication so we can safely scale to significantly more read replicas.
Looking ahead, we’ll continue to explore additional approaches to further scale, including sharded PostgreSQL or alternative distributed systems, as our infrastructure demands continue to grow.
Awtur
Ringrazzjamenti
Ringrazzjamenti speċjali lil Jon Lee, Sicheng Liu, Chaomin Yu u Chenglong Hao, li kkontribwew għal din il-kitba, u lit-tim kollu li għen jiskala PostgreSQL. Nixtiequ nirringrazzjaw ukoll lit-tim ta’ Azure PostgreSQL għas-sħubija qawwija tagħhom.


