Kutathmini uwezo wa AI katika kutekeleza shughuli za utafiti wa kisayansi
Tunatambulisha FrontierScience, kipimo kipya kinachotathmini uwezo wa AI kwa hoja za kisayansi za kiwango cha mtaalamu katika fizikia, kemia, na biolojia.

Hoja ni kiini cha kazi ya kisayansi. Zaidi ya kukumbuka ukweli, wanasayansi huzalisha dhana, kuzijaribu na kuziboresha, na kuunganisha mawazo katika nyanja mbalimbali. Kadri miundo yetu inavyopata uwezo zaidi, swali kuu ni jinsi gani inaweza waza kwa kina ili kuchangia katika utafiti wa kisayansi.
Katika mwaka uliopita, miundo yetu imefikia hatua kubwa, ikiwa ni pamoja na kufikia utendaji wa kiwango cha medali ya dhahabu katika Olimpiki ya Kimataifa ya Hisabati na Olimpiki ya Kimataifa ya taarifa. Sambamba, tunaanza kuona miundo yetu yenye uwezo mkubwa zaidi, kama vile GPT‑5, ikiharakisha kwa njia yenye maana michakato halisi ya kisayansi. Watafiti wanatumia mifumo hii kwa shughuli kama vile utafutaji wa fasihi katika taaluma na lugha mbalimbali na kushughulikia uthibitisho changamano wa kihisabati. Katika visa vingi, muundo hupunguza kazi ambayo ingeweza kuchukua siku au wiki hadi saa. Maendeleo haya yameandikwa katika karatasi yetu Majaribio ya awali ya kuongeza kasi ya sayansi na GPT‑5, iliyotolewa Novemba 2025, ambayo inaonyesha ushahidi wa awali kwamba GPT‑5 inaweza kuharakisha kwa kipimo michakato ya kazi za kisayansi.
Kwa kuwa kuharakisha maendeleo ya kisayansi ni mojawapo ya fursa za kuahidi zaidi kwa AI kufaidisha binadamu, tunaboresha miundo yetu kwenye shughuli ngumu za hisabati na sayansi na tunafanya kazi kwenye zana zitakazosaidia wanasayansi kupata manufaa zaidi kutoka na matumizi yake.
Wakati GPQA(fungua katika dirisha jipya), kipimo cha sayansi “kisichoweza kutafutwa na Google” cha maswali yaliyoandikwa na wataalamu wa PhD, kilipotolewa mnamo Novemba 2023, GPT‑4 ilipata alama ya 39%, chini ya kiwango cha msingi cha wataalamu cha 70%. Miaka miwili baadaye, GPT‑5.2 ilipata alama ya 92%. Kadri uwezo wa kufikiri na maarifa wa miundo unavyoendelea kupanuka, viwango vigumu zaidi vitakuwa muhimu kupima na kutabiri uwezo wa miundo kuharakisha utafiti wa kisayansi. Viwango vya awali vya kisayansi kwa kiasi kikubwa vinazingatia maswali ya chaguo nyingi, vimejaa, au havilengi kwa kiasi kikubwa kwenye sayansi.
Ili kuziba pengo hili, tunawaletea FrontierScience: kipimo kipya kilichoundwa kupima uwezo wa kisayansi wa kiwango cha kitaalamu. FrontierScience imeandikwa na thibitishwa na wataalamu katika fizikia, kemia, na biolojia, na inajumuisha mamia ya maswali yaliyoundwa kuwa magumu, ya kipekee, na yenye maana. FrontierScience inajumuisha njia mbili za maswali: Olympiad, inayopima uwezo wa hoja za kisayansi za mtindo wa Olympiad, na Utafiti, inayopima uwezo wa utafiti wa kisayansi wa ulimwengu halisi. Kutoa ufahamu zaidi kuhusu uwezo wa kisayansi wa miundo kunatusaidia kufuatilia maendeleo na kuendeleza sayansi inayoharakishwa na AI.
Katika tathmini zetu za awali, GPT‑5.2 ni muundo wetu unaofanya vizuri zaidi kwenye FrontierScience-Olympiad (ikifunga 77%) na Utafiti (ikifunga 25%), mbele ya miundo mingine ya kisasa. Tumeshuhudia maendeleo makubwa katika kutatua maswali ya kiwango cha mtaalamu huku tukiacha nafasi kwa maendeleo zaidi, hasa katika shughuli za utafiti zisizo na kikomo. Kwa wanasayansi, hii inapendekeza kwamba miundo ya sasa tayari inaweza kutoa usaidizi kwa sehemu za utafiti zinazohusisha hoja zilizopangwa, huku ikionyesha kuwa kazi kubwa inahitajika ili kuboresha uwezo wao wa kufanya mawazo yasiyo na mipaka. Matokeo haya yanalingana na jinsi wanasayansi wanavyotumia miundo ya leo: kuharakisha mchakato wa utafiti huku wakitegemea uamuzi wa binadamu kwa ajili ya kutunga na kuthibitisha matatizo, na kwa kiasi kikubwa kugundua mawazo na uhusiano ambao vinginevyo ungetumia muda mrefu zaidi kufichua—ikiwa ni pamoja na, katika baadhi ya matukio, kuchangia maarifa mapya ambayo wataalamu kisha wanatathmini na kujaribu.
Mwishowe, kipimo muhimu zaidi cha uwezo wa kisayansi wa AI ni uvumbuzi mpya ambao inasaidia kuzalisha; hayo ndiyo yanayojalisha sana kwa sayansi na jamii. FrontierScience iko juu ya hiyo. Inatupa mwongozo wa nyota ya kaskazini kwa ajili ya uwazaji wa kisayansi wa kiwango cha mtaalamu, ikituruhusu kujaribu miundo kwenye seti ya maswali iliyosanifishwa, kuona mahali yanapofaulu au kushindwa, na kutambua mahali tunapohitaji kuyaboresha. FrontierScience ni nyembamba na ina mapungufu katika nyanja muhimu (kwa mfano, ikilenga matatizo yaliyoandikwa na wataalamu) na haikamata kila kitu ambacho wanasayansi hufanya katika kazi zao za kila siku. Lakini uwanja unahitaji viwango vya sayansi ambavyo ni vigumu zaidi, vya asili, na vya maana, na FrontierScience inatoa hatua mbele katika mwelekeo huu.
Tathmini kamili ya FrontierScience inajumuisha maswali zaidi ya 700 ya maandishi (pamoja na 160 katika seti ya dhahabu) yanayofunika nyanja ndogo za fizikia, kemia, na biolojia. Kipimo cha Kiwango kinajumuisha Olympiad na mgawanyiko wa Utafiti. FrontierScience-Olympiad ina maswali 100 yaliyoundwa na washindi wa medali za olimpiki za kimataifa ili kutathmini hoja za kisayansi katika muundo wa majibu mafupi na uliowekwa mipaka. Seti ya Olympiad ilibuniwa ili kujumuisha maswali ya nadharia ambayo ni magumu angalau kama matatizo katika mashindano ya kimataifa ya Olympiad. FrontierScience-Research inajumuisha kazi ndogo 60 za utafiti wa awali zilizoundwa na wanasayansi wa PhD (wagombea wa udaktari, maprofesa, au watafiti wa baada ya udaktari) ambazo zinapimwa kwa kutumia rubriki ya alama 10. Seti ya Utafiti iliundwa ili kuwa na kazi ndogo ndogo zinazojitegemea, za hatua nyingi katika kiwango cha ugumu ambacho mwanasayansi wa PhD anaweza kukutana nacho wakati wa utafiti wake.
Kila shughuli katika FrontierScience imeandikwa na kuthibitishwa na mtaalamu wa kikoa katika fizikia, kemia, au biolojia. Kwa seti ya Olympiad, wataalamu wote walitunukiwa medali katika angalau moja (na mara nyingi zaidi ya moja) ya mashindano ya kimataifa ya Olimpiki. Kwa seti ya Utafiti, wataalam wote wana shahada ya uzamivu inayofaa.
Maswali ya Olympiad yaliundwa kwa ushirikiano na washindi wa zamani 42 wa medali za kimataifa au makocha wa Team za kitaifa katika kikoa husika, jumla ya medali 109 za Olympiad. Maswali ya utafiti yaliundwa kwa ushirikiano na wanasayansi 45 waliohitimu na wataalamu wa kikoa. Wanasayansi wote walikuwa ama wagombea wa uzamivu, watafiti wa baada ya uzamivu, au maprofesa. Maeneo yao ya utaalamu yalijumuisha safu ya taaluma maalum na muhimu za kisayansi, kutoka kwa quantum electrodynamics hadi kemia ya kikaboni ya sintetiki hadi biolojia ya mageuzi.
Mchakato wa kuunda shughuli kwa seti zote mbili ulijumuisha uteuzi dhidi ya miundo ya ndani ya OpenAI (kwa mfano, kuacha shughuli ambazo miundo ilifanikiwa kufanya vizuri, hivyo tunatarajia tathmini kuwa na upendeleo fulani dhidi ya miundo hii ikilinganishwa na mingine). Tunatoa seti ya dhahabu ya Olympiad yenye maswali 100 na seti ya dhahabu ya Utafiti yenye maswali 60, tukihifadhi maswali mengine ili kufuatilia uchafuzi.

Shughuli hupitia hatua nne: Uundaji, Mapitio, Utatuzi, Marekebisho. Wataalam huru hukagua shughuli za kila mmoja ili kuthibitisha kuwa zinakidhi vigezo.
Seti ya Olimpiki inaweza kupimika kwa jibu fupi: ama kwa nambari, maelezo, au mechi ya mfuatano isiyo wazi, ambayo husaidia thibitisha usahihi. Hata hivyo, uthibitishaji huu mara nyingi unakosa uelezaji na uwazi wa tatizo. Kwa seti ya Utafiti, tunatambulisha usanifu unaotegemea rubriki kwa ajili ya upangaji alama wa shughuli zilizo wazi zaidi. Kila swali linajumuisha rubriki ya upangaji alama yenye vipengele vingi vinavyoweza kutathminiwa kwa uhuru na kwa njia isiyo na upendeleo, jumla ya alama 10. Rubriki ya upangaji alama inatathmini sio tu usahihi wa jibu la mwisho, bali pia usahihi wa hatua za kati za uwazaji, ikiruhusu utendaji wa muundo wenye mwelekeo wa kina na uchambuzi wa makosa. Suluhisho linachukuliwa kuwa "sahihi" ikiwa limepewa angalau alama 7/10 za rubriki.
Majibu yanatathminiwa na mpimaji wa muundo (GPT‑5) kwa kutumia vigezo vya jibu fupi au rubriki. Ingawa tungependa kutumia mtaalamu wa kibinadamu kukadiria kila jibu, njia hii haiwezi kupanuliwa, kwa hivyo tuliunda rubriki inayoweza kukaguliwa kwa kutumia kigezo cha muundo. Tuliunda mfumo wa uthibitishaji ili kusaidia kuhakikisha rubriki na maswali yamepimwa kwa usahihi na ugumu.
Sample chemistry research problem
Phthalocyanines are important macrocyclic compounds widely utilized in material and life sciences due to their outstanding physicochemical properties, including intense absorption in the visible-near-IR region and high electron-accepting ability. However, unlike porphyrins, the structural diversity of phthalocyanines has been limited by traditional synthetic methods that rely on one-step condensation of isoindoline units. The development of methodologies that allow controlled modification of the phthalocyanine framework, particularly at the meso-positions, represents a significant advancement in expanding the structural and electronic diversity of these compounds.
Analyze how the modification of meso-nitrogen atoms in nickel(II) phthalocyanine affects the electronic structure, aromaticity, spectroscopic properties, and reactivity of the resulting macrocycles. Explain the synthetic methodology that enables these modifications, the relationship between π-electron count and observed properties, and how these systems compare to traditional phthalocyanines and related porphyrinoids.
Sample response
1. Introduction: Why Modify the meso-Nitrogens of Phthalocyanines?
Phthalocyanines (Pc) are a class of macrocyclic dyes known for their remarkable chemical, electronic, and optical properties. Their widespread use in materials science, catalysis, and biology stems from strong absorptions in the visible/near-IR regions, exceptional redox characteristics (i.e., high electron affinity), and rigid π-conjugation. Traditional synthetic methods, however, typically form the macrocycle in a single step from phthalonitrile or isoindoline precursors, restricting the ability to introduce substitutions selectively at key ring positions-especially the "meso" (bridging) nitrogens. Recently, methodologies allowing targeted modification of the meso-nitrogen atoms in Ni(II) phthalocyanines have emerged as a powerful means to expand the structural and electronic diversity of these macrocycles.
2. Effect on Electronic Structure and Aromaticity
-
π-Electron Count and Ring Currents
Phthalocyanines are classical 18-electron aromatic systems, conferring high stability and planarity. Altering the meso-nitrogens through substitution or ring expansion can shift the effective π-electron count. For example, introducing additional unsaturated moieties or heteroatoms may expand the cyclic conjugation to 20 or 22 electrons, moving the molecule between Hückel aromaticity and other regimes that affect ring currents and magnetic properties.
-
Charge Distribution and Frontier Orbitals
Meso-substitution can tune the electron density distribution around the macrocycle. Electron-donating groups can push electron density toward the ring center, lowering the energy of the lowest unoccupied molecular orbital (LUMO). Electron-withdrawing groups, conversely, stabilize the highest occupied molecular orbital (HOMO) and shift redox potentials toward more positive values, altering both the electrochemical profile and the Q- and B-band positions in the UV-vis spectrum.
3. Spectroscopic Consequences
-
UV-Vis Absorption (Q and B Bands)
The principal absorption features of phthalocyanines lie in the visible (Q-band, typically 600-700 nm) and near-UV (B-band, typically 300-400 nm).
Substitution that expands the ring conjugation or introduces strong electron-donating/withdrawing groups can:
- Shift the Q-band to longer wavelengths (bathochromic shift), reaching into the near-IR, which is highly desirable for optoelectronic and photodynamic applications.
- Alter relative intensities of these bands and merge or split them, reflecting changes in orbital symmetries and energies.
-
NMR Spectroscopy and Aromatic Ring Currents
Modifications to the π-electron count and distribution are directly observed in 1H and 13C NMR chemical shifts.
More highly conjugated (or expanded) aromatic rings exhibit distinct downfield shifts for protons located within induced ring currents, while any partial loss of aromaticity or incorporation of antiaromatic segments can cause atypical shielding/deshielding patterns.
4. Reactivity and Coordination Chemistry
Because phthalocyanines are often used as redox catalysts or sensors, the meso-nitrogen modifications can significantly influence reactivity:
- Electron-rich meso substituents facilitate nucleophilic or electrophilic attacks at the ring periphery, enabling site-selective functionalizations that are otherwise difficult.
(... shortened for the purposes of this figure)
Sample grading rubric
Analysis of Traditional Phthalocyanine Synthesis Limitations (1 point)
0.5 point: Mentions limitations of traditional methods but without specific focus on meso-position control challenges.
0.0 point: Fails to identify key limitations of traditional synthetic approaches or provides incorrect analysis.
Thiolate-Mediated Tetramerization Process (1 point)
1.0 point: Correctly describes the thiolate-mediated reductive tetramerization and explains how counter cation size (K+ or Cs+ vs. Na+) affects selectivity between tetramer formation and direct macrocyclization.
0.5 point: Mentions thiolate-mediated tetramerization but without explaining factors controlling selectivity.
Kushindwa 0.0 point: Incorrectly describes the oligomerization process or omits critical details about selectivity control.Analysis of NMR Spectroscopic Features (1 point)
1.0 point: Correctly explains that upfield shifts in the 16π system indicate paratropic ring current (antiaromaticity), contrasts this with the broad signals in 17π systems due to paramagnetism, and connects these observations to the underlying electronic structures.
Kupita 0.5 point: Identifies basic NMR patterns but without clear connection to ring currents or electronic structure.0.0 point: Incorrectly interprets NMR data or fails to connect spectral features to electronic properties.
Electrochemical Property Analysis (1 point)
1.0 point: Correctly explains that the 16π system shows two reversible reductions reflecting conversion to 17π radical and 18π aromatic states, while 17π systems show narrow redox gaps due to facile interconversion between 16π, 17π, and 18π states, and relates these patterns to the underlying electronic structures.
Kupita 0.5 point: Describes redox patterns without clearly connecting them to specific electronic state changes.0.0 point: Incorrectly interprets electrochemical data or fails to connect redox behavior to electronic properties.
Analysis of Absorption Spectroscopy (1 point)
1.0 point: Correctly explains that the 16π system shows weak/broad absorption due to symmetry-forbidden HOMO-LUMO transitions in antiaromatic systems, while 17π systems show Q-like bands plus NIR-II absorptions characteristic of radical species, and contrasts these with typical phthalocyanine spectral features.
Kupita 0.5 point: Describes absorption features but provides limited connection to underlying electronic structures.0.0 point: Incorrectly interprets absorption data or fails to relate spectral features to electronic properties.
Reactivity Analysis of Antiaromatic System (1 point)
1.0 point: Correctly explains the high reactivity of the 16π system toward nucleophiles, details specific reactions with hydroxide (ring opening) and hydrazine (ring expansion), and explains how these transformations relieve antiaromatic destabilization.
0.5 point: Mentions reactivity but provides limited analysis of specific transformations or the driving forces behind them.
Kushindwa 0.0 point: Incorrectly analyzes reactivity patterns or fails to connect them to the antiaromatic character of the 16π system.(... and more)
Kila shughuli katika seti ya utafiti inakaguliwa kwa kutumia rubriki yenye jumla ya alama 10 ambayo inaweza kutumiwa na mtaalamu au mkadiriaji wa muundo. Ili kupanua uwezo wetu wa kutathmini miundo, tunatumia muundo mwingine kupima majibu.
Tulitathmini miundo kadhaa ya kisasa: GPT‑5.2, Claude Opus 4.5, na Gemini 3 Pro, GPT‑4o, OpenAI o4-mini, na OpenAI o3 kwenye FrontierScience-Olympiad na FrontierScience-Research. Miundo yote ya uwazaji ilitathminiwa kwa "juhudi kubwa" isipokuwa GPT‑5.2 kwa "xhigh". Katika tathmini zetu za awali, GPT‑5.2 ni muundo wetu unaofanya vizuri zaidi kwenye FrontierScience-Olympiad (ikifunga 77%) na Utafiti (ikifunga 25%), mbele ya miundo mingine ya kisasa. Gemini 3 Pro inalinganishwa na GPT‑5.2 kwenye seti ya Olympiad (ikifunga 76%).
Tumeshuhudia maendeleo makubwa katika kutatua maswali ya kiwango cha mtaalamu, hasa katika shughuli za utafiti wa mtindo wa wazi. Bado kuna nafasi zaidi ya kukua: kutokana na kuchambua nakala za makosa, miundo ya kisasa wakati mwingine ilifanya makosa ya hoja, mantiki, na hesabu, haikuelewa dhana za kisayansi za kipekee, na ilifanya makosa ya ukweli.
Tunalinganisha usahihi katika miundo kadhaa ya kisasa. GPT‑5.2 ni muundo wetu wenye utendaji wa juu zaidi kwenye FrontierScience-Research na seti ya Olimpiki.
Tunalinganisha usahihi katika juhudi za kufikiri kwa GPT‑5.2 na o3. Kufikiri kwa muda mrefu kunasababisha usahihi ulioboreshwa.
Ingawa FrontierScience inawakilisha hatua mbele katika ugumu wa viwango vya kisayansi, bado kuna mapungufu mengi. FrontierScience inajumuisha maswali yenye taarifa ya tatizo iliyozuiliwa, ambayo inalenga kutathmini jibu la mwisho (Olympiad) au kutathmini mantiki ya kukamilisha shughuli ya utafiti (Utafiti). Zaidi ya hayo, kutumia rubriki zenye vipengele vingi kwenye shughuli ndefu ni chini ya kutoegemea upande mmoja kuliko kukagua jibu la mwisho.
FrontierScience inatoa picha ya haraka ya jinsi miundo inavyofikiri kwenye maswali magumu yaliyoandikwa na wataalamu, lakini si picha kamili ya jinsi sayansi inavyofanyika kwa vitendo. Hasa, haipimi sehemu muhimu ya utafiti wa kisayansi: jinsi miundo inavyozalisha dhana mpya kabisa, au jinsi inavyoshirikiana na njia nyingi, ikiwa ni pamoja na data ya video na mifumo halisi ya majaribio katika ulimwengu wa kimwili.
Tukiangalia mbele, tunatarajia maendeleo katika hoja za kisayansi yatatokana na mifumo bora ya hoja ya matumizi ya jumla na juhudi za makusudi katika kuboresha uwezo wa kisayansi. FrontierScience ni chombo kimoja kati ya vingi, na kadri miundo inavyoboreshwa, tuna mpango kurudia kipimo hiki, kukipanua kwa nyanja mpya, na kukiunganisha na tathmini zaidi za ulimwengu halisi zinazochunguza kile ambacho mifumo hii inawasha wanasayansi kufanya. Viashiria kama FrontierScience vinatusaidia kuelewa udhaifu wa mifumo ya AI ya leo ili tuzingatie kazi yetu ya kufanya miundo kuwa washirika wa kuaminika katika ugunduzi wa kisayansi.


