12 Eylül 2024

LLM’lerle akıl yürütmeyi öğrenmek

Karmaşık akıl yürütme yeteneği sergileyebilmesi için takviyeli öğrenme yöntemiyle eğitilmiş yeni büyük dil modeli OpenAI o1 ile tanış. Bu model, cevap vermeden önce düşünür ve kullanıcıyı uzun bir içsel düşünce zinciri oluşturduktan sonra yanıtlar.

Katkıda bulunanlar o1 kullan

Yükleniyor...

OpenAI o1, rekabetçi programlama sorularında (Codeforces) %89’luk dilime girerek ABD Matematik Olimpiyatları (AIME) elemelerinde ABD’nin en başarılı 500 öğrencisi arasında yer alacak düzeyde bir performans sergiledi. Ayrıca fizik, biyoloji ve kimya sorularından oluşan GPQA testinde doktora düzeyindeki insanların doğruluk oranını aştı. Bu yeni modelin kullanımını da diğer mevcut modeller kadar kolaylaştırmak için çalışmalarımız sürerken modelin erken bir versiyonu olan OpenAI o1‑preview’u şimdiden ChatGPT’de kullanıma sunuyor ve güvenilir API kullanıcılarının⁠(yeni bir pencerede açılır) erişimine açıyoruz.

Büyük ölçekli takviyeli öğrenme algoritmamız, verilerin son derece verimli bir şekilde kullanıldığı bir eğitim sürecinde modele düşünce zincirini üretken şekilde kullanmayı öğretiyor. o1’in performansının, daha fazla takviyeli öğrenme (eğitim sırasındaki bilgi işlem süresi) ve daha uzun düşünme süresi (test sırasındaki bilgi işlem süresi) ile sürekli olarak geliştiğini gözlemledik. Bu yaklaşımın ölçeklenmesinde, LLM ön eğitimine kıyasla oldukça farklı kısıtlamalar bulunuyor; biz de bu kısıtlamaları incelemeye devam ediyoruz.

Görselde, eğitim ve test aşamalarındaki “o1 AIME doğruluğunu” karşılaştıran iki serpilme diyagramı yer alıyor. Her iki diyagramda da y ekseninde “pass@1 doğruluğu”, x ekseninde ise bilgi işlem süresi (logaritmik ölçekte) gösteriliyor. Noktalar, bilgi işlem süresi uzadıkça doğruluğun arttığını gösteriyor.

Hem eğitim hem de test aşamalarında bilgi işlem süresi arttıkça o1’in performansı da istikrarlı bir şekilde yükseliyor

Değerlendirmeler

GPT‑4o’ya kıyasla akıl yürütme yeteneğindeki gelişimi vurgulamak için, modellerimizi hem insanlar için hazırlanmış sınavlarda hem de makine öğrenmesine yönelik kıyaslamalarda test ettik. Akıl yürütme ağırlıklı bu görevlerin büyük çoğunluğunda, o1’in GPT‑4o’ya kıyasla kayda değer ölçüde daha yüksek performans sergilediği görülüyor. Aksi yönde bir bilgi sunulan durumlar haricinde o1, test aşamasında maksimum bilgi işlem süresiyle değerlendirildi.

Matematik yarışması değerlendirmesi (AIME 2024) — Akıl yürütme konusundaki zorlayıcı kıyaslamalarda o1, GPT-4o’ya göre çok daha iyi performans gösteriyor. Düz çubuklar, pass@1 doğruluk oranını; gölgeli bölge ise 64 örnekle çoğunluk oylamasının (konsensüs) performansını gösterir.

Matematik (AIME 2024), Kodlama (CodeForces) ve Doktora Düzeyinde Bilim Soruları (GPQA Diamond) için yarışma değerlendirmeleri — Akıl yürütme konusundaki zorlayıcı kıyaslamalarda o1, GPT-4o’ya göre çok daha iyi performans gösteriyor. Düz çubuklar, pass@1 doğruluk oranını; gölgeli bölge ise 64 örnekle çoğunluk oylamasının (konsensüs) performansını gösterir.

Çeşitli yarışma değerlendirmelerinde gpt-4o ile o1’in doğruluğunun ve ham puanlarının dökümü — o1, 57 MMLU alt kategorisinden 54’ü dâhil olmak üzere birçok farklı kıyaslamada GPT-4o’dan daha iyi performans sergiliyor. Bunların yedisi örnek olarak gösterilmiştir.

o1’in gpt-4o’ya göre daha iyi performans gösterdiği ML kıyaslamaları ve sınav puanları üsttedir, mobil cihazlar için açıklayıcı metin — o1, 57 MMLU alt kategorisinden 54’ü dâhil olmak üzere birçok farklı kıyaslamada GPT-4o’dan daha iyi performans sergiliyor. Bunların yedisi örnek olarak gösterilmiştir.

o1, akıl yürütme ağırlıklı birçok kıyaslamada, alanlarında uzman insanlara denk bir performans gösteriyor. Yeni öncü modeller¹, MATH² ve GSM8K’da o kadar iyi sonuçlar alıyor ki artık bu kıyaslamalar, modeller arasında bir ayrım ortaya koymada yeterince etkili değil. Matematik performansını, Amerika’da lise düzeyi matematik alanındaki en parlak öğrenciler için tasarlanmış AIME sınavında değerlendirdik. 2024 AIME sınavlarında GPT‑4o, problemlerin ortalama yalnızca %12’sini (1,8/15) çözebildi. o1 ise her problem için tek örnekle %74’lük (11,1/15), 64 örnek arasında en sık üretilen yanıtın dikkate alındığı konsensüs yöntemiyle değerlendirildiğinde %83’lük (12,5/15) ve öğrenilmiş bir puanlama fonksiyonuyla 1000 örneği yeniden sıraladığında %93’lük (13,9/15) başarı gösterdi. o1, 13,9 puanla ülke çapındaki ilk 500 öğrenci arasına girecek bir performans gösterdi ve ABD Matematik Olimpiyatı’na katılım barajını geçti.

o1’i kimya, fizik ve biyoloji uzmanlığını test eden zorlu zeka kıyaslaması GPQA Diamond üzerinden de değerlendirdik. Modelleri insanlarla karşılaştırabilmek amacıyla GPQA Diamond sorularını yanıtlaması için doktora dereceli uzmanlar görevlendirdik. o1’in, alanlarında uzman olan bu insanların performansını geride bıraktığını ve bu kıyaslamada bunu başaran ilk model olduğunu gördük. Bu sonuçlar, o1’in doktora dereceli birinden her açıdan daha yetkin olduğu anlamına gelmiyor. Yalnızca modelin doktora düzeyindeki bir uzmandan beklenen bazı problemleri çözmede daha başarılı olduğunu gösteriyor. o1, makine öğrenimi alanındaki birkaç başka kıyaslamada da mevcut en ileri modellerin performansını aştı. Görüntü algılama yetenekleri etkinleştirildiğinde o1, MMMU’da %78,2 puan alarak uzman insanlarla rekabet edebilen ilk model oldu. Ayrıca MMLU’nun 57 alt kategorisinden 54’ünde GPT‑4o’yu geride bıraktı.

Düşünce Zinciri

İnsanlar zor bir soruya cevap vermeden önce uzun uzun nasıl düşünürse, o1 de problem çözmeye çalışırken bir düşünce zinciri kullanıyor. o1, takviyeli öğrenme sayesinde kendi düşünce zincirini geliştirmeyi ve kullandığı stratejileri iyileştirmeyi öğreniyor. Hatalarını fark edip düzeltmeyi, zor adımları daha basite indirgemeyi ve bir yöntem işe yaramadığında başka bir yöntem denemeyi öğreniyor. Bu süreç, modelin akıl yürütme yeteneğini önemli ölçüde artırıyor. Bu ilerlemeyi gösterebilmek için aşağıda o1‑preview’un birkaç zor problem üzerindeki düşünce zincirlerinden örnekler sunuyoruz.

GPT-4o

Kullanıcı

oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

Use the example above to decode:

oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Let’s break this down step by step based on the example:

1. Example given:
• Input: oyfjdnisdr rtqwainr acxz mynzbhhx
• Output: Think step by step

By examining the words:
• The pattern involves selecting specific letters or transforming them.

2. Now, let’s decode the new phrase:
• Input: oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Step-by-step decoding process:

Identify possible word divisions.

The first phrase might follow the same structure as the example. It seems like the input text could be broken into groups based on natural separations or patterns.

A possible division could be:

oyekaijzdf
aaptcg
suaokybhai
ouow
aqht
mynznvaatzacdfoulxxz

However, to make a meaningful decoding, I would need more context about the transformations or letter shifting that might be involved.

Could you provide any additional decoding rules or transformations used in this cipher?

OpenAI o1-preview

Kullanıcı

oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

Use the example above to decode:

oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

5 saniye düşündüm

First, what is going on here?

We are given:

First, an example:

"oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step"

We can see that when "oyfjdnisdr rtqwainr acxz mynzbhhx" is transformed to "Think step by step"

Our task is: Use the example above to decode: "oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"

So the first part is to figure out how "oyfjdnisdr rtqwainr acxz mynzbhhx" is decoded into "Think step by step"

Observation 1: The phrase "Think step by step" is 18 letters (excluding spaces) . Let's count the letters.

Similarly, let's count the letters in "oyfjdnisdr rtqwainr acxz mynzbhhx"

Original Ciphertext:

oyfjdnisdr rtqwainr acxz mynzbhhx

Plaintext:

Think step by step

First, break down the letters:

Ciphertext letters: o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x

Total letters: Let's count the letters in the ciphertext.

First word: oyfjdnisdr (10 letters)

Second word: rtqwainr (8 letters)

Third word: acxz (4 letters)

Fourth word: mynzbhhx (8 letters)

Total letters in ciphertext: 10 + 8 + 4 + 8 = 30 letters

Plaintext: Think step by step

First word: Think (5 letters)

Second word: step (4 letters)

Third word: by (2 letters)

Fourth word: step (4 letters)

Total letters in plaintext: 5 + 4 + 2 + 4 = 15 letters

Hmm.

But actually in the problem it says the example:

[Ciphertext] -> Think step by step

So is the decoding process that the ciphertext maps to "Think step by step"?

Our goal is to use the example above to decode:

"oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"

First, the cipher seems connected to the plaintext.

Given the time constraints, perhaps the easiest way is to try to see patterns.

Option 1: Try to find mapping from letters to letters.

Do any letters match?

First, let's write down the ciphertext and plaintext letters on top of each other.

Ciphertext: o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x

Plaintext: T h i n k s t e p b y s t e p

Wait a minute.

I think maybe there is an anagram or substitution cipher here.

Alternatively, I think that we can notice that each group of ciphertext corresponds to a plaintext word.

Check the number of letters.

First ciphertext word: oyfjdnisdr (10 letters)

First plaintext word: Think (5 letters)

Hmm.

Second ciphertext word: rtqwainr (8 letters)

Second plaintext word: step (4 letters)

Third ciphertext word: acxz (4 letters)

Third plaintext word: by (2 letters)

Fourth ciphertext word: mynzbhhx (8 letters)

Fourth plaintext word: step (4 letters)

Interesting.

It seems that the ciphertext words are exactly twice as long as the plaintext words.

(10 vs 5, 8 vs 4, 4 vs 2, 8 vs 4)

Idea: Maybe we need to take every other letter or rebuild the plaintext from the ciphertext accordingly.

Let's test this theory.

If we take every other letter from the ciphertext words, maybe we'll get the plaintext.

Let's try this for the first word:

Ciphertext word: o y f j d n i s d r (10 letters)

Alternate letters:

First letters (positions 1,3,5,7,9): o f d i d

Second letters (positions 2,4,6,8,10): y j n s r

Wait: Let's try taking every second letter or first letter.

Option 1: Take the odd positions:

Positions: 1 3 5 7 9

Letters: o f d i d

Plaintext word is "Think", letters are T h i n k

Hmm.

Perhaps we can map letters:

o f d i d --> Think

Not directly

Option 2: Try mapping as per an assigned code: perhaps columns of letters?

Alternatively, perhaps the cipher is more complex.

Alternatively, notice that "oyfjdnisdr" has 10 letters and "Think" has 5 letters.

What if each plaintext letter is being represented by two ciphertext letters?

Similarly for other words.

Suppose that each plaintext letter corresponds to pair of letters in the ciphertext.

Meaning that the first two letters of ciphertext map to the first plaintext letter, and so on.

So for the first word, partition the ciphertext letters into pairs:

oy fj dn is dr

Corresponding to plaintext letters T h i n k

So we can attempt to map:

oy -> T

fj -> h

dn -> i

is -> n

dr -> k

If this is the mapping, perhaps each pair of letters maps to one plaintext letter.

Similarly for other words.

Let's test this mapping.

Is there a consistent mapping from pairs to letters?

Let's make a table of pairs and corresponding letters.

First, make the pairs from the first word and their corresponding plaintext letters:

Ciphertext Pairs -> Plaintext Letter

oy -> T

fj -> h

dn -> i

is -> n

dr -> k

Now, what's the mapping from "oy" to 'T'?

Is there a way to map 'oy' to 'T'?

Similarly for others.

One possible way is that 'oy' represents 'T'.

But how?

Option 1: Sum of the letters.

Convert letters to numbers (A=1, B=2, ..., Z=26)

So let's try to get to 'T' from 'o' and 'y'

o = 15

y = 25

Sum: 15 +25 = 40

But 'T' is 20.

Alternatively, perhaps subtract: 25 -15 = 10.

No.

Alternatively, perhaps combine the numbers in some way.

Alternatively, think about their positions in the alphabet.

Alternatively, perhaps the letters are encrypted via a code.

Alternatively, perhaps if we overlay the word 'Think' over the cipher pairs 'oy', 'fj', etc., the cipher is formed by substituting each plaintext letter with two letters.

Alternatively, perhaps consider the 'original' letters.

Let's look at the letters between the word "Think" and "oyfjdnisdr"

Perhaps we can spot that "oyfjdnisdr" is an anagram for "Thinkfiles"?

But that seems unlikely.

Alternatively, think of the pairs.

Suppose we assign numbers to letters: a=1, b=2, ..., z=26

Consider the numbers in 'oy':

o = 15

y = 25

Add them: 15 + 25 = 40

Now, for 'T' we have:

'T' is number 20

What if we take the sum of the cipher pair mod 26.

15 +25 = 40

Then 40 mod 26 = 14

14 corresponds to 'N'

Doesn't match.

Alternatively, see if the cipher pairs correspond to something else.

Alternatively, perhaps the pair 'oy' corresponds to 'T' via key.

Alternatively, perhaps it's as simple as 'oy' is 'T'.

But 'o' corresponds to 'T', subtracting 'o' ASCII code from 'T'.

Now, let's think in terms of symmetric letters.

Alternatively, perhaps we can spot that the letters correspond to the first letters of words.

But perhaps an easier way is to look at the word positions.

Alternatively, perhaps we can see that the ciphertext letters are offsets from the plaintext letters.

So let's try to map the pairs to the plaintext letters.

First pair 'oy' and plaintext letter 'T'

Letter 'o' is 15, 'T' is 20

Letter 'o' + 5 = 'T' (15 +5=20)

But 'y' is 25, which is 5 letters after 'T' (20 +5=25)

Interesting, seems that 'o' +5 = 'T'

Similarly, 'y' +1 = 'Z', which is not 'T'.

Alternatively, maybe the average of the two letters corresponds to 'T'

(15 +25)/2 =20, which is 'T'

Wait a minute, that seems promising.

First pair: 'o' (15) + 'y' (25) = 40

40 /2 =20

20 corresponds to 'T'

So perhaps the average of the letters in the pair corresponds to the plaintext letter.

Let's check this with the second pair.

Second pair: 'fj' corresponding to 'h'

Letters 'f'=6, 'j'=10

Sum: 6+10=16

Average:16/2=8

8 corresponds to 'h' (since 'h' is 8)

Success!

Third pair: 'dn' to 'i'

'd'=4, 'n'=14

Sum:4+14=18

Average:18/2=9

9 corresponds to 'i'(9='i')

But 'i' is 9, so that seems off by 1.

So perhaps we need to think carefully about letters.

Wait, 18/2=9, 9 corresponds to 'I'

So this works.

Fourth pair: 'is' corresponding to 'n'

'i'=9, 's'=19

Sum:9+19=28

Average:28/2=14

14 corresponds to 'n'(14='N')

Yes!

Fifth pair: 'dr' corresponds to 'k'

'd'=4, 'r'=18

Sum:4+18=22

Average:22/2=11

11 corresponds to 'k'(11='K')

Perfect!

So our code is: For each pair, sum their numeric values, divide by 2, get the corresponding letter.

Alternatively, the average of the numeric values of the letters corresponds to the plaintext letter.

This seems to be a valid method.

Now let's test this with the other words.

Second word ciphertext: rtqwainr

Ciphertext pairs:

rt -> s

qw -> ?

ai -> ?

nr -> ?

Corresponding to plaintext letters 's', 't', 'e', 'p'

So let's take the pairs:

1. 'rt' corresponds to 's'

r=18, t=20

Sum:38

Average:19

19 corresponds to 's' (19='S')

Good

2. 'qw' corresponds to 't'

q=17, w=23

Sum:17+23=40

Average:20

20 corresponds to 't' (20='T')

Good

3. 'ai' corresponds to 'e'

a=1, i=9

Sum:10

Average:5

5 corresponds to 'e' (5='E')

Good

4. 'nr' corresponds to 'p'

n=14, r=18

Sum:32

Average:16

16 corresponds to 'p' (16='P')

Perfect!

Third word: acxz (4 letters)

Plaintext word: 'by' (2 letters)

Ciphertext pairs:

a c

x z

Plaintext letters: 'b', 'y'

First pair: 'a', 'c'

a=1, c=3

Sum:4

Average:2

2 corresponds to 'b'(2='B')

Second pair: 'x' and 'z'

x=24, z=26

Sum:50

Average:25

25 corresponds to 'y'(25='Y')

Yes!

Fourth word: mynzbhhx (8 letters)

Plaintext word: 'step' (4 letters)

Ciphertext pairs:

my

nz

bh

hx

Plaintext letters 's','t','e','p'

First pair 'm','y'

m=13, y=25

Sum:38

Average:19

19 corresponds to 's'(19='S')

Second pair: 'n','z'

n=14, z=26

Sum: 40

Average:20

20='t'

Third pair: 'b','h'

b=2, h=8

Sum:10

Average:5

5='e'

Fourth pair: 'h','x'

h=8, x=24

Sum:32

Average:16

16='p'

Got it!

Now, given this, we can decode the given message.

Ciphertext to decode:

"oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"

First, we need to divide into words.

Given that the ciphertext in the example had words of length:

-10 letters (5 pairs)
-8 letters (4 pairs)
-4 letters (2 pairs)
-8 letters (4 pairs)

Transported to plaintext words of:

-5 letters
-4 letters
-2 letters
-4 letters

So perhaps in our supplied ciphertext, we need to divide the letters into words accordingly.

Ciphertext: o y e k a i j z d f a a p t c g s u a o k y b h a i o u o w a q h t m y n z n v a a t z a c d f o u l x x z

First, let's write the letters in order:

1) oyekaijzdf (10 letters) -> word of 5 letters

2) aaptcg (6 letters) -> word of 3 letters

3) suaokybhai (10 letters) -> word of 5 letters

4) ouow (4 letters) -> word of 2 letters

5) aqht (4 letters) -> word of 2 letters

6) mynznvaatzacdfoulxxz (22 letters) -> word of 11 letters

Totals:

So the letters are:

First word: oyekaijzdf (10 letters = 5 pairs)

Second word: aaptcg (6 letters=3 pairs)

Third word: suaokybhai (10 letters=5 pairs)

Fourth word: ouow (4 letters=2 pairs)

Fifth word: aqht (4 letters=2 pairs)

Sixth word: mynznvaatzacdfoulxxz (22 letters=11 pairs)

So we can suppose that the plaintext words have lengths:

1) 5 letters

2) 3 letters

3) 5 letters

4) 2 letters

5) 2 letters

6)11 letters

Now we can proceed to decode.

First word: oyekaijzdf

Divide into pairs:

oy

ek

ai

jz

df

So we can find the plaintext letters by averaging the numeric values of the letters in each pair.

First pair: 'o','y'

o=15, y=25

Sum=40

Average=20

20='T'

Second pair: 'e','k'

e=5, k=11

Sum=16

Average=8

8='H'

Third pair: 'a','i'

a=1, i=9

Sum=10

Average=5

5='E'

Fourth pair: 'j','z'

j=10, z=26

Sum=36

Average=18

18='R'

Fifth pair: 'd','f'

d=4, f=6

Sum=10

Average=5

5='E'

So the first word is 'T','H','E','R','E'

So 'THERE'

Second word: aaptcg

Pairs:

aa

pt

cg

First pair: 'a','a'

a=1, a=1

Sum=2

Average=1

1='A'

Second pair: 'p','t'

p=16, t=20

Sum=36

Average=18

18='R'

Third pair: 'c','g'

c=3, g=7

Sum=10

Average=5

5='E'

So second word: 'A','R','E'

Third word: suaokybhai

Pairs:

su

ao

ky

bh

ai

First pair: s=19, u=21

Sum=40, Average=20

20='T'

Second pair: a=1, o=15

Sum=16, Average=8

8='H'

Third pair: k=11, y=25

Sum=36, Average=18

18='R'

Fourth pair: b=2, h=8

Sum=10, Average=5

5='E'

Fifth pair: a=1, i=9

Sum=10, Average=5

5='E'

Third word: 'T','H','R','E','E'

Fourth word: ouow

Pairs: ou, ow

First pair: o=15, u=21

Sum=36, Average=18

18='R'

Second pair: o=15, w=23

Sum=38, Average=19

19='S'

Fourth word: 'R','S'

Fifth word: aqht

Pairs: a q, h t

First pair: a=1, q=17

Sum=18, Average=9

9='I'

Second pair: h=8, t=20

Sum=28, Average=14

14='N'

Fifth word: 'I','N'

Sixth word: mynznvaatzacdfoulxxz (22 letters: 11 pairs)

Pairs:

m y

n z

n v

a a

t z

a c

d f

o u

l x

x z

Let's list them properly.

Wait, earlier I missed some letters there.

Let's re-express the sixth word letters:

m y n z n v a a t z a c d f o u l x x z

Total letters: 22 letters

Pairs:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z

Wait, that's 21 letters.

But in total, we have 22 letters; I think I have a mismatch.

Wait, in the problem the sixth word is:

mynznvaatzacdfoulxxz

Let's separate the letters:

m y n z n v a a t z a c d f o u l x x z

Total letters: 22 letters

So pairs would be:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z

Wait, 'z' is only one letter at the end.

So maybe the letters are:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z (left alone?)

But then we're missing a letter.

Alternatively, perhaps the 'z' at the end is an extra letter or it's paired with the previous 'x'.

Alternatively, perhaps there is a mistake.

Alternatively, perhaps when we reach the last 'x x z', we should write:

9. l x

10. x z

Total pairs: 11

So updated pairs:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x z

Now total pairs: 10 pairs. But we are supposed to have 11 letters in the plaintext word.

So perhaps I have miscounted and not considered including 'f' or 'z'.

Wait, is there any chance that the sequence is:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z

We have 22 letters.

Perhaps the 'x x' and 'z' are overlapping.

But we need 11 pairs.

I think for this to work, we need to create 11 pairs from the 22 letters.

So let's try pairing letters in order, taking two letters at a time.

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z ?

But again, last pair only has one letter 'z'

Alternatively, we might need to adjust the pairs.

Perhaps it's 'ou', 'lx', 'xz', and there is no extra 'x'.

Maybe I miswrote.

Wait, let's write the letters as they come:

m y n z n v a a t z a c d f o u l x x z

Positions:

1 m

2 y

3 n

4 z

5 n

6 v

7 a

8 a

9 t

10 z

11 a

12 c

13 d

14 f

15 o

16 u

17 l

18 x

19 x

20 z

So 20 letters.

Now that makes sense.

Now we can divide into pairs:

Pairs:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x z

Total of 10 pairs, which will correspond to 10 letters.

But earlier we thought the word was 11 letters.

But 10 letters may be acceptable.

Maybe we should proceed with the 10 pairs.

First pair: m y

m=13, y=25

Sum=38

Average=19

19='S'

Second pair: n z

n=14, z=26

Sum=40

Average=20

20='T'

Third pair: n v

n=14, v=22

Sum=36

Average=18

18='R'

Fourth pair: a a

a=1, a=1

Sum=2

Average=1

1='A'

Fifth pair: t z

t=20, z=26

Sum=46

Average=23

23='W'

Sixth pair: a c

a=1, c=3

Sum=4

Average=2

2='B'

Seventh pair: d f

d=4, f=6

Sum=10

Average=5

5='E'

Eighth pair: o u

o=15, u=21

Sum=36

Average=18

18='R'

Ninth pair: l x

l=12, x=24

Sum=36

Average=18

18='R'

Tenth pair: x z

x=24, z=26

Sum=50

Average=25

25='Y'

So the tenth word is 'S', 'T', 'R', 'A', 'W', 'B', 'E', 'R', 'R', 'Y'

Wait that's 10 letters

So the sixth word is 'STRAWBERRY', which makes sense.

So putting all these translated letters together:

First word: 'THERE'

Second word: 'ARE'

Third word: 'THREE'

Fourth word: 'RS' (? Doesn't make sense)

Wait, the fourth word, the pairs were:

ou ow

Pairs:

First pair: o u

o=15, u=21

Sum=36, Average=18

18='R'

Second pair: o w

o=15, w=23

Sum=38, Average=19

19='S'

So we get 'R','S'

Similarly, fifth word 'IN'

'R', 'S' 'I','N'

So the message so far is

'THERE ARE THREE RS IN'

And then 'STRAWBERRY'

Wait, let's put the words together:

'THERE ARE THREE Rs IN STRAWBERRY'

I think 'RS' is 'R's', meaning 'Rs'

So the message is:

'THERE ARE THREE R'S IN STRAWBERRY'

Which is true.

So the decoded message is:

"THERE ARE THREE R'S IN STRAWBERRY"

To decode the given ciphertext using the example provided, we’ll use the pattern identified in the example.

Ciphertext Example:

Düz Metin

1oyfjdnisdr rtqwainr acxz mynzbhhx

Decoded as:

Düz Metin

1Think step by step

Decoding Method:

Pair the letters in the ciphertext.
Convert each letter to its numerical position in the alphabet (A=1, B=2, …, Z=26).
Sum the numerical values of each pair.
Compute the average of the sum (divide by 2).
Convert the average back to a letter (1=A, 2=B, …, 26=Z).

Apply this method to the given ciphertext:

Düz Metin

1oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Break down the ciphertext into pairs:

1. First word: “oyekaijzdf”
1. Pairs: oy, ek, ai, jz, df
2. Decoded letters:
  1. oy → (15+25)/2 = 20 → T
  2. ek → (5+11)/2 = 8 → H
  3. ai → (1+9)/2 = 5 → E
  4. jz → (10+26)/2 = 18 → R
3. df → (4+6)/2 = 5 → E
4. Decoded word: THERE
Second word: “aaptcg”
1. Pairs: aa, pt, cg
2. Decoded letters:
  1. aa → (1+1)/2 = 1 → A
  2. pt → (16+20)/2 = 18 → R
  3. cg → (3+7)/2 = 5 → E
  4. Decoded word: ARE
Third word: “suaokybhai”
1. Pairs: su, ao, ky, bh, ai
2. Decoded letters:
  1. su → (19+21)/2 = 20 → T
  2. ao → (1+15)/2 = 8 → H
  3. ky → (11+25)/2 = 18 → R
  4. bh → (2+8)/2 = 5 → E
  5. ai → (1+9)/2 = 5 → E
3. Decoded word: THREE
Fourth word: “ouow”
1. Pairs: ou, ow
2. Decoded letters:
  1. ou → (15+21)/2 = 18 → R
  2. ow → (15+23)/2 = 19 → S
3. Decoded letters: R’S
Fifth word: “aqht”
1. Pairs: aq, ht
2. Decoded letters:
  1. aq → (1+17)/2 = 9 → I
  2. ht → (8+20)/2 = 14 → N
3. Decoded word: IN
Sixth word: “mynznvaatzacdfoulxxz”
1. Pairs: my, nz, nv, aa, tz, ac, df, ou, lx, xz
2. Decoded letters:
  1. my → (13+25)/2 = 19 → S
  2. nz → (14+26)/2 = 20 → T
  3. nv → (14+22)/2 = 18 → R
  4. aa → (1+1)/2 = 1 → A
  5. tz → (20+26)/2 = 23 → W
  6. ac → (1+3)/2 = 2 → B
  7. df → (4+6)/2 = 5 → E
  8. ou → (15+21)/2 = 18 → R
  9. lx → (12+24)/2 = 18 → R
  10. xz → (24+26)/2 = 25 → Y
3. Decoded word: STRAWBERRY

Final Decoded Message:

Düz Metin

1THERE ARE THREE R'S IN STRAWBERRY

Answer:
THERE ARE THREE R’S IN STRAWBERRY

Kodlama

Programlama yeteneklerini daha da geliştirmek için o1’i temel alarak eğittiğimiz bir model, 2024 Uluslararası Bilişim Olimpiyatı’nda (IOI) 213 puan alıp %49’luk dilime girdi. İnsan yarışmacılarla aynı şartlarda yarışan bu modele altı zorlu algoritma sorusunu çözmek için on saat süre verildi ve her soru için 50 gönderim hakkı sunuldu.

Sistemimiz, her soru için birçok olası çözüm örnekledi ve test sırasında uygulanan seçim stratejisiyle bunlardan 50’sini gönderdi. Gönderimler, Uluslararası Bilişim Olimpiyatı’nın herkese açık test örneklerindeki performans, model tarafından üretilen test örneklerindeki performans ve öğrenilmiş bir puanlama fonksiyonu doğrultusunda seçildi. Eğer gönderilecek çözümleri rastgele seçseydik ortalama puanımız yalnızca 156 olurdu. Bu durum, uygulanan stratejinin yarışma kısıtları altında bize yaklaşık 60 puanlık bir avantaj sağladığını gösteriyor.

Gönderim kısıtı esnetildiğinde ise model performansının önemli ölçüde iyileştiğini tespit ettik. Model, her problem için 10.000 gönderim hakkı tanındığı durumda test sırasında uygulanan seçim stratejisi olmaksızın bile 362,14 puan elde ederek altın madalya eşiğinin üzerine çıktı.

Son olarak bu modelin kodlama yeteneğini göstermek için Codeforces tarafından düzenlenen programlama yarışmalarını simüle ettik. Yarışma kurallarını neredeyse birebir uygulayıp soru başına 10 gönderim hakkı tanıdık. GPT‑4o, 808 Elo puanı³ elde ederek insan katılımcıların %11’lik diliminde yer aldı. Bu model ise hem GPT‑4o’yu hem de o1’i geride bırakarak 1807 Elo puanına ulaştı ve katılımcıların %93’ünden daha iyi performans gösterdi.

Görselde, farklı modellerin Codeforces Elo yüzdelik dilim sıralamalarını karşılaştıran bir çubuk grafik yer alıyor. GPT-4o, 808 Elo puanına (%11’lik dilim); o1 preview, 1258 Elo puanına (%62’lik dilim); o1, 1673 Elo puanına (%89’luk dilim) ve o1-ioi, 1807 Elo puanına (%93’lük dilim) sahip.

Programlama yarışmalarına yönelik ince ayarlar, o1’i daha da geliştirdi. Geliştirilmiş model, 2024 Uluslararası Bilişim Olimpiyatı’nda yarışma kuralları çerçevesinde %49’luk dilime girdi.

İnsan tercihi değerlendirmesi

o1‑preview ve GPT‑4o’yu, sınavlar ile akademik ölçütlerin yanı sıra birçok farklı alandaki zorlu, açık uçlu istemler üzerinden insanların tercihleri temelinde de karşılaştırdık. Bu değerlendirmede, o1‑preview ve GPT‑4o’nun bir isteme verdiği yanıtları anonimleştirerek gösterdiğimiz insan eğitmenlerden, bunları tercihleri doğrultusunda oylamalarını istedik. Eğitmenler; veri analizi, kodlama ve matematik gibi akıl yürütme ağırlıklı kategorilerde açık ara farkla o1‑preview’u GPT‑4o’ya tercih etti. Öte yandan bazı doğal dil görevlerinde o1‑preview tercih edilmedi; bu da modelin, her kullanım alanı için uygun olmadığını gösteriyor.

Görselde, beş modelin puanlarını karşılaştıran ve güven aralıklarını gösteren hata çubuklarından oluşan yatay bir çubuk grafik sunuluyor. X ekseni 0’dan 100’e kadar uzanıyor ve performans için referans noktası, kesik bir çizgiyle gösteriliyor.

Güvenlik

Düşünce zinciriyle akıl yürütme, uyumluluk ve güvenlik açısından yeni olanaklar tanıyor. Model davranışlarına ilişkin politikalarımızı bir akıl yürütme modelinin düşünce zincirine entegre etmenin, insani değerleri ve ilkeleri sağlam bir şekilde öğretmek için etkili bir yöntem olduğunu gördük. Modele güvenlik kurallarımızı ve bağlam içinde bunlar üzerine akıl yürütmeyi öğreterek, bu yeteneğin modelin dayanıklılığını doğrudan artırdığına dair kanıtlar bulduk. o1‑preview, kritik jailbreak değerlendirmelerinde ve modelimizin güvenli ret sınırlarını ölçen en zorlu dâhilî değerlendirmelerimizde çok daha gelişmiş bir performans gösterdi. Düşünce zinciri kullanımının güvenlik ve uyumluluk alanlarında önemli ilerlemeler sunduğuna inanıyoruz çünkü (1) modelin düşünme sürecini okunabilir biçimde gözlemlememizi sağlıyor ve (2) model güvenlik kuralları hakkında akıl yürütürken, eğitim verisi dışındaki veriler karşısında daha dayanıklı hâle geliyor.

İyileştirmelerimizi sınırlarına kadar zorlayıp test etmek için, modeli kullanıma sokmadan önce Hazırlıklılık Çerçevemize⁠(yeni bir pencerede açılır)⁠ uygun olarak bir dizi güvenlik testi ve kırmızı takım testi uyguladık. Akıl yürütme yeteneğinin, tüm değerlendirmelerimizde yetkinliklerin iyileştirilmesine katkıda bulunduğunu gördük. Özellikle dikkatimizi çeken nokta, ödül korsanlığı konusunda ilginç örneklerle⁠(yeni bir pencerede açılır) karşılaşmamız oldu. Bu değerlendirmelerin ayrıntılı sonuçlarını model ile birlikte sunulan Sistem Kartında bulabilirsin.

Metrik	GPT-4o	o1-preview
Zararlı istemlerde güvenli tamamlama oranı (%) Standart	0,990	0,995
Zararlı istemlerde güvenli tamamlama oranı (%) Zorlayıcı: jailbreak’ler ve uç vakalar	0,714	0,934
↳ Taciz (ağır)	0,845	0,900
↳ İstismar içeren cinsel içerikler	0,483	0,949
↳ Reşit olmayanların yer aldığı cinsel içerikler	0,707	0,931
↳ Şiddet içermeyen yasa dışı eylemlerle ilgili öneri	0,688	0,961
↳ Şiddet içeren yasa dışı eylemlerle ilgili öneri	0,778	0,963
WildChat’te her kategori için Moderation API puanı en yüksek olan ilk 200 örnekteki güvenli tamamlamaların oranı (%) Zhao ve diğerleri 2024	0,945	0,971
Goodness@0.1 StrongREJECT jailbreak testi Souly ve diğerleri 2024	0,220	0,840
İnsan kaynaklı jailbreak testi	0,770	0,960
Dâhilî zararsız uç örneklerdeki uyumluluk oranı (%) “not over-refusal”	0,910	0,930
XSTest’te zararsız uç örneklerdeki uyumluluk oranı (%) “not over-refusal” Röttger ve diğerleri 2023	0,924	0,976

Düşünce Zincirlerinin Gizlenmesi

Gizli bir düşünce zincirinin, modelleri izlemek için eşsiz bir fırsat sunduğuna inanıyoruz. Zincirin güvenilir ve okunabilir olması koşuluyla, modelin “aklından geçenleri okuyup” nasıl düşündüğünü anlayabiliyoruz. Örneğin gelecekte, kullanıcıyı yönlendirme eğilimlerini tespit etmek için bu zinciri izleyebiliriz. Ancak bunun işe yaraması için modelin, düşüncelerini olduğu gibi ifade etme özgürlüğüne sahip olması gerekiyor. Bu nedenle düşünce zincirini politika uyumluluğuna veya kullanıcı tercihlerine yönelik bir eğitime tabi tutamıyoruz. Öte yandan, uyumlandırılmamış bir düşünce zincirini doğrudan kullanıcılara görünür kılmak da istemiyoruz.

Bu nedenle; kullanıcı deneyimi, rekabet avantajı ve düşünce zincirinin izlenmesi seçeneği gibi birden çok etmeni değerlendirdikten sonra, ham düşünce zincirlerini kullanıcılara göstermeme kararı aldık. Bu kararın dezavantajları olduğunun farkındayız. Modele, düşünce zincirinde yer alan faydalı fikirleri yanıtlarına da aktarmayı öğreterek bunu kısmen telafi etmeyi hedefliyoruz. o1 model serisinde ise düşünce zincirinin model tarafından oluşturulan bir özetini sunuyoruz.

Sonuç

o1, akıl yürütebilen yapay zeka alanında çıtayı epey yükseltiyor. Yineleme çalışmalarımız devam ettikçe bu modelin daha gelişmiş sürümlerini kullanıma sunmayı planlıyoruz. Bu yeni akıl yürütme yeteneklerinin, modelleri insani değerler ve ilkelerle uyumlu hâle getirmemize yardımcı olacağını düşünüyoruz. o1 ve sonraki sürümlerinin bilim, kodlama, matematik ve benzeri alanlarda yapay zeka için birçok yeni kullanım alanı yaratacağına inanıyoruz. Kullanıcıların ve API geliştiricilerinin, bu yetenek sayesinde günlük işlerinin nasıl kolaylaşacağını keşfetmelerini heyecanla bekliyoruz.

Ek A

Veri Seti	Metrik	gpt-4o	o1-preview	o1
Yarışma Matematiği AIME (2024)	cons@64	13,4	56,7	83,3
Yarışma Matematiği AIME (2024)	pass@1	9,3	44,6	74,4
Yarışma Kodlaması CodeForces	Elo	808	1.258	1.673
Yarışma Kodlaması CodeForces	Yüzdelik dilim	11,0	62,0	89,0
GPQA Diamond	cons@64	56,1	78,3	78,0
GPQA Diamond	pass@1	50,6	73,3	77,3
Biyoloji	cons@64	63,2	73,7	68,4
Biyoloji	pass@1	61,6	65,9	69,2
Kimya	cons@64	43,0	60,2	65,6
Kimya	pass@1	40,2	59,9	64,7
Fizik	cons@64	68,6	89,5	94,2
Fizik	pass@1	59,5	89,4	92,8
MATEMATİK	pass@1	60,3	85,5	94,8
MMLU	pass@1	88,0	92,3	90,8
MMMU (val)	pass@1	69,1	Geçersiz	78,2
MathVista (testmini)	pass@1	63,8	Geçersiz	73,9

Yazarlar

OpenAI

Katkıda Bulunanları Gör

Alıntılar

1
https://www.anthropic.com/news/claude-3-5-sonnet⁠(yeni bir pencerede açılır), https://deepmind.google/technologies/gemini/pro⁠(yeni bir pencerede açılır)
2
Değerlendirmelerimizde, https://arxiv.org/abs/2305.20050⁠(yeni bir pencerede açılır) adresindeki 500 soruluk test kümesinin aynısı kullanıldı.
3
https://codeforces.com/blog/entry/68288⁠(yeni bir pencerede açılır)