12 ta’ Settembru 2024

Learning to reason with LLMs

Qed jillowdja…

OpenAI o1 jinsab fid-89 perċentil fuq mistoqsijiet ta’ programmar kompetittiv (Codeforces), jitpoġġa fost l-aqwa 500 student fl-Istati Uniti f’kwalifikazzjoni għall-Olimpjada tal-Matematika tal-Istati Uniti (AIME), u jaqbeż il-preċiżjoni umana fil-livell ta’ PhD fuq punt ta’ riferiment ta’ problemi tal-fiżika, il-bijoloġija u l-kimika (GPQA). Filwaqt li x-xogħol meħtieġ biex dan il-mudell il-ġdid ikun faċli daqs il-mudelli attwali għadu għaddej, qed noħorġu verżjoni bikrija ta’ dan il-mudell, OpenAI o1‑preview, għall-użu immedjat f’ChatGPT u għal utenti tal-API fdati⁠(jinfetaħ f’tieqa ġdida).

L-algoritmu tagħna ta’ apprendiment ta' tisħiħ fuq skala kbira jgħallem lill-mudell kif jaħseb b’mod produttiv billi juża l-katina tal-ħsieb tiegħu fi proċess ta’ taħriġ effiċjenti ħafna fid-data. Sibna li l-prestazzjoni ta’ o1 titjieb b’mod konsistenti b’aktar apprendiment ta' tisħiħ (komputazzjoni waqt it-taħriġ) u b’aktar ħin imqatta’ jaħseb (komputazzjoni waqt it-test). Ir-restrizzjonijiet fuq l-iskalar ta’ dan l-approċċ ivarjaw b’mod sostanzjali minn dawk tal-pretaħriġ tal-LLM, u qed inkompli ninvestigawhom.

The image shows two scatter plots comparing "o1 AIME accuracy" during training and at test time. Both charts have "pass@1 accuracy" on the y-axis and compute (log scale) on the x-axis. The dots indicate increasing accuracy with more compute time.

o1 performance smoothly improves with both train-time and test-time compute

Evalwazzjonijiet

Biex nenfasizzaw it-titjib fir-raġunament fuq GPT‑4o, ittestjajna l-mudelli tagħna fuq sett varjat ta’ eżamijiet umani u punti ta’ riferiment tal-ML. Nuru li o1 jaqbeż lil GPT‑4o b’mod sinifikanti fil-maġġoranza l-kbira ta’ dawn il-kompiti intensivi fir-raġunament. Sakemm ma jkunx speċifikat mod ieħor, ivvalutajna o1 fuq l-issettjar massimu tal-komputazzjoni waqt it-test.

Evalwazzjoni tal-matematika tal-kompetizzjoni (AIME 2024) — o1 itejjeb b'mod sinifikanti fuq GPT-4o fuq punti ta' riferiment ta' raġunament diffiċli. L-istrixxi solidi juru l-eżattezza pass@1, filwaqt li r-reġjun imdellel juri l-prestazzjoni tal-vot tal-maġġoranza (kunsens) b’64 kampjun.

Evalwazzjonijiet tal-kompetizzjoni għall-Matematika (AIME 2024), Kodiċi (CodeForces), u Mistoqsijiet tax-Xjenza ta’ Livell ta’ PhD (GPQA Diamond) — o1 itejjeb b'mod sinifikanti fuq GPT-4o fuq punti ta' riferiment ta' raġunament diffiċli. L-istrixxi solidi juru l-eżattezza pass@1, filwaqt li r-reġjun imdellel juri l-prestazzjoni tal-vot tal-maġġoranza (kunsens) b’64 kampjun.

Tqassim tal-eżattezza u l-punteġġ mhux ipproċessat ta' gpt-4o vs. o1 fuq diversi evals tal-kompetizzjoni — o1 itejjeb fuq GPT-4o fuq firxa wiesgħa ta' benchmarks, inklużi 54/57 subkategoriji MMLU. Sebgħa huma murija għall-illustrazzjoni.

Punteġġi tal-benchmarks u tal-eżamijiet tal-ML għal gpt4o vs. o1, titjib fuq nett, test alternattiv għall-mowbajl — o1 itejjeb fuq GPT-4o fuq firxa wiesgħa ta' benchmarks, inklużi 54/57 subkategoriji MMLU. Sebgħa huma murija għall-illustrazzjoni.

F’ħafna punti ta’ riferiment intensivi fir-raġunament, o1 jikkompeti mal-prestazzjoni ta’ esperti umani. Mudelli fruntiera reċenti¹ marru tajjeb ħafna fuq MATH² u GSM8K tant li dawn il-punti ta’ riferiment m’għadhomx effettivi biex jiddistingwu bejn mudelli. Ivvalutajna l-prestazzjoni fil-matematika fuq AIME, eżami mfassal biex jisfida lill-aktar studenti brillanti tal-matematika fl-iskola sekondarja fl-Amerika. Fl-eżamijiet AIME tal-2024, GPT‑4o solva biss medja ta’ 12% (1.8/15) tal-problemi. o1 kellu medja ta’ 74% (11.1/15) b’kampjun wieħed għal kull problema, 83% (12.5/15) b’kunsens fost 64 kampjun, u 93% (13.9/15) meta reġa’ kklassifika 1000 kampjun b’funzjoni ta’ punteġġ mitgħallma. Punteġġ ta’ 13.9 jqegħdu fost l-aqwa 500 student nazzjonalment u ’l fuq mil-livell ta’ qtugħ għall-Olimpjada Matematika tal-Istati Uniti.

Ivvalutajna wkoll lil o1 fuq GPQA diamond, punt ta’ riferiment diffiċli tal-intelliġenza li jittestja l-kompetenza fil-kimika, il-fiżika u l-bijoloġija. Sabiex inqabblu l-mudelli mal-bnedmin, irreklutajna esperti b’PhD biex iwieġbu mistoqsijiet ta’ GPQA-diamond. Sibna li o1 qabeż il-prestazzjoni ta’ dawk l-esperti umani, u sar l-ewwel mudell li għamel dan fuq dan il-punt ta’ riferiment. Dawn ir-riżultati ma jimplikawx li o1 huwa aktar kapaċi minn PhD f’kull aspett — biss li l-mudell huwa aktar profiċjenti biex isolvi xi problemi li wieħed jistenna li PhD isolvi. Fuq diversi punti oħra ta’ riferiment tal-ML, o1 tejjeb fuq l-aqwa livell attwali. Bil-kapaċitajiet tiegħu tal-perċezzjoni viżiva attivati, o1 kiseb 78.2% fuq MMMU, u sar l-ewwel mudell li hu kompetittiv ma’ esperti umani. Qabeż ukoll lil GPT‑4o f’54 minn 57 sottokategorija ta’ MMLU.

Katina tal-Ħsieb

Bħalma bniedem jista’ jaħseb għal żmien twil qabel iwieġeb mistoqsija diffiċli, o1 juża katina tal-ħsieb meta jipprova jsolvi problema. Permezz ta’ apprendiment ta' tisħiħ, o1 jitgħallem jirfina l-katina tal-ħsieb tiegħu u jtejjeb l-istrateġiji li juża. Jitgħallem jagħraf u jikkoreġi l-iżbalji tiegħu. Jitgħallem jaqsam passi diffiċli f’oħrajn aktar sempliċi. Jitgħallem jipprova approċċ differenti meta dak attwali ma jkunx qed jaħdem. Dan il-proċess itejjeb b’mod drammatiku l-kapaċità tal-mudell li jirraġuna. Biex nuru dan il-qabża ’l quddiem, hawn taħt nippreżentaw il-katina tal-ħsieb minn o1‑preview fuq diversi problemi diffiċli.

GPT-4o

Utent

oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

Use the example above to decode:

oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Let’s break this down step by step based on the example:

1. Example given:
• Input: oyfjdnisdr rtqwainr acxz mynzbhhx
• Output: Think step by step

By examining the words:
• The pattern involves selecting specific letters or transforming them.

2. Now, let’s decode the new phrase:
• Input: oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Step-by-step decoding process:

Identify possible word divisions.

The first phrase might follow the same structure as the example. It seems like the input text could be broken into groups based on natural separations or patterns.

A possible division could be:

oyekaijzdf
aaptcg
suaokybhai
ouow
aqht
mynznvaatzacdfoulxxz

However, to make a meaningful decoding, I would need more context about the transformations or letter shifting that might be involved.

Could you provide any additional decoding rules or transformations used in this cipher?

OpenAI o1-preview

Utent

oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

Use the example above to decode:

oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Ħaseb għal 5 sekondi

First, what is going on here?

We are given:

First, an example:

"oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step"

We can see that when "oyfjdnisdr rtqwainr acxz mynzbhhx" is transformed to "Think step by step"

Our task is: Use the example above to decode: "oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"

So the first part is to figure out how "oyfjdnisdr rtqwainr acxz mynzbhhx" is decoded into "Think step by step"

Observation 1: The phrase "Think step by step" is 18 letters (excluding spaces) . Let's count the letters.

Similarly, let's count the letters in "oyfjdnisdr rtqwainr acxz mynzbhhx"

Original Ciphertext:

oyfjdnisdr rtqwainr acxz mynzbhhx

Plaintext:

Think step by step

First, break down the letters:

Ciphertext letters: o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x

Total letters: Let's count the letters in the ciphertext.

First word: oyfjdnisdr (10 letters)

Second word: rtqwainr (8 letters)

Third word: acxz (4 letters)

Fourth word: mynzbhhx (8 letters)

Total letters in ciphertext: 10 + 8 + 4 + 8 = 30 letters

Plaintext: Think step by step

First word: Think (5 letters)

Second word: step (4 letters)

Third word: by (2 letters)

Fourth word: step (4 letters)

Total letters in plaintext: 5 + 4 + 2 + 4 = 15 letters

Hmm.

But actually in the problem it says the example:

[Ciphertext] -> Think step by step

So is the decoding process that the ciphertext maps to "Think step by step"?

Our goal is to use the example above to decode:

"oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"

First, the cipher seems connected to the plaintext.

Given the time constraints, perhaps the easiest way is to try to see patterns.

Option 1: Try to find mapping from letters to letters.

Do any letters match?

First, let's write down the ciphertext and plaintext letters on top of each other.

Ciphertext: o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x

Plaintext: T h i n k s t e p b y s t e p

Wait a minute.

I think maybe there is an anagram or substitution cipher here.

Alternatively, I think that we can notice that each group of ciphertext corresponds to a plaintext word.

Check the number of letters.

First ciphertext word: oyfjdnisdr (10 letters)

First plaintext word: Think (5 letters)

Hmm.

Second ciphertext word: rtqwainr (8 letters)

Second plaintext word: step (4 letters)

Third ciphertext word: acxz (4 letters)

Third plaintext word: by (2 letters)

Fourth ciphertext word: mynzbhhx (8 letters)

Fourth plaintext word: step (4 letters)

Interesting.

It seems that the ciphertext words are exactly twice as long as the plaintext words.

(10 vs 5, 8 vs 4, 4 vs 2, 8 vs 4)

Idea: Maybe we need to take every other letter or rebuild the plaintext from the ciphertext accordingly.

Let's test this theory.

If we take every other letter from the ciphertext words, maybe we'll get the plaintext.

Let's try this for the first word:

Ciphertext word: o y f j d n i s d r (10 letters)

Alternate letters:

First letters (positions 1,3,5,7,9): o f d i d

Second letters (positions 2,4,6,8,10): y j n s r

Wait: Let's try taking every second letter or first letter.

Option 1: Take the odd positions:

Positions: 1 3 5 7 9

Letters: o f d i d

Plaintext word is "Think", letters are T h i n k

Hmm.

Perhaps we can map letters:

o f d i d --> Think

Not directly

Option 2: Try mapping as per an assigned code: perhaps columns of letters?

Alternatively, perhaps the cipher is more complex.

Alternatively, notice that "oyfjdnisdr" has 10 letters and "Think" has 5 letters.

What if each plaintext letter is being represented by two ciphertext letters?

Similarly for other words.

Suppose that each plaintext letter corresponds to pair of letters in the ciphertext.

Meaning that the first two letters of ciphertext map to the first plaintext letter, and so on.

So for the first word, partition the ciphertext letters into pairs:

oy fj dn is dr

Corresponding to plaintext letters T h i n k

So we can attempt to map:

oy -> T

fj -> h

dn -> i

is -> n

dr -> k

If this is the mapping, perhaps each pair of letters maps to one plaintext letter.

Similarly for other words.

Let's test this mapping.

Is there a consistent mapping from pairs to letters?

Let's make a table of pairs and corresponding letters.

First, make the pairs from the first word and their corresponding plaintext letters:

Ciphertext Pairs -> Plaintext Letter

oy -> T

fj -> h

dn -> i

is -> n

dr -> k

Now, what's the mapping from "oy" to 'T'?

Is there a way to map 'oy' to 'T'?

Similarly for others.

One possible way is that 'oy' represents 'T'.

But how?

Option 1: Sum of the letters.

Convert letters to numbers (A=1, B=2, ..., Z=26)

So let's try to get to 'T' from 'o' and 'y'

o = 15

y = 25

Sum: 15 +25 = 40

But 'T' is 20.

Alternatively, perhaps subtract: 25 -15 = 10.

No.

Alternatively, perhaps combine the numbers in some way.

Alternatively, think about their positions in the alphabet.

Alternatively, perhaps the letters are encrypted via a code.

Alternatively, perhaps if we overlay the word 'Think' over the cipher pairs 'oy', 'fj', etc., the cipher is formed by substituting each plaintext letter with two letters.

Alternatively, perhaps consider the 'original' letters.

Let's look at the letters between the word "Think" and "oyfjdnisdr"

Perhaps we can spot that "oyfjdnisdr" is an anagram for "Thinkfiles"?

But that seems unlikely.

Alternatively, think of the pairs.

Suppose we assign numbers to letters: a=1, b=2, ..., z=26

Consider the numbers in 'oy':

o = 15

y = 25

Add them: 15 + 25 = 40

Now, for 'T' we have:

'T' is number 20

What if we take the sum of the cipher pair mod 26.

15 +25 = 40

Then 40 mod 26 = 14

14 corresponds to 'N'

Doesn't match.

Alternatively, see if the cipher pairs correspond to something else.

Alternatively, perhaps the pair 'oy' corresponds to 'T' via key.

Alternatively, perhaps it's as simple as 'oy' is 'T'.

But 'o' corresponds to 'T', subtracting 'o' ASCII code from 'T'.

Now, let's think in terms of symmetric letters.

Alternatively, perhaps we can spot that the letters correspond to the first letters of words.

But perhaps an easier way is to look at the word positions.

Alternatively, perhaps we can see that the ciphertext letters are offsets from the plaintext letters.

So let's try to map the pairs to the plaintext letters.

First pair 'oy' and plaintext letter 'T'

Letter 'o' is 15, 'T' is 20

Letter 'o' + 5 = 'T' (15 +5=20)

But 'y' is 25, which is 5 letters after 'T' (20 +5=25)

Interesting, seems that 'o' +5 = 'T'

Similarly, 'y' +1 = 'Z', which is not 'T'.

Alternatively, maybe the average of the two letters corresponds to 'T'

(15 +25)/2 =20, which is 'T'

Wait a minute, that seems promising.

First pair: 'o' (15) + 'y' (25) = 40

40 /2 =20

20 corresponds to 'T'

So perhaps the average of the letters in the pair corresponds to the plaintext letter.

Let's check this with the second pair.

Second pair: 'fj' corresponding to 'h'

Letters 'f'=6, 'j'=10

Sum: 6+10=16

Average:16/2=8

8 corresponds to 'h' (since 'h' is 8)

Success!

Third pair: 'dn' to 'i'

'd'=4, 'n'=14

Sum:4+14=18

Average:18/2=9

9 corresponds to 'i'(9='i')

But 'i' is 9, so that seems off by 1.

So perhaps we need to think carefully about letters.

Wait, 18/2=9, 9 corresponds to 'I'

So this works.

Fourth pair: 'is' corresponding to 'n'

'i'=9, 's'=19

Sum:9+19=28

Average:28/2=14

14 corresponds to 'n'(14='N')

Yes!

Fifth pair: 'dr' corresponds to 'k'

'd'=4, 'r'=18

Sum:4+18=22

Average:22/2=11

11 corresponds to 'k'(11='K')

Perfect!

So our code is: For each pair, sum their numeric values, divide by 2, get the corresponding letter.

Alternatively, the average of the numeric values of the letters corresponds to the plaintext letter.

This seems to be a valid method.

Now let's test this with the other words.

Second word ciphertext: rtqwainr

Ciphertext pairs:

rt -> s

qw -> ?

ai -> ?

nr -> ?

Corresponding to plaintext letters 's', 't', 'e', 'p'

So let's take the pairs:

1. 'rt' corresponds to 's'

r=18, t=20

Sum:38

Average:19

19 corresponds to 's' (19='S')

Good

2. 'qw' corresponds to 't'

q=17, w=23

Sum:17+23=40

Average:20

20 corresponds to 't' (20='T')

Good

3. 'ai' corresponds to 'e'

a=1, i=9

Sum:10

Average:5

5 corresponds to 'e' (5='E')

Good

4. 'nr' corresponds to 'p'

n=14, r=18

Sum:32

Average:16

16 corresponds to 'p' (16='P')

Perfect!

Third word: acxz (4 letters)

Plaintext word: 'by' (2 letters)

Ciphertext pairs:

a c

x z

Plaintext letters: 'b', 'y'

First pair: 'a', 'c'

a=1, c=3

Sum:4

Average:2

2 corresponds to 'b'(2='B')

Second pair: 'x' and 'z'

x=24, z=26

Sum:50

Average:25

25 corresponds to 'y'(25='Y')

Yes!

Fourth word: mynzbhhx (8 letters)

Plaintext word: 'step' (4 letters)

Ciphertext pairs:

my

nz

bh

hx

Plaintext letters 's','t','e','p'

First pair 'm','y'

m=13, y=25

Sum:38

Average:19

19 corresponds to 's'(19='S')

Second pair: 'n','z'

n=14, z=26

Sum: 40

Average:20

20='t'

Third pair: 'b','h'

b=2, h=8

Sum:10

Average:5

5='e'

Fourth pair: 'h','x'

h=8, x=24

Sum:32

Average:16

16='p'

Got it!

Now, given this, we can decode the given message.

Ciphertext to decode:

"oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"

First, we need to divide into words.

Given that the ciphertext in the example had words of length:

-10 letters (5 pairs)
-8 letters (4 pairs)
-4 letters (2 pairs)
-8 letters (4 pairs)

Transported to plaintext words of:

-5 letters
-4 letters
-2 letters
-4 letters

So perhaps in our supplied ciphertext, we need to divide the letters into words accordingly.

Ciphertext: o y e k a i j z d f a a p t c g s u a o k y b h a i o u o w a q h t m y n z n v a a t z a c d f o u l x x z

First, let's write the letters in order:

1) oyekaijzdf (10 letters) -> word of 5 letters

2) aaptcg (6 letters) -> word of 3 letters

3) suaokybhai (10 letters) -> word of 5 letters

4) ouow (4 letters) -> word of 2 letters

5) aqht (4 letters) -> word of 2 letters

6) mynznvaatzacdfoulxxz (22 letters) -> word of 11 letters

Totals:

So the letters are:

First word: oyekaijzdf (10 letters = 5 pairs)

Second word: aaptcg (6 letters=3 pairs)

Third word: suaokybhai (10 letters=5 pairs)

Fourth word: ouow (4 letters=2 pairs)

Fifth word: aqht (4 letters=2 pairs)

Sixth word: mynznvaatzacdfoulxxz (22 letters=11 pairs)

So we can suppose that the plaintext words have lengths:

1) 5 letters

2) 3 letters

3) 5 letters

4) 2 letters

5) 2 letters

6)11 letters

Now we can proceed to decode.

First word: oyekaijzdf

Divide into pairs:

oy

ek

ai

jz

df

So we can find the plaintext letters by averaging the numeric values of the letters in each pair.

First pair: 'o','y'

o=15, y=25

Sum=40

Average=20

20='T'

Second pair: 'e','k'

e=5, k=11

Sum=16

Average=8

8='H'

Third pair: 'a','i'

a=1, i=9

Sum=10

Average=5

5='E'

Fourth pair: 'j','z'

j=10, z=26

Sum=36

Average=18

18='R'

Fifth pair: 'd','f'

d=4, f=6

Sum=10

Average=5

5='E'

So the first word is 'T','H','E','R','E'

So 'THERE'

Second word: aaptcg

Pairs:

aa

pt

cg

First pair: 'a','a'

a=1, a=1

Sum=2

Average=1

1='A'

Second pair: 'p','t'

p=16, t=20

Sum=36

Average=18

18='R'

Third pair: 'c','g'

c=3, g=7

Sum=10

Average=5

5='E'

So second word: 'A','R','E'

Third word: suaokybhai

Pairs:

su

ao

ky

bh

ai

First pair: s=19, u=21

Sum=40, Average=20

20='T'

Second pair: a=1, o=15

Sum=16, Average=8

8='H'

Third pair: k=11, y=25

Sum=36, Average=18

18='R'

Fourth pair: b=2, h=8

Sum=10, Average=5

5='E'

Fifth pair: a=1, i=9

Sum=10, Average=5

5='E'

Third word: 'T','H','R','E','E'

Fourth word: ouow

Pairs: ou, ow

First pair: o=15, u=21

Sum=36, Average=18

18='R'

Second pair: o=15, w=23

Sum=38, Average=19

19='S'

Fourth word: 'R','S'

Fifth word: aqht

Pairs: a q, h t

First pair: a=1, q=17

Sum=18, Average=9

9='I'

Second pair: h=8, t=20

Sum=28, Average=14

14='N'

Fifth word: 'I','N'

Sixth word: mynznvaatzacdfoulxxz (22 letters: 11 pairs)

Pairs:

m y

n z

n v

a a

t z

a c

d f

o u

l x

x z

Let's list them properly.

Wait, earlier I missed some letters there.

Let's re-express the sixth word letters:

m y n z n v a a t z a c d f o u l x x z

Total letters: 22 letters

Pairs:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z

Wait, that's 21 letters.

But in total, we have 22 letters; I think I have a mismatch.

Wait, in the problem the sixth word is:

mynznvaatzacdfoulxxz

Let's separate the letters:

m y n z n v a a t z a c d f o u l x x z

Total letters: 22 letters

So pairs would be:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z

Wait, 'z' is only one letter at the end.

So maybe the letters are:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z (left alone?)

But then we're missing a letter.

Alternatively, perhaps the 'z' at the end is an extra letter or it's paired with the previous 'x'.

Alternatively, perhaps there is a mistake.

Alternatively, perhaps when we reach the last 'x x z', we should write:

9. l x

10. x z

Total pairs: 11

So updated pairs:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x z

Now total pairs: 10 pairs. But we are supposed to have 11 letters in the plaintext word.

So perhaps I have miscounted and not considered including 'f' or 'z'.

Wait, is there any chance that the sequence is:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z

We have 22 letters.

Perhaps the 'x x' and 'z' are overlapping.

But we need 11 pairs.

I think for this to work, we need to create 11 pairs from the 22 letters.

So let's try pairing letters in order, taking two letters at a time.

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z ?

But again, last pair only has one letter 'z'

Alternatively, we might need to adjust the pairs.

Perhaps it's 'ou', 'lx', 'xz', and there is no extra 'x'.

Maybe I miswrote.

Wait, let's write the letters as they come:

m y n z n v a a t z a c d f o u l x x z

Positions:

1 m

2 y

3 n

4 z

5 n

6 v

7 a

8 a

9 t

10 z

11 a

12 c

13 d

14 f

15 o

16 u

17 l

18 x

19 x

20 z

So 20 letters.

Now that makes sense.

Now we can divide into pairs:

Pairs:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x z

Total of 10 pairs, which will correspond to 10 letters.

But earlier we thought the word was 11 letters.

But 10 letters may be acceptable.

Maybe we should proceed with the 10 pairs.

First pair: m y

m=13, y=25

Sum=38

Average=19

19='S'

Second pair: n z

n=14, z=26

Sum=40

Average=20

20='T'

Third pair: n v

n=14, v=22

Sum=36

Average=18

18='R'

Fourth pair: a a

a=1, a=1

Sum=2

Average=1

1='A'

Fifth pair: t z

t=20, z=26

Sum=46

Average=23

23='W'

Sixth pair: a c

a=1, c=3

Sum=4

Average=2

2='B'

Seventh pair: d f

d=4, f=6

Sum=10

Average=5

5='E'

Eighth pair: o u

o=15, u=21

Sum=36

Average=18

18='R'

Ninth pair: l x

l=12, x=24

Sum=36

Average=18

18='R'

Tenth pair: x z

x=24, z=26

Sum=50

Average=25

25='Y'

So the tenth word is 'S', 'T', 'R', 'A', 'W', 'B', 'E', 'R', 'R', 'Y'

Wait that's 10 letters

So the sixth word is 'STRAWBERRY', which makes sense.

So putting all these translated letters together:

First word: 'THERE'

Second word: 'ARE'

Third word: 'THREE'

Fourth word: 'RS' (? Doesn't make sense)

Wait, the fourth word, the pairs were:

ou ow

Pairs:

First pair: o u

o=15, u=21

Sum=36, Average=18

18='R'

Second pair: o w

o=15, w=23

Sum=38, Average=19

19='S'

So we get 'R','S'

Similarly, fifth word 'IN'

'R', 'S' 'I','N'

So the message so far is

'THERE ARE THREE RS IN'

And then 'STRAWBERRY'

Wait, let's put the words together:

'THERE ARE THREE Rs IN STRAWBERRY'

I think 'RS' is 'R's', meaning 'Rs'

So the message is:

'THERE ARE THREE R'S IN STRAWBERRY'

Which is true.

So the decoded message is:

"THERE ARE THREE R'S IN STRAWBERRY"

To decode the given ciphertext using the example provided, we’ll use the pattern identified in the example.

Ciphertext Example:

Test sempliċi

1oyfjdnisdr rtqwainr acxz mynzbhhx

Decoded as:

Test sempliċi

1Think step by step

Decoding Method:

Pair the letters in the ciphertext.
Convert each letter to its numerical position in the alphabet (A=1, B=2, …, Z=26).
Sum the numerical values of each pair.
Compute the average of the sum (divide by 2).
Convert the average back to a letter (1=A, 2=B, …, 26=Z).

Apply this method to the given ciphertext:

Test sempliċi

1oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Break down the ciphertext into pairs:

1. First word: “oyekaijzdf”
1. Pairs: oy, ek, ai, jz, df
2. Decoded letters:
  1. oy → (15+25)/2 = 20 → T
  2. ek → (5+11)/2 = 8 → H
  3. ai → (1+9)/2 = 5 → E
  4. jz → (10+26)/2 = 18 → R
3. df → (4+6)/2 = 5 → E
4. Decoded word: THERE
Second word: “aaptcg”
1. Pairs: aa, pt, cg
2. Decoded letters:
  1. aa → (1+1)/2 = 1 → A
  2. pt → (16+20)/2 = 18 → R
  3. cg → (3+7)/2 = 5 → E
  4. Decoded word: ARE
Third word: “suaokybhai”
1. Pairs: su, ao, ky, bh, ai
2. Decoded letters:
  1. su → (19+21)/2 = 20 → T
  2. ao → (1+15)/2 = 8 → H
  3. ky → (11+25)/2 = 18 → R
  4. bh → (2+8)/2 = 5 → E
  5. ai → (1+9)/2 = 5 → E
3. Decoded word: THREE
Fourth word: “ouow”
1. Pairs: ou, ow
2. Decoded letters:
  1. ou → (15+21)/2 = 18 → R
  2. ow → (15+23)/2 = 19 → S
3. Decoded letters: R’S
Fifth word: “aqht”
1. Pairs: aq, ht
2. Decoded letters:
  1. aq → (1+17)/2 = 9 → I
  2. ht → (8+20)/2 = 14 → N
3. Decoded word: IN
Sixth word: “mynznvaatzacdfoulxxz”
1. Pairs: my, nz, nv, aa, tz, ac, df, ou, lx, xz
2. Decoded letters:
  1. my → (13+25)/2 = 19 → S
  2. nz → (14+26)/2 = 20 → T
  3. nv → (14+22)/2 = 18 → R
  4. aa → (1+1)/2 = 1 → A
  5. tz → (20+26)/2 = 23 → W
  6. ac → (1+3)/2 = 2 → B
  7. df → (4+6)/2 = 5 → E
  8. ou → (15+21)/2 = 18 → R
  9. lx → (12+24)/2 = 18 → R
  10. xz → (24+26)/2 = 25 → Y
3. Decoded word: STRAWBERRY

Final Decoded Message:

Test sempliċi

1THERE ARE THREE R'S IN STRAWBERRY

Answer:
THERE ARE THREE R’S IN STRAWBERRY

Kodifikazzjoni

Ħarreġna mudell li kiseb 213-il punt u ġie kklassifikat fid-49 perċentil fl-Olimpjada Internazzjonali tal-Informatika (IOI) tal-2024, billi bdejna minn o1 u ħarrġnieh biex ikompli jtejjeb il-ħiliet tal-programmazzjoni. Dan il-mudell ikkompetta fl-IOI tal-2024 taħt l-istess kundizzjonijiet bħall-kontestanti umani. Kellu għaxar sigħat biex isolvi sitt problemi algoritmiċi ta’ sfida u kien permess jagħmel 50 sottomissjoni għal kull problema.

Għal kull problema, is-sistema tagħna ħadet kampjun ta’ ħafna sottomissjonijiet kandidati u ssottomettiet 50 minnhom abbażi ta’ strateġija ta’ għażla waqt it-test. Is-sottomissjonijiet intgħażlu abbażi tal-prestazzjoni fuq il-każijiet pubbliċi tat-test tal-IOI, każijiet tat-test iġġenerati mill-mudell, u funzjoni ta’ punteġġ mitgħallma. Kieku minflok issottomettejna b’mod każwali, konna niksbu biss 156 punt bħala medja, li jissuġġerixxi li din l-istrateġija kienet tiswa kważi 60 punt taħt ir-restrizzjonijiet tal-kompetizzjoni.

B’restrizzjoni aktar rilassata fuq is-sottomissjonijiet, sibna li l-prestazzjoni tal-mudell tjiebet b’mod sinifikanti. Meta tħallew 10,000 sottomissjoni għal kull problema, il-mudell kiseb punteġġ ta’ 362.14 – ’il fuq mil-limitu tal-midalja tad-deheb – anke mingħajr ebda strateġija ta’ għażla waqt it-test.

Fl-aħħar nett, issimulajna kompetizzjonijiet ta’ programmar kompetittiv ospitati minn Codeforces biex nuru l-ħila ta’ dan il-mudell fil-kodifikazzjoni. Il-valutazzjonijiet tagħna qablu mill-qrib mar-regoli tal-kompetizzjoni u ppermettew 10 sottomissjonijiet. GPT‑4o kiseb klassifikazzjoni Elo³ ta’ 808, li tinsab fil-11-il perċentil tal-kompetituri umani. Dan il-mudell qabeż ferm kemm lil GPT‑4o kif ukoll lil o1—kiseb klassifikazzjoni Elo ta’ 1807, u kellu prestazzjoni aħjar minn 93% tal-kompetituri.

The image shows a bar chart comparing Codeforces Elo percentile rankings for different models. GPT-4o has 808 Elo (11th percentile), o1 preview has 1258 Elo (62nd percentile), o1 has 1673 Elo (89th percentile), and o1-ioi has 1807 Elo (93rd percentile).

Further fine-tuning on programming competitions improves o1. The improved model ranked in the 49th percentile in the 2024 International Olympiad in Informatics under competition rules.

Valutazzjoni tal-preferenza umana

Minbarra l-eżamijiet u l-punti ta’ riferiment akkademiċi, ivvalutajna wkoll il-preferenza umana ta’ o1‑preview meta mqabbel ma’ GPT‑4o fuq prompts ta’ sfida u miftuħa f’firxa wiesgħa ta’ oqsma. F’din il-valutazzjoni, trainers umani raw tweġibiet anonimizzati għal prompt minn o1‑preview u GPT‑4o, u vvutaw għal liema tweġiba ppreferew. o1‑preview huwa preferut għal gpt-4o b’marġni kbir f’kategoriji intensivi fir-raġunament bħall-analiżi tad-data, il-kodifikazzjoni u l-matematika. Madankollu, o1‑preview mhuwiex preferut fuq xi kompiti tal-lingwa naturali, u dan jissuġġerixxi li mhuwiex adattat sew għall-każijiet kollha ta’ użu.

Sikurezza

Ir-raġunament bil-katina tal-ħsieb jipprovdi opportunitajiet ġodda għall-allinjament u s-sikurezza. Sibna li l-integrazzjoni tal-politiki tagħna għall-imġiba tal-mudell fil-katina tal-ħsieb ta’ mudell tar-raġunament hija mod effettiv biex ngħallmu b’mod robust il-valuri u l-prinċipji umani. Billi ngħallmu lill-mudell ir-regoli tas-sikurezza tagħna u kif jirraġuna dwarhom fil-kuntest, sibna evidenza ta’ kapaċità ta’ raġunament li tibbenefika direttament ir-robustezza tal-mudell: o1‑preview kiseb prestazzjoni mtejba b’mod sostanzjali fuq evalwazzjonijiet ewlenin ta’ jailbreak u fuq l-aktar punti ta’ riferiment interni diffiċli tagħna biex nivvalutaw il-limiti ta’ rifjut tas-sikurezza tal-mudell tagħna. Nemmnu li l-użu ta’ katina tal-ħsieb joffri avvanzi sinifikanti għas-sikurezza u l-allinjament għax (1) jippermettilna nosservaw lill-mudell jaħseb b’mod li jinftiehem, u (2) ir-raġunament tal-mudell dwar ir-regoli tas-sikurezza huwa aktar robust għal xenarji barra mid-distribuzzjoni.

Biex nittestjaw l-istress fuq it-titjib tagħna, wettaqna sett ta’ testijiet tas-sikurezza u red-teaming qabel it-tnedija, skont il-Qafas tat-Tħejjija⁠(jinfetaħ f’tieqa ġdida) tagħna. Sibna li r-raġunament bil-katina tal-ħsieb ikkontribwixxa għal titjib fil-kapaċitajiet tul il-valutazzjonijiet tagħna. Ta’ nota partikolari, osservajna każijiet interessanti ta’ reward hacking⁠(jinfetaħ f’tieqa ġdida). Riżultati dettaljati minn dawn il-valutazzjonijiet jinsabu fil-kard tas-sistema akkumpanjanti.

Metrika	GPT-4o	o1-preview
% Kompletamenti siguri fuq prompts ta’ ħsara Standard	0.990	0.995
% Kompletamenti sikuri fuq prompts ta’ ħsara Diffikultajiet: jailbreaks u każijiet estremi	0.714	0.934
↳ Fastidju (sever)	0.845	0.900
↳ Kontenut sesswali ta' sfruttatment	0.483	0.949
↳ Kontenut sesswali li jinvolvi lill-minorenni	0.707	0.931
↳ Pariri dwar għemil ħażin mhux vjolenti	0.688	0.961
↳ Pariri dwar għemil ħażin vjolenti	0.778	0.963
% Tlestiji siguri għall-aqwa 200 bl-ogħla punteġġi tal-Moderation API għal kull kategorija f’WildChat Zhao, et al. 2024	0.945	0.971
Goodness@0.1 StrongREJECT jailbreak eval Souly et al. 2024	0.220	0.840
Evalwazzjoni ta' jailbreak minn sors uman	0.770	0.960
% ta' Konformità fuq il-każijiet interni beninni fil-marġni “mhux rifjut żejjed”	0.910	0.930
% Konformità fuq każijiet marġinali beninni f’XSTest “mhux rifjut żejjed” Röttger, et al. 2023	0.924	0.976

Naħbu l-Katini tal-Ħsieb

Nemmnu li katina tal-ħsieb moħbija tippreżenta opportunità unika għall-monitoraġġ tal-mudelli. Jekk nassumu li hija fidila u tinftiehem, il-katina tal-ħsieb moħbija tippermettilna “naqraw il-moħħ” tal-mudell u nifhmu l-proċess tal-ħsieb tiegħu. Pereżempju, fil-futur nistgħu nixtiequ nimmonitorjaw il-katina tal-ħsieb għal sinjali ta’ manipulazzjoni tal-utent. Madankollu, biex dan jaħdem il-mudell irid ikollu l-libertà jesprimi l-ħsibijiet tiegħu fil-forma mhux mibdula tagħhom, għalhekk ma nistgħux inħarrġu konformità ma’ politika jew preferenzi tal-utent fuq il-katina tal-ħsieb. Lanqas ma rridu nagħmlu katina tal-ħsieb mhux allinjata viżibbli direttament lill-utenti.

Għalhekk, wara li qiesna bosta fatturi inklużi l-esperjenza tal-utent, il-vantaġġ kompetittiv, u l-għażla li nsegwu l-monitoraġġ tal-katina tal-ħsieb, iddeċidejna li ma nurux il-katini mhux ipproċessati tal-ħsieb lill-utenti. Nagħrfu li din id-deċiżjoni għandha żvantaġġi. Nistinkaw biex nikkumpensaw parzjalment għal dan billi ngħallmu lill-mudell jirriproduċi fit-tweġiba kwalunkwe idea utli mill-katina tal-ħsieb. Għas-serje ta’ mudelli o1 nuru sommarju tal-katina tal-ħsieb iġġenerat mill-mudell.

Konklużjoni

o1 javvanza b’mod sinifikanti l-aqwa livell attwali fir-raġunament tal-IA. Qed nippjanaw li noħorġu verżjonijiet imtejba ta’ dan il-mudell hekk kif inkomplu ntennu u ntejbu. Nistennew li dawn il-kapaċitajiet ġodda tar-raġunament itejbu l-ħila tagħna li nallinjaw il-mudelli mal-valuri u l-prinċipji umani. Nemmnu li o1 – u s-suċċessuri tiegħu – se jiftħu ħafna każijiet ġodda ta’ użu għall-IA fix-xjenza, il-kodifikazzjoni, il-matematika u oqsma relatati. Aħna eċċitati biex l-utenti u l-iżviluppaturi tal-API jiskopru kif jista’ jtejjeb ix-xogħol tagħhom ta’ kuljum.

Appendiċi A

Sett tad-dejta	Metrika	gpt-4o	o1-preview	o1
Matematika tal-Kompetizzjoni AIME (2024)	cons@64	13.4	56.7	83.3
Matematika tal-Kompetizzjoni AIME (2024)	pass@1	9.3	44.6	74.4
Kodiċi tal-Kompetizzjoni CodeForces	Elo	808	1,258	1,673
Kodiċi tal-Kompetizzjoni CodeForces	Perċentil	11.0	62.0	89.0
GPQA Diamond	cons@64	56.1	78.3	78.0
GPQA Diamond	pass@1	50.6	73.3	77.3
Bijoloġija	cons@64	63.2	73.7	68.4
Bijoloġija	pass@1	61.6	65.9	69.2
Kimika	cons@64	43.0	60.2	65.6
Kimika	pass@1	40.2	59.9	64.7
Fiżika	cons@64	68.6	89.5	94.2
Fiżika	pass@1	59.5	89.4	92.8
MATEMATIKA	pass@1	60.3	85.5	94.8
MMLU	pass@1	88.0	92.3	90.8
MMMU (val)	pass@1	69.1	n/a	78.2
MathVista (testmini)	pass@1	63.8	n/a	73.9

Awturi

OpenAI

Ara l-kontributuri

Ċitazzjonijiet

1
https://www.anthropic.com/news/claude-3-5-sonnet⁠(jinfetaħ f’tieqa ġdida), https://deepmind.google/technologies/gemini/pro⁠(jinfetaħ f’tieqa ġdida)
2
Il-valutazzjonijiet tagħna użaw l-istess qasma tat-test ta’ 500 problema li tinsab f’ https://arxiv.org/abs/2305.20050⁠(jinfetaħ f’tieqa ġdida)
3
https://codeforces.com/blog/entry/68288⁠(jinfetaħ f’tieqa ġdida)