12 settembre 2024

Imparare a ragionare con gli LLM

Ti presentiamo OpenAI o1, un nuovo modello linguistico di grandi dimensioni addestrato con apprendimento per rinforzo e in grado di svolgere complesse attività di ragionamento. o1 pensa prima di rispondere: sviluppa una lunga catena di pensiero interna prima di rispondere all’utente.

Contributi Utilizza o1

Caricamento in corso...

OpenAI o1 si colloca nell’89° percentile nelle domande di programmazione competitiva (Codeforces), si piazza tra i primi 500 studenti negli Stati Uniti in una prova di qualificazione per le Olimpiadi di matematica USA (AIME) e supera la precisione di un dottorato di ricerca umano in un benchmark di problemi di fisica, biologia e chimica (GPQA). Sebbene si stia ancora lavorando per rendere questo nuovo modello facile da usare quanto i modelli attuali, stiamo rilasciando una versione preliminare del modello, OpenAI o1‑preview, per l’utilizzo immediato in ChatGPT e per gli utenti API fidati⁠(si apre in una nuova finestra).

Il nostro algoritmo di apprendimento per rinforzo su larga scala insegna al modello a ragionare in modo produttivo sfruttando la propria catena di pensiero, all’interno di un processo di addestramento altamente efficiente dal punto di vista dei dati. Abbiamo osservato che le prestazioni di o1 migliorano in modo costante con un maggiore apprendimento per rinforzo (potenza di calcolo durante l’addestramento) e con più tempo dedicato al ragionamento (potenza di calcolo durante l’esecuzione). I limiti di espansione di questo approccio sono sensibilmente diversi da quelli del pre-addestramento degli LLM e stiamo continuando a esplorarli.

L’immagine mostra due grafici a dispersione che confrontano la "accuratezza in AIME di o1" durante l’addestramento e durante l'esecuzione. Entrambi i grafici riportano "pass@1 accuracy" sull’asse delle y e "compute (log scale)" sull’asse delle x. I punti indicano un aumento dell'accuratezza al crescere del tempo di calcolo.

Le prestazioni di o1 migliorano in modo graduale con l’aumento della potenza di calcolo sia durante l'addestramento che durante l'esecuzione

Valutazioni

Per evidenziare i progressi nel ragionamento rispetto a GPT‑4o, abbiamo testato i nostri modelli su un set variegato di esami umani e benchmark di machine learning (ML). I risultati mostrano che o1 supera notevolmente GPT‑4o nella stragrande maggioranza di queste attività che richiedono un ragionamento approfondito. Salvo diversa indicazione, o1 è stato valutato utilizzando la massima potenza di calcolo durante l'esecuzione.

Valutazione di matematica competitiva (AIME 2024) — o1 supera di gran lunga GPT-4o in benchmark di ragionamento complessi. Le barre piene mostrano l’accuratezza per pass@1, mentre la zona ombreggiata illustra i risultati del voto di maggioranza (consenso) con 64 campioni.

Valutazioni competitive per matematica (AIME 2024), codifica (CodeForces) e quesiti di scienza a livello di dottorato (GPQA Diamond) — o1 supera di gran lunga GPT-4o in benchmark di ragionamento complessi. Le barre piene mostrano l’accuratezza per pass@1, mentre la zona ombreggiata illustra i risultati del voto di maggioranza (consenso) con 64 campioni.

Analisi dell'accuratezza e del punteggio grezzo di GPT-4o rispetto a o1 in diverse valutazioni competitive — o1 supera GPT-4o in una vasta gamma di benchmark, incluse 54 su 57 sottocategorie MMLU. Ne vengono mostrate sette a titolo esemplificativo.

Benchmark di ML e punteggi degli esami: miglioramenti di GPT-4o rispetto a o1 in alto, testo alternativo per dispositivi mobile — o1 supera GPT-4o in una vasta gamma di benchmark, incluse 54 su 57 sottocategorie MMLU. Ne vengono mostrate sette a titolo esemplificativo.

In numerosi benchmark che richiedono un ragionamento complesso, o1 si avvicina alle prestazioni di esperti umani. I modelli all’avanguardia più recenti¹ ottengono risultati così elevati in MATH2² e GSM8K da rendere questi benchmark poco efficaci nel distinguere tra i vari modelli. Abbiamo valutato le prestazioni in matematica per AIME, un esame pensato per mettere alla prova gli studenti più brillanti delle scuole superiori degli Stati Uniti. Negli esami di AIME 2024, GPT‑4o ha risolto in media solo il 12% (1,8 su 15) dei problemi. o1 ha raggiunto una media del 74% (11,1 su 15) con una singola risposta per problema, l’83% (12,5 su 15) con un consenso su 64 risposte, e il 93% (13,9 su 15) in seguito a una riclassificazione di 1.000 risposte con una funzione di valutazione appresa. Un punteggio di 13,9 lo colloca tra i primi 500 studenti a livello nazionale e al di sopra della soglia di partecipazione alle Olimpiadi di matematica USA.

Abbiamo valutato o1 anche sulla base di GPQA diamond, un benchmark impegnativo che verifica competenze in chimica, fisica e biologia. Al fine di confrontare i modelli con gli esseri umani, abbiamo coinvolto esperti con titoli di dottorato di ricerca (PhD) che rispondessero alle domande di GPQA diamond. Abbiamo scoperto che o1 ha superato le prestazioni di questi esperti, diventando il primo modello a raggiungere questo risultato per questo benchmark. Questi risultati non implicano che o1 sia più competente di una persona con PhD sotto tutti i punti di vista, ma soltanto che il modello è più abile nel risolvere alcuni problemi che ci si aspetta una persona con questo titolo sappia risolvere. In vari altri benchmark di ML, o1 è risultato migliore rispetto ai modelli all'avanguardia. Con le funzionalità di percezione visiva attive, o1 ha ottenuto un punteggio del 78,2% in MMMU, diventando così il primo modello in grado di competere con esperti umani. Ha inoltre superato GPT‑4o in 54 delle 57 sottocategorie di MMLU.

Catena di pensiero

Proprio come una persona può prendersi del tempo per pensare prima di rispondere a una domanda difficile, o1 utilizza una catena di pensiero quando tenta di risolvere un problema. Attraverso l'apprendimento per rinforzo, o1 impara ad affinare questa catena e a perfezionare le strategie che applica. Impara a riconoscere e correggere i propri errori, a scomporre passaggi complessi in parti più semplici e a cambiare approccio quando quello attuale non funziona. Questo processo migliora in modo significativo la capacità di ragionamento del modello. Per illustrare questo progresso, mostriamo di seguito la catena di pensiero di o1‑preview per vari problemi complessi.

GPT-4o

Utente

oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

Use the example above to decode:

oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Let’s break this down step by step based on the example:

1. Example given:
• Input: oyfjdnisdr rtqwainr acxz mynzbhhx
• Output: Think step by step

By examining the words:
• The pattern involves selecting specific letters or transforming them.

2. Now, let’s decode the new phrase:
• Input: oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Step-by-step decoding process:

Identify possible word divisions.

The first phrase might follow the same structure as the example. It seems like the input text could be broken into groups based on natural separations or patterns.

A possible division could be:

oyekaijzdf
aaptcg
suaokybhai
ouow
aqht
mynznvaatzacdfoulxxz

However, to make a meaningful decoding, I would need more context about the transformations or letter shifting that might be involved.

Could you provide any additional decoding rules or transformations used in this cipher?

OpenAI o1-preview

Utente

oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

Use the example above to decode:

oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Ragionato per 5 secondi

First, what is going on here?

We are given:

First, an example:

"oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step"

We can see that when "oyfjdnisdr rtqwainr acxz mynzbhhx" is transformed to "Think step by step"

Our task is: Use the example above to decode: "oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"

So the first part is to figure out how "oyfjdnisdr rtqwainr acxz mynzbhhx" is decoded into "Think step by step"

Observation 1: The phrase "Think step by step" is 18 letters (excluding spaces) . Let's count the letters.

Similarly, let's count the letters in "oyfjdnisdr rtqwainr acxz mynzbhhx"

Original Ciphertext:

oyfjdnisdr rtqwainr acxz mynzbhhx

Plaintext:

Think step by step

First, break down the letters:

Ciphertext letters: o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x

Total letters: Let's count the letters in the ciphertext.

First word: oyfjdnisdr (10 letters)

Second word: rtqwainr (8 letters)

Third word: acxz (4 letters)

Fourth word: mynzbhhx (8 letters)

Total letters in ciphertext: 10 + 8 + 4 + 8 = 30 letters

Plaintext: Think step by step

First word: Think (5 letters)

Second word: step (4 letters)

Third word: by (2 letters)

Fourth word: step (4 letters)

Total letters in plaintext: 5 + 4 + 2 + 4 = 15 letters

Hmm.

But actually in the problem it says the example:

[Ciphertext] -> Think step by step

So is the decoding process that the ciphertext maps to "Think step by step"?

Our goal is to use the example above to decode:

"oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"

First, the cipher seems connected to the plaintext.

Given the time constraints, perhaps the easiest way is to try to see patterns.

Option 1: Try to find mapping from letters to letters.

Do any letters match?

First, let's write down the ciphertext and plaintext letters on top of each other.

Ciphertext: o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x

Plaintext: T h i n k s t e p b y s t e p

Wait a minute.

I think maybe there is an anagram or substitution cipher here.

Alternatively, I think that we can notice that each group of ciphertext corresponds to a plaintext word.

Check the number of letters.

First ciphertext word: oyfjdnisdr (10 letters)

First plaintext word: Think (5 letters)

Hmm.

Second ciphertext word: rtqwainr (8 letters)

Second plaintext word: step (4 letters)

Third ciphertext word: acxz (4 letters)

Third plaintext word: by (2 letters)

Fourth ciphertext word: mynzbhhx (8 letters)

Fourth plaintext word: step (4 letters)

Interesting.

It seems that the ciphertext words are exactly twice as long as the plaintext words.

(10 vs 5, 8 vs 4, 4 vs 2, 8 vs 4)

Idea: Maybe we need to take every other letter or rebuild the plaintext from the ciphertext accordingly.

Let's test this theory.

If we take every other letter from the ciphertext words, maybe we'll get the plaintext.

Let's try this for the first word:

Ciphertext word: o y f j d n i s d r (10 letters)

Alternate letters:

First letters (positions 1,3,5,7,9): o f d i d

Second letters (positions 2,4,6,8,10): y j n s r

Wait: Let's try taking every second letter or first letter.

Option 1: Take the odd positions:

Positions: 1 3 5 7 9

Letters: o f d i d

Plaintext word is "Think", letters are T h i n k

Hmm.

Perhaps we can map letters:

o f d i d --> Think

Not directly

Option 2: Try mapping as per an assigned code: perhaps columns of letters?

Alternatively, perhaps the cipher is more complex.

Alternatively, notice that "oyfjdnisdr" has 10 letters and "Think" has 5 letters.

What if each plaintext letter is being represented by two ciphertext letters?

Similarly for other words.

Suppose that each plaintext letter corresponds to pair of letters in the ciphertext.

Meaning that the first two letters of ciphertext map to the first plaintext letter, and so on.

So for the first word, partition the ciphertext letters into pairs:

oy fj dn is dr

Corresponding to plaintext letters T h i n k

So we can attempt to map:

oy -> T

fj -> h

dn -> i

is -> n

dr -> k

If this is the mapping, perhaps each pair of letters maps to one plaintext letter.

Similarly for other words.

Let's test this mapping.

Is there a consistent mapping from pairs to letters?

Let's make a table of pairs and corresponding letters.

First, make the pairs from the first word and their corresponding plaintext letters:

Ciphertext Pairs -> Plaintext Letter

oy -> T

fj -> h

dn -> i

is -> n

dr -> k

Now, what's the mapping from "oy" to 'T'?

Is there a way to map 'oy' to 'T'?

Similarly for others.

One possible way is that 'oy' represents 'T'.

But how?

Option 1: Sum of the letters.

Convert letters to numbers (A=1, B=2, ..., Z=26)

So let's try to get to 'T' from 'o' and 'y'

o = 15

y = 25

Sum: 15 +25 = 40

But 'T' is 20.

Alternatively, perhaps subtract: 25 -15 = 10.

No.

Alternatively, perhaps combine the numbers in some way.

Alternatively, think about their positions in the alphabet.

Alternatively, perhaps the letters are encrypted via a code.

Alternatively, perhaps if we overlay the word 'Think' over the cipher pairs 'oy', 'fj', etc., the cipher is formed by substituting each plaintext letter with two letters.

Alternatively, perhaps consider the 'original' letters.

Let's look at the letters between the word "Think" and "oyfjdnisdr"

Perhaps we can spot that "oyfjdnisdr" is an anagram for "Thinkfiles"?

But that seems unlikely.

Alternatively, think of the pairs.

Suppose we assign numbers to letters: a=1, b=2, ..., z=26

Consider the numbers in 'oy':

o = 15

y = 25

Add them: 15 + 25 = 40

Now, for 'T' we have:

'T' is number 20

What if we take the sum of the cipher pair mod 26.

15 +25 = 40

Then 40 mod 26 = 14

14 corresponds to 'N'

Doesn't match.

Alternatively, see if the cipher pairs correspond to something else.

Alternatively, perhaps the pair 'oy' corresponds to 'T' via key.

Alternatively, perhaps it's as simple as 'oy' is 'T'.

But 'o' corresponds to 'T', subtracting 'o' ASCII code from 'T'.

Now, let's think in terms of symmetric letters.

Alternatively, perhaps we can spot that the letters correspond to the first letters of words.

But perhaps an easier way is to look at the word positions.

Alternatively, perhaps we can see that the ciphertext letters are offsets from the plaintext letters.

So let's try to map the pairs to the plaintext letters.

First pair 'oy' and plaintext letter 'T'

Letter 'o' is 15, 'T' is 20

Letter 'o' + 5 = 'T' (15 +5=20)

But 'y' is 25, which is 5 letters after 'T' (20 +5=25)

Interesting, seems that 'o' +5 = 'T'

Similarly, 'y' +1 = 'Z', which is not 'T'.

Alternatively, maybe the average of the two letters corresponds to 'T'

(15 +25)/2 =20, which is 'T'

Wait a minute, that seems promising.

First pair: 'o' (15) + 'y' (25) = 40

40 /2 =20

20 corresponds to 'T'

So perhaps the average of the letters in the pair corresponds to the plaintext letter.

Let's check this with the second pair.

Second pair: 'fj' corresponding to 'h'

Letters 'f'=6, 'j'=10

Sum: 6+10=16

Average:16/2=8

8 corresponds to 'h' (since 'h' is 8)

Success!

Third pair: 'dn' to 'i'

'd'=4, 'n'=14

Sum:4+14=18

Average:18/2=9

9 corresponds to 'i'(9='i')

But 'i' is 9, so that seems off by 1.

So perhaps we need to think carefully about letters.

Wait, 18/2=9, 9 corresponds to 'I'

So this works.

Fourth pair: 'is' corresponding to 'n'

'i'=9, 's'=19

Sum:9+19=28

Average:28/2=14

14 corresponds to 'n'(14='N')

Yes!

Fifth pair: 'dr' corresponds to 'k'

'd'=4, 'r'=18

Sum:4+18=22

Average:22/2=11

11 corresponds to 'k'(11='K')

Perfect!

So our code is: For each pair, sum their numeric values, divide by 2, get the corresponding letter.

Alternatively, the average of the numeric values of the letters corresponds to the plaintext letter.

This seems to be a valid method.

Now let's test this with the other words.

Second word ciphertext: rtqwainr

Ciphertext pairs:

rt -> s

qw -> ?

ai -> ?

nr -> ?

Corresponding to plaintext letters 's', 't', 'e', 'p'

So let's take the pairs:

1. 'rt' corresponds to 's'

r=18, t=20

Sum:38

Average:19

19 corresponds to 's' (19='S')

Good

2. 'qw' corresponds to 't'

q=17, w=23

Sum:17+23=40

Average:20

20 corresponds to 't' (20='T')

Good

3. 'ai' corresponds to 'e'

a=1, i=9

Sum:10

Average:5

5 corresponds to 'e' (5='E')

Good

4. 'nr' corresponds to 'p'

n=14, r=18

Sum:32

Average:16

16 corresponds to 'p' (16='P')

Perfect!

Third word: acxz (4 letters)

Plaintext word: 'by' (2 letters)

Ciphertext pairs:

a c

x z

Plaintext letters: 'b', 'y'

First pair: 'a', 'c'

a=1, c=3

Sum:4

Average:2

2 corresponds to 'b'(2='B')

Second pair: 'x' and 'z'

x=24, z=26

Sum:50

Average:25

25 corresponds to 'y'(25='Y')

Yes!

Fourth word: mynzbhhx (8 letters)

Plaintext word: 'step' (4 letters)

Ciphertext pairs:

my

nz

bh

hx

Plaintext letters 's','t','e','p'

First pair 'm','y'

m=13, y=25

Sum:38

Average:19

19 corresponds to 's'(19='S')

Second pair: 'n','z'

n=14, z=26

Sum: 40

Average:20

20='t'

Third pair: 'b','h'

b=2, h=8

Sum:10

Average:5

5='e'

Fourth pair: 'h','x'

h=8, x=24

Sum:32

Average:16

16='p'

Got it!

Now, given this, we can decode the given message.

Ciphertext to decode:

"oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"

First, we need to divide into words.

Given that the ciphertext in the example had words of length:

-10 letters (5 pairs)
-8 letters (4 pairs)
-4 letters (2 pairs)
-8 letters (4 pairs)

Transported to plaintext words of:

-5 letters
-4 letters
-2 letters
-4 letters

So perhaps in our supplied ciphertext, we need to divide the letters into words accordingly.

Ciphertext: o y e k a i j z d f a a p t c g s u a o k y b h a i o u o w a q h t m y n z n v a a t z a c d f o u l x x z

First, let's write the letters in order:

1) oyekaijzdf (10 letters) -> word of 5 letters

2) aaptcg (6 letters) -> word of 3 letters

3) suaokybhai (10 letters) -> word of 5 letters

4) ouow (4 letters) -> word of 2 letters

5) aqht (4 letters) -> word of 2 letters

6) mynznvaatzacdfoulxxz (22 letters) -> word of 11 letters

Totals:

So the letters are:

First word: oyekaijzdf (10 letters = 5 pairs)

Second word: aaptcg (6 letters=3 pairs)

Third word: suaokybhai (10 letters=5 pairs)

Fourth word: ouow (4 letters=2 pairs)

Fifth word: aqht (4 letters=2 pairs)

Sixth word: mynznvaatzacdfoulxxz (22 letters=11 pairs)

So we can suppose that the plaintext words have lengths:

1) 5 letters

2) 3 letters

3) 5 letters

4) 2 letters

5) 2 letters

6)11 letters

Now we can proceed to decode.

First word: oyekaijzdf

Divide into pairs:

oy

ek

ai

jz

df

So we can find the plaintext letters by averaging the numeric values of the letters in each pair.

First pair: 'o','y'

o=15, y=25

Sum=40

Average=20

20='T'

Second pair: 'e','k'

e=5, k=11

Sum=16

Average=8

8='H'

Third pair: 'a','i'

a=1, i=9

Sum=10

Average=5

5='E'

Fourth pair: 'j','z'

j=10, z=26

Sum=36

Average=18

18='R'

Fifth pair: 'd','f'

d=4, f=6

Sum=10

Average=5

5='E'

So the first word is 'T','H','E','R','E'

So 'THERE'

Second word: aaptcg

Pairs:

aa

pt

cg

First pair: 'a','a'

a=1, a=1

Sum=2

Average=1

1='A'

Second pair: 'p','t'

p=16, t=20

Sum=36

Average=18

18='R'

Third pair: 'c','g'

c=3, g=7

Sum=10

Average=5

5='E'

So second word: 'A','R','E'

Third word: suaokybhai

Pairs:

su

ao

ky

bh

ai

First pair: s=19, u=21

Sum=40, Average=20

20='T'

Second pair: a=1, o=15

Sum=16, Average=8

8='H'

Third pair: k=11, y=25

Sum=36, Average=18

18='R'

Fourth pair: b=2, h=8

Sum=10, Average=5

5='E'

Fifth pair: a=1, i=9

Sum=10, Average=5

5='E'

Third word: 'T','H','R','E','E'

Fourth word: ouow

Pairs: ou, ow

First pair: o=15, u=21

Sum=36, Average=18

18='R'

Second pair: o=15, w=23

Sum=38, Average=19

19='S'

Fourth word: 'R','S'

Fifth word: aqht

Pairs: a q, h t

First pair: a=1, q=17

Sum=18, Average=9

9='I'

Second pair: h=8, t=20

Sum=28, Average=14

14='N'

Fifth word: 'I','N'

Sixth word: mynznvaatzacdfoulxxz (22 letters: 11 pairs)

Pairs:

m y

n z

n v

a a

t z

a c

d f

o u

l x

x z

Let's list them properly.

Wait, earlier I missed some letters there.

Let's re-express the sixth word letters:

m y n z n v a a t z a c d f o u l x x z

Total letters: 22 letters

Pairs:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z

Wait, that's 21 letters.

But in total, we have 22 letters; I think I have a mismatch.

Wait, in the problem the sixth word is:

mynznvaatzacdfoulxxz

Let's separate the letters:

m y n z n v a a t z a c d f o u l x x z

Total letters: 22 letters

So pairs would be:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z

Wait, 'z' is only one letter at the end.

So maybe the letters are:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z (left alone?)

But then we're missing a letter.

Alternatively, perhaps the 'z' at the end is an extra letter or it's paired with the previous 'x'.

Alternatively, perhaps there is a mistake.

Alternatively, perhaps when we reach the last 'x x z', we should write:

9. l x

10. x z

Total pairs: 11

So updated pairs:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x z

Now total pairs: 10 pairs. But we are supposed to have 11 letters in the plaintext word.

So perhaps I have miscounted and not considered including 'f' or 'z'.

Wait, is there any chance that the sequence is:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z

We have 22 letters.

Perhaps the 'x x' and 'z' are overlapping.

But we need 11 pairs.

I think for this to work, we need to create 11 pairs from the 22 letters.

So let's try pairing letters in order, taking two letters at a time.

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x x

11. z ?

But again, last pair only has one letter 'z'

Alternatively, we might need to adjust the pairs.

Perhaps it's 'ou', 'lx', 'xz', and there is no extra 'x'.

Maybe I miswrote.

Wait, let's write the letters as they come:

m y n z n v a a t z a c d f o u l x x z

Positions:

1 m

2 y

3 n

4 z

5 n

6 v

7 a

8 a

9 t

10 z

11 a

12 c

13 d

14 f

15 o

16 u

17 l

18 x

19 x

20 z

So 20 letters.

Now that makes sense.

Now we can divide into pairs:

Pairs:

1. m y

2. n z

3. n v

4. a a

5. t z

6. a c

7. d f

8. o u

9. l x

10. x z

Total of 10 pairs, which will correspond to 10 letters.

But earlier we thought the word was 11 letters.

But 10 letters may be acceptable.

Maybe we should proceed with the 10 pairs.

First pair: m y

m=13, y=25

Sum=38

Average=19

19='S'

Second pair: n z

n=14, z=26

Sum=40

Average=20

20='T'

Third pair: n v

n=14, v=22

Sum=36

Average=18

18='R'

Fourth pair: a a

a=1, a=1

Sum=2

Average=1

1='A'

Fifth pair: t z

t=20, z=26

Sum=46

Average=23

23='W'

Sixth pair: a c

a=1, c=3

Sum=4

Average=2

2='B'

Seventh pair: d f

d=4, f=6

Sum=10

Average=5

5='E'

Eighth pair: o u

o=15, u=21

Sum=36

Average=18

18='R'

Ninth pair: l x

l=12, x=24

Sum=36

Average=18

18='R'

Tenth pair: x z

x=24, z=26

Sum=50

Average=25

25='Y'

So the tenth word is 'S', 'T', 'R', 'A', 'W', 'B', 'E', 'R', 'R', 'Y'

Wait that's 10 letters

So the sixth word is 'STRAWBERRY', which makes sense.

So putting all these translated letters together:

First word: 'THERE'

Second word: 'ARE'

Third word: 'THREE'

Fourth word: 'RS' (? Doesn't make sense)

Wait, the fourth word, the pairs were:

ou ow

Pairs:

First pair: o u

o=15, u=21

Sum=36, Average=18

18='R'

Second pair: o w

o=15, w=23

Sum=38, Average=19

19='S'

So we get 'R','S'

Similarly, fifth word 'IN'

'R', 'S' 'I','N'

So the message so far is

'THERE ARE THREE RS IN'

And then 'STRAWBERRY'

Wait, let's put the words together:

'THERE ARE THREE Rs IN STRAWBERRY'

I think 'RS' is 'R's', meaning 'Rs'

So the message is:

'THERE ARE THREE R'S IN STRAWBERRY'

Which is true.

So the decoded message is:

"THERE ARE THREE R'S IN STRAWBERRY"

To decode the given ciphertext using the example provided, we’ll use the pattern identified in the example.

Ciphertext Example:

Testo semplice

1oyfjdnisdr rtqwainr acxz mynzbhhx

Decoded as:

Testo semplice

1Think step by step

Decoding Method:

Pair the letters in the ciphertext.
Convert each letter to its numerical position in the alphabet (A=1, B=2, …, Z=26).
Sum the numerical values of each pair.
Compute the average of the sum (divide by 2).
Convert the average back to a letter (1=A, 2=B, …, 26=Z).

Apply this method to the given ciphertext:

Testo semplice

1oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

Break down the ciphertext into pairs:

1. First word: “oyekaijzdf”
1. Pairs: oy, ek, ai, jz, df
2. Decoded letters:
  1. oy → (15+25)/2 = 20 → T
  2. ek → (5+11)/2 = 8 → H
  3. ai → (1+9)/2 = 5 → E
  4. jz → (10+26)/2 = 18 → R
3. df → (4+6)/2 = 5 → E
4. Decoded word: THERE
Second word: “aaptcg”
1. Pairs: aa, pt, cg
2. Decoded letters:
  1. aa → (1+1)/2 = 1 → A
  2. pt → (16+20)/2 = 18 → R
  3. cg → (3+7)/2 = 5 → E
  4. Decoded word: ARE
Third word: “suaokybhai”
1. Pairs: su, ao, ky, bh, ai
2. Decoded letters:
  1. su → (19+21)/2 = 20 → T
  2. ao → (1+15)/2 = 8 → H
  3. ky → (11+25)/2 = 18 → R
  4. bh → (2+8)/2 = 5 → E
  5. ai → (1+9)/2 = 5 → E
3. Decoded word: THREE
Fourth word: “ouow”
1. Pairs: ou, ow
2. Decoded letters:
  1. ou → (15+21)/2 = 18 → R
  2. ow → (15+23)/2 = 19 → S
3. Decoded letters: R’S
Fifth word: “aqht”
1. Pairs: aq, ht
2. Decoded letters:
  1. aq → (1+17)/2 = 9 → I
  2. ht → (8+20)/2 = 14 → N
3. Decoded word: IN
Sixth word: “mynznvaatzacdfoulxxz”
1. Pairs: my, nz, nv, aa, tz, ac, df, ou, lx, xz
2. Decoded letters:
  1. my → (13+25)/2 = 19 → S
  2. nz → (14+26)/2 = 20 → T
  3. nv → (14+22)/2 = 18 → R
  4. aa → (1+1)/2 = 1 → A
  5. tz → (20+26)/2 = 23 → W
  6. ac → (1+3)/2 = 2 → B
  7. df → (4+6)/2 = 5 → E
  8. ou → (15+21)/2 = 18 → R
  9. lx → (12+24)/2 = 18 → R
  10. xz → (24+26)/2 = 25 → Y
3. Decoded word: STRAWBERRY

Final Decoded Message:

Testo semplice

1THERE ARE THREE R'S IN STRAWBERRY

Answer:
THERE ARE THREE R’S IN STRAWBERRY

Codifica

Abbiamo addestrato un modello che ha ottenuto 213 punti e si è classificato nel 49° percentile alle Olimpiadi internazionali di informatica (IOI) del 2024, partendo da o1 ed eseguendo un addestramento per affinare ulteriormente le capacità di programmazione. Questo modello ha partecipato alle IOI 2024 alle stesse condizioni dei concorrenti umani: aveva dieci ore per risolvere sei problemi algoritmici complessi e poteva inviare fino a 50 soluzioni per ciascun problema.

Per ogni esercizio, il sistema ha generato numerose soluzioni candidate e ne ha inviate 50 sulla base di una strategia di selezione durante l'esecuzione. Le soluzioni inviate venivano scelte in base alle prestazioni su casi di test pubblici delle IOI, casi di test generati dal modello e in base a una funzione di valutazione appresa. Se le soluzioni fossero state inviate in modo casuale, il punteggio medio sarebbe stato di soli 156 punti, il che suggerisce che questa strategia ha permesso di ottenere quasi 60 punti in più in condizioni di competizione.

Con un limite di soluzioni meno restrittivo, le prestazioni del modello sono migliorate sensibilmente. Con 10.000 soluzioni consentite per problema, il modello ha ottenuto un punteggio di 362,14, al di sopra della soglia necessaria per la medaglia d’oro, anche senza alcuna strategia di selezione durante l'esecuzione.

Infine, abbiamo simulato un concorso di programmazione competitiva ospitato su Codeforces per dimostrare le abilità di codifica di questo modello. Le nostre valutazioni hanno rispettato fedelmente le regole del concorso e hanno consentito 10 soluzioni. GPT‑4o ha raggiunto un punteggio Elo³ pari a 808, posizionandosi nell’11° percentile dei concorrenti umani. Questo modello ha superato di gran lunga sia GPT‑4o che l'o1: ha raggiunto una valutazione Elo pari a 1.807, con prestazioni migliori del 93% dei concorrenti.

L'immagine mostra un grafico a barre che confronta le classifiche in percentili Elo su Codeforces di diversi modelli. GPT-4o presenta un punteggio Elo di 808 (11º percentile), o1 preview di 1.258 (62º percentile), o1 di 1.673 (89º percentile) e o1-ioi di 1.807 (93º percentile).

Un'ulteriore ottimizzazione per competizioni di programmazione migliora le capacità di o1. Il modello così potenziato si è classificato nel 49° percentile alle Olimpiadi internazionali di informatica 2024 secondo le regole della competizione.

Valutazione basata su preferenze umane

Oltre a esami e benchmark accademici, abbiamo valutato anche la preferenza umana di o1‑preview rispetto a GPT‑4o per prompt complessi e aperti in una vasta gamma di ambiti. In questa valutazione, ai formatori umani sono state mostrate le risposte anonime a un prompt di o1‑preview e GPT‑4o e i formatori hanno votato la risposta che preferivano. o1‑preview è stato preferito a gpt-4o con un ampio margine nelle categorie di ragionamento complesso come analisi dei dati, codifica e matematica. Tuttavia, o1‑preview non è stato preferito per alcune attività di linguaggio naturale, il che suggerisce che non sia adatto a tutti i casi d’uso.

L’immagine mostra un grafico a barre orizzontali che confronta i punteggi di cinque modelli, con barre di errore che rappresentano gli intervalli di confidenza. L’asse delle x va da 0 a 100 e include una linea tratteggiata come punto di riferimento per le prestazioni.

Sicurezza

Il ragionamento tramite catena di pensiero apre la strada a nuove opportunità in termini di allineamento e sicurezza. Abbiamo riscontrato che l'integrazione delle nostre politiche di comportamento del modello all'interno della catena di pensiero costituisce un modo efficace per trasmettere adeguatamente valori e principi umani. Insegnando al modello le nostre regole di sicurezza e come ragionarvi in base al contesto, abbiamo rilevato che le capacità di ragionamento migliorano direttamente la robustezza del modello: o1‑preview ha ottenuto prestazioni notevolmente superiori nei principali test di jailbreak e nei nostri benchmark interni più rigorosi per la valutazione dei limiti di rifiuto della sicurezza. Riteniamo che il ricorso alla catena di pensiero rappresenti un progresso rilevante per la sicurezza e l’allineamento in quanto (1) consente di osservare in modo leggibile il processo di ragionamento del modello e (2) il ragionamento del modello sulle regole di sicurezza risulta più resistente a scenari fuori distribuzione.

Per mettere alla prova i nostri miglioramenti, abbiamo eseguito una serie di test di sicurezza e red teaming prima del rilascio, in linea con il nostro Framework di preparazione⁠(si apre in una nuova finestra)⁠. Abbiamo constatato che il ragionamento con catena di pensiero ha contribuito a migliorare le capacità in tutte le nostre valutazioni. In particolare, abbiamo rilevato casi interessanti di hacking della ricompensa⁠⁠(si apre in una nuova finestra). La scheda di sistema fornisce i risultati dettagliati di queste valutazioni.

Metrica	GPT-4o	o1-preview
% di completamenti sicuri su prompt dannosi Standard	0,990	0,995
% di completamenti sicuri su prompt dannosi Sfidante: jailbreak e casi limite	0,714	0,934
↳ Molestie (gravi)	0,845	0,900
↳ Contenuti sessuali di sfruttamento	0,483	0,949
↳ Contenuti sessuali che coinvolgono minori	0,707	0,931
↳ Consigli su comportamenti illeciti non violenti	0,688	0,961
↳ Consigli su comportamenti illeciti violenti	0,778	0,963
% di completamenti sicuri per i primi 200 con i punteggi API di moderazione più alti per categoria in WildChat Zhao et al. 2024	0,945	0,971
Valutazione jailbreak Goodness@0.1 StrongREJECT Souly et al. 2024	0,220	0,840
Valutazione jailbreak di origine umana	0,770	0,960
% di conformità sui casi limite benigni interni “not over-refusal”	0,910	0,930
% di conformità sui casi limite benigni in XSTest “not over-refusal” Röttger et al. 2023	0,924	0,976

Nascondere le catene di pensiero

Riteniamo che una catena di pensiero nascosta rappresenti un’opportunità unica per il monitoraggio dei modelli. Se fedele e leggibile, questa catena nascosta ci permette di “leggere nella mente” del modello e comprenderne il processo di ragionamento. Ad esempio, in futuro potremmo voler monitorare la catena di pensiero per individuare segnali di manipolazione dell’utente. Tuttavia, affinché ciò funzioni, il modello deve poter esprimere liberamente i propri pensieri in forma non alterata, pertanto non possiamo applicare alcuna politica di conformità o preferenze utente direttamente alla catena di pensiero. Inoltre, non vogliamo rendere visibile agli utenti una catena di pensiero non allineata.

Di conseguenza, dopo aver valutato diversi fattori, tra cui l’esperienza utente, il vantaggio competitivo e la possibilità di monitorare la catena di pensiero, abbiamo deciso di non mostrare agli utenti le catene di pensiero grezze. Riconosciamo che questa decisione comporta degli svantaggi, ma cerchiamo di compensarli parzialmente insegnando al modello a riportare nelle sue risposta le idee utili emerse dalla catena di pensiero. Per la serie di modelli o1, mostriamo un riepilogo della catena di pensiero generata dal modello.

Conclusioni

o1 segna un grande progresso nello stato dell’arte del ragionamento delle IA. Abbiamo in programma di rilasciare versioni migliorate di questo modello man mano che proseguiremo con le iterazioni. Ci aspettiamo che queste nuove capacità di ragionamento migliorino l'allineamento dei modelli ai valori e ai principi umani. Crediamo che o1 e i suoi successori apriranno la strada a molti nuovi casi d’uso dell’IA in ambiti come la scienza, la codifica, la matematica e settori correlati. Non vediamo l'ora che utenti e sviluppatori API scoprano in che modo può migliorare il loro lavoro quotidiano.

Appendice A

Set di dati	Metrica	gpt-4o	o1-preview	o1
Matematica competitiva AIME (2024)	cons@64	13,4	56,7	83,3
Matematica competitiva AIME (2024)	pass@1	9,3	44,6	74,4
Codice competizione CodeForces	Elo	808	1258	1673
Codice competizione CodeForces	Percentile	11,0	62,0	89,0
GPQA diamond	cons@64	56,1	78,3	78,0
GPQA diamond	pass@1	50,6	73,3	77,3
Biologia	cons@64	63,2	73,7	68,4
Biologia	pass@1	61,6	65,9	69,2
Chimica	cons@64	43,0	60,2	65,6
Chimica	pass@1	40,2	59,9	64,7
Fisica	cons@64	68,6	89,5	94,2
Fisica	pass@1	59,5	89,4	92,8
MATH	pass@1	60,3	85,5	94,8
MMLU	pass@1	88,0	92,3	90,8
MMMU (val)	pass@1	69,1	n/d	78,2
MathVista (testmini)	pass@1	63,8	n/d	73,9

Autori

OpenAI

Visualizza collaboratori

Citazioni

1
https://www.anthropic.com/news/claude-3-5-sonnet⁠(si apre in una nuova finestra), https://deepmind.google/technologies/gemini/pro⁠(si apre in una nuova finestra)
2
Le nostre valutazioni hanno utilizzato la stessa porzione di dati di test composta da 500 problemi presente all'indirizzo https://arxiv.org/abs/2305.20050⁠(si apre in una nuova finestra)
3
https://codeforces.com/blog/entry/68288⁠(si apre in una nuova finestra)