၂၀၂၄ ဇွန် ၆

GPT‑4 မှ သဘောတရားများ ထုတ်ယူခြင်း

GPT‑4 ၏ အတွင်းကိုယ်စားပြုမှုများကို အဓိပ္ပာယ်ဖော်လေ့ရှိသော pattern ၁၆ သန်းအဖြစ် ခွဲထုတ်ရန် scale လုပ်နိုင်သော နည်းလမ်းသစ်များကို အသုံးပြုခဲ့သည်။

စာတမ်းဖတ်ရန်ကုဒ်ကို ဖတ်ရန်feature များ ကြည့်ရှုရန်

ဖွင့်နေသည်…

လက်ရှိအချိန်တွင် language model များအတွင်းရှိ neural activity ကို မည်သို့ အဓိပ္ပာယ်ဖော်ရမည်ကို ကျွန်ုပ်တို့ နားမလည်သေးပါ။ ယနေ့တွင် လူသားများ နားလည်နိုင်မည်ဟု မျှော်လင့်သည့် activity pattern များဖြစ်သော “feature” အများအပြားကို ရှာဖွေရန် ပိုမိုကောင်းမွန်သော နည်းလမ်းများကို မျှဝေနေသည်။ ကျွန်ုပ်တို့၏ နည်းလမ်းများသည် ရှိပြီးသားလုပ်ငန်းများထက် ပိုမိုကောင်းစွာ scale လုပ်နိုင်ပြီး ၎င်းတို့ကို အသုံးပြုကာ GPT‑4 အတွင်း feature ၁၆ သန်းကို ရှာဖွေထားသည်။ ထပ်မံလေ့လာမှုများကို အားပေးရန် သုတေသနအသိုင်းအဝိုင်းနှင့် စာတမ်း⁠(ဝင်းဒိုးအသစ်တွင် ဖွင့်မည်), ကုဒ်⁠(ဝင်းဒိုးအသစ်တွင် ဖွင့်မည်) နှင့် feature visualizations⁠(ဝင်းဒိုးအသစ်တွင် ဖွင့်မည်) များကို မျှဝေနေသည်။

neural network များကို အဓိပ္ပာယ်ဖော်ခြင်း၏ စိန်ခေါ်မှု

လူသားဖန်တီးမှုအများစုနှင့် မတူဘဲ neural network များ၏ အတွင်းလုပ်ဆောင်ပုံကို ကျွန်ုပ်တို့ အမှန်တကယ် နားမလည်သေးပါ။ ဥပမာအားဖြင့် အင်ဂျင်နီယာများသည် ကား၏ အစိတ်အပိုင်းသတ်မှတ်ချက်များအပေါ် အခြေခံကာ ကားများကို တိုက်ရိုက် ဒီဇိုင်းဆွဲ၊ အကဲဖြတ်၊ ပြုပြင်နိုင်ပြီး ဘေးကင်းမှုနှင့် စွမ်းဆောင်ရည်ကို သေချာစေနိုင်သည်။ သို့သော် neural network များကို တိုက်ရိုက် ဒီဇိုင်းမဆွဲဘဲ ၎င်းတို့ကို လေ့ကျင့်ပေးမည့် algorithm များကိုသာ ကျွန်ုပ်တို့ ဒီဇိုင်းဆွဲသည်။ ထိုမှ ထွက်ပေါ်လာသော network များကို ကောင်းစွာ နားမလည်ရသေးသလို သတ်မှတ်လို့ရသော အစိတ်အပိုင်းများအဖြစ်လည်း အလွယ်တကူ မခွဲခြမ်းနိုင်ပါ။ ထို့ကြောင့် AI ဘေးကင်းရေးကို ကားဘေးကင်းရေးကဲ့သို့ စဉ်းစားသုံးသပ်သည့် နည်းလမ်းအတိုင်း မစဉ်းစားနိုင်ပါ။

neural network များကို နားလည်ပြီး အဓိပ္ပာယ်ဖော်ရန် neural computations အတွက် အသုံးဝင်သော building blocks များကို ဦးစွာ ရှာဖွေရန် လိုအပ်သည်။ ကံမကောင်းစွာဖြင့် language model အတွင်းရှိ neural activations များသည် ခန့်မှန်းမရသော pattern များဖြင့် activate ဖြစ်ပြီး သဘောတရားများစွာကို တစ်ပြိုင်နက် ကိုယ်စားပြုနေသလိုပင် ဖြစ်သည်။ ထို့ပြင် ၎င်းတို့သည် densely activate လည်း ဖြစ်နေသဖြင့် input တစ်ခုစီအတွက် activation တစ်ခုချင်းစီက အမြဲ firing ဖြစ်နေသည်။ သို့သော် လက်တွေ့ကမ္ဘာရှိ သဘောတရားများသည် အလွန် sparse ဖြစ်ကြသည်—context တစ်ခုချင်းစီတွင် သဘောတရားအားလုံးထဲမှ အနည်းငယ်သာ သက်ဆိုင်သည်။ ထိုအကြောင်းကြောင့် sparse autoencoders ကို အသုံးပြုရန် လှုံ့ဆော်ပေးသည်။ ၎င်းသည် output တစ်ခုချင်းစီ ထုတ်ပေးရာတွင် အရေးကြီးသော neural network အတွင်းရှိ “feature” အနည်းငယ်ကို ဖော်ထုတ်ပေးသည့် နည်းလမ်းဖြစ်ပြီး အခြေအနေတစ်ခုအပေါ် ကျိုးကြောင်းသင့်လျော်စွာ စဉ််းစားပေးသောအခါ လူတစ်ဦး၏ စိတ်ထဲတွင် ရှိနေနိုင်သည့် သဘောတရားအစုငယ်နှင့် ဆင်တူသည်။ interpretability အတွက် တိုက်ရိုက် လှုံ့ဆော်ချက် မရှိသော်လည်း ၎င်းတို့၏ feature များသည် လူသားများ နားလည်ရလွယ်သော သဘောတရားများနှင့် သဘာဝကျကျ ကိုက်ညီသည့် sparse activation pattern များကို ပြသသည်။

သိပ်သည်းသော neural activations ကို sparse features အဖြစ် encode နှင့် decode လုပ်သည့် Sparse autoencoder ကို ပြသထားသော diagram တစ်ခု။

သို့သော် sparse autoencoder များကို လေ့ကျင့်ရာတွင် လေးနက်သော စိန်ခေါ်မှုများ ရှိနေသေးသည်။ Large language models များသည် အလွန်များပြားသော သဘောတရားများကို ကိုယ်စားပြုပြီး စွမ်းဆောင်ရည်အမြင့်ဆုံး မော်ဒယ်တစ်ခုအတွင်းရှိ concept များကို အပြည့်အဝ လွှမ်းခြုံရန် နီးစပ်စေရန် ကျွန်ုပ်တို့၏ autoencoder များသည် ထို့နည်းတူ အလွန်ကြီးမားရန် လိုအပ်နိုင်သည်။ sparse feature အရေအတွက်များစွာကို သင်ယူစေခြင်းသည် စိန်ခေါ်မှုဖြစ်ပြီး ယခင်လုပ်ငန်းများမှာ scale ကောင်းကောင်း လုပ်နိုင်ကြောင်း မပြသနိုင်သေးပါ။

ကျွန်ုပ်တို့၏ သုတေသနတိုးတက်မှု - အကြီးစား autoencoder လေ့ကျင့်ခြင်း

ကျွန်ုပ်တို့သည် စွမ်းဆောင်ရည်အမြင့်ဆုံး AI မော်ဒယ်များပေါ်တွင် ကျွန်ုပ်တို့၏ sparse autoencoders များကို feature သန်းပေါင်းများစွာအထိ scale လုပ်နိုင်စေသော ခေတ်မီဆုံး နည်းလမ်းသစ်များကို ဖန်တီးခဲ့သည်။ ကျွန်ုပ်တို့၏ methodology သည် ချောမွေ့ပြီး ခန့်မှန်းနိုင်သော scaling ကို ပြသသလို ယခင်နည်းလမ်းများထက် scale တိုးလာသည့်အခါ အကျိုးရလဒ် ပိုကောင်းကြောင်း တွေ့ရှိခဲ့သည်။ feature quality ကို အကဲဖြတ်ရန် metric အသစ်အချို့ကိုလည်း မိတ်ဆက်ပေးထားသည်။

ဤနည်းလမ်းကို အသုံးပြုပြီး GPT‑2 small နှင့် GPT‑4 activations များပေါ်တွင် autoencoder အမျိုးမျိုးကို လေ့ကျင့်ခဲ့ပြီး GPT‑4 ပေါ်ရှိ feature ၁၆ သန်းပါ autoencoder တစ်ခုလည်း ပါဝင်သည်။ feature များ၏ interpretability ကို စစ်ဆေးရန် ၎င်း activate ဖြစ်သော document များကို ပြသကာ feature တစ်ခုစီကို မြင်ကွင်းဖော်ပြသည်။ အောက်တွင် ကျွန်ုပ်တို့ တွေ့ရှိခဲ့သည့် နားလည်နိုင်သော feature အချို့ကို ဖော်ပြထားသည်:

GPT-4 feature: phrases relating to things (especially humans) being flawed

မြင်ကွင်း အပြည့်အစုံကို ကြည့်ရန်

most people, it isn’t. We all have wonderful days, glimpses of what we perceive to be perfection, but we can also all have truly shit-tastic ones, and I can assure you that you’re not alone. So toddler of mine, and most other toddlers out there, remember; Don’t be a

has warts. What system that is used to build real world software doesn't? I've built systems in a number of languages and frameworks and they all had warts and issues. How much research has the author done to find other solutions? The plea at the end seemed very lazywebish to me

often put our hope in the wrong places – in the world, in other people, in our abilities or finances – but all of that is like sinking sand. The only place we can find hope is in Jesus Christ. These words by Kutless tell us just where we need to go to find hope. I lift my

churches since the last Great Reformation has also become warped. I state again, while churches are formed and planted with the most Holy and Divine of inspirations, they are not free from the corruption of humanity. While they are of our great and perfect Father, they are on an imperfect Earth. And we Rogues are

perfect. If anyone does not believe that let them say so. You really do appear to be just about a meter away from me. But you are actually in my brain. What artistry! What perfection! Not the slightest blurring. And in 3-D. Sound is also 3-D. And images.

အခြား စိတ်ဝင်စားဖွယ် feature များစွာကိုလည်း ကျွန်ုပ်တို့ တွေ့ရှိခဲ့ပြီး ၎င်းတို့ကို ဤနေရာတွင် ကြည့်ရှုနိုင်သည်⁠(ဝင်းဒိုးအသစ်တွင် ဖွင့်မည်)။

ကန့်သတ်ချက်များ

interpretability သည် နောက်ဆုံးတွင် မော်ဒယ်၏ ယုံကြည်စိတ်ချရမှုနှင့် steerability ကို တိုးမြှင့်ပေးမည်ဟု ကျွန်ုပ်တို့ စိတ်လှုပ်ရှားနေသည်။ သို့သော် ၎င်းသည် ကနဦးအဆင့်လုပ်ငန်းသာဖြစ်ပြီး ကန့်သတ်ချက်များစွာ ရှိနေသေးသည်:

ယခင်လုပ်ငန်းများကဲ့သို့ပင် ရှာဖွေတွေ့ရှိထားသော feature အများအပြားသည် အဓိပ္ပာယ်ဖော်ရန် ခက်ခဲနေဆဲဖြစ်ပြီး အချို့မှာ ရှင်းလင်းသော pattern မရှိဘဲ activate ဖြစ်နေသလို ၎င်းတို့ ပုံမှန် encode လုပ်တတ်သော concept နှင့် မသက်ဆိုင်သော spurious activations များကိုလည်း ပြသတတ်သည်။ ထို့အပြင် interpretation များ မှန်ကန်ကြောင်း စစ်ဆေးရန် နည်းလမ်းကောင်းများ မရှိသေးပါ။
sparse autoencoder သည် မူလမော်ဒယ်၏ အပြုအမူအားလုံးကို မဖမ်းယူနိုင်ပါ။ လက်ရှိတွင် GPT‑4 ၏ activations များကို sparse autoencoder မှတစ်ဆင့် ဖြတ်သန်းစေပါက compute အနည်းငယ် 10x ခန့်သာ အသုံးပြု၍ လေ့ကျင့်ထားသော မော်ဒယ်တစ်ခုနှင့်ညီမျှသော စွမ်းဆောင်ရည်ရလဒ်ကို ရရှိသည်။ စွမ်းဆောင်ရည်အမြင့်ဆုံး LLM များအတွင်းရှိ concept များကို ပြည့်ပြည့်စုံစုံ map လုပ်နိုင်ရန် feature ဘီလျံများ သို့မဟုတ် ထရီလျံများအထိ scale လုပ်ရန် လိုအပ်နိုင်ပြီး ၎င်းသည် ကျွန်ုပ်တို့၏ ပိုမိုကောင်းမွန်သော scaling techniques များရှိနေသော်လည်း စိန်ခေါ်မှုကြီးတစ်ရပ် ဖြစ်နေမည်ဖြစ်သည်။
sparse autoencoder များသည် မော်ဒယ်အတွင်းရှိ တစ်နေရာတွင် feature များကို ရှာတွေ့စေနိုင်သော်လည်း ၎င်းသည် မော်ဒယ်ကို အဓိပ္ပာယ်ဖော်ရန် ဦးတည်သည့် အဆင့်တစ်ဆင့်သာ ဖြစ်သည်။ ထို feature များကို မော်ဒယ်က မည်သို့ တွက်ချက်ဖန်တီးသည်နှင့် မော်ဒယ်၏ ကျန်အစိတ်အပိုင်းများတွင် ၎င်း feature များကို downstream ဘယ်လို အသုံးပြုသည်ကို နားလည်ရန် နောက်ထပ် အလုပ်များစွာ လိုအပ်နေသေးသည်။

ရှေ့ကိုမျှော်ကြည့်ခြင်းနှင့် ကျွန်ုပ်တို့၏ သုတေသနကို open source အဖြစ် ထုတ်ပေးခြင်း

sparse autoencoder သုတေသနသည် စိတ်လှုပ်ရှားဖွယ်ကောင်းသော်လည်း မဖြေရှင်းရသေးသော စိန်ခေါ်မှုများစွာနှင့်အတူ ရှေ့ဆက်ရမည့် လမ်းရှည်တစ်လျှောက် ရှိနေသေးသည်။ အနီးကပ်ကာလတွင် ကျွန်ုပ်တို့ တွေ့ရှိထားသော feature များသည် language model ၏ အပြုအမူများကို စောင့်ကြည့်ခြင်းနှင့် လမ်းညွှန်ထိန်းညှိခြင်းအတွက် လက်တွေ့အသုံးဝင်လာမည်ဟု မျှော်လင့်ပြီး ၎င်းကို ကျွန်ုပ်တို့၏ စွမ်းဆောင်ရည်အမြင့်ဆုံး မော်ဒယ်များတွင် စမ်းသပ်ရန် စီစဉ်ထားသည်။ နောက်ဆုံးတွင် interpretability သည် တစ်နေ့နေ့၌ မော်ဒယ်ဘေးကင်းရေးနှင့် robustness ကို နည်းလမ်းသစ်များဖြင့် စဉ်းစားနိုင်စေပြီး အားကောင်းသော AI မော်ဒယ်များ၏ အပြုအမူအပေါ် ခိုင်မာသော အာမခံချက်များ ပေးခြင်းမှတဆင့် ၎င်းတို့အပေါ် ကျွန်ုပ်တို့၏ ယုံကြည်မှုကို သိသိသာသာ တိုးမြှင့်ပေးနိုင်မည်ဟု မျှော်လင့်သည်။

ယနေ့တွင် ကျွန်ုပ်တို့၏ စမ်းသပ်မှုများနှင့် နည်းလမ်းများကို အသေးစိတ်ဖော်ပြထားသော စာတမ်း⁠(ဝင်းဒိုးအသစ်တွင် ဖွင့်မည်) တစ်စောင်ကို မျှဝေနေပြီး ၎င်းက သုတေသီများအတွက် autoencoder များကို scale ကြီးမားစွာ လေ့ကျင့်ရန် ပိုမိုလွယ်ကူစေလိမ့်မည်ဟု မျှော်လင့်သည်။ GPT‑2 small အတွက် autoencoder အစုံအလင်တစ်ခုလုံးကို ကျွန်ုပ်တို့ ထုတ်ပေးနေပြီး ၎င်းတို့ကို အသုံးပြုရန် ကုဒ်⁠(ဝင်းဒိုးအသစ်တွင် ဖွင့်မည်) နှင့် GPT‑2 နှင့် GPT‑4 feature များသည် မည်သည့်အရာများနှင့် ကိုက်ညီနိုင်သည်ကို နားလည်စေရန် feature visualizer⁠(ဝင်းဒိုးအသစ်တွင် ဖွင့်မည်) ကိုလည်း ထုတ်ပေးနေသည်။

စာရေးသူများ

Jeffrey Wu - Leo Gao - Tom Dupré la Tourနှင့် Henk Tillman

ကျေးဇူးတင်လွှာ

Taya Christianson, Elizabeth Proehl, Yo Shavit, Niko Felix, Cathy Yeh, Gabriel Goh, Rajan Troll, Alec Radford, Jan Leike, Ilya Sutskever, David Robinson, Greg Brockman