2022年9月21日

隆重介紹 Whisper

我們訓練並開源了一個名為 Whisper 的神經網路，其在英文語音辨識方面達到接近人類水準的穩健性及準確性。

載入中…

載入中...

Whisper 是一套自動語音辨識 (ASR) 系統，使用從網路蒐集的多語言及多任務監督式資料，經 680,000 小時訓練而成。我們證明，使用如此大量且多元的資料集，有助於提升系統對口音、背景噪音及技術性語言的穩健性。此外，它支援多種語言的轉錄，並可將這些語言翻譯成英文。我們為開源模型及推論程式碼，以作為開發實用應用程式及進一步研究穩健語音處理的基礎。

Whisper 架構是一種簡單的端對端方法，作為編碼器與解碼器 Transformer 架構的實作。輸入的音訊會被分割成 30 秒的組塊，轉換為對數梅爾倒頻譜，然後傳入編碼器中。解碼器會被訓練來預測對應的文字說明，並混合使用特殊詞元，引導單一模型執行語言辨識、片語級別時間標記、多語言語音轉錄及翻譯成英文的語音翻譯等任務。

其他現有方法經常使用較小且音訊與文字配對更緊密的訓練資料集 (¹ ² ³)，或是使用範圍廣泛但未經監督的音訊預先訓練 (⁴ ⁵ ⁶)。由於 Whisper 是在大型且多元的資料集上訓練，並未針對任何特定資料集進行微調，因此在語音辨識領域中極具競爭力的 LibriSpeech 基準測試方面，表現不如專精於該測試的模型。然而，當我們在許多不同的資料集上測量 Whisper 的零樣本效能時，我們發現它更具穩健性，錯誤率比其他模型低 50%。

Whisper 的語音資料中約三分之一為非英文語音，其在訓練時會交替執行原語言轉錄或翻譯成英文的任務。我們發現這種方法在學習語音轉文字翻譯上特別有效，並在 CoVoST2 的翻譯至英文零樣本測試中表現優於監督式 SOTA 模型。

載入中...

我們希望 Whisper 的高準確性及易用性，能讓開發者為更多不同的應用程式加入語音介面。請參閱論文⁠(在新視窗中開啟)、模型說明卡⁠(在新視窗中開啟)及程式碼⁠(在新視窗中開啟)，深入了解相關細節並試用 Whisper。

參考資料

1
Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., and Norouzi, M. SpeechStew: Simply mix all available speech recogni- tion data to train one large neural network. arXiv preprint arXiv:2104.02133, 2021⁠(在新視窗中開啟).
2
Galvez, D., Diamos, G., Torres, J. M. C., Achorn, K., Gopi, A., Kanter, D., Lam, M., Mazumder, M., and Reddi, V. J. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage. arXiv preprint arXiv:2111.09344, 2021⁠(在新視窗中開啟).
3
Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.-Q., Weng, C., Su, D., Povey, D., Trmal, J., Zhang, J., et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021⁠(在新視窗中開啟).
4
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020⁠(在新視窗中開啟).
5
Baevski, A., Hsu, W.N., Conneau, A., and Auli, M. Unsu pervised speech recognition. Advances in Neural Information Processing Systems, 34:27826–27839, 2021.
6
Zhang, Y., Park, D. S., Han, W., Qin, J., Gulati, A., Shor, J., Jansen, A., Xu, Y., Huang, Y., Wang, S., et al. BigSSL: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2109.13226, 2021⁠(在新視窗中開啟).

檢視全部

Hierarchical text-conditional image generation with CLIP latents

研究發表2022年4月13日

Solving (some) formal math olympiad problems

里程碑2022年2月2日

Solving math word problems

研究發表2021年10月29日

隆重介紹 Whisper

參考資料

相關文章