Editing Openai/69137a48-17f4-8006-8777-d87a321743fc (section)

=== Omnilingual ASR（Meta, 2025-11-10）徹底解説 ===

==== 何を目指した研究か ====

Omnilingual ASR は、1,600+言語に対応するオープンソースの自動音声認識（ASR）スイートです。従来の「主要言語中心」のASRから一歩進めて、これまでASRが事実上未対応だった多数の言語まで射程に入れたのが最大のポイント。さらに'''「少数のペアデータ（音声‐文字起こし）だけで新言語を追加」'''できる設計を打ち出し、広範なコミュニティ参加で言語カバレッジを伸ばせるようにしています。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>

: 実運用の目安として、7B LLM-ASR が'''1,600+言語のうち78%で CER<10%'''を達成する、とリポジトリは述べています。
: GitHub

==== モデル群の全体像（3ファミリー） ====

Omnilingual ASR は用途・リソースに合わせて選べる3系統のモデルを公開しています。各系統はサイズ違い（300M / 1B / 3B / 7B）も用意。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
# SSLエンコーダ（Wav2Vec 2.0 系） - 役割：多言語の自己教師あり事前学習で、頑健な音声表現を抽出。 - 例：omniASR_W2V_7B（約6.49Bパラメータ, FP32 25 GiB）。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
# CTCベースASR - 役割：軽量・高速寄りのデコーダ。 - 例：omniASR_CTC_7B（FP32 25 GiB、推論目安VRAM~15 GiB、RTF≈0.006＝約16×リアルタイム）。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
# LLMベースASR（Wav2Vec2 + LLM） - 役割：言語条件付けや'''コンテキスト例（in-context）'''を与えて柔軟に書記体系・表記規約に適応。 - 例：omniASR_LLM_7B（約7.80B）、およびゼロショット特化の omniASR_LLM_7B_ZS（約7.81B）。推論目安VRAMは~17–20 GiB、RTFは約1×（ZSは~0.5×）。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>

: いずれの系列もApache-2.0でコード／モデル配布。研究・商用で扱いやすいライセンス形態です。
: GitHub

==== 入出力と言語指定の扱い ====
* 音声長の制約：現状の参照実装は40秒未満の音声のみ対応（今後拡張予定と明記）。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
* 言語指定：{lang}_{script} 形式（例：英語ラテン字 eng_Latn、中国語（簡体）cmn_Hans）。推論時にこの言語タグで条件付けできます。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>

==== 性能ハイライトと立ち位置 ====
* 1,600+言語の78%で CER<10%（7B LLM-ASR）。ロングテール言語を多数含みます。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
* 公式ブログ／外部報道では、少数例の in-context 学習で未学習言語へ一般化できる点が強調されています（メディアによっては「5,400+言語に一般化可能」との報道もあり。これは報道ベースの数値）。AI Meta<ref>{{cite web|title=AI Meta|url=https://ai.meta.com/blog/omnilingual-asr-advancing-automatic-speech-recognition/|publisher=AI Meta|access-date=2025-11-13}}</ref>

: 参考までに、Metaの先行公開物である MMS（Massively Multilingual Speech）は1000+言語級でした。Omnilingual はそれをさらに拡張した位置づけです。
: Hugging Face

==== 「新言語を少例で追加する」設計 ====

GitHub リード文は、「専門知識や大規模データなしに、少数ペアで新言語を足せる」'''ことを明確に謳っています。これは LLM-ASR 系の'''言語条件付け＋コンテキスト例の設計に根ざしています。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
* 具体的には、推論パイプラインで言語タグを与えたり、（LLM-ASR では）数例の既知変換をテキストのコンテキストとして渡すことで、表記ゆれ・正書法・固有名詞などにモデルを寄せる運用が可能、という思想です（詳細は推論ガイド／コードにて）。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/inference/README.md|publisher=github.com|access-date=2025-11-13}}</ref>

==== データセット：Omnilingual ASR Corpus ====
* 対象：348 のアンダーサーブド言語の自発話とその転写で構成。Hugging Face で公開。CC-BY-4.0。Hugging Face<ref>{{cite web|title=Hugging Face|url=https://huggingface.co/datasets/facebook/omnilingual-asr-corpus/blob/main/README.md|publisher=Hugging Face|access-date=2025-11-13}}</ref>
* データカードは、言語コード（ISO 639-3）＋書記体系（ISO 15924）＋ Glottolog を揃えたスキーマを採用。笑い・言い直し・雑音などのタグ体系も明示し、自然会話の不流暢性を忠実に含む設計です。Hugging Face<ref>{{cite web|title=Hugging Face|url=https://huggingface.co/datasets/facebook/omnilingual-asr-corpus/blob/main/README.md|publisher=Hugging Face|access-date=2025-11-13}}</ref>
* 実サイズの指標として、公開ページは約492 GBを示しています（全体像の目安）。Hugging Face<ref>{{cite web|title=Hugging Face|url=https://huggingface.co/datasets/facebook/omnilingual-asr-corpus/tree/main|publisher=Hugging Face|access-date=2025-11-13}}</ref>

: モデル実演の Hugging Face Space（デモ） も用意されています。
: GIGAZINE

==== 使い分けガイド（実務目線） ====
* できるだけ軽く速く回したい：CTC 系（300M/1B）。RTF が速く、VRAM 消費も比較的少ない。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
* 表記規約・専門語彙に寄せたい／新言語を小規模追加したい：LLM-ASR 系（3B/7B）。言語条件付け + コンテキスト例で柔軟に追従。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
* 未学習言語のゼロショットに挑む：omniASR_LLM_7B_ZS。ただし VRAM と速度は重め。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>

==== 推論の最小例（概念） ====

公式 README の最小例は以下のように Pipeline を呼び出します（Python）。詳細はリポジトリ参照。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>

<syntaxhighlight lang="python">from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B")
audio_files = ["/path/to/eng.flac", "/path/to/jpn.wav"]
lang = ["eng_Latn", "jpn_Jpan"]  # 言語タグで条件付け
texts = pipeline.transcribe(audio_files, lang=lang, batch_size=2)

</syntaxhighlight>

: 注意：現状の参照実装は40秒未満の音声が前提です。長尺は分割などの前処理が必要になります。
: GitHub

==== ライセンスと配布 ====
* コード／モデル：Apache-2.0。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
* コーパス：CC-BY-4.0。Hugging Face<ref>{{cite web|title=Hugging Face|url=https://huggingface.co/datasets/facebook/omnilingual-asr-corpus/tree/main|publisher=Hugging Face|access-date=2025-11-13}}</ref>

==== 限界・留意点 ====
* 言語間ばらつき：総計では 78% が CER<10% ですが、低資源言語ほど誤り率が上がりやすい傾向は報道でも指摘されています（数値の内訳はメディア依存）。導入前の自言語検証は実務的に必須です。チョソンビズ<ref>{{cite web|title=チョソンビズ|url=https://biz.chosun.com/jp/jp-it/2025/11/11/LED4DQNBBZAA5O57RZMMZ7X7HI/|publisher=biz.chosun.com|date=2025-11-11|access-date=2025-11-13}}</ref>
* 計算資源：LLM-ASR 7B 系は VRAM ~17–20 GiB 程度が目安。軽量化が必要なら CTC 系を選択。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
* 長音声：参照実装は 40s 制約。長尺会議録音などは分割・バッチ化のワークフロー設計が要ります。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>

==== 研究的意義のまとめ ====
# 極端に多言語なカバレッジ（1,600+）をオープンに提供し、言語資源の長い尻尾を初めて本気で掬い上げた。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
# 少例追加とコンテキスト適応を前提とする LLM-ASR 設計で、「ASRを作る」コスト構造を作り替えた。GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>
# データ側では348言語の自発話コーパスを CC-BY-4.0 で公開し、正書法・記号付与の実務仕様まで整備。後続研究・現場実装の再現性を高めている。Hugging Face<ref>{{cite web|title=Hugging Face|url=https://huggingface.co/datasets/facebook/omnilingual-asr-corpus/blob/main/README.md|publisher=Hugging Face|access-date=2025-11-13}}</ref>

主要リソース：
GitHub（モデル・コード・使用法・数値）GitHub<ref>{{cite web|title=GitHub|url=https://github.com/facebookresearch/omnilingual-asr|publisher=github.com|access-date=2025-11-13}}</ref>／ コーパス（Hugging Face, CC-BY-4.0）Hugging Face<ref>{{cite web|title=Hugging Face|url=https://huggingface.co/datasets/facebook/omnilingual-asr-corpus/tree/main|publisher=Hugging Face|access-date=2025-11-13}}</ref>／ Meta 公式ブログ（概要・位置づけ）AI Meta<ref>{{cite web|title=AI Meta|url=https://ai.meta.com/blog/omnilingual-asr-advancing-automatic-speech-recognition/|publisher=AI Meta|access-date=2025-11-13}}</ref>／ 報道・解説（GIGAZINE、VentureBeat など）GIGAZINE<ref>{{cite web|title=GIGAZINE|url=https://gigazine.net/news/20251111-meta-omnilingual-asr/|publisher=gigazine.net|access-date=2025-11-13}}</ref>