What Is Llama 4? Meta’s Multimodal Open AI Family — Scout, Maverick, and Behemoth Explained – IT Glossary Plus

🌐
この記事の日本語版：
Llama 4とは？Metaの最新オープンAIモデル・MoE構造・Scout/Maverickの違いを徹底解説 →

What Is Llama 4?

Llama 4 is the latest generation of Meta’s open-weight large language model family, introduced in April 2025. The Llama 4 herd marks two major firsts for Meta: a Mixture-of-Experts (MoE) architecture and native multimodal (text and image) capability. It succeeds Llama 3 as the centerpiece of Meta’s AI strategy and powers Meta AI, plus the AI features embedded in Facebook, Instagram, and WhatsApp.

A useful mental model for MoE: think of a firm staffed with dozens of specialists where only the few specialists relevant to a specific task actually work on any given problem. Llama 4 Scout has 109B total parameters but activates just 17B per input token, so you get the quality benefits of a large model with the runtime cost of a much smaller one.

Keep in mind that Meta Superintelligence Labs announced Muse Spark in April 2026 as a proprietary successor to Llama — a significant strategic pivot away from Meta’s open-model lineage. That shift positions Llama 4 as the last major release in the open-weight era for Meta’s flagship family, making it historically important even as it is displaced in production.

How to Pronounce Llama 4

LAH-muh four (/ˈlɑː.mə fɔːr/)

YAH-mah four (Spanish-style, /ˈjɑː.mɑː fɔːr/)

Llama the animal is native to the Andes, and in English the word is pronounced “LAH-muh” rather than the Spanish “YAH-mah.” Meta’s branding, logos, and marketing all use the anglicized pronunciation consistently, and the “4” is read as “four” — so the canonical reading is “LAH-muh four.”

How Llama 4 Works

The defining architectural choice in Llama 4 is the shift to Mixture of Experts. In a dense Transformer, every parameter participates in every token. In an MoE layer, a routing network picks a handful of expert subnetworks to handle each token. Only the chosen experts’ parameters activate per token, so the effective compute per step stays manageable even when total parameter count is enormous.

Llama 4 also incorporates iRoPE (interleaved Rotary Position Embedding), a positional encoding scheme designed to extend usable context dramatically without quality degradation. This is the mechanism that enables Scout’s 10-million-token context window — position encoding for long sequences is a surprisingly subtle problem, and the technique Meta adopted here is the difference between a context window that works in theory and one that produces useful answers in practice. Keep in mind that very long contexts still put heavy pressure on KV-cache memory during inference, so the theoretical ceiling is almost never the practical ceiling for a given hardware budget.

The Llama 4 Family

Llama 4 Family

Scout
109B total
17B active / 16 experts
10M context

Maverick
400B total
17B active / 128 experts
1M context

Behemoth
~2T total
288B active / 16 experts
Unreleased

Native Multimodality

Llama 4 ingests images and text side by side, trained jointly rather than bolted on. You can point it at a UI screenshot and ask for a review, hand it a chart and request interpretation, or describe medical imagery for preliminary analysis (always with appropriate human verification). It supports 12 languages natively, making it a usable starting point for non-English workflows.

Native multimodality differs meaningfully from the “vision-adapter” approach used by some earlier models, where a separate visual encoder was trained and then connected to a pre-existing language model. Training the image and text modalities together from the start tends to produce tighter grounding between visual content and linguistic reasoning. In practice, this shows up as better performance on tasks that require the model to reference specific regions of an image, count objects, or follow visual instructions step by step. Important to note: multimodal capability adds its own cost at inference time because image tokens consume context budget. You should measure carefully when deploying multimodal Llama 4 under tight latency or memory constraints.

10M-Token Context

Scout’s 10-million-token context window was genuinely industry-leading at launch. In practice, it enables loading complete code monorepos, whole books, or multi-hour transcript dumps into a single request. Important: while context supports 10M, memory consumption scales accordingly — production deployments almost always lean on INT4 or INT8 quantization to keep GPU requirements reasonable.

Training Data and Tokenizer

Llama 4 was pre-trained on a mixture of publicly available text, licensed data, and data from Meta’s products, with deduplication and filtering applied before training. The tokenizer vocabulary was expanded relative to Llama 3 to better accommodate non-English text, which contributes to the broader language coverage. You should note that Meta has not publicly disclosed the exact composition of the training set, which is a point of criticism from transparency advocates but consistent with industry practice for major model releases.

Routing and Expert Specialization

The router in a Mixture-of-Experts model learns during training to send each token to the experts most likely to produce a good prediction for it. Over time, experts specialize informally — some tend to handle syntactic structure, others domain-specific vocabulary, others numeric reasoning. Llama 4’s 16-expert Scout and 128-expert Maverick represent two different points in the compute-versus-expressivity trade-off. More experts allow finer specialization but increase the burden on routing quality; fewer experts route more deterministically but cap the potential for specialization. Keep in mind that MoE models also introduce load-balancing considerations during training and inference — without care, a few “favorite” experts would receive almost all the traffic, which defeats the point of the architecture.

Llama 4 Usage and Examples

Because Llama 4 is open-weight, you can download the model and run it on your own hardware or pick from a wide range of hosted providers. Hugging Face, Ollama, vLLM, AWS Bedrock, Azure AI Foundry, and Groq all support it.

Downloading from Hugging Face

# Download with the Hugging Face CLI
huggingface-cli download meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --local-dir ./llama-4-scout

# Run inference with Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./llama-4-scout")
model = AutoModelForCausalLM.from_pretrained(
    "./llama-4-scout",
    torch_dtype="auto",
    device_map="auto"
)

prompt = "Explain the Mixture-of-Experts architecture in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using Cloud Providers

AWS Bedrock, Azure AI Foundry, Google Cloud Vertex AI, Groq, Together AI, and others all expose Llama 4 as a managed API. You should reach for these when self-hosting is infeasible — GPU scarcity, operational overhead, or regulatory compliance concerns that are easier to satisfy through an established cloud.

Meta AI Apps

Llama 4 powers the Meta AI assistant available at meta.ai and inside Facebook, Instagram, and WhatsApp. These consumer surfaces expose Llama 4 without requiring a developer to wire anything up, which makes them the fastest way to benchmark Llama 4 on everyday prompts.

Quantization Options

Running Llama 4 outside of hyperscale environments almost always involves quantization. INT8 quantization typically preserves nearly all quality and halves memory footprint compared to BF16. INT4 (via GPTQ, AWQ, or bitsandbytes) pushes memory usage lower still at the cost of some accuracy, particularly on tasks that stress precise numeric reasoning. For most chat and code-assistance workloads, INT4 Scout running on a single consumer-grade workstation GPU produces results that are hard to distinguish from the full-precision model. Important to note: quantization is lossy, so you should evaluate on your own validation set before committing to a specific bit width in production.

Serving with vLLM

The vLLM project has become the default serving stack for self-hosted Llama 4 deployments. vLLM implements PagedAttention, which dramatically improves throughput for concurrent requests, and supports tensor parallelism across GPUs for Maverick. A typical deployment exposes an OpenAI-compatible HTTP endpoint so that existing client libraries work unchanged. You should plan for at least one H100 or equivalent per Scout replica, and a multi-GPU node per Maverick replica, plus headroom for peak traffic.

Advantages and Disadvantages of Llama 4

Advantages

Open weights unlock self-hosting, a necessity for organizations that can’t ship sensitive data to third-party APIs. Finance, healthcare, and government deployments routinely cite this as the deciding factor over Claude or GPT. MoE keeps inference costs bounded even at large total parameter counts. Fine-tuning — full, LoRA, or QLoRA — is unconstrained, letting teams build truly specialized variants. Note that the vibrant ecosystem around Llama (tuned derivatives, quantized releases, evaluation harnesses) keeps it useful even as newer frontier models ship.

Disadvantages

Llama 4 is open-weight but not OSI-approved open source — the Llama 4 Community License restricts commercial use for services with more than 700 million monthly active users. Self-hosting requires premium GPUs; a single H100 fits Scout but Maverick demands multi-GPU clusters. Japanese and other non-English language quality trails Claude and Gemini. Keep in mind that with Meta’s 2026 pivot to Muse Spark, future large-scale updates to Llama are uncertain.

There are additional operational costs that open-weight adopters sometimes underestimate. You need engineers who understand inference-server tuning, observability for latency and quality drift, and the ability to respond when a hardware node dies at 3 AM. Hosted APIs hide all of that behind a black box; self-hosting surfaces it as real engineering work. Organizations that choose Llama 4 should budget for this ongoing operational overhead rather than treating weight downloads as a one-time cost. Important to note: this is still cheaper at scale than per-token API billing for many workloads, but it is not free.

Llama 4 vs Claude vs GPT

Llama 4 is frequently benchmarked against Claude and GPT, but the trade-off is fundamentally about openness vs convenience.

Aspect	Llama 4	Claude	GPT
Distribution	Open weights	Closed (API)	Closed (API)
Self-host	Yes	No	No
Fine-tuning	Unrestricted	Limited	Limited
Context	Up to 10M (Scout)	200K	Hundreds of K
Architecture	MoE (disclosed)	Undisclosed	Undisclosed
Strength	Control & on-prem	Agent workflows	Generalist

The short form: pick Llama 4 when you need to own the model end-to-end — data residency, fine-tuning, offline operation — and pick Claude or GPT when you prioritize capability and operational simplicity. You should also consider the ecosystem fit: Claude’s agent-first design, GPT’s mass-market integrations, or Llama’s deep customization all encode different priorities.

Performance Considerations

On public benchmarks (MMLU, HumanEval, GSM8K, and others), Llama 4 Maverick competes with leading closed-weight models of its era but is generally not the absolute leader. The comparison is further complicated because Claude and GPT evolve through iterative point releases, while Llama 4 was released in a single milestone and its quality has remained fixed since. Fine-tuned derivatives of Llama 4 can exceed the base model on specific tasks, which is one of the strategic reasons to adopt an open-weight model. Important to note: quality comparisons age quickly in this field, and the right way to reason about model choice is to benchmark on your actual workload with recent builds of each candidate rather than relying on static snapshots of leaderboard data.

Ecosystem Maturity

The Llama ecosystem is measurably more mature than any competing open-weight family. Hugging Face hosts tens of thousands of Llama-based derivatives, quantizations, and fine-tuned adapters. Tooling for serving (vLLM, TGI), fine-tuning (Axolotl, Unsloth, PEFT), and evaluation (lm-evaluation-harness, helm) has been optimized against Llama’s architecture for multiple generations. Llama 4 inherits this ecosystem almost unchanged, which means a team adopting it can stand up production infrastructure faster than with any less-established open-weight alternative. Keep in mind that ecosystem inertia cuts both ways — if Meta does pivot away from Llama entirely, the community may sustain the family through forks and derivatives even without first-party support.

Common Misconceptions

Misconception 1: Llama 4 Is Fully Open Source

The Llama 4 Community License is not OSI-certified open source. Terms include a 700M monthly-active-users threshold for large deployments that requires a separate commercial agreement with Meta. “Open weights” is the most accurate label.

Misconception 2: Anyone Can Run Llama 4 on a Laptop

Scout fits on a single H100 (80GB) at full precision; consumer GPUs can only handle heavily quantized builds. Maverick requires a multi-GPU cluster. Keep in mind that even quantized Scout comfortably requires more VRAM than most laptops provide.

Misconception 3: Behemoth Has Shipped

As of April 2026, Behemoth (approximately 2T total parameters) remains unreleased. Meta described it as “still training” when Scout and Maverick launched, and the 2026 pivot to Muse Spark has put Behemoth’s open-weight release status in question.

Misconception 4: Llama 4 Replaces Llama 3 Everywhere

Many production deployments still use Llama 3 or its derivatives because migration is non-trivial — prompt templates differ, fine-tuned adapters are not cross-compatible, and Llama 3 variants remain competitive for certain tasks. You should not assume that a newer generation is automatically the right choice; the decision depends on how much of your infrastructure has been tuned to the older model.

Misconception 5: MoE Means Faster Than Dense Models

MoE reduces compute per token relative to a dense model of the same total parameter count, but that does not mean MoE is always faster in wall-clock terms. Routing overhead, sparse expert loading, and inter-expert communication can add latency. In practice, MoE shines for throughput (many requests at once) more than for single-request latency. Keep in mind that vendor claims about MoE speed often cite aggregate throughput rather than single-sequence latency, so it is important to benchmark the specific metric that matters for your deployment.

Real-World Use Cases

Llama 4 earns its keep in environments that can’t send data to third-party APIs: banking, health systems, defense, legal. Teams also fine-tune Llama 4 to produce domain-specialized variants — legal-Llama, medical-Llama, finance-Llama — that outperform general models on narrow tasks. The open architecture makes Llama 4 a preferred platform for LLM research and for cost-optimized bulk inference pipelines where API-per-token pricing would be prohibitive.

Representative Use Cases

1. On-prem AI: regulated industries where external APIs are off-limits.
2. Domain-specialized models: legal, medical, manufacturing.
3. Edge inference: quantized Scout for constrained deployments.
4. Research: interpretability and benchmarking work.
5. Cost-optimized bulk jobs: where per-token API costs would dominate.

Regional Deployments and Sovereign AI

Several national governments have deployed Llama 4 as the foundation of sovereign AI initiatives — the idea that critical infrastructure should run on models whose weights are locally controlled rather than accessed through foreign APIs. Japan, France, Saudi Arabia, India, and others have either publicly announced or are widely rumored to be building national LLM stacks on top of Llama 4, often fine-tuned on locally curated datasets for language and cultural alignment. Important to note: this has become a strategic reason the open-weight ecosystem matters beyond the purely technical arguments. A hosted API model cannot satisfy sovereignty requirements regardless of how favorable its contract terms are.

Academic and Interpretability Research

The open weights have made Llama 4 the model of choice for interpretability research. Researchers can attach probes to any layer, visualize the router decisions of MoE experts, and measure how activations change in response to interventions. Closed-weight models offer nothing comparable; even if a vendor publishes a research paper, outside researchers cannot reproduce or extend it without access to the weights. This explains why a disproportionate share of mechanistic-interpretability papers from 2025 and 2026 use Llama 4 Scout as their test model.

Frequently Asked Questions (FAQ)

Q. Is Llama 4 free?

A. Weight downloads are free (Hugging Face access gate required). Meta AI apps are free for end users. Self-hosting incurs GPU costs, and hyperscaler services carry their own pricing.

Q. How is Llama 4’s non-English capability?

A. Improved over Llama 3, but trailing Claude and Gemini on many non-English benchmarks. Community derivatives (e.g., ELYZA Llama for Japanese) can close the gap meaningfully.

Q. How does Llama 4 relate to Muse Spark?

A. Muse Spark is Meta Superintelligence Labs’ 2026 proprietary replacement for Llama. It is not open-weight and cannot be self-hosted, which makes Llama 4 the last open-weight flagship from Meta for the foreseeable future.

Q. Is fine-tuning hard?

A. LoRA and QLoRA methods let you fine-tune Scout with just a few GPUs. Hugging Face’s PEFT library and tools like Axolotl make the workflow reasonably approachable.

Q. Does Llama 4 support tool use and function calling?

A. Yes — Llama 4 was trained to emit tool-call JSON blocks compatible with common function-calling schemas. Major inference servers (vLLM, TGI, Ollama) expose this as a first-class feature, and client libraries translate between Llama 4’s native format and OpenAI-style tool-call APIs. Agentic frameworks such as LangChain, LlamaIndex, and Haystack all support Llama 4 as a backend.

Q. What hardware do I need for Maverick?

A. Maverick’s 400B total parameters require multi-GPU inference. A realistic minimum is eight H100 GPUs with high-bandwidth interconnect (NVLink, NVSwitch). Cloud providers also offer Maverick behind managed endpoints for customers who cannot afford the capital expense of owning the hardware. You should model out your expected request volume before deciding between self-hosted Maverick and managed inference.

Q. Can I combine Llama 4 with RAG?

A. Yes — Scout’s 10M context window is often paired with RAG pipelines to pre-filter documents before letting the model reason over a large but focused subset. This combination is particularly effective for enterprise search and knowledge-management use cases where the full corpus is too large to fit even in a 10M context.

Q. What is the Llama 4 Community License?

A. The Llama 4 Community License is Meta’s custom license, distinct from standard open-source licenses like Apache 2.0 or MIT. It permits broad use including commercial, but requires a separate agreement with Meta once a service exceeds 700 million monthly active users. You should review the full license text before deploying in a commercial product, especially if the product is expected to scale.

Conclusion

Llama 4 is the most capable open-weight model family Meta has released to date and marks a turning point where Mixture of Experts moved from research prototype to mainstream deployment. Its combination of open weights, native multimodality, and million-plus-token context has made it the foundation of sovereign-AI programs, regulated-industry deployments, and academic interpretability work. Whether Meta continues its open-weight strategy beyond Llama 4 under the Muse Spark banner is an open question, but Llama 4 itself will remain a reference point for years because its weights are freely downloadable and its ecosystem of derivatives continues to grow.