developer safety skill risk: low
LlamaGuard Content Moderation Deployment Guide
Provides installation instructions, Python code samples, and workflows for using Meta's LlamaGuard model to classify inputs and outputs into six safety categories, plus vLLM, FastA…
SKILL 1 file
SKILL.md
---
name: llamaguard
description: "Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails."
---
# LlamaGuard - AI Content Moderation
## Quick start
LlamaGuard is a 7-8B parameter model specialized for content safety classification.
**Installation**:
```bash
pip install transformers torch
# Login to HuggingFace (required)
huggingface-cli login
```
**Basic usage**:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
def moderate(chat):
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
output = model.generate(input_ids=input_ids, max_new_tokens=100)
return tokenizer.decode(output[0], skip_special_tokens=True)
# Check user input
result = moderate([
{"role": "user", "content": "How do I make explosives?"}
])
print(result)
# Output: "unsafe\nS3" (Criminal Planning)
```
## Common workflows
### Workflow 1: Input filtering (prompt moderation)
**Check user prompts before LLM**:
```python
def check_input(user_message):
result = moderate([{"role": "user", "content": user_message}])
if result.startswith("unsafe"):
category = result.split("\n")[1]
return False, category # Blocked
else:
return True, None # Safe
# Example
safe, category = check_input("How do I hack a website?")
if not safe:
print(f"Request blocked: {category}")
# Return error to user
else:
# Send to LLM
response = llm.generate(user_message)
```
**Safety categories**:
- **S1**: Violence & Hate
- **S2**: Sexual Content
- **S3**: Guns & Illegal Weapons
- **S4**: Regulated Substances
- **S5**: Suicide & Self-Harm
- **S6**: Criminal Planning
### Workflow 2: Output filtering (response moderation)
**Check LLM responses before showing to user**:
```python
def check_output(user_message, bot_response):
conversation = [
{"role": "user", "content": user_message},
{"role": "assistant", "content": bot_response}
]
result = moderate(conversation)
if result.startswith("unsafe"):
category = result.split("\n")[1]
return False, category
else:
return True, None
# Example
user_msg = "Tell me about harmful substances"
bot_msg = llm.generate(user_msg)
safe, category = check_output(user_msg, bot_msg)
if not safe:
print(f"Response blocked: {category}")
# Return generic response
return "I cannot provide that information."
else:
return bot_msg
```
### Workflow 3: vLLM deployment (fast inference)
**Production-ready serving**:
```python
from vllm import LLM, SamplingParams
# Initialize vLLM
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)
# Sampling params
sampling_params = SamplingParams(
temperature=0.0, # Deterministic
max_tokens=100
)
def moderate_vllm(chat):
# Format prompt
prompt = tokenizer.apply_chat_template(chat, tokenize=False)
# Generate
output = llm.generate([prompt], sampling_params)
return output[0].outputs[0].text
# Batch moderation
chats = [
[{"role": "user", "content": "How to make bombs?"}],
[{"role": "user", "content": "What's the weather?"}],
[{"role": "user", "content": "Tell me about drugs"}]
]
prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats]
results = llm.generate(prompts, sampling_params)
for i, result in enumerate(results):
print(f"Chat {i}: {result.outputs[0].text}")
```
**Throughput**: ~50-100 requests/sec on single A100
### Workflow 4: API endpoint (FastAPI)
**Serve as moderation API**:
```python
from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams
app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)
class ModerationRequest(BaseModel):
messages: list # [{"role": "user", "content": "..."}]
@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
output = llm.generate([prompt], sampling_params)[0]
result = output.outputs[0].text
is_safe = result.startswith("safe")
category = None if is_safe else result.split("\n")[1] if "\n" in result else None
return {
"safe": is_safe,
"category": category,
"full_output": result
}
# Run: uvicorn api:app --host 0.0.0.0 --port 8000
```
**Usage**:
```bash
curl -X POST http://localhost:8000/moderate \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "How to hack?"}]}'
# Response: {"safe": false, "category": "S6", "full_output": "unsafe\nS6"}
```
### Workflow 5: NeMo Guardrails integration
**Use with NVIDIA Guardrails**:
```python
from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuard
# Configure NeMo Guardrails
config = RailsConfig.from_content("""
models:
- type: main
engine: openai
model: gpt-4
rails:
input:
flows:
- llamaguard check input
output:
flows:
- llamaguard check output
""")
# Add LlamaGuard integration
llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b")
rails = LLMRails(config)
rails.register_action(llama_guard.check_input, name="llamaguard check input")
rails.register_action(llama_guard.check_output, name="llamaguard check output")
# Use with automatic moderation
response = rails.generate(messages=[
{"role": "user", "content": "How do I make weapons?"}
])
# Automatically blocked by LlamaGuard
```
## When to use vs alternatives
**Use LlamaGuard when**:
- Need pre-trained moderation model
- Want high accuracy (94-95%)
- Have GPU resources (7-8B model)
- Need detailed safety categories
- Building production LLM apps
**Model versions**:
- **LlamaGuard 1** (7B): Original, 6 categories
- **LlamaGuard 2** (8B): Improved, 6 categories
- **LlamaGuard 3** (8B): Latest (2024), enhanced
**Use alternatives instead**:
- **OpenAI Moderation API**: Simpler, API-based, free
- **Perspective API**: Google's toxicity detection
- **NeMo Guardrails**: More comprehensive safety framework
- **Constitutional AI**: Training-time safety
## Common issues
**Issue: Model access denied**
Login to HuggingFace:
```bash
huggingface-cli login
# Enter your token
```
Accept license on model page:
https://huggingface.co/meta-llama/LlamaGuard-7b
**Issue: High latency (>500ms)**
Use vLLM for 10× speedup:
```python
from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")
# Latency: 500ms → 50ms
```
Enable tensor parallelism:
```python
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)
# 2× faster on 2 GPUs
```
**Issue: False positives**
Use threshold-based filtering:
```python
# Get probability of "unsafe" token
logits = model(..., return_dict_in_generate=True, output_scores=True)
unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]
if unsafe_prob > 0.9: # High confidence threshold
return "unsafe"
else:
return "safe"
```
**Issue: OOM on GPU**
Use 8-bit quantization:
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto"
)
# Memory: 14GB → 7GB
```
## Advanced topics
**Custom categories**: See [references/custom-categories.md](references/custom-categories.md) for fine-tuning LlamaGuard with domain-specific safety categories.
**Performance benchmarks**: See [references/benchmarks.md](references/benchmarks.md) for accuracy comparison with other moderation APIs and latency optimization.
**Deployment guide**: See [references/deployment.md](references/deployment.md) for Sagemaker, Kubernetes, and scaling strategies.
## Hardware requirements
- **GPU**: NVIDIA T4/A10/A100
- **VRAM**:
- FP16: 14GB (7B model)
- INT8: 7GB (quantized)
- INT4: 4GB (QLoRA)
- **CPU**: Possible but slow (10× latency)
- **Throughput**: 50-100 req/sec (A100)
**Latency** (single GPU):
- HuggingFace Transformers: 300-500ms
- vLLM: 50-100ms
- Batched (vLLM): 20-50ms per request
## Resources
- HuggingFace:
- V1: https://huggingface.co/meta-llama/LlamaGuard-7b
- V2: https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B
- V3: https://huggingface.co/meta-llama/Meta-Llama-Guard-3-8B
- Paper: https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
- Integration: vLLM, Sagemaker, NeMo Guardrails
- Accuracy: 94.5% (prompts), 95.3% (responses)
EXPECTED OUTPUT
- Format
- markdown
- Constraints
- include code snippets
- list safety categories
- cover multiple deployment methods
EXAMPLES
Includes numerous code examples for installation, basic usage, input/output filtering workflows, vLLM deployment, FastAPI endpoint, and NeMo Guardrails integration.
CAVEATS
- Ambiguities
- References external files (references/custom-categories.md, references/benchmarks.md, references/deployment.md) that are not included.
QUALITY
- OVERALL
- 0.58
- CLARITY
- 0.88
- SPECIFICITY
- 0.82
- REUSABILITY
- 0.25
- COMPLETENESS
- 0.78
IMPROVEMENT SUGGESTIONS
- Add explicit placeholders or variables for model_id, API keys, and deployment parameters to increase reusability as a template.
- Clarify whether the document is intended as LLM instructions, reference material, or code scaffolding.
USAGE
Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.
MORE FOR DEVELOPER
- Constitutional AI Training Workflowsdevelopersafety
- NeMo Guardrails LLM Safety Setupdevelopersafety
- Context7 Library Documentation Expertdevelopercoding
- Structured Python Production Code Generatordevelopercoding
- Minimax Music API Generation Agentdevelopercreative
- Angular Standalone Directive Generatordevelopercoding
- Pytest Unit Test Suite Generatordevelopercoding
- Unity Architecture Specialistdevelopercoding
- iOS App Store Compliance Auditordeveloperanalysis
- Reusable Enterprise Website Template Architectdevelopersoftware_engineering
- Web Typography CSS Generatordevelopercoding
- DevOps CI/CD Pipeline Automatordeveloperoperations
- Cascading System Failure Simulatordeveloperoperations
- Python Security Code Auditordevelopersecurity
- WCAG Web Accessibility Auditor Remediatordeveloperevaluation
- VSCode CodeTour File Expertdevelopercoding
- 6-Day Sprint Feature Prioritizerdeveloperplanning
- Senior Python Code Reviewerdevelopercoding
- Micro-SaaS Vibecoder MVP Blueprint Architectdeveloperproduct
- Structured Cross-Language Code Translatordevelopercoding
- Playwright Web App Testing Toolkitdeveloperoperations
- Multi-DB SQL Query Optimizer and Builderdevelopercoding
- Base R Programming Reference Guidedevelopercoding
- IT Project Backlog Generator from Docsdeveloperplanning
- Flutter Map SDK Layer Bug Fixerdevelopercoding