Set up production-ready 3-tier speculative decoding in minutes. This guide walks you through configuring the pyramid architecture for 92% quality at just $5-10/month.
pip install momo-kiji)# Install momo-kiji with 3-tier support pip install momo-kiji[3tier] # Download required models momo-kiji download-models --preset 3tier # This downloads: # - Llama 2B (draft model) # - Llama 8B (qualifier model) # - Configuration files
# Set OpenRouter API key for cloud fallback export OPENROUTER_API_KEY="your-key-here" # Or add to your .env file echo "OPENROUTER_API_KEY=your-key-here" >> .env
import momo_kiji as mk
# Initialize with Hybrid Config 4
config = mk.ThreeTierConfig(
# Tier 1: Draft model on ANE
draft_model="llama-2b-ane",
draft_device="ane",
# Tier 2: Qualifier on GPU
qualifier_model="llama-8b",
qualifier_device="gpu",
# Tier 3: Cloud fallback
cloud_provider="openrouter",
cloud_model="anthropic/claude-3-opus",
# Performance settings
max_draft_tokens=256,
qualification_threshold=0.85,
cloud_fallback_threshold=0.7
)
# Create the 3-tier pipeline
pipeline = mk.SpeculativePipeline(config)# Simple generation
response = pipeline.generate("Explain quantum computing")
# With streaming
for chunk in pipeline.stream("Write a story about AI"):
print(chunk, end="", flush=True)
# Check tier usage
stats = pipeline.get_stats()
print(f"Draft accepted: {'{stats.draft_acceptance_rate:.1%}'}")
print(f"Cloud usage: {'{stats.cloud_usage_rate:.1%}'}")
print(f"Estimated cost: {stats.estimated_cost:.2f}")# server.py
from momo_kiji import create_app
app = create_app(
config_file="3tier.yaml",
enable_monitoring=True,
enable_caching=True
)
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)version: 1.0
tiers:
draft:
model: llama-2b-ane
device: ane
max_tokens: 256
temperature: 0.7
qualifier:
model: llama-8b
device: gpu
threshold: 0.85
cloud:
provider: openrouter
model: anthropic/claude-3-opus
fallback_threshold: 0.7
budget_limit: 10.00 # Monthly USD
monitoring:
enable: true
webhook: https://your-webhook.com/alerts
caching:
enable: true
ttl: 3600 # 1 hourThe 3-tier system includes built-in monitoring to track performance and costs:
# Enable detailed logging
pipeline.enable_logging(level="DEBUG")
# Get real-time metrics
metrics = pipeline.get_metrics()
print(f"Avg latency: {'{metrics.avg_latency_ms}'}ms")
print(f"P95 latency: {'{metrics.p95_latency_ms}'}ms")
print(f"Tier distribution: {'{metrics.tier_distribution}'}")
# Export to monitoring service
pipeline.export_metrics(format="prometheus")Check your qualification threshold. Too high means more cloud fallbacks. Start with 0.85 and adjust based on quality requirements.
Use model preloading: pipeline.preload_models()at application start to avoid cold starts.
The system automatically falls back to GPU. Check ANE availability withmk.check_ane_status().
Join our Discord for help with production deployments and performance optimization.
Learn more about 3-tier architecture →