3-Tier Setup Guide

Set up production-ready 3-tier speculative decoding in minutes. This guide walks you through configuring the pyramid architecture for 92% quality at just $5-10/month.

Prerequisites

  • macOS 14+ with Apple Silicon (M1/M2/M3)
  • Python 3.9 or higher
  • momo-kiji installed (pip install momo-kiji)
  • OpenRouter API key (for cloud fallback)

Step 1: Install Dependencies

# Install momo-kiji with 3-tier support
pip install momo-kiji[3tier]

# Download required models
momo-kiji download-models --preset 3tier

# This downloads:
# - Llama 2B (draft model)
# - Llama 8B (qualifier model)
# - Configuration files

Step 2: Configure API Keys

# Set OpenRouter API key for cloud fallback
export OPENROUTER_API_KEY="your-key-here"

# Or add to your .env file
echo "OPENROUTER_API_KEY=your-key-here" >> .env

Step 3: Initialize 3-Tier Configuration

import momo_kiji as mk

# Initialize with Hybrid Config 4
config = mk.ThreeTierConfig(
    # Tier 1: Draft model on ANE
    draft_model="llama-2b-ane",
    draft_device="ane",
    
    # Tier 2: Qualifier on GPU
    qualifier_model="llama-8b",
    qualifier_device="gpu",
    
    # Tier 3: Cloud fallback
    cloud_provider="openrouter",
    cloud_model="anthropic/claude-3-opus",
    
    # Performance settings
    max_draft_tokens=256,
    qualification_threshold=0.85,
    cloud_fallback_threshold=0.7
)

# Create the 3-tier pipeline
pipeline = mk.SpeculativePipeline(config)

Step 4: Use the Pipeline

# Simple generation
response = pipeline.generate("Explain quantum computing")

# With streaming
for chunk in pipeline.stream("Write a story about AI"):
    print(chunk, end="", flush=True)

# Check tier usage
stats = pipeline.get_stats()
print(f"Draft accepted: {'{stats.draft_acceptance_rate:.1%}'}")
print(f"Cloud usage: {'{stats.cloud_usage_rate:.1%}'}")
print(f"Estimated cost: {stats.estimated_cost:.2f}")

Step 5: Production Deployment

# server.py
from momo_kiji import create_app

app = create_app(
    config_file="3tier.yaml",
    enable_monitoring=True,
    enable_caching=True
)

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

Configuration Reference

3tier.yaml

version: 1.0
tiers:
  draft:
    model: llama-2b-ane
    device: ane
    max_tokens: 256
    temperature: 0.7
  
  qualifier:
    model: llama-8b
    device: gpu
    threshold: 0.85
    
  cloud:
    provider: openrouter
    model: anthropic/claude-3-opus
    fallback_threshold: 0.7
    budget_limit: 10.00  # Monthly USD

monitoring:
  enable: true
  webhook: https://your-webhook.com/alerts
  
caching:
  enable: true
  ttl: 3600  # 1 hour

Performance Tuning

For Lower Latency

  • • Reduce max_draft_tokens to 128
  • • Increase qualification_threshold to 0.9
  • • Use smaller draft model (1B params)

For Lower Cost

  • • Decrease qualification_threshold to 0.8
  • • Enable aggressive caching
  • • Use smaller cloud model for fallback

Monitoring & Observability

The 3-tier system includes built-in monitoring to track performance and costs:

# Enable detailed logging
pipeline.enable_logging(level="DEBUG")

# Get real-time metrics
metrics = pipeline.get_metrics()
print(f"Avg latency: {'{metrics.avg_latency_ms}'}ms")
print(f"P95 latency: {'{metrics.p95_latency_ms}'}ms")
print(f"Tier distribution: {'{metrics.tier_distribution}'}")

# Export to monitoring service
pipeline.export_metrics(format="prometheus")

Troubleshooting

High cloud usage?

Check your qualification threshold. Too high means more cloud fallbacks. Start with 0.85 and adjust based on quality requirements.

Slow startup?

Use model preloading: pipeline.preload_models()at application start to avoid cold starts.

ANE not available?

The system automatically falls back to GPU. Check ANE availability withmk.check_ane_status().

Ready to deploy?

Join our Discord for help with production deployments and performance optimization.

Learn more about 3-tier architecture →