Why it matters
- Python-native syntax (
@app.function(gpu='A100')) eliminates Docker, Kubernetes, and infrastructure configuration — the simplest possible GPU cloud interface. - Scale-to-zero by default means LLM inference endpoints cost nothing when idle — critical for early-stage products with unpredictable traffic.
- Hot reloading during development (
modal serve) makes iterating on ML code fast — changes deploy in seconds, not minutes. - First-class web endpoint support (
@asgi_app()) turns any FastAPI app into a serverless, GPU-backed API with one command.
Key capabilities
- GPU functions:
@app.function(gpu='T4'/'A10G'/'A100'/'H100')— any Python function on any GPU class. - Auto-scaling: Scale from 0 to N instances based on request volume; scale back to zero when idle.
- Container management: Automatic container building from Python requirements — no Dockerfile needed.
- Web endpoints: Deploy FastAPI or any ASGI app as serverless web endpoint with
@asgi_app(). - Scheduled jobs: Cron-style scheduling with
@app.function(schedule=Period(hours=1)). - Parallel jobs: Map a function across thousands of inputs in parallel (
stub.map()). - Persistent volumes: Mount NFS volumes for model weights and dataset caching.
- Secrets: Secure secrets management for API keys and credentials.
- CLI:
modal run,modal serve,modal deployfor development, serving, and production.
Technical notes
- Language: Python (primary); REST API for other languages
- GPUs: T4, A10G, A100 (40/80GB), H100
- Pricing: Free $30/mo credits; T4 ~$0.59/hr, A10G ~$1.10/hr, A100 ~$3.72/hr, H100 ~$7.20/hr
- Containers: Custom Python environments built from
modal.Image; supports pip, conda, Docker - Secrets:
modal.Secretfor environment variables; integrations with AWS, GCP, Cloudflare - Founded: 2021; New York; raised $67M (Redpoint, Andreessen Horowitz)
- Team: Ex-Stripe, ex-Google, MIT engineers
Ideal for
- ML engineers who want GPU access for LLM inference, diffusion models, or training without managing Kubernetes or Docker.
- Python developers building AI-powered APIs or batch processing pipelines on serverless infrastructure.
- Startups and researchers who need powerful GPUs for experiments but don't want to pay for idle VMs.
Not ideal for
- Non-Python workloads — Modal is built around Python; Go, Rust, or Node.js backend work needs another solution.
- Sustained high-throughput inference at scale — RunPod or dedicated GPU instances may be cheaper at constant load.
- Teams who need fine-grained GPU memory sharing across concurrent users (LoRAX, vLLM serving features).