Test NVIDIA NIM API speed in real-time. Compare 100+ AI models — LLM, code generation, text-to-image, text-to-video, embedding, and more. Measure tokens per second, time to first token (TTFT), and API latency. Free online NVIDIA NIM speed test tool.
Get your free API key from build.nvidia.com
Select a category to filter models. Click Test Speed on any model to measure its real-time performance — TTFT, tokens/sec, and total response time.
Or models will load automatically if you've saved your key before.
NVIDIA NIM (NVIDIA Inference Microservices) is a groundbreaking platform that provides performance-optimized, portable AI inference microservices. Launched by NVIDIA as part of their AI Enterprise platform, NIM API gives developers instant access to over 100 pre-optimized AI models through a simple, OpenAI-compatible API endpoint.
Unlike traditional AI API providers that charge per-token fees, NVIDIA NIM free tier offers generous access to cutting-edge models including Meta Llama, DeepSeek, Qwen, Mistral, Google Gemma, NVIDIA Nemotron, and many more — all without requiring a credit card. The NIM API endpoint at integrate.api.nvidia.com/v1 follows the OpenAI Chat Completions format, making it incredibly easy to migrate existing applications.
The platform runs on NVIDIA DGX Cloud infrastructure, ensuring enterprise-grade reliability and performance. Whether you're building a chatbot, code assistant, image generator, or multimodal AI application, NVIDIA NIM provides the inference backbone. The NIM API supports streaming responses, tool calling, and system prompts — everything modern AI applications need.
For developers looking to test AI model performance before committing to a provider, our NVIDIA NIM speed test tool above lets you benchmark any model in real-time. Measure tokens per second, time to first token, and total API latency to find the perfect model for your use case.
Follow these simple steps to test NVIDIA NIM API speed and compare model performance:
nvapi-).Drop-in replacement for OpenAI API. Change base_url and your existing code works with 100+ models instantly.
40+ models completely free, no credit card required. ~40 requests per minute — more than enough for development and testing.
Powered by NVIDIA DGX Cloud and TensorRT-LLM. Optimized for speed with streaming support and fast time-to-first-token.
Access Llama, DeepSeek, Qwen, Mistral, Gemma, Nemotron, FLUX, Stable Diffusion, and more — all from one API.
Built on NVIDIA DGX Cloud infrastructure. Continuous vulnerability fixes, SOC2 compliance, and enterprise support available.
NIM API endpoints are distributed globally for minimal latency. Test from anywhere in the world with consistent performance.
Here's a comprehensive list of popular NVIDIA NIM models available through the API catalog. Use our speed test tool above to benchmark their real-time performance.
| Model | Provider | Category | Description |
|---|---|---|---|
| meta/llama-3.3-70b-instruct | Meta | LLM | 70B parameter flagship instruct model |
| meta/llama-4-maverick-17b-128e-instruct | Meta | Multimodal | Latest Llama 4 with vision capabilities |
| deepseek-ai/deepseek-v4-pro | DeepSeek | LLM | 1M-token context window MoE model |
| nvidia/llama-3.1-nemotron-ultra-253b-v1 | NVIDIA | LLM | 253B parameter reasoning model |
| qwen/qwen3-coder-480b-a35b-instruct | Qwen | Code | 480B MoE code generation model |
| qwen/qwq-32b | Qwen | LLM | 32B reasoning model |
| mistralai/mistral-nemotron | Mistral | LLM | Mistral-NVIDIA collaboration model |
| google/gemma-4-31b-it | LLM | Latest Gemma instruction-tuned model | |
| black-forest-labs/flux.1-dev | Black Forest Labs | Image | High-quality text-to-image generation |
| stabilityai/stable-diffusion-xl | Stability AI | Image | SDXL text-to-image model |
| stabilityai/stable-video-diffusion | Stability AI | Video | Image-to-video generation |
| nvidia/cosmos-predict1-7b | NVIDIA | Video | World model for video prediction |
| nvidia/nv-embedqa-e5-v5 | NVIDIA | Embedding | Text embedding for RAG and search |
| baai/bge-m3 | BAAI | Embedding | Multi-lingual embedding model |
| nvidia/llama-3.2-11b-vision-instruct | Meta | Multimodal | Vision-language model |
| microsoft/phi-4-multimodal-instruct | Microsoft | Multimodal | Small multimodal instruct model |
| openai/gpt-oss-120b | OpenAI | LLM | Open-source 120B parameter model |
| nvidia/nemotron-mini-4b-instruct | NVIDIA | LLM | Compact 4B instruction model |
| moonshotai/kimi-k2-instruct | Moonshot AI | LLM | Moonshot's instruction-tuned model |
| arc/evo2-40b | Arc | Healthcare | DNA sequence generation model |
Understanding NVIDIA NIM pricing helps you choose the right tier for your project. Here's a detailed comparison of the NIM free tier vs paid offerings.
| Feature | NIM Free Tier | NIM Paid (DGX Cloud) |
|---|---|---|
| Price | Free forever | Pay-per-use (varies by model) |
| Models Available | 40+ models | 100+ models |
| Rate Limit | ~40 requests/min | Custom (higher limits) |
| Credit Card Required | No | Yes |
| SLA Guarantee | None | Yes (99.9%+ uptime) |
| Support | Community | Priority enterprise support |
| Production Ready | Dev/testing only | Yes |
| Fine-tuning | Not available | Available |
| API Format | OpenAI-compatible | OpenAI-compatible |
| Best For | Prototyping, learning, personal projects | Production apps, enterprise, high-scale |
Pro tip: Start with the NVIDIA NIM free tier to prototype and benchmark. Use our speed test tool above to find the fastest model for your use case, then upgrade to paid when you need production reliability.
Getting a NVIDIA NIM API key is free and takes less than 2 minutes. Follow these steps to start building with 100+ AI models:
nvapi-) in our speed test tool above and start benchmarking models instantly!Important: Your NVIDIA NIM API key provides access to the free tier with ~40 requests per minute. No credit card is required. The key works with the OpenAI Python SDK, JavaScript SDK, curl, and any HTTP client.
Testing NVIDIA NIM API speed before deploying is critical for building responsive AI applications. Our speed test tool measures three key metrics that directly impact user experience:
Time to First Token (TTFT) — This measures how quickly the API starts streaming a response. For chat applications, a low TTFT (under 500ms) creates the feeling of instant responsiveness. Large models like Llama 3.3 70B may have higher TTFT due to model initialization, while smaller models like Nemotron Mini 4B respond almost instantly.
Tokens per Second — This determines how fast the full response streams in. Higher tokens/sec means users see the complete answer faster. This metric is crucial for code generation, long-form writing, and any application where response length matters. Our tool measures this in real-time using the NIM streaming API.
Total Response Time — The complete end-to-end latency from request to final token. This is the "wall clock time" users actually experience. By benchmarking NIM API response time across different models, you can find the optimal balance between model capability and speed for your specific use case.
Different NVIDIA NIM models have vastly different performance characteristics. A 70B parameter model produces higher quality outputs but runs slower than a 7B model. Our NVIDIA NIM benchmark tool helps you make data-driven decisions about which model to use, saving development time and ensuring the best user experience for your AI-powered application.
https://integrate.api.nvidia.com/v1 with your API key, and NIM returns model responses. It supports chat completions, text completions, streaming, and tool calling — just like OpenAI's API.nvapi- and is displayed only once — copy it immediately. See our step-by-step guide above for detailed instructions.https://integrate.api.nvidia.com/v1 and use your NVIDIA API key for authentication.