Nvidia-Developer-Blog | Daily Tech Articles Feed

Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes

2026-03-12 16:30

Streamlining AI clusters on Kubernetes is now easier with the AI Cluster Runtime. 🖥️ This open-source project provides validated and reproducible Kubernetes configurations as recipes. It supports GPU clusters in various environments by publishing tested combinations of drivers, runtimes, and more. Users can easily capture their cluster's state and generate tailored recipes for deployment. For more info, explore the repository or use the aicr CLI! 🔧📦 #Kubernetes #AI #OpenSource #GPU...

Source: Nvidia Developer Blog

Mark Chmarny

Educational

Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

2026-03-12 16:00

🚗✨ Physical AI is advancing rapidly, transforming autonomous vehicles and robotics. The focus is shifting to enabling high-fidelity reasoning and real-time interactions while managing power and latency. NVIDIA’s TensorRT Edge-LLM provides a solution with enhanced capabilities for its DRIVE AGX Thor and Jetson Thor platforms. Key features include advanced edge architectures and optimized support for open models. This technology allows developers to enhance autonomous systems while keeping...

Source: Nvidia Developer Blog

Lin Chai

Product Announcements

Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning

2026-03-11 16:00

Introducing the Nemotron 3 Super! 🚀 This new AI model is designed for agentic reasoning, offering enhanced efficiency for complex tasks like software development and cybersecurity. With a 120B total and 12B active parameters, it addresses key challenges like "context explosion" and the "thinking tax." The model features a 1M-token context window, allowing for long-term memory and improved accuracy. Its innovative architecture enhances throughput by over 5x compared to previous models....

Source: Nvidia Developer Blog

Chris Alexiuk

Product Announcements

NVIDIA RTX Innovations Are Powering the Next Era of Game Development

2026-03-10 15:30

NVIDIA is leading a revolution in game development with its RTX ray tracing and AI-powered technologies. 🌟 At GDC 2026, they showcased innovations that enhance visual fidelity and performance. Key highlights include: - A new system for dense, path-traced foliage in NVIDIA RTX Mega Geometry. - Path-traced indirect lighting and RTX Hair for streamlined graphics in UE5. - Advanced on-device AI models for engaging gameplay and character interactions. These developments promise to transform how...

Source: Nvidia Developer Blog

Ike Nnoli

Product Announcements

Reliable AI Coding for Unreal Engine: Improving Accuracy and Reducing Token Costs

2026-03-10 15:30

Transforming game development with AI! 🎮✨ Agentic code assistants are now integral in building expansive worlds and supporting distributed teams in Unreal Engine 5. These tools enhance efficiency by generating code, refactoring, and answering specific engine queries. A key focus is bridging the context gap, ensuring AI understands unique studio coding patterns. NVIDIA collaborates with studios to improve AI reliability, aiming to streamline production workflows. Reducing documentation...

Source: Nvidia Developer Blog

Paul Logan

Educational

CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features

2026-03-09 21:13

🚀 CUDA 13.2 has arrived with significant updates! NVIDIA CUDA Tile is now supported on Ampere, Ada, and Blackwell architectures. This release enhances developer productivity with new Python features, including profiling in CUDA Python and improved debugging for Numba kernels. The cuTile Python DSL has received updates like support for recursive functions, closures, and custom reduction functions. For easy installation, use the command: `pip install cuda-tile[tileiras]` to get started! #CUDA...

Source: Nvidia Developer Blog

Jonathan Bentz

Product Announcements

Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core

2026-03-09 19:30

NVIDIA Megatron Core is a key framework for large language model development, offering advanced parallelism and GPU performance. The Technology Innovation Institute (TII) has integrated the Falcon-H1 hybrid architecture into Megatron Bridge, addressing the challenges of coordinating diverse layers. This innovative design features parallel processing of attention mechanisms and SSM. Additionally, TII's integration of BitNet into Megatron Core enhances training efficiency through the use of...

Source: Nvidia Developer Blog

Mireille Fares

Technical Deep Dives

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library

2026-03-09 17:00

🚀 Large language models (LLMs) are increasingly relying on large-scale distributed inference, utilizing multiple GPUs to improve user experience and reduce latency. Key techniques include disaggregated serving, efficient KV cache loading, and wide expert parallelism. These methods help maximize performance by optimizing data transfers and enabling dynamic resource allocation. The NVIDIA Inference Transfer Library (NIXL) is introduced as a solution for managing diverse hardware environments,...

Source: Nvidia Developer Blog

Seonghee Lee

Technical Deep Dives

Removing the Guesswork from Disaggregated Serving

2026-03-09 16:00

🚀 Deploying large language models (LLMs) can be complex and time-consuming. AIConfigurator simplifies this process by optimizing configurations without needing extensive hardware tests. It breaks down LLM operations and provides latency estimates based on real measurements, allowing developers to find the best setups quickly. This tool also supports continuous batching and handles unique challenges like expert parallelism. Explore how AIConfigurator can streamline your deployment process! 💻✨...

Source: Nvidia Developer Blog

Tianhao Xu

Technical Deep Dives

NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance

2026-03-05 18:00

NVIDIA's Blackwell has achieved a record in LLM inference for finance, showcasing the power of large language models in analyzing unstructured data for trading insights. 📈💼 The STAC-AI benchmark was developed to assess LLM performance, focusing on the LANG6 tests with models like Llama 3.1 across various datasets. These tests analyze financial documents for investment strategies. Results include batch and interactive mode scenarios, measuring throughput and response times. ⚙️📊 #NVIDIA...

Source: Nvidia Developer Blog

Dan Blanaru

Industry Analysis

Controlling Floating-Point Determinism in NVIDIA CCCL

2026-03-05 17:00

Controlling floating-point determinism can be challenging in parallel programming. NVIDIA's CCCL 3.1 introduces a new single-phase API in CUB, allowing users to customize algorithm behavior for determinism. This feature enables configurations for the reduce algorithm's determinism property, enhancing performance and reliability. For a detailed code example, check the full article! #NVIDIA #CUDA #ParallelProgramming #Computing #CUB

Source: Nvidia Developer Blog

Nader Al Awar

Technical Deep Dives

Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile

2026-03-05 17:00

Unlock the potential of AI with Flash Attention! 🌟 This article explores implementing Flash Attention using NVIDIA cuTile, providing a complete code walkthrough for production readiness. It discusses the "trap and rescue" optimization journey, highlighting pitfalls of naive optimizations. Discover advanced techniques like FMA patterns, fast math, loop splitting, and adaptive tiling to maximize performance. 🚀 For implementation, ensure you have CUDA 13.1, NVIDIA Blackwell architecture, and...

Source: Nvidia Developer Blog

Alessandro Morari

Technical Deep Dives

Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile

2026-03-04 17:00

Unlock the potential of AI with Flash Attention! 🌟 This article explores implementing Flash Attention using NVIDIA cuTile, providing a complete code walkthrough for production readiness. It discusses the "trap and rescue" optimization journey, highlighting pitfalls of naive optimizations. Discover advanced techniques like FMA patterns, fast math, loop splitting, and adaptive tiling to maximize performance. 🚀 For implementation, ensure you have CUDA 13.1, NVIDIA Blackwell architecture, and...

Source: Nvidia Developer Blog

Alessandro Morari

Technical Deep Dives

How to Minimize Game Runtime Inference Costs with Coding Agents

2026-03-03 19:49

🚀 NVIDIA ACE is revolutionizing AI in gaming with its suite of technologies. It offers cloud and on-device models for in-game characters, enhancing aspects like speech and animation. The NVIDIA In-Game Inferencing SDK 1.5 introduces a new code agent sample, streamlining AI interactions in games. It focuses on reducing GPU contention by minimizing inference calls while maximizing their effectiveness. However, using AI agents poses challenges, such as potential security risks when they can...

Source: Nvidia Developer Blog

Brandon Rowlett

Technical Deep Dives

cuTile.jl Brings NVIDIA CUDA Tile-Based Programming to Julia

2026-03-03 19:48

🚀 Exciting news for Julia developers! cuTile.jl now brings NVIDIA CUDA Tile-based programming to the Julia language. This enables automatic access to tensor cores and simplifies the creation of high-performance GPU kernels. Building on NVIDIA's earlier release for Python, cuTile.jl allows developers to focus on operations at the tile level, reducing complexity in coding. Learn how this new tool can enhance your GPU programming experience! #JuliaLang #NVIDIA #CUDA #GPUProgramming #cuTile

Source: Nvidia Developer Blog

Tim Besard

Product Announcements

5 New Digital Twin Products Developers Can Use to Build 6G Networks

2026-03-01 07:00

🌐 The telecom industry is gearing up for 6G, facing the challenge of designing AI-native networks. NVIDIA's Aerial Omniverse Digital Twin (AODT) offers a solution by enabling a continuous integration workflow that simulates and validates RAN software in a realistic environment. Its modular design supports easy integration, fostering an ecosystem of commercial products to enhance network planning and testing. Discover how AODT accelerates innovation in building AI-native 6G networks. #6G...

Source: Nvidia Developer Blog

Cindy Goh

Product Announcements

Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

2026-03-01 07:00

🚀 Autonomous networks are a key focus in telecommunications, as highlighted by the NVIDIA State of AI report. 📊 65% of operators see AI driving automation, but many face challenges in AI expertise, hindering network scalability. 🤖 Tech Mahindra and NVIDIA are working together to bridge this skills gap by providing resources for building AI models that emulate NOC engineers. 🔧 Their approach includes generating realistic incident data and translating expert procedures into structured...

Source: Nvidia Developer Blog

Aiden Chang

Educational

Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints

2026-02-27 17:30

🚀 Alibaba has launched the Qwen3.5 series, an open-source model designed for native multimodal agents. This model features a ~400B parameter vision-language architecture that enhances user interface navigation. It's suitable for various applications, including coding, visual reasoning, chat, and complex searches. Developers can access GPU-accelerated endpoints for experimentation through NVIDIA's platform. #Qwen3 #ArtificialIntelligence #NVIDIA #OpenSource #MachineLearning

Source: Nvidia Developer Blog

Anu Srivastava

Product Announcements

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM

2026-02-27 17:00

Organizations deploying large language models (LLMs) face challenges with varying inference workloads. Small models may need limited GPU memory, while larger ones require multiple GPUs, leading to low utilization and high costs. Intelligent scheduling is key to enhancing GPU performance without over or underprovisioning. The article discusses NVIDIA NIM's containerized microservices and Run:ai's scheduling strategies that improve GPU utilization by nearly 2x and reduce latency significantly....

Source: Nvidia Developer Blog

Shwetha Krishnamurthy

Educational

Making Softmax More Efficient with NVIDIA Blackwell Ultra

2026-02-25 17:00

NVIDIA Blackwell Ultra addresses the bottleneck in large language models (LLMs) caused by the softmax function. As LLMs evolve with complex attention schemes, the softmax function can slow down processing. Blackwell Ultra improves efficiency by doubling the throughput of Special Function Units (SFUs), reducing pipeline stalls. The article details how attention mechanisms work and provides benchmarks for measuring performance improvements. 📊💻 #NVIDIA #AI #MachineLearning #Softmax #BlackwellUltra

Source: Nvidia Developer Blog

Jamie Li

Technical Deep Dives

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

2026-02-23 18:00

AI model training faces challenges due to increasing model sizes and dataset demands. Higher-precision BF16 training is becoming insufficient. Lower-precision training can enhance efficiency and reduce costs. The article compares three low-precision formats—FP8-CS, MXFP8, and NVFP4—showing up to 1.6x higher throughput and similar model quality. Low-precision training uses fewer bits, lowering memory demands and boosting GPU operations. Adoption of these formats can lead to significant...

Source: Nvidia Developer Blog

Aditya Vavre

Technical Deep Dives

Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization

2026-02-19 17:30

NVIDIA's latest GPUs, including Ampere, Hopper, and Blackwell, utilize non-uniform memory access (NUMA) behaviors while presenting a unified memory space. The article explores how increased bandwidth in newer models can enhance performance and efficiency through compute and data locality. It highlights the benefits of using Multi-Instance GPU (MIG) mode for better data localization. Key insights include the impact of memory hierarchy and the significance of minimizing data transfer latency...

Source: Nvidia Developer Blog

Mukul Joshi

Technical Deep Dives

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

2026-02-18 18:00

Unlocking high throughput in AI workloads is essential. NVIDIA Run:ai tackles this with intelligent scheduling and dynamic GPU fractioning, enhancing resource efficiency. 🚀 A recent benchmark with Nebius shows that fractional GPU allocation can significantly boost large language model (LLM) inference performance. Results indicate impressive stats, like achieving 77% GPU throughput with just a 0.5 GPU fraction. 📊 This approach allows enterprises to run multiple LLMs efficiently, meeting user...

Source: Nvidia Developer Blog

Boskey Savla

Technical Deep Dives

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

2026-02-18 17:00

NVIDIA's cuda.compute library is transforming GPU programming for Python developers. Historically, achieving fast GPU performance required C++ expertise. However, cuda.compute offers a high-level Python API that simplifies access to optimized CUDA primitives. This innovation helped the NVIDIA CCCL team excel in the GPU MODE leaderboard, achieving top finishes across various architectures. 🏆💻 The library enables custom data types, supports rapid development with JIT compilation, and maintains...

Source: Nvidia Developer Blog

Daniel Rodriguez

Technical Deep Dives

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models

2026-02-18 16:00

🌐 As AI adoption grows, developers face challenges in delivering performance for large language models (LLMs) while managing latency and costs. Sarvam AI, based in Bengaluru, is tackling this by creating multilingual models with a focus on data sovereignty. They partnered with NVIDIA to optimize hardware and software, achieving a significant 4x boost in inference performance. 🚀 This collaboration involved using NVIDIA’s latest technology, enabling the development of models supporting 22...

Source: Nvidia Developer Blog

Utkarsh Uppal

Technical Deep Dives

Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities

2026-02-17 18:00

Unlock the potential of your enterprise data with the NVIDIA Enterprise RAG Blueprint! 📊 This framework enhances retrieval-augmented generation (RAG) systems by integrating multimodal capabilities. It processes complex documents—text, tables, images, and more—ensuring accurate insights. Key features include: - Baseline multimodal RAG pipeline - Reasoning capabilities - Query decomposition - Efficient metadata filtering - Visual reasoning for rich data Transform your traditional data...

Source: Nvidia Developer Blog

Shruthii Sathyanarayanan

Technical Deep Dives

R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab

2026-02-10 18:30

🚀 Building intelligent robots requires effective testing in complex environments. Traditional methods of gathering real-world data can be costly and slow, often leaving robots unprepared for unexpected situations. 💻 NVIDIA’s Isaac Lab offers a solution with its open-source, GPU-native simulation framework. It addresses key challenges in robot learning, including scaling simulations and integrating multiple sensor modalities. 🔍 This unified platform allows researchers to train robots...

Source: Nvidia Developer Blog

Oyindamola Omotuyi

Technical Deep Dives

Using Accelerated Computing to Live-Steer Scientific Experiments at Massive Research Facilities

2026-02-10 17:30

🚀 Scientists and engineers at NVIDIA are addressing challenges in massive research facilities, focusing on the need to manage high data rates effectively. By utilizing GPU-accelerated computing, they enhance real-time experiment steering. Notable projects include the Vera C. Rubin Observatory and SLAC’s LCLS-II, which are revolutionizing astrophysics and X-ray science. These advancements allow for unprecedented data collection, capturing thousands of new asteroids nightly and producing...

Source: Nvidia Developer Blog

Quynh L. Nguyen

Technical Deep Dives

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

2026-02-09 18:30

🚀 NVIDIA has introduced AutoDeploy as a beta feature in TensorRT LLM, streamlining the process of deploying large language models (LLMs). This tool automates the conversion of PyTorch models into optimized inference graphs, reducing manual effort and deployment time. It supports various architectures and simplifies challenges related to inference optimization. Key features include seamless model translation, a single source of truth, and the ability to deploy new models quickly while...

Source: Nvidia Developer Blog

Lucas Liebenwein

Product Announcements

3 Ways NVFP4 Accelerates AI Training and Inference

2026-02-06 16:00

🚀 NVIDIA's latest innovation, NVFP4, is transforming AI training and inference. As AI models grow in complexity, NVIDIA's extreme codesign approach enhances performance across chips and software. NVFP4, starting with Blackwell GPUs, offers 4-bit precision that maintains accuracy while boosting energy efficiency. Key points about NVFP4: 1️⃣ It enables significant performance improvements for training and inference. 2️⃣ Blackwell Ultra GPUs achieve up to 15 petaFLOPS, tripling FP8 performance....

Source: Nvidia Developer Blog

Ashraf Eassa

Product Announcements

How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation

2026-02-05 18:00

🚀 Building specialized AI models can be challenging due to limited high-quality domain data, unclear licensing, high compute costs, and slow iterations. This article provides a guide to overcoming these challenges with a license-compliant synthetic data pipeline. It highlights open source tools like OpenRouter and NVIDIA NeMo Data Designer, which streamline model access and data generation. By following the tutorial, developers can create scalable and compliant data pipelines, even with...

Source: Nvidia Developer Blog

Alex Steiner

Educational

How Painkiller RTX Uses Generative AI to Modernize Game Assets at Scale

2026-02-05 14:00

Painkiller RTX demonstrates how small teams can leverage generative AI to enhance game assets efficiently. By upscaling thousands of legacy textures into high-quality materials, the team reduces repetitive tasks and allows creativity to flourish. Key insights from team members highlight the blend of automation with artistic judgment across 35 levels, showcasing a new production pipeline approach. This innovation opens doors for those without traditional modding backgrounds to focus on...

Source: Nvidia Developer Blog

Phillip Singh

Technical Deep Dives

Build with Kimi K2.5 Multimodal VLM Using NVIDIA GPU-Accelerated Endpoints

2026-02-04 19:46

🚀 Introducing Kimi K2.5, the latest open vision language model (VLM) from the Kimi family! This multimodal model excels in various tasks, including AI workflows, chat, and coding. 🔧 Built on the Megatron-LM framework, Kimi K2.5 utilizes advanced GPU optimization and parallelism for efficient training. 📊 Key specs include 1 trillion total parameters, 384 experts, and a unique ability to handle text, images, and video. #AI #MachineLearning #NVIDIA #KimiK25 #VLM

Source: Nvidia Developer Blog

Anu Srivastava

Product Announcements

How to Build a Document Processing Pipeline for RAG with Nemotron

2026-02-04 16:00

Unlock the power of AI with NVIDIA Nemotron RAG! 🌟 This article explores how to build a document processing pipeline capable of handling complex PDFs and extracting structured data. It guides users through utilizing the NVIDIA NeMo Retriever library and integrating it with Nemotron RAG models for accurate results. Check out the tutorial resources and maximize your document processing capabilities! 📄💻 #AI #DocumentProcessing #NVIDIA #DataExtraction #MachineLearning

Source: Nvidia Developer Blog

Moon Chung

Educational

Accelerating Long-Context Model Training in JAX and XLA

2026-02-03 17:30

🚀 Large language models (LLMs) are pushing the boundaries with context windows reaching over 256K tokens. However, training these models brings computational challenges due to memory and communication overhead. 🔍 A recent study highlights how integrating NVSHMEM with the XLA compiler enhances context parallelism, yielding a 36% speedup for the Llama 3 8B model during long-context training. 📈 Context parallelism, particularly with ring attention, optimizes memory usage and communication...

Source: Nvidia Developer Blog

Sevin Fide Varoglu

Technical Deep Dives

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

2026-02-02 18:43

Optimizing communication in mixture-of-experts (MoE) training is crucial for large language models (LLMs). The article introduces Hybrid-EP, a solution for improving Expert Parallel communication, particularly in NVIDIA's Megatron frameworks. This addresses challenges like communication bottlenecks and load imbalances in models like DeepSeek-V3. The new approach enhances training efficiency by integrating advanced parallelism strategies and optimizing resource usage. 🔍💻✨ #MachineLearning...

Source: Nvidia Developer Blog

Fan Yu

Technical Deep Dives

Advancing GPU Programming with the CUDA Tile IR Backend for OpenAI Triton

2026-01-30 20:01

NVIDIA is advancing GPU programming with the integration of CUDA Tile as a backend for OpenAI Triton. This development targets portability for NVIDIA Tensor Cores, enhancing GPU performance. CUDA Tile allows developers to express computations at a higher abstraction level by working with data blocks (tiles). This reduces programming complexity and enables better compiler optimizations. The Triton-to-TileIR backend connects Triton with CUDA Tile IR, allowing developers to compile GPU kernels...

Source: Nvidia Developer Blog

Jie Xin

Technical Deep Dives

Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor

2026-01-30 18:00

🔍 Exploring the world of sparse tensors! Sparse tensors, which are essential in fields like scientific computing and deep learning, help optimize storage and computation. However, managing them can be challenging due to existing limitations. The Universal Sparse Tensor (UST) offers a solution by separating tensor sparsity from its memory representation. Developers can use a domain-specific language (DSL) to define and optimize sparse storage formats to fit their applications. This innovative...

Source: Nvidia Developer Blog

Aart J.C. Bik

Technical Deep Dives

Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk

2026-01-30 16:13

AI coding agents enhance developer productivity by automating tasks and facilitating test-driven development. However, they pose security risks due to indirect prompt injection from malicious sources. ⚠️ To mitigate these risks, the NVIDIA AI Red Team recommends several controls, including: - **Network egress controls** to block unauthorized site access. - **File write restrictions** to prevent unauthorized persistence and code execution. - **Sandboxing techniques** to isolate development...

Source: Nvidia Developer Blog

Rich Harang

Security Compliance

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare

2026-01-28 17:00

🚀 NVIDIA Run:ai v2.24 introduces time-based fairshare for Kubernetes clusters, enhancing GPU resource allocation. This new scheduling mode addresses challenges in shared GPU systems by considering historical resource usage, ensuring fair access for teams with varying job sizes. Teams that frequently utilize resources receive lower scores, while those waiting get a boost. Time-based fairshare promotes balanced compute time over days and weeks, allowing for efficient resource planning and...

Source: Nvidia Developer Blog

Ekin Karabulut

Product Announcements

Articles from Source: Nvidia-Developer-Blog