2026-03-12 16:30
Streamlining AI clusters on Kubernetes is now easier with the AI Cluster Runtime. 🖥️ This open-source project provides validated and reproducible Kubernetes configurations as recipes. It supports GPU clusters in various environments by publishing tested combinations of drivers, runtimes, and more. Users can easily capture their cluster's state and generate tailored recipes for deployment. For more info, explore the repository or use the aicr CLI! 🔧📦 #Kubernetes #AI #OpenSource #GPU...
Source: Nvidia Developer Blog
Mark Chmarny
2026-03-12 16:00
🚗✨ Physical AI is advancing rapidly, transforming autonomous vehicles and robotics. The focus is shifting to enabling high-fidelity reasoning and real-time interactions while managing power and latency. NVIDIA’s TensorRT Edge-LLM provides a solution with enhanced capabilities for its DRIVE AGX Thor and Jetson Thor platforms. Key features include advanced edge architectures and optimized support for open models. This technology allows developers to enhance autonomous systems while keeping...
Source: Nvidia Developer Blog
Lin Chai
2026-03-11 16:00
Introducing the Nemotron 3 Super! 🚀 This new AI model is designed for agentic reasoning, offering enhanced efficiency for complex tasks like software development and cybersecurity. With a 120B total and 12B active parameters, it addresses key challenges like "context explosion" and the "thinking tax." The model features a 1M-token context window, allowing for long-term memory and improved accuracy. Its innovative architecture enhances throughput by over 5x compared to previous models....
Source: Nvidia Developer Blog
Chris Alexiuk
2026-03-10 15:30
NVIDIA is leading a revolution in game development with its RTX ray tracing and AI-powered technologies. 🌟 At GDC 2026, they showcased innovations that enhance visual fidelity and performance. Key highlights include: - A new system for dense, path-traced foliage in NVIDIA RTX Mega Geometry. - Path-traced indirect lighting and RTX Hair for streamlined graphics in UE5. - Advanced on-device AI models for engaging gameplay and character interactions. These developments promise to transform how...
Source: Nvidia Developer Blog
Ike Nnoli
2026-03-10 15:30
Transforming game development with AI! 🎮✨ Agentic code assistants are now integral in building expansive worlds and supporting distributed teams in Unreal Engine 5. These tools enhance efficiency by generating code, refactoring, and answering specific engine queries. A key focus is bridging the context gap, ensuring AI understands unique studio coding patterns. NVIDIA collaborates with studios to improve AI reliability, aiming to streamline production workflows. Reducing documentation...
Source: Nvidia Developer Blog
Paul Logan
2026-03-09 21:13
🚀 CUDA 13.2 has arrived with significant updates! NVIDIA CUDA Tile is now supported on Ampere, Ada, and Blackwell architectures. This release enhances developer productivity with new Python features, including profiling in CUDA Python and improved debugging for Numba kernels. The cuTile Python DSL has received updates like support for recursive functions, closures, and custom reduction functions. For easy installation, use the command: `pip install cuda-tile[tileiras]` to get started! #CUDA...
Source: Nvidia Developer Blog
Jonathan Bentz
2026-03-09 19:30
NVIDIA Megatron Core is a key framework for large language model development, offering advanced parallelism and GPU performance. The Technology Innovation Institute (TII) has integrated the Falcon-H1 hybrid architecture into Megatron Bridge, addressing the challenges of coordinating diverse layers. This innovative design features parallel processing of attention mechanisms and SSM. Additionally, TII's integration of BitNet into Megatron Core enhances training efficiency through the use of...
Source: Nvidia Developer Blog
Mireille Fares
2026-03-09 17:00
🚀 Large language models (LLMs) are increasingly relying on large-scale distributed inference, utilizing multiple GPUs to improve user experience and reduce latency. Key techniques include disaggregated serving, efficient KV cache loading, and wide expert parallelism. These methods help maximize performance by optimizing data transfers and enabling dynamic resource allocation. The NVIDIA Inference Transfer Library (NIXL) is introduced as a solution for managing diverse hardware environments,...
Source: Nvidia Developer Blog
Seonghee Lee
2026-03-09 16:00
🚀 Deploying large language models (LLMs) can be complex and time-consuming. AIConfigurator simplifies this process by optimizing configurations without needing extensive hardware tests. It breaks down LLM operations and provides latency estimates based on real measurements, allowing developers to find the best setups quickly. This tool also supports continuous batching and handles unique challenges like expert parallelism. Explore how AIConfigurator can streamline your deployment process! 💻✨...
Source: Nvidia Developer Blog
Tianhao Xu
2026-03-05 18:00
NVIDIA's Blackwell has achieved a record in LLM inference for finance, showcasing the power of large language models in analyzing unstructured data for trading insights. 📈💼 The STAC-AI benchmark was developed to assess LLM performance, focusing on the LANG6 tests with models like Llama 3.1 across various datasets. These tests analyze financial documents for investment strategies. Results include batch and interactive mode scenarios, measuring throughput and response times. ⚙️📊 #NVIDIA...
Source: Nvidia Developer Blog
Dan Blanaru
2026-03-05 17:00
Controlling floating-point determinism can be challenging in parallel programming. NVIDIA's CCCL 3.1 introduces a new single-phase API in CUB, allowing users to customize algorithm behavior for determinism. This feature enables configurations for the reduce algorithm's determinism property, enhancing performance and reliability. For a detailed code example, check the full article! #NVIDIA #CUDA #ParallelProgramming #Computing #CUB
Source: Nvidia Developer Blog
Nader Al Awar
2026-03-05 17:00
Unlock the potential of AI with Flash Attention! 🌟 This article explores implementing Flash Attention using NVIDIA cuTile, providing a complete code walkthrough for production readiness. It discusses the "trap and rescue" optimization journey, highlighting pitfalls of naive optimizations. Discover advanced techniques like FMA patterns, fast math, loop splitting, and adaptive tiling to maximize performance. 🚀 For implementation, ensure you have CUDA 13.1, NVIDIA Blackwell architecture, and...
Source: Nvidia Developer Blog
Alessandro Morari
2026-03-04 17:00
Unlock the potential of AI with Flash Attention! 🌟 This article explores implementing Flash Attention using NVIDIA cuTile, providing a complete code walkthrough for production readiness. It discusses the "trap and rescue" optimization journey, highlighting pitfalls of naive optimizations. Discover advanced techniques like FMA patterns, fast math, loop splitting, and adaptive tiling to maximize performance. 🚀 For implementation, ensure you have CUDA 13.1, NVIDIA Blackwell architecture, and...
Source: Nvidia Developer Blog
Alessandro Morari
2026-03-03 19:49
🚀 NVIDIA ACE is revolutionizing AI in gaming with its suite of technologies. It offers cloud and on-device models for in-game characters, enhancing aspects like speech and animation. The NVIDIA In-Game Inferencing SDK 1.5 introduces a new code agent sample, streamlining AI interactions in games. It focuses on reducing GPU contention by minimizing inference calls while maximizing their effectiveness. However, using AI agents poses challenges, such as potential security risks when they can...
Source: Nvidia Developer Blog
Brandon Rowlett
2026-03-03 19:48
🚀 Exciting news for Julia developers! cuTile.jl now brings NVIDIA CUDA Tile-based programming to the Julia language. This enables automatic access to tensor cores and simplifies the creation of high-performance GPU kernels. Building on NVIDIA's earlier release for Python, cuTile.jl allows developers to focus on operations at the tile level, reducing complexity in coding. Learn how this new tool can enhance your GPU programming experience! #JuliaLang #NVIDIA #CUDA #GPUProgramming #cuTile
Source: Nvidia Developer Blog
Tim Besard
2026-03-01 07:00
🌐 The telecom industry is gearing up for 6G, facing the challenge of designing AI-native networks. NVIDIA's Aerial Omniverse Digital Twin (AODT) offers a solution by enabling a continuous integration workflow that simulates and validates RAN software in a realistic environment. Its modular design supports easy integration, fostering an ecosystem of commercial products to enhance network planning and testing. Discover how AODT accelerates innovation in building AI-native 6G networks. #6G...
Source: Nvidia Developer Blog
Cindy Goh
2026-03-01 07:00
🚀 Autonomous networks are a key focus in telecommunications, as highlighted by the NVIDIA State of AI report. 📊 65% of operators see AI driving automation, but many face challenges in AI expertise, hindering network scalability. 🤖 Tech Mahindra and NVIDIA are working together to bridge this skills gap by providing resources for building AI models that emulate NOC engineers. 🔧 Their approach includes generating realistic incident data and translating expert procedures into structured...
Source: Nvidia Developer Blog
Aiden Chang
2026-02-27 17:30
🚀 Alibaba has launched the Qwen3.5 series, an open-source model designed for native multimodal agents. This model features a ~400B parameter vision-language architecture that enhances user interface navigation. It's suitable for various applications, including coding, visual reasoning, chat, and complex searches. Developers can access GPU-accelerated endpoints for experimentation through NVIDIA's platform. #Qwen3 #ArtificialIntelligence #NVIDIA #OpenSource #MachineLearning
Source: Nvidia Developer Blog
Anu Srivastava
2026-02-27 17:00
Organizations deploying large language models (LLMs) face challenges with varying inference workloads. Small models may need limited GPU memory, while larger ones require multiple GPUs, leading to low utilization and high costs. Intelligent scheduling is key to enhancing GPU performance without over or underprovisioning. The article discusses NVIDIA NIM's containerized microservices and Run:ai's scheduling strategies that improve GPU utilization by nearly 2x and reduce latency significantly....
Source: Nvidia Developer Blog
Shwetha Krishnamurthy
2026-02-25 17:00
NVIDIA Blackwell Ultra addresses the bottleneck in large language models (LLMs) caused by the softmax function. As LLMs evolve with complex attention schemes, the softmax function can slow down processing. Blackwell Ultra improves efficiency by doubling the throughput of Special Function Units (SFUs), reducing pipeline stalls. The article details how attention mechanisms work and provides benchmarks for measuring performance improvements. 📊💻 #NVIDIA #AI #MachineLearning #Softmax #BlackwellUltra
Source: Nvidia Developer Blog
Jamie Li
2026-02-23 18:00
AI model training faces challenges due to increasing model sizes and dataset demands. Higher-precision BF16 training is becoming insufficient. Lower-precision training can enhance efficiency and reduce costs. The article compares three low-precision formats—FP8-CS, MXFP8, and NVFP4—showing up to 1.6x higher throughput and similar model quality. Low-precision training uses fewer bits, lowering memory demands and boosting GPU operations. Adoption of these formats can lead to significant...
Source: Nvidia Developer Blog
Aditya Vavre
2026-02-19 17:30
NVIDIA's latest GPUs, including Ampere, Hopper, and Blackwell, utilize non-uniform memory access (NUMA) behaviors while presenting a unified memory space. The article explores how increased bandwidth in newer models can enhance performance and efficiency through compute and data locality. It highlights the benefits of using Multi-Instance GPU (MIG) mode for better data localization. Key insights include the impact of memory hierarchy and the significance of minimizing data transfer latency...
Source: Nvidia Developer Blog
Mukul Joshi
2026-02-18 18:00
Unlocking high throughput in AI workloads is essential. NVIDIA Run:ai tackles this with intelligent scheduling and dynamic GPU fractioning, enhancing resource efficiency. 🚀 A recent benchmark with Nebius shows that fractional GPU allocation can significantly boost large language model (LLM) inference performance. Results indicate impressive stats, like achieving 77% GPU throughput with just a 0.5 GPU fraction. 📊 This approach allows enterprises to run multiple LLMs efficiently, meeting user...
Source: Nvidia Developer Blog
Boskey Savla
2026-02-18 17:00
NVIDIA's cuda.compute library is transforming GPU programming for Python developers. Historically, achieving fast GPU performance required C++ expertise. However, cuda.compute offers a high-level Python API that simplifies access to optimized CUDA primitives. This innovation helped the NVIDIA CCCL team excel in the GPU MODE leaderboard, achieving top finishes across various architectures. 🏆💻 The library enables custom data types, supports rapid development with JIT compilation, and maintains...
Source: Nvidia Developer Blog
Daniel Rodriguez
2026-02-18 16:00
🌐 As AI adoption grows, developers face challenges in delivering performance for large language models (LLMs) while managing latency and costs. Sarvam AI, based in Bengaluru, is tackling this by creating multilingual models with a focus on data sovereignty. They partnered with NVIDIA to optimize hardware and software, achieving a significant 4x boost in inference performance. 🚀 This collaboration involved using NVIDIA’s latest technology, enabling the development of models supporting 22...
Source: Nvidia Developer Blog
Utkarsh Uppal
2026-02-17 18:00
Unlock the potential of your enterprise data with the NVIDIA Enterprise RAG Blueprint! 📊 This framework enhances retrieval-augmented generation (RAG) systems by integrating multimodal capabilities. It processes complex documents—text, tables, images, and more—ensuring accurate insights. Key features include: - Baseline multimodal RAG pipeline - Reasoning capabilities - Query decomposition - Efficient metadata filtering - Visual reasoning for rich data Transform your traditional data...
Source: Nvidia Developer Blog
Shruthii Sathyanarayanan
2026-02-10 18:30
🚀 Building intelligent robots requires effective testing in complex environments. Traditional methods of gathering real-world data can be costly and slow, often leaving robots unprepared for unexpected situations. 💻 NVIDIA’s Isaac Lab offers a solution with its open-source, GPU-native simulation framework. It addresses key challenges in robot learning, including scaling simulations and integrating multiple sensor modalities. 🔍 This unified platform allows researchers to train robots...
Source: Nvidia Developer Blog
Oyindamola Omotuyi
2026-02-10 17:30
🚀 Scientists and engineers at NVIDIA are addressing challenges in massive research facilities, focusing on the need to manage high data rates effectively. By utilizing GPU-accelerated computing, they enhance real-time experiment steering. Notable projects include the Vera C. Rubin Observatory and SLAC’s LCLS-II, which are revolutionizing astrophysics and X-ray science. These advancements allow for unprecedented data collection, capturing thousands of new asteroids nightly and producing...
Source: Nvidia Developer Blog
Quynh L. Nguyen
2026-02-09 18:30
🚀 NVIDIA has introduced AutoDeploy as a beta feature in TensorRT LLM, streamlining the process of deploying large language models (LLMs). This tool automates the conversion of PyTorch models into optimized inference graphs, reducing manual effort and deployment time. It supports various architectures and simplifies challenges related to inference optimization. Key features include seamless model translation, a single source of truth, and the ability to deploy new models quickly while...
Source: Nvidia Developer Blog
Lucas Liebenwein
2026-02-06 16:00
🚀 NVIDIA's latest innovation, NVFP4, is transforming AI training and inference. As AI models grow in complexity, NVIDIA's extreme codesign approach enhances performance across chips and software. NVFP4, starting with Blackwell GPUs, offers 4-bit precision that maintains accuracy while boosting energy efficiency. Key points about NVFP4: 1️⃣ It enables significant performance improvements for training and inference. 2️⃣ Blackwell Ultra GPUs achieve up to 15 petaFLOPS, tripling FP8 performance....
Source: Nvidia Developer Blog
Ashraf Eassa
2026-02-05 18:00
🚀 Building specialized AI models can be challenging due to limited high-quality domain data, unclear licensing, high compute costs, and slow iterations. This article provides a guide to overcoming these challenges with a license-compliant synthetic data pipeline. It highlights open source tools like OpenRouter and NVIDIA NeMo Data Designer, which streamline model access and data generation. By following the tutorial, developers can create scalable and compliant data pipelines, even with...
Source: Nvidia Developer Blog
Alex Steiner
2026-02-05 14:00
Painkiller RTX demonstrates how small teams can leverage generative AI to enhance game assets efficiently. By upscaling thousands of legacy textures into high-quality materials, the team reduces repetitive tasks and allows creativity to flourish. Key insights from team members highlight the blend of automation with artistic judgment across 35 levels, showcasing a new production pipeline approach. This innovation opens doors for those without traditional modding backgrounds to focus on...
Source: Nvidia Developer Blog
Phillip Singh
2026-02-04 19:46
🚀 Introducing Kimi K2.5, the latest open vision language model (VLM) from the Kimi family! This multimodal model excels in various tasks, including AI workflows, chat, and coding. 🔧 Built on the Megatron-LM framework, Kimi K2.5 utilizes advanced GPU optimization and parallelism for efficient training. 📊 Key specs include 1 trillion total parameters, 384 experts, and a unique ability to handle text, images, and video. #AI #MachineLearning #NVIDIA #KimiK25 #VLM
Source: Nvidia Developer Blog
Anu Srivastava
2026-02-04 16:00
Unlock the power of AI with NVIDIA Nemotron RAG! 🌟 This article explores how to build a document processing pipeline capable of handling complex PDFs and extracting structured data. It guides users through utilizing the NVIDIA NeMo Retriever library and integrating it with Nemotron RAG models for accurate results. Check out the tutorial resources and maximize your document processing capabilities! 📄💻 #AI #DocumentProcessing #NVIDIA #DataExtraction #MachineLearning
Source: Nvidia Developer Blog
Moon Chung
2026-02-03 17:30
🚀 Large language models (LLMs) are pushing the boundaries with context windows reaching over 256K tokens. However, training these models brings computational challenges due to memory and communication overhead. 🔍 A recent study highlights how integrating NVSHMEM with the XLA compiler enhances context parallelism, yielding a 36% speedup for the Llama 3 8B model during long-context training. 📈 Context parallelism, particularly with ring attention, optimizes memory usage and communication...
Source: Nvidia Developer Blog
Sevin Fide Varoglu
2026-02-02 18:43
Optimizing communication in mixture-of-experts (MoE) training is crucial for large language models (LLMs). The article introduces Hybrid-EP, a solution for improving Expert Parallel communication, particularly in NVIDIA's Megatron frameworks. This addresses challenges like communication bottlenecks and load imbalances in models like DeepSeek-V3. The new approach enhances training efficiency by integrating advanced parallelism strategies and optimizing resource usage. 🔍💻✨ #MachineLearning...
Source: Nvidia Developer Blog
Fan Yu
2026-01-30 20:01
NVIDIA is advancing GPU programming with the integration of CUDA Tile as a backend for OpenAI Triton. This development targets portability for NVIDIA Tensor Cores, enhancing GPU performance. CUDA Tile allows developers to express computations at a higher abstraction level by working with data blocks (tiles). This reduces programming complexity and enables better compiler optimizations. The Triton-to-TileIR backend connects Triton with CUDA Tile IR, allowing developers to compile GPU kernels...
Source: Nvidia Developer Blog
Jie Xin
2026-01-30 18:00
🔍 Exploring the world of sparse tensors! Sparse tensors, which are essential in fields like scientific computing and deep learning, help optimize storage and computation. However, managing them can be challenging due to existing limitations. The Universal Sparse Tensor (UST) offers a solution by separating tensor sparsity from its memory representation. Developers can use a domain-specific language (DSL) to define and optimize sparse storage formats to fit their applications. This innovative...
Source: Nvidia Developer Blog
Aart J.C. Bik
2026-01-30 16:13
AI coding agents enhance developer productivity by automating tasks and facilitating test-driven development. However, they pose security risks due to indirect prompt injection from malicious sources. ⚠️ To mitigate these risks, the NVIDIA AI Red Team recommends several controls, including: - **Network egress controls** to block unauthorized site access. - **File write restrictions** to prevent unauthorized persistence and code execution. - **Sandboxing techniques** to isolate development...
Source: Nvidia Developer Blog
Rich Harang
2026-01-28 17:00
🚀 NVIDIA Run:ai v2.24 introduces time-based fairshare for Kubernetes clusters, enhancing GPU resource allocation. This new scheduling mode addresses challenges in shared GPU systems by considering historical resource usage, ensuring fair access for teams with varying job sizes. Teams that frequently utilize resources receive lower scores, while those waiting get a boost. Time-based fairshare promotes balanced compute time over days and weeks, allowing for efficient resource planning and...
Source: Nvidia Developer Blog
Ekin Karabulut