Nvidia-Developer-Blog | Daily Tech Articles Feed

Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer

2025-09-16 17:35

🚀 Deploying large language models (LLMs) can be challenging due to cold start delays, which hinder performance and scalability. 🖥️ The article discusses the NVIDIA Run:ai Model Streamer, an open-source SDK that reduces loading times by concurrently streaming model weights into GPU memory. 📊 Benchmark tests show significant improvements in cold start latency, especially in cloud environments, while maintaining compatibility with Safetensor formats. #AI #MachineLearning #NVIDIA #Inference...

Source: Nvidia Developer Blog

Omer Dayan

Technical Deep Dives

What’s New in PyNvVideoCodec 2.0 for Python GPU-Accelerated Video Processing

2025-09-16 17:32

🚀 Exciting news for video processing developers! PyNvVideoCodec 2.0 is an upgraded NVIDIA library for GPU-accelerated video encoding, decoding, and transcoding using Python. This lightweight, easy-to-install library offers performance on par with the native SDK. It supports projects in video analytics, AI preprocessing, media transcoding, and real-time streaming, combining the speed of C++ with the ease of Python. Discover the enhanced features and performance improvements in this latest...

Source: Nvidia Developer Blog

Abhijit Patait

Product Announcements

Autodesk Research Brings Warp Speed to Computational Fluid Dynamics on NVIDIA GH200

2025-09-16 15:00

🚀 Autodesk Research has made strides in computational fluid dynamics (CFD) with its Accelerated Lattice Boltzmann (XLB) library. This open-source solver bridges the gap between traditional CAE and AI/ML ecosystems. By leveraging NVIDIA Warp and the GH200 Superchip, XLB achieves an ~8x speedup in performance, allowing for high-fidelity simulations at scale. This advancement demonstrates the potential of Python in high-performance scenarios. #CFD #AutodeskResearch #NVIDIAWarp...

Source: Nvidia Developer Blog

Mehdi Ataei

Technical Deep Dives

Build a Report Generator AI Agent with NVIDIA Nemotron on OpenRouter

2025-09-15 19:31

🚀 Discover how to build an AI report generator with NVIDIA Nemotron! This self-paced workshop covers essential topics including the four core considerations for AI agents, creating a document generation agent, and utilizing LangGraph and OpenRouter. Participants will have access to a portable development environment and can share their customized agents as NVIDIA Launchables. #AI #NVIDIA #MachineLearning #OpenSource #TechWorkshop

Source: Nvidia Developer Blog

Edward Li

Educational

New Open Source Qwen3-Next Models Preview Hybrid MoE Architecture Delivering Improved Accuracy and Accelerated Parallel Processing across NVIDIA Platform

2025-09-15 13:00

🚀 Alibaba has unveiled two new open-source models: Qwen3-Next 80B-A3B-Thinking and Qwen3-Next 80B-A3B-Instruct. These models feature a hybrid Mixture of Experts (MoE) architecture designed for improved efficiency and accuracy. 🔍 The Qwen3-Next-80B-A3B-Thinking model is now available on build.nvidia.com, allowing developers to explore its advanced reasoning capabilities. 💡 With 80 billion parameters, only a fraction is activated per token, optimizing processing for longer context lengths. The...

Source: Nvidia Developer Blog

Anu Srivastava

Product Announcements

Modeling Attacks on AI-Powered Apps with the AI Kill Chain Framework

2025-09-11 16:00

AI-powered applications are facing new security challenges that traditional models may not address. The AI Kill Chain framework, developed by NVIDIA, outlines how adversaries target these systems. This framework emphasizes the stages of an attack: recon, poison, hijack, persist, and impact. It aims to help defenders identify where they can intervene effectively. Learn more about the evolving landscape of AI security! 🔐💻🛡️ #AI #CyberSecurity #NVIDIA #AIKillChain #TechTrends

Source: Nvidia Developer Blog

Rich Harang

Security Compliance

How Quantization Aware Training Enables Low-Precision Accuracy Recovery

2025-09-11 15:00

Optimizing AI models for deployment involves various compression techniques. Post-training quantization (PTQ) is common, but quantization aware training (QAT) and quantization aware distillation (QAD) provide significant advantages. These methods prepare models for lower precision by simulating quantization effects, enhancing accuracy recovery. Learn more about these techniques and their impact on model performance! 📊🤖 #AI #Quantization #MachineLearning #ModelOptimization #TechTrends

Source: Nvidia Developer Blog

Eduardo Alvarez

Technical Deep Dives

Accelerate Protein Structure Inference Over 100x with NVIDIA RTX PRO 6000 Blackwell Server Edition

2025-09-10 16:48

Unlocking the future of protein structure analysis is now possible with the NVIDIA RTX PRO 6000 Blackwell Server Edition. This new GPU significantly accelerates protein structure inference, enhancing research efficiency and reducing costs for organizations. 🧬💻 With advancements from NVIDIA's Digital Biology Research labs, researchers can now utilize OpenFold for rapid analysis without sacrificing accuracy compared to AlphaFold2. Discover how this technology can transform large-scale protein...

Source: Nvidia Developer Blog

Kyle Tretina

Product Announcements

Deploy Scalable AI Inference with NVIDIA NIM Operator 3.0.0

2025-09-10 16:30

Unlock the potential of AI with NVIDIA NIM Operator 3.0.0! 🚀 This latest release enhances the deployment of NVIDIA NIM and NeMo microservices in Kubernetes environments, making it easier to manage complex AI inference pipelines. Key features include efficient resource utilization and seamless integration with existing infrastructures, including KServe. 🤖 Collaboration with Red Hat further streamlines NIM deployment, supporting model caching and trusted AI capabilities. #NVIDIA #AI #Kubernetes...

Source: Nvidia Developer Blog

Meenakshi Kaushik

Product Announcements

Developers Can Now Get CUDA Directly from Their Favorite Third-Party Platforms

2025-09-10 16:00

🚀 Developers can now access CUDA directly through popular third-party platforms, making application deployment easier. NVIDIA is collaborating with Canonical, CIQ, and others to simplify installation and maintain compatibility across various OS and package managers. This initiative helps streamline the integration of GPU support in applications like PyTorch and OpenCV. Key benefits include consistent CUDA naming, timely updates, and continued free access to CUDA. #NVIDIA #CUDA #DeveloperTools...

Source: Nvidia Developer Blog

Jonathan Bentz

Product Announcements

Maximizing Low-Latency Networking Performance for Financial Services with NVIDIA Rivermax and NEIO FastSocket

2025-09-10 16:00

Ultra-low latency and reliable packet delivery are essential in sectors like financial services, cloud gaming, and media. Delays or packet losses can lead to significant issues, including financial losses and poor user experiences. NVIDIA Rivermax offers a high-performance solution for these challenges. It utilizes GPU-accelerated technologies to ensure high throughput, low latency, and minimal CPU usage, making it ideal for demanding applications. Learn more about how Rivermax is...

Source: Nvidia Developer Blog

Simon Raviv

Technical Deep Dives

How to Connect Distributed Data Centers Into Large AI Factories with Scale-Across Networking

2025-09-09 17:00

AI scaling faces challenges due to physical limitations in data centers, such as power and cooling capacity. 🌐 Traditional long-haul Ethernet solutions can lead to high latency and unpredictable data delivery, which is problematic for AI workloads. NVIDIA's Spectrum-XGS Ethernet technology introduces scale-across networking, allowing multiple data centers to function as one large AI factory, enhancing performance for training and inference tasks. 🚀 #ArtificialIntelligence #DataCenters...

Source: Nvidia Developer Blog

Taylor Allison

Technical Deep Dives

NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut

2025-09-09 15:00

🚀 NVIDIA's Blackwell Ultra architecture has made a significant impact in the latest MLPerf Inference v5.1 benchmarks. New models like DeepSeek-R1 and Llama 3.1 have set high performance standards, with impressive token processing speeds. The benchmarks highlight the growing need for advanced compute power as large language models evolve. NVIDIA continues to lead with record-breaking results across all tested scenarios. #NVIDIA #MLPerf #AI #MachineLearning #TechNews

Source: Nvidia Developer Blog

Ashwin Nanjappa

Industry Analysis

NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads

2025-09-09 15:00

NVIDIA is addressing the increasing complexity of AI inference with its new Rubin CPX GPU. This technology supports workloads requiring extensive context, like software development and long-form video generation. The NVIDIA SMART framework optimizes inference across various dimensions, allowing for better resource allocation. This disaggregated approach separates the context and generation phases, improving efficiency and reducing latency. Discover how NVIDIA is redefining AI infrastructure....

Source: Nvidia Developer Blog

Joe DeLaere

Product Announcements

How to Build AI Systems In House with Outerbounds and DGX Cloud Lepton

2025-09-08 16:00

Building production-grade AI systems involves managing numerous components. Companies are increasingly opting to develop in-house solutions for better security and compliance. Outerbounds offers a cloud-native platform that simplifies this process, utilizing open-source Metaflow for efficient orchestration. Key to success is leveraging NVIDIA DGX Cloud Lepton for GPU access, enabling scalable AI operations. Explore how to create customized AI products while navigating the complex GPU cloud...

Source: Nvidia Developer Blog

Ville Tuulos

Educational

Register for the Global Webinar: How to Prepare for NVIDIA Generative AI Certification

2025-09-07 15:00

🌍 Join the global webinar on October 7 to learn how to prepare for the NVIDIA Generative AI Certification exams. Get insights into the new professional level certification and tips for success. Don't miss this opportunity to enhance your skills! #NVIDIA #GenerativeAI #Webinar #Certification #ProfessionalDevelopment

Source: Nvidia Developer Blog

Shara Tibken

Event

Just Released: NVIDIA PhysicsNeMo 25.08

2025-09-05 17:37

🚀 Exciting news from NVIDIA! The latest release of PhysicsNeMo 25.08 introduces powerful workflows and recipes specifically designed for CAE application developers. This update aims to enhance simulations and streamline development processes. Explore the new features and boost your CAE projects with NVIDIA's advanced tools! #NVIDIA #PhysicsNeMo #CAE #TechUpdates #Simulation

Source: Nvidia Developer Blog

Bhoomi Gadhia

Product Announcements

Just Released: NVIDIA PhysicsNeMo 25.08

2025-09-05 17:37

🚀 Exciting news for CAE developers! NVIDIA has just launched PhysicsNeMo 25.08, introducing new workflows and recipes designed to enhance application development. This update aims to streamline processes and improve efficiency in computational physics. Stay tuned for more advancements in simulation technology! #NVIDIA #PhysicsNeMo #CAE #TechUpdate #Simulation

Source: Nvidia Developer Blog

Bhoomi Gadhia

Product Announcements

Accelerate Large-Scale LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing

2025-09-05 17:24

Large Language Models (LLMs) like Llama 3 70B and Llama 4 Scout 109B are pushing AI boundaries but pose memory challenges for inference efficiency. These models can require significant memory, with Llama 3 needing around 140 GB and Llama 4 about 218 GB. The key-value (KV) cache also demands additional memory as context and batch sizes increase. NVIDIA's Grace Hopper and Blackwell architectures use NVLink-C2C, allowing CPU-GPU memory sharing. This innovation enhances data access and...

Source: Nvidia Developer Blog

Afroze Syed

Technical Deep Dives

Accelerate Large-Scale LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing

2025-09-05 17:24

Large Language Models (LLMs) like Llama 3 70B and Llama 4 Scout 109B face challenges with inference due to their size. These models can require significant memory, often exceeding GPU limits, especially with large context windows. The NVIDIA Grace architectures address this by utilizing NVLink C2C, allowing CPU and GPU to share memory efficiently. This setup enhances the processing of large datasets and enables quicker access, minimizing the risk of out-of-memory errors during inference....

Source: Nvidia Developer Blog

Afroze Syed

Technical Deep Dives

Accelerate Large-Scale LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing

2025-09-05 17:24

Large Language Models (LLMs) like Llama 3 and Llama 4 are pushing AI boundaries, but their size poses challenges for inference efficiency. These models can require substantial GPU memory, often leading to out-of-memory errors during inference. The NVIDIA Grace architectures address this with NVLink C2C, offering a high-bandwidth connection that shares CPU and GPU memory. This innovation enhances processing capabilities, making it easier to handle large datasets and models. #AI #NVIDIA...

Source: Nvidia Developer Blog

Afroze Syed

Technical Deep Dives

Accelerate Autonomous Vehicle Development with the NVIDIA DRIVE AGX Thor Developer Kit

2025-09-03 17:30

🚗🔍 The NVIDIA DRIVE AGX Thor Developer Kit is now available, enhancing the development of autonomous vehicle technology. This platform supports advanced AI models for better perception and decision-making, enabling a comprehensive in-vehicle experience. With powerful Blackwell GPUs and next-gen Arm CPUs, it meets high safety and security standards. The DRIVE AGX Thor is designed to empower automotive OEMs and developers in scaling performance and efficiency for future demands. #NVIDIA...

Source: Nvidia Developer Blog

Abhinaw Priyadershi

Product Announcements

How to Run AI-Powered CAE Simulations

2025-09-03 16:09

🚀 In modern engineering, accelerated simulations are crucial for innovation. Computer-aided engineering (CAE) helps design reliable products by verifying performance and safety. Traditional simulations take time, often hindering exploration of design options. Physics-based AI models serve as surrogates, predicting outcomes in seconds or minutes, thus enhancing the design process. This article outlines a modular workflow for automotive aerodynamics, leveraging NVIDIA technologies. It covers...

Source: Nvidia Developer Blog

Abouzar Ghasemi

Educational

North–South Networks: The Key to Faster Enterprise AI Workloads

2025-09-03 15:04

In the realm of AI infrastructure, data movement is crucial for performance. As enterprises adopt advanced AI systems, they face challenges in quickly and reliably moving data. NVIDIA’s Enterprise Reference Architectures (RAs) provide guidance on optimizing north-south networks, essential for tasks like model loading and inference queries. By utilizing NVIDIA Spectrum-X Ethernet, organizations can enhance data flow, particularly for data-intensive AI applications. Legacy networks often...

Source: Nvidia Developer Blog

Shashank Sabhlok

Technical Deep Dives

Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap

2025-09-02 18:44

Deploying large language models (LLMs) at scale involves balancing fast responsiveness and GPU costs. Organizations often face tough choices: over-provisioning GPUs or risking user experience with latency spikes. NVIDIA's GPU memory swap, or model hot-swapping, offers a solution. This innovation allows multiple models to share GPUs, dynamically offloading inactive models to CPU memory, enabling rapid activation when needed. Benchmark tests show promising results with lower costs and improved...

Source: Nvidia Developer Blog

Ekin Karabulut

Technical Deep Dives

Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2

2025-09-02 17:00

🚀 Selecting the optimal GEMM kernel for specific hardware is challenging due to the many performance-determining parameters. NVIDIA introduces **nvMatmulHeuristics** to enhance the process. This module identifies a small set of top-performing kernel configurations, simplifying the tuning workflow and saving time. ⏱️ With nvMatmulHeuristics and CUTLASS 4.2, users can quickly generate and auto-tune kernels, leading to faster model compilation and better performance. #NVIDIA #GEMM #CUDA...

Source: Nvidia Developer Blog

Harrison Barclay

Technical Deep Dives

What’s New in CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More

2025-09-02 16:00

🚀 Exciting advancements are on the horizon with CUDA Toolkit 13.0 for Jetson Thor! This release introduces a unified toolkit for Arm platforms, eliminating the need for separate installations. Developers can build applications once and deploy them seamlessly across various systems. Enhanced features like Unified Virtual Memory and improved developer tools streamline workflows and enhance performance for edge AI applications. #NVIDIA #CUDA #JetsonThor #EdgeComputing #AI

Source: Nvidia Developer Blog

Rekha Mukund

Product Announcements

How Small Language Models Are Key to Scalable Agentic AI

2025-08-29 18:00

The rise of agentic AI is transforming how businesses approach automation and productivity. 🤖 Recent insights highlight the potential of small language models (SLMs) as efficient alternatives to large language models (LLMs) in agentic applications. SLMs can reduce costs and improve operational flexibility while maintaining performance. This shift enables enterprises to utilize SLMs for specific tasks, reserving LLMs for more complex scenarios. Tools like NVIDIA’s Nemotron demonstrate the...

Source: Nvidia Developer Blog

Peter Belcak

Industry Analysis

Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training

2025-08-29 14:47

OpenAI's gpt-oss model has made waves in the AI community with its innovative architecture and performance capabilities. 📈🧠 It features a mixture of expert architecture and a 128K context length, competing closely with OpenAI's closed-source models. However, deploying foundational models like gpt-oss in critical fields requires careful fine-tuning. The article discusses employing Supervised Fine-Tuning (SFT) and Quantization-Aware Training (QAT) to enhance model accuracy while maintaining...

Source: Nvidia Developer Blog

Eduardo Alvarez

Technical Deep Dives

Getting Started with NVIDIA Isaac for Healthcare Using the Telesurgery Workflow

2025-08-28 16:00

🚀 Telesurgery is transforming healthcare delivery as the shortage of surgeons rises. With advancements in 5G and AI, experts can now operate remotely, shifting from experimental to essential. 🌍 NVIDIA Isaac for Healthcare offers a modular workflow that includes video streaming, robot control, and simulation tools. This enables seamless training and clinical deployment. Learn how this technology is paving the way for the next generation of surgical robotics. 🤖💡 #Telesurgery...

Source: Nvidia Developer Blog

Michael Zephyr

Educational

How to Improve CUDA Kernel Performance with Shared Memory Register Spilling

2025-08-27 16:30

🚀 New in CUDA Toolkit 13.0: Shared Memory Register Spilling! This feature helps improve CUDA kernel performance by allowing the compiler to use shared memory for excess variables instead of local memory. This reduces spill latency and L2 pressure for register-heavy kernels. To enable shared memory spilling, use the pragma command in your kernel definition. With this optimization, kernels can perform better, especially in critical regions where registers are heavily used. Learn more about how...

Source: Nvidia Developer Blog

Divya Shanmughan

Technical Deep Dives

How to Scale Your LangGraph Agents in Production From A Single User to 1,000 Coworkers

2025-08-27 16:00

📈 Scaling your AI agent for production use? In a recent article, the deployment of a deep-research agent using the AI-Q NVIDIA Blueprint is explored. This article outlines how NVIDIA tackled the challenges of sharing their AI tools with up to 1,000 coworkers. The focus was on using the NeMo Agent Toolkit to ensure scalability and security while accessing internal data. It details the architecture that supports document processing and web search capabilities. Learn more about the techniques...

Source: Nvidia Developer Blog

Sean Lopp

Educational

How Industry Collaboration Fosters NVIDIA Co-Packaged Optics

2025-08-26 17:00

NVIDIA is transforming data-center connectivity by merging optical and electrical components through strong industry partnerships. 🤝 Their networking platform integrates advanced technologies from top partners, focusing on scalable and efficient optical systems. Key innovations include the Micro Ring Modulator, allowing high data throughput with a compact design. Collaboration with TSMC has addressed manufacturing challenges, ensuring reliable performance essential for modern data centers....

Source: Nvidia Developer Blog

Ashkan Seyedi

Industry Analysis

NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit

2025-08-25 17:59

🚀 NVIDIA has introduced NVFP4, a 4-bit format designed to enhance AI workloads during pretraining of large language models (LLMs). This innovation aims to improve training efficiency and throughput while maintaining accuracy. The shift from higher precision formats to 4-bit is set to redefine scalability in AI development. Collaboration with major organizations like Google Cloud and OpenAI is ongoing to explore this technology's full potential. #AI #NVIDIA #MachineLearning #LLMs #Innovation

Source: Nvidia Developer Blog

Kirthi Devleker

Product Announcements

Introducing NVIDIA Jetson Thor, the Ultimate Platform for Physical AI

2025-08-25 17:57

🚀 Robotics is evolving! The shift from specialist machines to adaptable robots marks a new era in generalist robotics. These robots are designed to learn and perform various tasks, enhancing efficiency across industries. With NVIDIA's Jetson Thor platform, developers can create flexible robots that streamline operations without constant reprogramming. Key components include hardware integration, real-time control, perception, and high-level reasoning to facilitate complex interactions....

Source: Nvidia Developer Blog

Shashank Maheshwari

Product Announcements

How to Spot (and Fix) 5 Common Performance Bottlenecks in pandas Workflows

2025-08-22 19:54

Are you facing slow data loads and memory issues in your pandas workflows? 🐍💻 This article highlights five common performance bottlenecks in pandas, including slow CSV parsing and memory-intensive joins. It offers practical solutions to improve your workflow efficiency, such as using the PyArrow engine for faster CSV reads and exploring the cudf.pandas library for GPU acceleration. Don't have a GPU? You can use cudf.pandas for free in Google Colab! 🚀📊 #DataScience #Python #Pandas #Performance...

Source: Nvidia Developer Blog

Jamil Semaan

Educational

Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era

2025-08-22 17:58

Introducing the NVIDIA Blackwell Ultra GPU, a key advancement in the Blackwell architecture. This GPU enhances AI training and reasoning with innovative technology. Key features include a dual-reticle design, high bandwidth, and energy-efficient performance. It boasts 208 billion transistors and provides significant scalability for AI tasks. With 15 PetaFLOPS performance and improved memory access, the Blackwell Ultra sets a new standard for accelerated computing. #NVIDIA #AI #BlackwellUltra...

Source: Nvidia Developer Blog

Kyle Aubrey

Technical Deep Dives

NVIDIA Hardware Innovations and Open Source Contributions Are Shaping AI

2025-08-22 15:00

NVIDIA is making strides in AI through open source models like Cosmos, DeepSeek, and Llama. 🌐 These models offer free access to AI methodologies, enabling innovation across the globe. Their new Blackwell GPU architecture enhances AI performance with advanced features like NVFP4 and high-bandwidth interconnects. ⚡️ Additionally, NVIDIA provides a wealth of open source tools and libraries, fostering an environment for developers to build and scale AI efficiently. 💻 Discover more about these...

Source: Nvidia Developer Blog

George Chellapa

Industry Analysis

Less Coding, More Science: Simplify Ocean Modeling on GPUs With OpenACC and Unified Memory

2025-08-21 16:53

🚀 Exciting advancements in ocean modeling are here! NVIDIA HPC SDK v25.7 simplifies GPU programming for high-performance computing applications. This update automates data movement between CPU and GPU, reducing manual management and enhancing developer productivity. Notable systems like the NVIDIA GH200 Grace Hopper Superchip are leading the way. With unified memory programming, developers can focus more on science and less on coding complexities. This change is already benefiting projects,...

Source: Nvidia Developer Blog

Anastasia Stulova

Product Announcements

Improve Data Integrity and Security with Accelerated Hash Functions and Merkle Trees in cuPQC 0.4

2025-08-21 15:00

🔒 As data sizes grow, ensuring security and integrity is vital. The cuPQC SDK v0.4 offers advanced cryptographic techniques, including inclusion proofs and digital signatures, to enhance data protection. New features include expanded hash function support and efficient Merkle tree calculations, improving performance in data verification. 🌳 Discover how these updates can benefit your cryptographic tasks! #DataIntegrity #Cryptography #cuPQC #MerkleTrees #CyberSecurity

Source: Nvidia Developer Blog

Yarkin Doroz

Technical Deep Dives

Articles from Source: Nvidia-Developer-Blog