Hermes plus NVIDIA Nemotron: a free local AI agent that handles video, audio and text together

Video: "Hermes AI Agent + Nemotron 3 Nano Omni (FREE + OpenSource!)" by Julian Goldie on YouTube.

What Nemotron 3 Nano Omni actually is

NVIDIA released Nemotron 3 Nano Omni as a free, open-weight model built specifically to handle multiple types of input in one go. Text, audio, images, and video all go through the same model — you do not need to chain separate specialist models together. NVIDIA claim it runs nine times more efficiently than comparable multimodal models, which in practice means you can run it on consumer-grade hardware rather than needing a data-centre GPU.

The "Nano" in the name is worth noting. It is a smaller model than the full Nemotron family, optimised for inference speed and lower memory footprint. That trade-off means it handles practical tasks well — transcription, image description, video summarisation — but is not competing with GPT or Claude on complex reasoning.

How the Hermes integration works

Hermes Agent supports pluggable model providers. You point it at Nemotron running locally via an Ollama-compatible server, and from that point Hermes uses Nemotron as its intelligence layer. The agent still handles memory, skills, tool calls, and multi-step task management. Nemotron handles the actual understanding of whatever media you feed in.

Julian's demo showed a workflow where Hermes took a meeting recording, passed it to Nemotron for transcription and summary, pulled out action items, and filed them into a structured note. The whole thing ran on a local machine, offline, at no API cost. That is genuinely useful for businesses that deal with sensitive meeting content they would rather not send to a cloud.

Where this is actually useful

The obvious use case is any business that produces a lot of mixed media: agencies recording client calls, consultants with video interviews, teams with hours of product demos to work through. You can drop the files at Hermes and get back structured summaries, extracted quotes, or tagged action items without manual review.

Document processing is another solid fit. Nemotron can read PDFs, extract data from images of forms or invoices, and pass the structured output to Hermes for further processing. Done locally, this sidesteps the data-handling questions that come up whenever you pipe sensitive documents through a third-party API.

The setup and the limits

Running this locally is not plug-and-play. You need a machine with a decent discrete GPU — NVIDIA obviously, given the model — and a working Ollama setup. On an older GPU, inference on large media files will be slow. It is fine for batch processing overnight; less comfortable if you need a fast turnaround during a live meeting.

In practice the quality on complex reasoning tasks sits below Claude or GPT. For straight transcription and summarisation it is more than adequate. The free-and-local story is compelling enough that many businesses will find the quality acceptable for the tasks that matter most.

Where this connects to NordSys

Setting up a local AI agent stack — Hermes, a model like Nemotron, memory, and tools — takes more than watching a tutorial. The configuration needs to match your actual workflows, your hardware, and your data-handling requirements. We work with clients to build and maintain exactly these setups. Our AI Agents service covers the full build.

See our AI Agents service →