DeepSeek v4 reviewed: 1 million token open-source model — benchmarks strong, real-world output mixed

Video: "Deepseek v4: Best Opensource Model Ever? (Fully Tested)" by Julian Goldie on YouTube.

What's in the release

DeepSeek V4 comes in two flavours. V4 Pro is a Mixture-of-Experts model with 1.6 trillion parameters, 49 billion of which are active on any given request. V4 Flash is the smaller option: 284 billion parameters, 13 billion active, designed for speed and lower cost. Both support a 1 million token context window — roughly 750,000 words of input in a single prompt.

Both models are open-source, meaning you can run them yourself rather than paying per token to a third party. The API is free at the standard tier. That combination — large context, open weights, no per-token charge — is genuinely unusual for a model at this performance level.

DeepSeek is a Chinese AI company, which matters to some organisations for data sovereignty and supply-chain reasons. Worth considering if you're working with anything sensitive.

Pro vs Flash — which to reach for

The distinction is practical. Use Pro when the task involves serious reasoning, complex coding, research synthesis, or long documents. Use Flash when you need fast turnaround, repeated API calls, or lightweight agent workflows where speed matters more than depth.

In practice, most automation workflows benefit from Flash for the high-volume, low-stakes steps and Pro only for the bits that actually need to think. Running everything through Pro on a busy workflow adds up in time and, if you're on a hosted plan rather than self-hosting, cost.

The practical test results

On the benchmark side, DeepSeek V4 looks competitive with GPT-5.5 and ahead of earlier Claude and Gemini versions on coding, world knowledge, and long-context tasks. That's impressive. The benchmark numbers are not marketing fiction.

The real-world test told a different story on one specific task. When Julian asked the model to produce a landing page design from a brief, the output was functional — the HTML worked, the structure was sensible — but the visual design felt dated. Compared side by side, GPT-5.5 produced something more modern, more complete, and with better visual judgement. For pure coding tasks, the gap was narrower. For anything requiring aesthetic output, it showed.

That said, "functional but visually dated" is fine for internal tools, prototypes, and anything where you're styling it yourself. If you're building a customer-facing product and design quality matters, factor in a pass by someone who cares about that.

What makes this interesting for open-source users

The 1 million token context window is the standout feature for practical use. It means you can feed an entire codebase — or a large document corpus — into a single prompt and get coherent analysis back. Most models cap out well below that, which forces you to chunk and summarise rather than just handing it everything at once.

For businesses building AI-assisted automation — scraping pipelines, document processors, code review bots — a model you can run locally with that context size is genuinely useful. Self-hosting means no API costs, no data leaving your network, and no dependency on a third party's pricing changes. The trade-off is infrastructure: you need the hardware to run it, and DeepSeek V4 Pro is large enough that consumer-grade machines won't cut it.

Worth knowing: DeepSeek also has an "expert mode" that improves performance on structured reasoning tasks. If you're using it for SEO planning, topic clustering, or content hierarchy work, turning that on makes the output noticeably more useful.

Where this connects to NordSys

Open-source models like DeepSeek V4 are increasingly viable as the backbone of custom automation builds — document processors, internal dashboards, batch analysis jobs — particularly for businesses that want to keep costs predictable and data in-house. The tricky part is the setup: choosing the right model size, provisioning the hardware or managed hosting, and wiring it into your actual workflow. That's the kind of build we do on our programming service. If you've got a process that's currently manual and expensive in time, it's worth a conversation about whether an open model can handle it.

See our programming service →

DeepSeek v4 reviewed: the 1 million token open-source model — benchmarks strong, real-world output mixed

What's in the release

Pro vs Flash — which to reach for

The practical test results

What makes this interesting for open-source users

Where this connects to NordSys