DeepSeek V4 Can Now See — The Last Pure-Text Top Model Finally Catches Up

Just dropped 1M context, then immediately added image mode

DeepSeek’s update pace is honestly unreasonable.

V4 with its 1M context window barely had time to settle in the community before image mode quietly appeared. No press conference, no PR blast — a researcher posted a message on social media, deleted it, and the feature showed up in the app.

Classic DeepSeek.

Not OCR. It actually understood.

The test was simple: upload a photo of Guilin’s Elephant Trunk Hill with zero text on it.

DeepSeek V4 gave the landmark name, described its morphological features, and inferred the geographic location.

This isn’t “there’s text in the image, let me read it for you.” This is genuine visual understanding — it “saw” the scene and matched it against its knowledge base.

Put simply: the last major Chinese LLM without vision support has finally filled this gap.

Why Didn’t It Have This Before?

DeepSeek took a different path from the start.

Tongyi Qianwen, ERNIE, Kimi, Zhipu GLM — these competitors added multimodal input from early on. DeepSeek focused its energy on text reasoning and coding, pushing a pure-text model into the top tier.

That choice was controversial at the time. Many felt that not supporting images in 2025 meant the model was “crippled.” But DeepSeek’s logic might have been: max out text capability first, add vision incrementally.

Looking back, that strategy worked. V4’s text prowess is proven across multiple benchmarks, and image mode removes the last obvious gap.

The Benefits of Incremental Multimodal

DeepSeek didn’t build a multimodal model from scratch — it extended a visual encoder on top of the existing architecture.

Unified experience. No need to switch products or modes — text and images in the same dialog box.

Faster iteration. No need to wait for V5 — existing architecture extends to new capabilities.

Better cost control. Incremental training costs far less than training a multimodal model from zero.

Of course, this incremental approach may have limits — complex visual reasoning tasks might need more iterations to match dedicated multimodal models. But at least, the direction is right.

Still in Gray-Scale

Image mode is currently in gray-scale internal testing. Some users may not yet see the entry point. The official recommendation is to upgrade the app to the latest version.

If you already see the “Image Mode” icon in your app — congratulations, your DeepSeek V4 just unlocked its final piece.

Just dropped 1M context, then immediately added image mode

Not OCR. It actually understood.

Why Didn’t It Have This Before?

The Benefits of Incremental Multimodal

Still in Gray-Scale

Related

OpenAI Workspace Agents Launch: From Personal Chat to Team Automation, ChatGPT Paradigm Shift

DeepSeek V4 Flash Review: Tool Calling Significantly Improved, Multi-Step Workflows in One Prompt

Baidu ERNIE 5.1 Preview Debuts on Arena at #13, Tops Legal & Government Category