Tech
Three layers, signal to application
01 / Signal
Early Fusion multimodal: understanding you, not just what you said
Mainstream multimodal AI uses Late Fusion: each modality is compressed to text first, then concatenated. Temporal relations, intensity contrasts, "heart rate rose 3 seconds after they said that" — all lost. Conclusions are coarse, safe, identical for everyone.
Late Fusion (mainstream)
audio ──→ ASR ──→ "boss called me over" ─┐
HR ──→ numeric ──→ "HR rose from 75 to 110" ─┼──→ LLM concat:
vision ──→ caption ──→ "person in a suit" ─┘ "you seem to have
something at work"
Early Fusion (Yunjue)
audio ┐
HR ┤
IMU ┼──→ aligned at raw signal layer → multimodal LLM
vision ┤ "you got nervous the moment your boss called you over"
chat ┘The model builds a causal chain at the raw-signal layer — "HR rose 3 seconds before that sentence was spoken." Conclusions become precise to the event, the moment, the individual.
Why mainstream models can't
Not unwilling — there's no data
Mainstream models cover: video → text (VL), audio → text (ASR / gpt-audio), image → text (most have it natively).
But the combination of HR (dense numeric) + IMU (dense numeric) + HRV (sparse) + audio (continuous waveform) + image (sparse frames) + personal profile (text) + knowledge graph — there is no public-internet training data at scale, because only people wearing always-on multimodal capture devices 24/7 generate it.
This is Yunjue's flank: a new training arena that bypasses the frontal battle with mainstream models. Today we use the alignment capabilities of mainstream multimodal models to collect dense "unusual modalities → human behavior outcome" labels; long-term, we train our own Human-Centric World Model.
Early Fusion boundary
The things Late Fusion can never catch, Early Fusion can
All three capabilities below require heart rate, vision, voiceprint, and dialogue to be precisely aligned at the raw-signal layer. Compress each modality to text first, and this information is gone.
Self-report vs. body
How you describe your state in words and what your heart rate, HRV, and gait actually reveal often diverge. Yunjue aligns both streams across the same window and surfaces the gap pure language never catches.
Same activity, across days
The same sit-and-type, the same meeting, the same commute can produce very different physiological curves. The multimodal timeline turns "today vs. baseline" into a measurable quantity.
Multimodal causal chains
"Heart rate rose 3 seconds before that sentence was spoken" — heart rate, vision, voiceprint shifts, and dialogue aligned at the raw-signal layer construct causal chains language alone never produces.
These are Early Fusion's current capabilities — and the entry point for accumulating "human behavior outcome" labels that will form the training ground truth of the future Human-Centric World Model.
02 / System
Self-Evolving Agent: smarter the longer you use it
5 benchmarks · 3 SOTA · 2 runner-up
Across HLE, DeepSearchQA, FinSearchComp, xBench-ScienceQA, and xBench-DeepSearch — measured against GPT-5.2 Pro, GPT-5 Pro, Gemini 3 Pro, Claude 4.5 Opus and other frontier baselines — Yunjue Agent takes state-of-the-art on three and second place on the other two (trailing only the closed-source frontier):
- In-Situ Self-Evolving paradigm: traditional agents draw a hard line between offline training and online deployment. We propose "inference IS evolution": every inference mutates the system's configuration and immediately feeds the next one.
- Tabula Rasa experiment: the agent starts with an empty toolkit and builds tools entirely through inference-time generate / verify / induce.
- Tool-library convergence: the tools the agent authors converge to a reusable set — only 97 tools synthesized across 2,500 HLE queries — evidence that "general problem-solving" is a learnable, finite, distillable pattern.
- Warm-start transfer: an HLE-evolved toolset bootstraps the other benchmarks; new-tool growth drops to zero on xSciQA / xDS, showing the skills transfer cross-domain.
Code, benchmark scripts, and the versioned tool-generation / modification / merge traces are all CC BY 4.0 open source — auditable and reproducible. Full work in the Yunjue Agent post.
Self-evolution in product
A self-evolution sandbox, once a night
Every night · end-of-day data assembled
│
▼
For each user, run one self-evolution sandbox
│
├─ Audit today Analyze the event narrative and product-side signals,
│ re-examine suspect events (mislabel / composite / boundary)
│
├─ Incremental model profile / glossary / personal KG / relationship graph /
│ tracked items — appended as mode statements, never overwritten
│
├─ Find the gaps Identify capabilities the user's current Skill library
│ does not yet cover
│
└─ Author tools Emit Python tool skeletons and design briefs, validate,
and promote verified ones into the shared library
│
▼
Every decision is audited and replayable, node by nodeSmallest unit of action
User-facing card unit
Self-evolution comes in two stages: private evolution (fully automatic, inside a sandbox) + shared extraction (high-value generic tools are de-identified and promoted into the shared library). Skills and tools are private by default; sharing only happens during the "non-personal generic tool" extraction step.
03 / Application
Zero-Skill: every user's app is different
Mainstream "AI assistant" products preload features and ask the user to pick. Result: every user sees the same product, varying only in usage.
Yunjue starts with Zero Skills:
- Day-one feed is empty — the system knows nothing about you, so shows nothing
- The system observes throughout the day and slices activity into event narratives
- After a few days, it identifies "what you do at which times" and writes Skills
- By around two weeks, no two users share the same card library
Slow cold start is a disadvantage — but it's the price of the moat. Each card is a standalone HTML mini-program, which means the same feed shell can host tools, retrospectives, companion dialogue, even bespoke games. The more user-specific Skills accumulate, the higher the switching cost.
Internal-trial observations
Same system, completely different Skill libraries
Yunjue's self-evolution sandbox runs once a night — reading event narratives, voiceprint profiles, and behavioral patterns to identify capability gaps and author new Skills. There is no fixed feature list; what each user ends up with diverges over time. Typical directions:
Creators
Creative-energy curves, inspiration-flash replays, decision-style retrospectives, long-form writing reviews — capabilities organized around "why did I do it this way" get identified and authored automatically.
Professional workers
Effective-work-hour tallies, collaboration-dialogue summaries, information-utilization rates, expert-interview digests — capabilities organized around "where does my output actually come from" get identified and authored automatically.
Daily life & companionship
Parenting journals, long-horizon growth archives, scene-by-scene field notes, relationship graphs — capabilities organized around "how do everyday details accumulate into a long-term story" get identified and authored automatically.
On-location creators
In-the-moment shoot / performance notes, live state curves, inspiration ledgers — capabilities organized around "what was I actually thinking at the moment of creation" get identified and authored automatically.
All of these directions emerge from a Zero-Skill start — they are not preloaded features and not user-picked from a list.
Comparison
Mainstream LLM / domain-bounded Agent / Yunjue
| Dimension | Mainstream LLM assistant | Domain-bounded Agent | Yunjue |
|---|---|---|---|
| Openness | ✅ Any input | ❌ Fixed intent | ✅ |
| Safety / controllability | ❌ Probabilistic | ✅ FSM | ✅ Sandbox + tool verification |
| Cost | ❌ Full inference | ✅ Fixed path | ✅ Cached path + deep-reason on demand |
| Personalization | ❌ One-size-fits-all | ❌ No personalization | ✅ Individual-scale |
| Multimodal depth | ❌ Late Fusion | ❌ Single-modal | ✅ Early Fusion |
| Self-evolution | ❌ Frozen after training | ❌ Hand-edited by engineers | ✅ Automatic, nightly |
The commercial impossible-triangle of openness / controllability / economics — traditional approaches can only hit two corners. We believe a self-evolving agent is the viable path across all three. Full argument in Why dynamic self-evolution is the right path for consumer services.
Privacy architecture
Whether raw data leaves your device is your choice
Listening, watching, and measuring heart rate 24/7 looks, from the outside, like always-on capture hardware. This is Yunjue's biggest product burden — and a question engineering has to answer head-on.
Consumer edition (Yunjue's in-house multimodal hardware + iPhone, mass-market user)
├─ Option 1: edge-side small model compresses to text summary before upload
└─ Option 2: edge-side fusion adapter — raw frames / raw data never leave;
what's uploaded is a fusion vector
Geek edition (developers / power users)
├─ Private-server bundle: raw data uploaded, but to YOUR own server, token billing
└─ Lightweight backend open-sourced: deploy your own IP, hardware-only billing- Today (deep internal trial): raw data is processed in the cloud under per-device encryption, strict isolation, and no cross-user mixing — the stage where we and our early peers chose to harden the main loop first.
- Mid-term roadmap (after in-house multimodal hardware + edge adapter ship): the consumer edition defaults to an end / edge path — raw data never leaves the end / edge, the cloud only receives a fusion vector. "Whether raw data leaves your device" becomes an explicit user choice, no longer the default.
- Geek-edition options (shipped in parallel): private deployment or lightweight backend open source — raw data fully under your control.
- Tools / Skills are private by default: publishing to the community is an explicit user action. Pre-publish de-identification removes user-specific terms, names, and locations.
Zooming out
Three mutually-prerequisite long-horizon tracks
The three layers above all serve three long-horizon tracks Yunjue invests in. Remove any one and the remaining two no longer hold.
Human-Centric World Model
Yunjue's long-horizon goal is to close that gap: aligning heart rate, IMU, audio, vision, dialogue, profile, and relationship graphs at the raw-signal layer to train a truly human-centric multimodal foundation model whose training target is not "general world knowledge" but "the state, intent, and needs of a specific person at a specific moment."
Once this model exists, embodied agents, humanoid robots, and consumer personal AI will have a foundation that genuinely understands people — not just a language model that can hold a conversation.
Self-Evolving Agent
Always-on multimodal capture · External sensory system
Yunjue positions this hardware as an external sensory system: a dual-mic array that always listens, a camera that captures key frames on trigger, and heart rate / HRV / IMU / skin temperature running continuously — perceiving signals you can't perceive yourself, and handing them to the cloud-side external prefrontal cortex for analysis and reflection.
Yunjue today validates the main loop on Apple Watch; in-house multimodal hardware v1 is in progress, organized around "dual-mic + HR / IMU + privacy indicator + hardware kill-switch." Future form factors stay open.