Tech

Three layers, signal to application

Yunjue's tech stack runs bottom-up across three layers: Early Fusion multimodal at the bottom; a Self-Evolving Agent in the middle that runs a sandbox every night — auditing events, refreshing the user model, and authoring its own tools / Skills; and Zero-Skill personalization on top. Each layer depends on the one below. The long-term arc leads to a self-trained Human-Centric World Model.

01 / Signal

Early Fusion multimodal: understanding you, not just what you said

Mainstream multimodal AI uses Late Fusion: each modality is compressed to text first, then concatenated. Temporal relations, intensity contrasts, "heart rate rose 3 seconds after they said that" — all lost. Conclusions are coarse, safe, identical for everyone.

Late Fusion (mainstream)
  audio  ──→  ASR        ──→ "boss called me over"     ─┐
  HR     ──→  numeric    ──→ "HR rose from 75 to 110"  ─┼──→  LLM concat:
  vision ──→  caption    ──→ "person in a suit"        ─┘  "you seem to have
                                                            something at work"

Early Fusion (Yunjue)
  audio  ┐
  HR     ┤
  IMU    ┼──→  aligned at raw signal layer → multimodal LLM
  vision ┤      "you got nervous the moment your boss called you over"
  chat   ┘

The model builds a causal chain at the raw-signal layer — "HR rose 3 seconds before that sentence was spoken." Conclusions become precise to the event, the moment, the individual.

Why mainstream models can't

Not unwilling — there's no data

Mainstream models cover: video → text (VL), audio → text (ASR / gpt-audio), image → text (most have it natively).

But the combination of HR (dense numeric) + IMU (dense numeric) + HRV (sparse) + audio (continuous waveform) + image (sparse frames) + personal profile (text) + knowledge graph — there is no public-internet training data at scale, because only people wearing always-on multimodal capture devices 24/7 generate it.

This is Yunjue's flank: a new training arena that bypasses the frontal battle with mainstream models. Today we use the alignment capabilities of mainstream multimodal models to collect dense "unusual modalities → human behavior outcome" labels; long-term, we train our own Human-Centric World Model.

Early Fusion boundary

The things Late Fusion can never catch, Early Fusion can

All three capabilities below require heart rate, vision, voiceprint, and dialogue to be precisely aligned at the raw-signal layer. Compress each modality to text first, and this information is gone.

CAPABILITY 01

Self-report vs. body

How you describe your state in words and what your heart rate, HRV, and gait actually reveal often diverge. Yunjue aligns both streams across the same window and surfaces the gap pure language never catches.

CAPABILITY 02

Same activity, across days

The same sit-and-type, the same meeting, the same commute can produce very different physiological curves. The multimodal timeline turns "today vs. baseline" into a measurable quantity.

CAPABILITY 03

Multimodal causal chains

"Heart rate rose 3 seconds before that sentence was spoken" — heart rate, vision, voiceprint shifts, and dialogue aligned at the raw-signal layer construct causal chains language alone never produces.

These are Early Fusion's current capabilities — and the entry point for accumulating "human behavior outcome" labels that will form the training ground truth of the future Human-Centric World Model.

02 / System

Self-Evolving Agent: smarter the longer you use it

Benchmarks · arXiv 2601.18226

5 benchmarks · 3 SOTA · 2 runner-up

Across HLE, DeepSearchQA, FinSearchComp, xBench-ScienceQA, and xBench-DeepSearch — measured against GPT-5.2 Pro, GPT-5 Pro, Gemini 3 Pro, Claude 4.5 Opus and other frontier baselines — Yunjue Agent takes state-of-the-art on three and second place on the other two (trailing only the closed-source frontier):

48.0

HLE

#2 · only behind GPT-5.2 Pro

73.5

DSQA

SOTA · +16.9 vs Gemini 3 Pro

65.0

FSC

SOTA · +15.1 vs Gemini 3 Pro

76.5

xSciQA

SOTA

59.7

xDS

#2 · only behind GPT-5 Pro

arXiv paper GitHub repo Full evolution traces

In-Situ Self-Evolving paradigm: traditional agents draw a hard line between offline training and online deployment. We propose "inference IS evolution": every inference mutates the system's configuration and immediately feeds the next one.
Tabula Rasa experiment: the agent starts with an empty toolkit and builds tools entirely through inference-time generate / verify / induce.
Tool-library convergence: the tools the agent authors converge to a reusable set — only 97 tools synthesized across 2,500 HLE queries — evidence that "general problem-solving" is a learnable, finite, distillable pattern.
Warm-start transfer: an HLE-evolved toolset bootstraps the other benchmarks; new-tool growth drops to zero on xSciQA / xDS, showing the skills transfer cross-domain.

Code, benchmark scripts, and the versioned tool-generation / modification / merge traces are all CC BY 4.0 open source — auditable and reproducible. Full work in the Yunjue Agent post.

Self-evolution in product

A self-evolution sandbox, once a night

Every night · end-of-day data assembled
   │
   ▼
For each user, run one self-evolution sandbox
   │
   ├─ Audit today        Analyze the event narrative and product-side signals,
   │                     re-examine suspect events (mislabel / composite / boundary)
   │
   ├─ Incremental model  profile / glossary / personal KG / relationship graph /
   │                     tracked items — appended as mode statements, never overwritten
   │
   ├─ Find the gaps      Identify capabilities the user's current Skill library
   │                     does not yet cover
   │
   └─ Author tools       Emit Python tool skeletons and design briefs, validate,
                         and promote verified ones into the shared library
   │
   ▼
Every decision is audited and replayable, node by node

Tool

Smallest unit of action

A standalone executable code unit that defines "what to do." Map queries, KOL aggregation, HR analysis. Every night the sandbox writes them, runs them, validates them; verified tools are promoted into the shared tool library.

Skill

User-facing card unit

An execution + render contract bundled with dedicated tools. The agent first emits a long report, then renders it into HTML cards in the app.

Self-evolution comes in two stages: private evolution (fully automatic, inside a sandbox) + shared extraction (high-value generic tools are de-identified and promoted into the shared library). Skills and tools are private by default; sharing only happens during the "non-personal generic tool" extraction step.

03 / Application

Zero-Skill: every user's app is different

Mainstream "AI assistant" products preload features and ask the user to pick. Result: every user sees the same product, varying only in usage.

Yunjue starts with Zero Skills:

Day-one feed is empty — the system knows nothing about you, so shows nothing
The system observes throughout the day and slices activity into event narratives
After a few days, it identifies "what you do at which times" and writes Skills
By around two weeks, no two users share the same card library

Slow cold start is a disadvantage — but it's the price of the moat. Each card is a standalone HTML mini-program, which means the same feed shell can host tools, retrospectives, companion dialogue, even bespoke games. The more user-specific Skills accumulate, the higher the switching cost.

Internal-trial observations

Same system, completely different Skill libraries

Yunjue's self-evolution sandbox runs once a night — reading event narratives, voiceprint profiles, and behavioral patterns to identify capability gaps and author new Skills. There is no fixed feature list; what each user ends up with diverges over time. Typical directions:

Creators

Creative-energy curves, inspiration-flash replays, decision-style retrospectives, long-form writing reviews — capabilities organized around "why did I do it this way" get identified and authored automatically.

Professional workers

Effective-work-hour tallies, collaboration-dialogue summaries, information-utilization rates, expert-interview digests — capabilities organized around "where does my output actually come from" get identified and authored automatically.

Daily life & companionship

Parenting journals, long-horizon growth archives, scene-by-scene field notes, relationship graphs — capabilities organized around "how do everyday details accumulate into a long-term story" get identified and authored automatically.

On-location creators

In-the-moment shoot / performance notes, live state curves, inspiration ledgers — capabilities organized around "what was I actually thinking at the moment of creation" get identified and authored automatically.

All of these directions emerge from a Zero-Skill start — they are not preloaded features and not user-picked from a list.

Comparison

Mainstream LLM / domain-bounded Agent / Yunjue

Dimension	Mainstream LLM assistant	Domain-bounded Agent	Yunjue
Openness	✅ Any input	❌ Fixed intent	✅
Safety / controllability	❌ Probabilistic	✅ FSM	✅ Sandbox + tool verification
Cost	❌ Full inference	✅ Fixed path	✅ Cached path + deep-reason on demand
Personalization	❌ One-size-fits-all	❌ No personalization	✅ Individual-scale
Multimodal depth	❌ Late Fusion	❌ Single-modal	✅ Early Fusion
Self-evolution	❌ Frozen after training	❌ Hand-edited by engineers	✅ Automatic, nightly

The commercial impossible-triangle of openness / controllability / economics — traditional approaches can only hit two corners. We believe a self-evolving agent is the viable path across all three. Full argument in Why dynamic self-evolution is the right path for consumer services.

Privacy architecture

Whether raw data leaves your device is your choice

Listening, watching, and measuring heart rate 24/7 looks, from the outside, like always-on capture hardware. This is Yunjue's biggest product burden — and a question engineering has to answer head-on.

Consumer edition (Yunjue's in-house multimodal hardware + iPhone, mass-market user)
  ├─ Option 1: edge-side small model compresses to text summary before upload
  └─ Option 2: edge-side fusion adapter — raw frames / raw data never leave;
              what's uploaded is a fusion vector

Geek edition (developers / power users)
  ├─ Private-server bundle: raw data uploaded, but to YOUR own server, token billing
  └─ Lightweight backend open-sourced: deploy your own IP, hardware-only billing

Today (deep internal trial): raw data is processed in the cloud under per-device encryption, strict isolation, and no cross-user mixing — the stage where we and our early peers chose to harden the main loop first.
Mid-term roadmap (after in-house multimodal hardware + edge adapter ship): the consumer edition defaults to an end / edge path — raw data never leaves the end / edge, the cloud only receives a fusion vector. "Whether raw data leaves your device" becomes an explicit user choice, no longer the default.
Geek-edition options (shipped in parallel): private deployment or lightweight backend open source — raw data fully under your control.
Tools / Skills are private by default: publishing to the community is an explicit user action. Pre-publish de-identification removes user-specific terms, names, and locations.

Zooming out

Three mutually-prerequisite long-horizon tracks

The three layers above all serve three long-horizon tracks Yunjue invests in. Remove any one and the remaining two no longer hold.

Human-Centric World Model

Every mainstream large model is trained on the public internet — they learn general world knowledge, but none has been systematically pretrained on the human itself. No model truly understands that for the same person, the heart-rate curve while typing differs from a meeting; that the body reacts before a word is spoken.

Yunjue's long-horizon goal is to close that gap: aligning heart rate, IMU, audio, vision, dialogue, profile, and relationship graphs at the raw-signal layer to train a truly human-centric multimodal foundation model whose training target is not "general world knowledge" but "the state, intent, and needs of a specific person at a specific moment."

Once this model exists, embodied agents, humanoid robots, and consumer personal AI will have a foundation that genuinely understands people — not just a language model that can hold a conversation.

Self-Evolving Agent

Today's assistants are hand-built scenario by scenario, an approach that cannot serve the long tail of individual users. The Yunjue Agent paper demonstrates that an agent can extend itself through inference-as-evolution — authoring its own tools, writing its own Skills, and identifying its own capability gaps inside a sandbox. We believe this is the viable path through the openness / controllability / economics impossible-triangle of consumer AI.

Always-on multimodal capture · External sensory system

Understanding a person requires being with them continuously, ambiently, with low intrusion and verifiable privacy controls — a workload for always-on multimodal hardware, not a phone app or smart speaker.

Yunjue positions this hardware as an external sensory system: a dual-mic array that always listens, a camera that captures key frames on trigger, and heart rate / HRV / IMU / skin temperature running continuously — perceiving signals you can't perceive yourself, and handing them to the cloud-side external prefrontal cortex for analysis and reflection.

Yunjue today validates the main loop on Apple Watch; in-house multimodal hardware v1 is in progress, organized around "dual-mic + HR / IMU + privacy indicator + hardware kill-switch." Future form factors stay open.

See the product cadence or join us — or read the three-track narrative in full

Early Fusion → Self-Evolving Agent → Human-Centric World Model: all three tracks have corresponding delivery milestones on the product page.

Product Join us