EDITH

Method

Hardware System

We build a hardware system that streams the human's first-person view, gaze, and speech in real time, and synchronizes them with the robot's observations.

Capturing human signals via smartglasses.

Bimanual Robot

Why first-person view and gaze?

First-person view and gaze represent human nonverbal signals by capturing what the human is doing and where their attention is focused. Together, these streams provide cues for a robot policy to infer the human's underlying intent and needs.

Capturing human signals via smartglasses.

Using Project Aria glasses, EDITH streams first-person RGB, gaze, and speech to the robot server, transcribes speech into \(\ell_t\), and synchronizes the signals with robot observations \(o_t\) to produce \((C_t^{\mathrm{ego}}, \ell_t, o_t)\) at each timestep.

Policy Design

Human signals contain rich information but are often noisy and transient. To effectively process such signals, we propose a hierarchical policy that decouples inferring the human's intent from producing low-level actions. It consists of a and a .

Overall Design

EDITH converts verbal instructions and egocentric context into instruction-keyframe subtasks, stores them in \(Q\), and executes each subtask with the robot policy.

Results

EDITH improves task success while reducing the user's instruction burden.

Results are reported with success rate (SR) and task progress (TP) over 48 trials per task per method, plus a user study on instruction workload.

Baselines

\(\pi_l^{\mathrm{lang}}\): \(\pi_{0.5}\) finetuned with task-specific demonstration data.
\(\pi_h^{\mathrm{lang}} + \pi_l^{\mathrm{lang}}\): Hierarchical policy employing a VLM as the high-level planner, similar to Hi Robot.
\(\pi_l^{\mathrm{ego+lang}}\): \(\pi_{0.5}\) finetuned with task-specific demonstration data, additionally conditioned on egocentric context.

Main Results

EDITH achieves 59.7% average success rate and 84.7% task progress by translating nonverbal human signals into keyframe-grounded subtasks.

Muffin-Serving

0.0 4.2 2.1 50.0

SR (%)

12.2 25.3 12.2 80.6

TP (%)

Tumbler-Sorting

0.0 2.5 12.5 45.8

SR (%)

26.2 54.5 28.5 83.0

TP (%)

Tool-Passing

12.5 12.5 14.6 83.3

SR (%)

31.3 27.0 40.6 90.6

TP (%)

Average

4.2 6.4 9.7 59.7

SR (%)

23.2 35.6 27.1 84.7

TP (%)

\(\pi_l^{\mathrm{lang}}\) \(\pi_l^{\mathrm{ego+lang}}\) \(\pi_h^{\mathrm{lang}} + \pi_l^{\mathrm{lang}}\) EDITH

Language-only baselines, \(\pi_l^{\mathrm{lang}}\) and \(\pi_h^{\mathrm{lang}} + \pi_l^{\mathrm{lang}}\), do not use egocentric context and perform poorly , highlighting the importance of egocentric context for grounding nonverbal intent.

Directly conditioning an end-to-end policy on the current egocentric context yields inconsistent benefits: it helps only when gaze remains on the target, but degrades when attention is intermittent. EDITH handles both cases more consistently by monitoring intent separately through the high-level policy.

User Study on 16 Participants

EDITH reduces workload of humans in conveying their intent to the robot

We conducted the user study with 16 participants to evaluate the workload of conveying intent to the robot. The study was IRB-approved.