EDITH

Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction

Verbal Instruction + Nonverbal Signals \(C^{\mathrm{ego}}, \ell \rightarrow Q \rightarrow a_t\)

EDITH

Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction

Dongjun Lee*1, Juheon Choi*1, Dong Kyu Shin*2, Sinjae Kang1, Kimin Lee1,3

* Equal contribution

1 KAIST 2 Seoul National University 3 Config

Motivation & Key Idea

Overview figure showing EDITH's hardware system, hierarchical policy, and robot execution flow.

Language-conditioned policies require humans to fully verbalize their intent through language, which is often cumbersome and imprecise.

To enable robots to understand human nonverbal signals as well, we use the human's egocentric view and gaze as inputs to the robot control policy.

Method

Hardware System

We build a hardware system that streams the human's first-person view, gaze, and speech in real time, and synchronizes them with the robot's observations.

Capturing human signals via smartglasses.
Bimanual Robot

Why first-person view and gaze?

First-person view and gaze represent human nonverbal signals by capturing what the human is doing and where their attention is focused. Together, these streams provide cues for a robot policy to infer the human's underlying intent and needs.

Capturing human signals via smartglasses.

Using Project Aria glasses, EDITH streams first-person RGB, gaze, and speech to the robot server, transcribes speech into \(\ell_t\), and synchronizes the signals with robot observations \(o_t\) to produce \((C_t^{\mathrm{ego}}, \ell_t, o_t)\) at each timestep.

Policy Design

Human signals contain rich information but are often noisy and transient. To effectively process such signals, we propose a hierarchical policy that decouples inferring the human's intent from producing low-level actions. It consists of a and a .

EDITH method architecture

Overall Design

EDITH converts verbal instructions and egocentric context into instruction-keyframe subtasks, stores them in \(Q\), and executes each subtask with the robot policy.

Results

EDITH improves task success while reducing the user's instruction burden.

Results are reported with success rate (SR) and task progress (TP) over 48 trials per task per method, plus a user study on instruction workload.

Baselines

\(\pi_l^{\mathrm{lang}}\)
\(\pi_{0.5}\) finetuned with task-specific demonstration data.
\(\pi_h^{\mathrm{lang}} + \pi_l^{\mathrm{lang}}\)
Hierarchical policy employing a VLM as the high-level planner, similar to Hi Robot.
\(\pi_l^{\mathrm{ego+lang}}\)
\(\pi_{0.5}\) finetuned with task-specific demonstration data, additionally conditioned on egocentric context.
Main Results

EDITH achieves 59.7% average success rate and 84.7% task progress by translating nonverbal human signals into keyframe-grounded subtasks.

\(\pi_l^{\mathrm{lang}}\) \(\pi_l^{\mathrm{ego+lang}}\) \(\pi_h^{\mathrm{lang}} + \pi_l^{\mathrm{lang}}\) EDITH

Language-only baselines, \(\pi_l^{\mathrm{lang}}\) and \(\pi_h^{\mathrm{lang}} + \pi_l^{\mathrm{lang}}\), do not use egocentric context and perform poorly , highlighting the importance of egocentric context for grounding nonverbal intent.

Directly conditioning an end-to-end policy on the current egocentric context yields inconsistent benefits: it helps only when gaze remains on the target, but degrades when attention is intermittent. EDITH handles both cases more consistently by monitoring intent separately through the high-level policy.

User Study on 16 Participants

EDITH reduces workload of humans in conveying their intent to the robot

We conducted the user study with 16 participants to evaluate the workload of conveying intent to the robot. The study was IRB-approved.
Instruction Workload (Lower is better ↓)

EDITH reduces the effort users must spend to convey their intent. Compared with the \(\pi_h^{\mathrm{lang}} + \pi_l^{\mathrm{lang}}\) baseline, EDITH substantially lowers instruction workload on both Muffin-Serving and Tool-Passing, and the reduction is statistically significant (\(p < 0.001\)).

The baseline places much of the burden on the user: participants must verbally describe each target object precisely enough to disambiguate it, including attributes such as color, position, and surrounding objects. EDITH instead lets users pair brief utterances with nonverbal expressions such as gaze and pointing, removing the need to fully verbalize the target.

Analysis

Q: How does EDITH perform when humans are distracted?

A: In natural interaction, human attention can shift to unrelated activities such as briefly checking a phone. We test this in Muffin-Serving by having the human alternate between checking a text message and requesting muffins until all three are requested. EDITH's SR and TP remain comparable to the non-distracted setting, with only a 0.5% relative drop in task progress.

EDITH EDITH w/ distract \(\pi_l^{\mathrm{ego+lang}}\) \(\pi_l^{\mathrm{ego+lang}}\) w/ distract
EDITH under distracting human activity.
Q: What is the effect of using keyframe as a subtask representation?

A: To isolate the keyframe's contribution, we compare EDITH with a hierarchical no-keyframe baseline: \(\pi_h\) still maps egocentric context and language into subtasks, but \(\pi_l\) receives only the subtask instruction. Removing the keyframe drops SR/TP by 49.9/51.5 points on average. This happens because \(\pi_h\) can hallucinate the wrong target when verbalizing nonverbal signals, and semantically correct subtasks can be phrased in ways unseen during \(\pi_l\)'s training. Since EDITH retrieves the keyframe from the egocentric stream instead of generating it as text, it avoids both hallucination and ambiguous phrasing.

EDITH w/o \(C^{\mathrm{key}}\) EDITH

Evaluation Tasks

Three tasks require grounding underspecified language in nonverbal signals.

In each task, the language instruction does not fully specify the target: the robot must interpret the human's eye gaze or gestures to identify the target objects and complete the requested task.

Muffin-Serving task setup

Task 01

Muffin-Serving

Six muffins are densely arranged. The human requests three muffins while consecutively pointing at each one.

Tumbler-Sorting task setup

Task 02

Tumbler-Sorting

Five tumblers and two baskets are placed on the table. The human points to tumblers and baskets to specify a multi-step sorting request.

Tool-Passing task setup

Task 03

Tool-Passing

While the human is assembling something with both hands, they request the tool they need through a brief utterance and a glance, and the robot hands it over.

Resources

Read the paper on arXiv.

Citation

If you find our work useful, please cite the paper using the BibTeX entry on the right.

@article{lee2026hierarchicalpoliciesverbalegocentric,
      title={Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction},
      author={Dongjun Lee and Juheon Choi and Dong Kyu Shin and Sinjae Kang and Kimin Lee},
      year={2026},
      url={https://arxiv.org/abs/2606.10276},
}

Why are human signals noisy?

Eye gaze can shift unstably frame by frame, and the egocentric view changes frequently as the human moves. These temporal changes make the signal rich but noisy.

Detailed Results