GDM
36.48M · 2026-03-29
0x00 摘要
0x01 HITL
0x02 MCP流程
0x03 INFO 操作
0x04 auto_reply 函数
0x05 特殊分析
0xFF 参考
25年底,阶跃星辰升级发布了全新的AI Agent系列模型Step-GUI,包括云端模型Step-GUI、首个面向GUI Agent的MCP协议:GUI-MCP(Graphical User Interface - Model Context Protocol),这是首个专为图形用户界面自动化而设计的 MCP 实现,兼顾标准化与隐私保护。
因此,我们就来解读这个MCP协议,顺便看看端侧Agent的实现架构。本文是第六篇,主要是介绍Step-GUI的HITL,以及其他特殊之处。
因为是反推解读,而且时间有限,所以可能会有各种错误,还请大家不吝指出。
Human-in-the-loop(简称HITL)是一种重新划分人类认知与机器能力边界、放大双方优势的系统设计理念。它的存在价值,可从三个核心维度展开:
因此,若要让人类有效掌控任务走向,落实HITL理念的核心在于两点:
Step-GUI中,HITL 信息获取能力如下:
上下文感知:
auto_reply 函数利用当前截图和任务信息生成澄清问题;信息类型多样化:
value 字段传递具体问题内容流程图如下,该流程图展示了任务从启动到结束的完整流程:
hitl-1
时序图如下。该时序图清晰展示了「客户端 - MCP 服务器 - Agent 服务器 - 设备 - 人类」的协同交互流程,核心逻辑分为三阶段:
进入循环执行逻辑,直到任务结束 / 达到步数上限:
MCP 服务器将「会话 ID + 截图」传给 Agent 服务器,调用 automate_step 接口获取 Agent 动作;
分支处理:
若动作是INFO(需要人类介入):
若动作非INFO(如点击、输入等):
COMPLETE 动作,MCP 服务器向客户端返回「任务成功完成」结果;hitl-2
ask_agent_continue 和 ask_agent_start_new_task 的业务逻辑区别如下:
| 特性 | ask_agent_start_new_task | ask_agent_continue |
|---|---|---|
| 环境重置 | reset_environment = True | reset_environment = False |
| 会话状态 | 创建新会话 | 继续现有会话 |
| 目标 | 开始新任务 | 继续已有任务 |
| 设备状态 | 重置到初始状态 | 保持当前状态 |
ask_agent_start_new_task
@mcp.tool
def ask_agent_start_new_task(
# ...参数
):
# 启动新任务,重置环境
# 重置设备到初始状态(按 HOME 键)
# 创建全新的会话
# 适用于独立的、全新的任务
reset_environment = True # 重置环境
return_log = execute_task(
device_id=device_id,
task=task,
reset_environment=reset_environment, # 重置环境
)
return return_log
ask_agent_continue
@mcp.tool
def ask_agent_continue(
# ... 参数
):
# 继续任务,不重置环境
# 保持设备当前状态
# 基于之前的上下文继续执行
# 适用于需要连续性的任务
reset_environment = False # 不重置环境
return_log = execute_task(
device_id=device_id,
task=task,
reset_environment=reset_environment, # 不重置环境
)
return return_log
ask_agent_start_new_task 使用场景
graph TD
A[用户发起新任务] --> B[是否与之前任务相关?]
B --> |无关/新任务| C[使用 ask_agent_start_new_task]
C --> D[重置环境到初始状态]
D --> E[启动全新任务]
适用场景:
ask_agent_continue 使用场景
graph TD
A[用户继续任务] --> B{是否与之前任务相关?}
B -->|相关/继续| C[使用 ask_agent_continue]
C --> D[保持当前环境状态]
D --> E[基于上下文继续任务]
适用场景:
def execute_task(
# ...
reset_environment: bool, # 关键参数
# ...
):
if reset_environment and session_id is None and task is not None:
press_home_key(device_id, print_command=True) # 重置设备
# session_id 为 None 时创建新会话
# session_id 存在时继续现有会话
def gui_agent_loop(
# ...
reset_environment: bool = True,
session_id: str = None,
# ...
):
if reset_environment and session_id is None and task is not None:
press_home_key(device_id, print_command=True) # 重置环境
if session_id is None:
# 创建新会话
session_id = agent_server.get_session({...})
else:
# 继续现有会话
print(f"Continue Session ID: {session_id}")
任务中断后继续
# 第一步:开始任务,遇到INFO action
result = ask_agent_start_new_task(
device_id=device_id,
task="去淘宝帮我选一个生日礼物",
# ...
)
# 返回:stop_reason="INFO_ACTION_NEEDS_REPLY", session_id="xxx"
# 第二步:用户提供回复后继续
result = ask_agent_continue(
device_id=device_id,
task=None, # 不需要重新指定任务
session_id="xxx", # 使用之前的会话ID
reply_from_client="铜苹果", # 用户的回复
# ...
)
多任务切换
# 开始任务A
result_a = ask_agent_start_new_task(
device_id=device_id,
task="打开微信并发送消息",
# ...
)
# 完成后开始不相关的任务B
result_b = ask_agent_start_new_task( # 使用 start_new_task 重置环境
device_id=device_id,
task="打开高德地图导航到公司",
# ...
)
环境状态:
会话管理:
使用时机:
上下文连续性:
这种设计使得系统既能处理独立的离散任务,又能处理需要连续性的复杂任务,提高了任务执行的灵活性和效率。
ask_agent_start_new_task 代码如下:
@mcp.tool
def ask_agent_start_new_task(
device_id: Annotated[str, Field(description="ID of the device to perform the task on. listed by list_connected_devices tool.")],
task: Annotated[str | None, Field(description="The task that the agent needs to perform on the mobile device. if this is not None, the agent will try to perform this task. if None, the session_id must be provided to continue the previous session.")],
# reset_environment: Annotated[bool, Field(description="Whether to reset the environment before executing the task, close current app, and back to home screen. If you want to execute a independent task, set this to True will make it easy to execute. If you want to continue the previous session, set this to False.")] = False,
max_steps: Annotated[int, Field(description="Maximum number of steps the agent can take to complete the task.")] = 20,
# session_id: Annotated[str | None, Field(description="Optional, session ID must provide when the last task endwith INFO action and you want to reply, the session id and device id and the reply from client must be provided.")] = None,
# When the INFO action is called, how to handle it.
# 1. "auto_reply": the INFO action will be handled automatically by calling the caption model to generate image captions.
# 2. "no_reply": the INFO action will be ignored. THE AGENT MAY GET STUCK IF THE INFO ACTION IS IGNORED.
# 3. "manual_reply": the INFO action will cause an interruption, and the user needs to provide the reply manually by input things in server's console.
# 4. "pass_to_client": the INFO action will be returned to the MCP client to handle it.
# reply_mode: Annotated[str, Field(description='''
# How to handle the INFO action during task execution.
# Options:
# - "auto_reply": Automatically generate image captions for INFO actions.
# - "no_reply": Ignore INFO actions (may cause the agent to get stuck).
# - "manual_reply": Interrupt and require user input for INFO actions.
# - "pass_to_client": Pass INFO actions to the MCP client for handling.
# ''')] = "auto_reply",
# reply_from_client: Annotated[str | None, Field(description="If the last task is ended with INFO action, and you want to give GUI agent a reply, provide the reply here. If you do so, you must provide last session id and last device id.")] = None,
) -> dict:
"""
# Ask GUI Agent to start performing a new task on a connected device.
Ask the GUI agent to perform the specified task on a connected device.
The GUI Agent can be able to understand natural language instructions and interact with the device accordingly.
The agent will be able to execute a high-level task description,if you have any additional requirements, write them down in detail at tast string.
This function will reset the environment before executing the task, close current app, and back to home screen.
if you have
## The agent has the below limited capabilities:
1. The task must be related to an app that is already installed on the device. for example, "打开微信,帮我发一条消息给张三,说今天下午三点开会"; "帮我在淘宝上搜索一款性价比高的手机,并加入购物车"; "to purchase an ea on Amazon".
2. The task must be simple and specific. for example, "do yyy in xxx app"; "find xxx information in xxx app". ONE THING AT ONE APP AT A TIME.
3. The agent may not be able to handle complex tasks that require multi-step reasoning or planning. for example. You may need to break down complex tasks into simpler sub-tasks and ask the agent to perform them sequentially. For example, instead of asking the agent to "plan a trip to Paris for xxx", you can ask it to "search for flights to Paris on xxx app", "find hotels in Paris on xxx app", make the plan yourself and ask agent to "sent the plan to xxx via IM app like wechat".
4. The agent connot accept multimodal inputs now. if you want to provide additional information like screenshot captions, please include them in the task description.
## Usage guidance:
1. you should never directly ask an Agent to pay or order anything. If user want to make a purchase, you should ask agent to stop brfore ordering/paying, and let user to order/pay.
2. tell the agent, if human verification is appeared during the task execution, the agent should ask Client. when the you see the INFO, you should ask user to handle the verification manually. after user says "done", you can continue the task with the session_id and device_id and ask the agent to continue in reply_from_client.
3. IF the last agentic call is failed or you want to perform a new task in different app, you should always use this function to start a new task, so that the environment will be reset before executing the task.
Returns:
dict: Execution log containing details of the task execution.
with keys including
- device_info: Information about the device used for task execution.
- final_action: The final action taken by the agent to complete the task.
- global_step_idx: The total number of steps taken during the task execution.
- local_step_idx: The number of steps taken in the current session.
- session_id: The session ID for maintaining context across multiple tasks.
- stop_reason: The reason for stopping the task execution (e.g., TASK_COMPLETED_SUCCESSFULLY).
- task: The original task description provided to the agent.
"""
reply_mode = "pass_to_client"
# if task is not None:
# assert session_id is None, "If task is provided, session_id must be None."
# # New task, so reset_environment is True
# reset_environment = True
# else:
# assert session_id is not None, "If task is None, session_id must be provided to continue the previous session."
# # Continuing previous session, so reset_environment is False
# reset_environment = False
reset_environment = True
return_log = execute_task(
device_id=device_id,
task=task,
reset_environment=reset_environment,
max_steps=max_steps,
# enable_intermediate_logs=False,
# enable_intermediate_image_caption=False,
#
enable_intermediate_logs=True,
# enable_intermediate_image_caption=False,
enable_intermediate_image_caption=True,
enable_intermediate_screenshots=False,
enable_final_screenshot=False,
# enable_final_image_caption=False,
enable_final_image_caption=True,
reply_mode=reply_mode,
session_id=None,
# session_id=session_id,
reply_from_client=None,
# reply_from_client=reply_from_client,
)
return return_log
ask_agent_continue 代码如下:
@mcp.tool
def ask_agent_continue(
device_id: Annotated[str, Field(description="ID of the device to perform the task on. listed by list_connected_devices tool.")],
task: Annotated[str | None, Field(description="The task that the agent needs to perform on the mobile device. if this is not None, the agent will try to perform this task. if None, the session_id must be provided to continue the previous session.")],
# reset_environment: Annotated[bool, Field(description="Whether to reset the environment before executing the task, close current app, and back to home screen. If you want to execute a independent task, set this to True will make it easy to execute. If you want to continue the previous session, set this to False.")] = False,
max_steps: Annotated[int, Field(description="Maximum number of steps the agent can take to complete the task.")] = 20,
# session_id: Annotated[str | None, Field(description="Optional, session ID must provide when the last task endwith INFO action and you want to reply, the session id and device id and the reply from client must be provided.")] = None,
# When the INFO action is called, how to handle it.
# 1. "auto_reply": the INFO action will be handled automatically by calling the caption model to generate image captions.
# 2. "no_reply": the INFO action will be ignored. THE AGENT MAY GET STUCK IF THE INFO ACTION IS IGNORED.
# 3. "manual_reply": the INFO action will cause an interruption, and the user needs to provide the reply manually by input things in server's console.
# 4. "pass_to_client": the INFO action will be returned to the MCP client to handle it.
# reply_mode: Annotated[str, Field(description='''
# How to handle the INFO action during task execution.
# Options:
# - "auto_reply": Automatically generate image captions for INFO actions.
# - "no_reply": Ignore INFO actions (may cause the agent to get stuck).
# - "manual_reply": Interrupt and require user input for INFO actions.
# - "pass_to_client": Pass INFO actions to the MCP client for handling.
# ''')] = "auto_reply",
# reply_from_client: Annotated[str | None, Field(description="If the last task is ended with INFO action, and you want to give GUI agent a reply, provide the reply here. If you do so, you must provide last session id and last device id.")] = None,
) -> dict:
"""
# Ask GUI Agent to continue performing a task on a connected device, using previous context.
Ask the GUI agent to perform the specified task on a connected device.
The GUI Agent can be able to understand natural language instructions and interact with the device accordingly.
The agent will be able to execute a high-level task description,if you have any additional requirements, write them down in detail at tast string.
This function will **NOT** reset the environment before executing the task, so that the agent can continue the previous session.
if you have
## The agent has the below limited capabilities:
1. The task must be related to an app that is already installed on the device. for example, "打开微信,帮我发一条消息给张三,说今天下午三点开会"; "帮我在淘宝上搜索一款性价比高的手机,并加入购物车"; "to purchase an ea on Amazon".
2. The task must be simple and specific. for example, "do yyy in xxx app"; "find xxx information in xxx app". ONE THING AT ONE APP AT A TIME.
3. The agent may not be able to handle complex tasks that require multi-step reasoning or planning. for example. You may need to break down complex tasks into simpler sub-tasks and ask the agent to perform them sequentially. For example, instead of asking the agent to "plan a trip to Paris for xxx", you can ask it to "search for flights to Paris on xxx app", "find hotels in Paris on xxx app", make the plan yourself and ask agent to "sent the plan to xxx via IM app like wechat".
4. The agent connot accept multimodal inputs now. if you want to provide additional information like screenshot captions, please include them in the task description.
## Usage guidance:
1. you should never directly ask an Agent to pay or order anything. If user want to make a purchase, you should ask agent to stop brfore ordering/paying, and let user to order/pay.
2. tell the agent, if human verification is appeared during the task execution, the agent should ask Client. when the you see the INFO, you should ask user to handle the verification manually. after user says "done", you can continue the task with the session_id and device_id and ask the agent to continue in reply_from_client.
3. IF the last agentic call is successful or the last action is INFO or the new task is related to the previous task, you can use this function to continue the task, so that the agent can finish the task faster by leveraging the previous context.
dict: Execution log containing details of the task execution.
with keys including
- device_info: Information about the device used for task execution.
- final_action: The final action taken by the agent to complete the task.
- global_step_idx: The total number of steps taken during the task execution.
- local_step_idx: The number of steps taken in the current session.
- session_id: The session ID for maintaining context across multiple tasks.
- stop_reason: The reason for stopping the task execution (e.g., TASK_COMPLETED_SUCCESSFULLY).
- task: The original task description provided to the agent.
"""
reply_mode = "pass_to_client"
# if task is not None:
# assert session_id is None, "If task is provided, session_id must be None."
# # New task, so reset_environment is True
# reset_environment = True
# else:
# assert session_id is not None, "If task is None, session_id must be provided to continue the previous session."
# # Continuing previous session, so reset_environment is False
# reset_environment = False
reset_environment = False
return_log = execute_task(
device_id=device_id,
task=task,
reset_environment=reset_environment,
max_steps=max_steps,
# enable_intermediate_logs=False,
# enable_intermediate_image_caption=False,
#
enable_intermediate_logs=True,
enable_intermediate_image_caption=True,
enable_intermediate_screenshots=False,
enable_final_screenshot=False,
# enable_final_image_caption=False,
enable_final_image_caption=True,
reply_mode=reply_mode,
session_id=None,
# session_id=session_id,
reply_from_client=None,
# reply_from_client=reply_from_client,
)
return return_log
INFO交互模式特殊性如下:
INFO 操作有多种处理策略,具体在 reply_mode 中设置:
auto_reply:自动调用模型生成回复no_reply:忽略 INFO 操作,可能导致代理卡住manual_reply:手动输入回复pass_to_client:将 INFO 操作传递给 MCP 客户端处理何处设置 reply_mode?具体如下:
execute_task 函数中定义处理模式gui_agent_loop 函数根据 reply_mode 执行相应逻辑自动回复机制的细节如下:
auto_reply 函数结合当前任务、截图和 INFO 操作内容人工回复处理的细节如下:
manual_reply 模式下,程序暂停并等待用户输入INFO 的流程控制机制如下:
会话中断与恢复:
stop_reason 设置为 INFO_ACTION_NEEDS_REPLYsession_idsession_id 继续执行回复传递机制:
reply_from_client 参数传递query 字段传递给代理INFO 操作的信息传递流程如下:
从代理到用户:
从用户到代理:
reply_from_client 参数传递reply_info 变量存储用户回复query 字段传递给下一次 automate_step 调用INFO 操作的应用场景可能如下:
人机协作场景
验证码处理:
敏感操作确认:
信息补充场景
个性化信息获取:
决策支持:
INFO的相关代码如下:
def gui_agent_loop( # 省略代码
):
"""
Evaluate a task on a device using the provided frontend action converter and action function.
"""
# 省略代码
action = uiTars_to_frontend_action(action)
if action['action_type'].upper() == "INFO":
if reply_mode == "auto_reply":
print(f"AUTO REPLY INFO FROM MODEL!")
reply_info = auto_reply(image_b64_url, task, action, model_provider=agent_loop_config['model_config']['model_provider'], model_name=agent_loop_config['model_config']['model_name'])
print(f"info: {reply_info}")
elif reply_mode == "no_reply":
print(f"INFO action ignored as per reply_mode=no_reply. Agent may get stuck.")
reply_info = "Please follow the task and continue. Don't ask further questions."
# do nothing, agent may get stuck
elif reply_mode == "manual_reply":
print(f"EN: Agent asks: {action['value']} Please Reply: ")
print(f"ZH: Agent 问你: {action['value']} 回复一下:")
reply_info = input("Your reply:")
print(f"Replied info action: {reply_info}")
elif reply_mode == "pass_to_client":
print(f"Passing INFO action to client for reply.")
# break the loop and return to client for handling
stop_reason = "INFO_ACTION_NEEDS_REPLY"
break
else:
raise ValueError(f"Unknown reply_mode: {reply_mode}")
# 省略代码
act_on_device(action, device_id, device_wm_size, print_command=True, reflush_app=reflush_app)
history_actions.append(action)
# 省略代码
if stop_reason in ['MANUAL_STOP_SCREEN_OFF', 'INFO_ACTION_NEEDS_REPLY', "NOT_STARTED"]:
pass
elif action['action_type'].upper() == 'COMPLETE':
stop_reason = "TASK_COMPLETED_SUCCESSFULLY"
elif action['action_type'].upper() == 'ABORT':
stop_reason = "TASK_ABORTED_BY_AGENT"
elif step_idx == max_steps - 1:
stop_reason = "MAX_STEPS_REACHED"
return return_log
auto_reply 函数的作用如下:
信息处理功能
输入处理
输出生成
auto_reply 函数与其他组件的关系如下:
与 gui_agent_loop 的关系
与 execute_task 的关系
与 ask_llm_anything 的关系
与 MCP 服务器的关系
工作流程
触发条件
处理过程
输出结果
作用意义
auto_reply 代码如下:
def auto_reply(current_image_url, task, info_action, model_provider, model_name):
"""
Reply with information action.
"""
messages_to_ask = [
{
"role": "user",
"content": [
{
"type": "text",
"text": f"""# 角色
你将扮演一个正在使用GUI Agent完成任务的用户。
# 任务
阅读下方提供的所有背景信息,针对[Agent的澄清问题],生成一个提供关键信息的、简短直接的回答。
# 背景信息
- **任务目标:** {task}
- **agent 问的问题:** {json.dumps(info_action, ensure_ascii=False)}
# 输出要求
- 你的回答必须极其简短和明确。
- 你的回答应直接命中问题的核心,解决Agent的疑惑。
- 不要进行任何额外的解释、对话或使用礼貌用语。
- 只输出回答本身,不要添加任何引号或其他修饰。
以下是当前页面内容:
""",
},
{
'type': "image_url",
'image_url': {
'url': current_image_url
}
},
{
"type": "text",
"text": '请基于以上信息,简洁直接地回答Agent的问题。'
}
]
}
]
response = ask_llm_anything(
model_provider=model_provider,
model_name=model_name,
messages=messages_to_ask,
args={
"max_tokens": 1024,
"temperature": 0.5,
"top_p": 1.0,
"frequency_penalty": 0.0,
}
)
if "</think>" in response:
response = response.split("</think>")[-1].strip()
return response
我们接下来就系统的一些特殊之处进行分析。
插件化架构
配置化管理
mcp_server_config.yaml管理服务参数model_config.yaml管理模型参数mcp-扩展
action 类型枚举
动作验证机制
action_assertion:验证动作格式和参数完整性动作转换映射
动作实现
act_on_device函数中添加新的动作处理逻辑设备操作层
坐标处理
convert_point_to_realworld_point:坐标转换函数定义新动作
action_tools.py的_ACTION_TYPE_ENUM中添加新动作action_assertion中添加新动作的参数验证实现动作转换
pu_frontend_executor.py的action_type_map中添加新动作映射step_api_to_frontend_action中添加新动作的转换处理实现设备操作
act_on_device函数中添加新动作的 ADB 命令实现解析器支持
parser_0920_summary.py中添加新动作的解析支持添加 SCROLL 动作
_ACTION_TYPE_ENUM中添加 “SCROLL”action_assertion中添加 SCROLL 参数验证step_api_to_frontend_action中添加转换逻辑act_on_device中实现 SCROLL ADB 命令扩展参数支持
CopilotClientRolloutRunner 存在于代码中,但是没有调用实例,所以我们直接反推。
CopilotClientRolloutRunner 中的 “rollout” 概念确实来源于强化学习领域,在强化学习中,rollout 指从当前状态开始执行一系列动作直到终止状态的完整轨迹。
CopilotClientRolloutRunner 实际上是一个多设备任务并行执行框架,主要功能是管理多个设备上的 GUI 代理任务执行, 支持同时在多个设备上运行 GUI 代理任务,提高执行效率。CopilotClientRolloutRunner 与强化学习有概念上的联系,但不是强化学习算法的实现。
主要用途:
当然,GUI 代理任务的执行轨迹可以用于训练或评估智能体,因此,CopilotClientRolloutRunner 类可能用于强化学习训练数据的收集,执行的轨迹数据可用于后续的强化学习算法训练。
使用 multiprocessing 模块实现并行处理,包含四个主要进程:
logger_runner - 日志记录进程reader_runner - 任务读取和分发进程work_runner - 每个设备对应一个工作进程writer_runner - 结果写入进程任务队列与负载均衡
CopilotClientRolloutRunner 为每个设备维护独立的任务队列 task_queue
任务分发时会检查结果文件,避免重复执行已完成的任务
支持任务失败重试机制,失败任务会重新放回队列
结果收集与持久化
done_queue 收集所有设备的执行结果writer_runner 进程将结果写入指定的输出文件设备管理与监控
device_name_map 支持设备命名与 GUI 代理系统集成
evaluate_task_on_device 函数执行具体的 GUI 代理任务LocalServer 集成,使用指定的配置执行任务数据收集与处理
反馈数据类型
反馈数据转换
反馈数据传递
automate_step接口:将反馈数据传递给 AI 模型session_id维护上下文连续性任务执行闭环
反馈驱动决策
ask_llm_anything处理反馈数据并生成下一步动作parser.str2action将模型输出转换为可执行动作act_on_device执行解析后的动作用户交互闭环
执行日志记录
会话状态追踪
session_id管理:维护会话唯一标识异常处理与终止条件
闭环终止条件
stop_reason机制:定义多种终止条件
TASK_COMPLETED_SUCCESSFULLY:任务成功完成TASK_ABORTED_BY_AGENT:代理主动中止MAX_STEPS_REACHED:达到最大步数限制INFO_ACTION_NEEDS_REPLY:需要用户回复异常处理机制
detect_screen_on确保设备处于可用状态从HITL(Human In The Loop) 实践出发看Agent与设计模式的对跖点
本文使用 markdown.com.cn 排版