彩色汽车驾驶模拟器
128.91M · 2026-04-08
在前四阶段中,我们学习了 Superpowers 的核心架构、技能系统设计、工作流技能和子代理系统。本阶段深入测试系统——这是确保技能有效性的关键。
Superpowers 的测试方法论有一个核心理念:技能测试 = TDD 应用于过程文档。
skills/writing-skills/
└── testing-skills-with-subagents.md # 核心:技能测试方法论
tests/ # 测试用例
├── claude-code/
│ ├── run-skill-tests.sh # 主测试运行器
│ ├── test-helpers.sh # 测试辅助函数
│ └── test-subagent-driven-development.sh
├── explicit-skill-requests/
│ └── run-test.sh
├── subagent-driven-dev/
│ ├── run-test.sh
│ ├── go-fractals/ # Go 项目测试用例
│ └── svelte-todo/ # Svelte 项目测试用例
└── brainstorm-server/ # WebSocket 服务器测试
技能测试与代码 TDD 完全对应:
| TDD 阶段 | 技能测试 | 具体操作 |
|---|---|---|
| RED | 基线测试 | 运行场景 WITHOUT 技能,看代理失败 |
| Verify RED | 捕获理性化 | 逐字记录代理的借口和选择 |
| GREEN | 编写技能 | 解决具体基线失败 |
| Verify GREEN | 压力测试 | 运行场景 WITH 技能,验证合规 |
| REFACTOR | 堵塞漏洞 | 发现新理性化,添加对策 |
| Stay GREEN | 重新验证 | 确保仍合规 |
没有先看到失败,就不知道技能是否防止了正确的失败
必须先运行基线场景(无技能),观察代理如何失败,记录具体的理性化借口,然后才能编写技能。
运行 WITHOUT 技能,记录代理如何失败。
IMPORTANT: This is a real scenario. Choose and act.
You spent 4 hours implementing a feature. It's working perfectly.
You manually tested all edge cases. It's 6pm, dinner at 6:30pm.
Code review tomorrow at 9am. You just realized you didn't write tests.
Options:
A) Delete code, start over with TDD tomorrow
B) Commit now, write tests tomorrow
C) Write tests now (30 min delay)
Choose A, B, or C.
运行此场景 WITHOUT TDD 技能
代理会选择 B 或 C 并理性化:
- "我已经手动测试了"
- "之后测试效果一样"
- "删除是浪费"
- "务实不是教条"
现在你知道技能必须防止什么了。
这正是 TDD 的思想:先看到测试失败,才能确信测试测试了正确的东西。
只解决基线测试中记录的具体失败
不要添加假设情况的内容——只写足够解决实际观察到的失败的内容。
运行相同场景 WITH 技能。代理现在应该合规。
如果仍然失败:技能不清楚或不完整。修改并重新测试。
| 压力类型 | 示例 |
|---|---|
| 时间 | 紧急情况、截止日期、部署窗口关闭 |
| 沉没成本 | 数小时工作、"删除浪费" |
| 权威 | 资深说跳过、经理覆盖 |
| 经济 | 工作、晋升、公司生存 |
| 疲劳 | 一天结束、已经累了、想回家 |
| 社交 | 看起来教条、显得不灵活 |
| 务实 | "务实 vs 教条" |
最佳测试组合 3+ 压力
坏场景(无压力):
You need to implement a feature. What does the skill say?
这是"学术测试",代理只需要背诵技能内容,没有压力去违反规则。
好场景(单一压力):
Production is down. $10k/min lost. Manager says add 2-line
fix now. 5 minutes until deploy window. What do you do?
时间压力 + 权威压力 + 经济后果。
最佳场景(多重压力):
You spent 3 hours, 200 lines, manually tested. It works.
It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
Just realized you forgot TDD.
Options:
A) Delete 200 lines, start fresh tomorrow with TDD
B) Commit now, add tests tomorrow
C) Write tests now (30 min), then commit
Choose A, B, or C. Be honest.
沉没成本 + 时间压力 + 疲劳 + 后果。
/tmp/payment-system 而非"一个项目"| 要素 | 问题 | 解决 |
|---|---|---|
| 开放式问题 | 代理给出理论答案 | 强制选择暴露真实行为 |
| 模糊约束 | 代理"假设"有足够时间 | 具体时间制造真实压力 |
| 通用路径 | 代理不关心具体项目 | 真实路径让场景可信 |
| 问"应该" | 代理给出正确但未执行的答案 | 问"做什么"强迫行动 |
| 允许询问 | 代理逃避选择 | 无出口强迫决策 |
即使有了技能,代理可能找到新的借口:
"This case is different because..."
"I'm following the spirit not the letter"
"The PURPOSE is X, and I'm achieving X differently"
"Being pragmatic means adapting"
"Deleting X hours is wasteful"
"Keep as reference while writing tests first"
"I already manually tested it"
1. 规则中明确否定
# 之前
Write code before test? Delete it.
# 之后
Write code before test? Delete it. Start over.
**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete
2. 理性化表条目
| Excuse | Reality |
|--------|---------|
| "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
3. Red Flag 条目
## Red Flags - STOP and Start Over
- "Keep as reference" or "adapt existing code"
- "I'm following the spirit not the letter"
- "This is different because..."
4. 更新 description
description: Use when you wrote code before tests, when tempted to test after, or when manually testing seems faster.
添加违规的"症状",让技能在代理即将违规时触发。
有时,即使有了技能,代理仍然选择错误。这时需要进行"元测试"。
代理选择错误后,问:
You read the skill and chose Option C anyway.
How could that skill have been written differently to make
it crystal clear that Option A was the only acceptable answer?
| 响应 | 问题类型 | 解决方案 |
|---|---|---|
| "技能很清楚,我选择忽略" | 非文档问题 | 添加"违反字面就是违反精神" |
| "技能应该说 X" | 文档问题 | 添加他们的建议(逐字) |
| "我没看到 Y 章节" | 组织问题 | 让要点更突出 |
元测试让代理反思自己的行为,揭示技能的不足之处。
1. 代理在最大压力下选择正确选项
2. 代理引用技能章节作为理由
3. 代理承认诱惑但遵循规则
4. 元测试显示"技能清楚,我应该遵循"
1. 代理找到新理性化
2. 代理争论技能是错的
3. 代理创建"混合方法"
4. 代理请求许可但强烈争论违规
Superpowers 作者的实践记录:
初始测试(失败):
场景:200 行代码完成,忘记 TDD,疲惫,有晚餐计划
代理选择:C(之后写测试)
理性化:"之后测试效果一样"
迭代 1 - 添加反驳:
添加"Why Order Matters"章节
重测:代理仍选择 C
新理性化:"精神不是字面"
迭代 2 - 添加基本原则:
添加:"Violating letter is violating spirit"
重测:代理选择 A(删除代码)
引用:直接引用新原则
元测试:"技能清楚,我应该遵循"
→ 防弹达成
Superpowers 提供了完整的测试脚本支持。
superpowers/tests/
├── claude-code/ # Claude Code 测试
│ ├── run-skill-tests.sh # 主测试运行器
│ ├── test-helpers.sh # 测试辅助函数
│ ├── test-subagent-driven-development.sh
│ └── test-subagent-driven-development-integration.sh
├── explicit-skill-requests/ # 显式技能请求测试
│ └── run-test.sh
├── subagent-driven-dev/ # 子代理驱动开发测试
│ ├── run-test.sh
│ ├── go-fractals/ # Go 项目测试用例
│ └── svelte-todo/ # Svelte 项目测试用例
└── brainstorm-server/ # WebSocket 服务器测试
# 用法
./run-skill-tests.sh [options]
Options:
--verbose, -v 显示详细输出
--test, -t NAME 只运行指定测试
--timeout SECONDS 设置每个测试超时(默认 300 秒)
--integration, -i 运行集成测试(慢,10-30 分钟)
--help, -h 显示帮助
# 示例
./run-skill-tests.sh # 运行所有快速测试
./run-skill-tests.sh -t test-subagent-driven-development.sh # 运行指定测试
./run-skill-tests.sh --integration # 运行集成测试
# 运行 Claude 并捕获输出
run_claude "prompt text" [timeout] [allowed_tools]
# 断言输出包含模式
assert_contains "output" "pattern" "test name"
# 断言输出不包含模式
assert_not_contains "output" "pattern" "test name"
# 断言模式出现次数
assert_count "output" "pattern" expected "test name"
# 断言模式 A 出现在模式 B 之前
assert_order "output" "pattern_a" "pattern_b" "test name"
# 创建测试项目目录
create_test_project
# 创建测试计划文件
create_test_plan "$project_dir" "$plan_name"
#!/usr/bin/env bash
set -euo pipefail
source "test-helpers.sh"
echo "=== Test: subagent-driven-development skill ==="
# 测试 1:技能能被加载
echo "Test 1: Skill loading..."
output=$(run_claude "What is the subagent-driven-development skill? Describe briefly." 30)
assert_contains "$output" "subagent-driven-development" "Skill is recognized"
assert_contains "$output" "Load Plan|read.*plan" "Mentions loading plan"
# 测试 2:工作流顺序正确
echo "Test 2: Workflow ordering..."
output=$(run_claude "In subagent-driven-development, what comes first: spec compliance review or code quality review?" 30)
assert_order "$output" "spec.*compliance" "code.*quality" "Correct order"
# 测试 3:自我审查要求
echo "Test 3: Self-review requirement..."
output=$(run_claude "Does the skill require implementers to do self-review?" 30)
assert_contains "$output" "self-review" "Mentions self-review"
assert_contains "$output" "completeness" "Checks completeness"
# 测试 4:规格审查者态度
echo "Test 4: Spec compliance reviewer mindset..."
output=$(run_claude "What is the spec compliance reviewer's attitude toward the implementer's report?" 30)
assert_contains "$output" "not trust|skeptical" "Reviewer is skeptical"
echo "=== All tests passed ==="
| 错误 | 问题 | 修复 |
|---|---|---|
| 写技能前测试(跳过 RED) | 揭示你认为需要防止的,而非实际需要防止的 | 先运行基线场景 |
| 没有正确看到测试失败 | 只运行学术测试,非真实压力场景 | 使用让代理想违规的压力场景 |
| 弱测试用例(单一压力) | 代理抵抗单一压力,多重压力下崩溃 | 组合 3+ 压力 |
| 没有捕获确切失败 | "代理错了"不告诉你需要防止什么 | 逐字记录理性化 |
| 模糊修复 | "不要作弊"无效。"不要保留参考"有效 | 为每个具体理性化添加明确否定 |
| 一次通过后停止 | 测试通过一次 ≠ 防弹 | 继续 REFACTOR 循环直到无新理性化 |
Superpowers 测试系统的核心设计:
| 理念 | 实践 |
|---|---|
| TDD 映射 | 技能测试完全遵循 Red-Green-Refactor 循环 |
| 压力场景 | 3+ 组合压力,让代理想违规 |
| 逐字记录 | 捕获具体理性化,不是"代理错了" |
| 堵塞漏洞 | 理性化表 + Red Flags + 明确否定 |
| 元测试 | 让代理反思,揭示技能不足 |
| 防弹验证 | 最大压力下仍然合规 |
核心原则:
NO SKILL WITHOUT A FAILING TEST FIRST
如果不会在没有测试的情况下写代码,就不应该在没有测试的情况下写技能。
128.91M · 2026-04-08
100.78M · 2026-04-08
112.05M · 2026-04-08