AI Agent on Steve Sun

How I Use Hermes Agent to Write Code

Sun, 07 Jun 2026 10:00:00 +0800

I run two Hermes Agents in parallel.

One is called Super Juaner and handles daily conversation and information retrieval. The other is called Code Juaner and works exclusively on software engineering. Two independent Telegram bots, independent configurations, independent session databases.

Running two Agents separately is for context and environment isolation.

Core discipline: Code Juaner never writes code.

All coding is delegated to Codex CLI for execution. Code Juaner only does the Product Owner work: writing requirement definitions, making architecture decisions, and accepting deliverables. Codex is the implementer: reads the spec, writes code, runs tests.

Role	Output	Responsibility
Code Juaner (Hermes PO)	Feature doc + acceptance	Requirement definition, architecture decisions, quality control
Codex CLI (Implementer)	Code + tests	Technical solution, coding implementation, self-testing

The flow is simple. The user says “I need a feature.” Code Juaner writes a feature doc containing only the requirement description and verifiable acceptance criteria. Every acceptance criterion must be verifiable—“clicking changes the URL to /zh/monaco” rather than “navigation is correct.” Then Codex reads the doc, plan mode produces a technical spec, and build mode implements plus tests. Code Juaner accepts each criterion one by one; if all pass, deploy; if there are issues, summarize and send back.

Code Juaner never reads code files, never finishes reading code to tell Codex how to write. For project-level questions, delegate to Codex plan mode for investigation. Codex timeout means reporting back directly and waiting for next instructions.

The acceptance standard is that npm run build passing is only the minimum bar. For behavior changes, walk through the full user path in the browser before marking complete.

The Three-Layer Gate

Discipline sounds simple but is easy to forget in practice. The model drifts in long conversations—after thirty turns, it may think it can write code and start modifying files directly. I built three layers of defense.

Layer 1: System prompt (hardest, unbypassable)

Locked in config.yaml: all coding must be delegated to Codex CLI for execution, never write code yourself, report and wait after Codex fails. Injected every turn, can’t be avoided.

The system prompt’s lifecycle in Hermes is longer than SOUL.md; it doesn’t reset between conversations, so it better prevents forgetting after long conversations. Even if the model starts drifting after dozens of turns, this terminal instruction is still there.

Layer 2: SOUL.md + personality (loaded at session start)

SOUL.md defines personality and creed: terse, conclusion-first, type-safety > runtime correctness > performance optimization. But its effective range is the start of each conversation, weaker than the system prompt.

SOUL loads a development-workflow skill I co-developed with Super Juaner, which defines pre-checks when picking up code: must load the workflow skill first before making decisions, don’t skip. This file hosts my entire coding discipline as an initial activation guide.

Layer 3: Plugin gate (physical interception, unbypassable)

Two plugins intercept Code Juaner’s source code read/write. write-code-gate blocks write_file and patch on source files like .ts, .tsx, .py and returns refusal. read-code-gate blocks read_file and search_files on source files. Non-source files like .md, .yaml, .toml pass through.

The blocked extensions cover 30+ common languages. Exempt paths include .hermes/, node_modules/, .next/ and other non-project directories.

Plugins are loaded via Hermes’s pre_tool_call hook, take effect at session start, and remain available after gateway restarts.

The three layers increase in hardness from outside to inside; when rules conflict, the hardest layer wins. The system prompt is a sticky note; the plugin is a locked door.

These practices aren’t new. Most of them are decades of accumulated software engineering—separation of roles, clear responsibilities, acceptance-first—just wearing a new skin in the AI era and landing in a new way.

Dual-Track Workflow

Take different tracks based on task nature.

Track A: New project or feature from zero to one

Full pipeline: Feature Doc → Plan → Build → Verify → Fix (loop).

Phase one: Code Juaner writes the feature doc. Phase two: Codex plan produces the technical spec, including file list, implementation plan, data structures, and test cases. Phase three: Codex build reads the spec, implements all files, runs tests. Phase four: Code Juaner accepts each criterion.

Acceptance walks through five layers: data layer (type definitions, state management), logic layer (hooks, reducers), presentation layer (component rendering, interaction binding), config layer (static config, data loading), browser verification (route checks, localStorage confirmation, screenshot comparison).

Track B: Bug fixes or feature modifications

Most projects go through Track B. Skipping the feature doc only applies to single-file, pure styling, copy changes, or minor config tweaks. Multi-file changes, those involving data persistence, state machines, or new components must write a feature doc plus spec before dispatching Codex.

When delegating to Codex, give only the goal and acceptance criteria, not the implementation plan. Codex reads the code and designs by itself.

Preview Tunnel

In-progress websites need to be previewed on mobile. I use Cloudflare Tunnel to expose the local Next.js dev server to the public internet; Cloudflare assigns a temporary trycloudflare.com domain, open it in a mobile browser to preview.

Early on I stopped and restarted the tunnel after every change, which triggered Cloudflare rate limits. Each stop and restart assigned a new URL, requiring me to reopen it on the phone—annoying.

Later I wrote a tunnel-manager.sh script. The first start runs cloudflared as a background daemon; subsequent builds only restart the next server, not touching the tunnel. The tunnel daemon is reused across preview sessions; it only rebuilds on machine restart or cloudflared crash. The same URL persists through the entire development cycle, eliminating a lot of repetitive operations.

Git plus Vercel Deployment

Hit one silent pitfall: Vercel’s GitHub integration checks whether the commit author email is a real GitHub account email. When it doesn’t match, the CLI reports “Your deployment failed” and vercel inspect shows Builds: [0ms]—the build never started, no logs at all.

Fix: use the system git identity for commit, and rebase to reset the author if there’s a mismatch.

Deployment uses Vercel CLI with --no-wait to avoid timeout blocking. GitHub Actions’ deploy.yml triggers automatically on push to the main branch.

Some Reflections

After all those specific practices, let me share some thoughts.

AI writing code has developed too fast. At the start of the year I was still manually writing every line; by mid-year I already had a pipeline that can independently complete features, acceptance, and deployment. What I do has shifted from writing code to defining rules, setting boundaries, and accepting results.

The engineer’s role is changing. You used to be a bricklayer, placing bricks one by one. Now you’re a rancher: set up the fences, put out enough feed, and let the herd graze and grow. The software industry is shifting from construction to animal husbandry.

This trend will only accelerate. Better models mean simpler harnesses, cheaper execution costs. You don’t need to be the best programmer—you need to be the best rule-maker. Define clearly what’s allowed, what’s not allowed, and what counts as done—leave the rest to the agent.

The time I spend defining boundaries has a much higher return than the time I spend on actual coding. That’s the most surprising discovery.

How to Design an AI Agent

Tue, 26 May 2026 16:11:00 +0800

AI Agents will most likely be the paradigm for future AI software design, so for most developers and non-technical people just getting into vibe coding, understanding how they’re designed and the principles behind them will help you design next-generation application software more effectively.

This post tries to use plain language to help you understand what an AI Agent is, what problems it solves, and which protocols and tools will come into play as part of that infrastructure.

Target audience:

Vibe coders (rapid prototyping, build and iterate)
Programmers
Non-technical users just starting to code

First Principles: What Problems Does an Agent Framework Actually Solve?

Models are powerful, but unreliable

Large Language Models (LLMs) “guess”; they don’t “guarantee.” So you can’t treat them as deterministic programs (same input always produces same output).

Problems to solve:

How to bring unstable output into a controllable flow
How to know where the failure is when something breaks

Real-world task results are usually not simple answers, but complete workflow outputs

Real tasks typically involve:

Read information
Make decisions
Call tools
Continue deciding based on tool results
Ultimately produce documents, code, or other artifacts

This means an Agent’s design goal isn’t limited to “question and answer,” but is a “cyclic decision system.”

Users don’t want to wait until the end to see results

When interacting with AI, users typically want:

Visible process (streaming)
Ability to interrupt (abort the current task)
Ability to add instructions mid-task (steer, guide while executing)

So the system must natively support real-time interaction, not one-shot black-box execution.

Context keeps growing, costs keep growing

The longer the conversation, the larger the input, the slower the speed, the higher the cost, and it may even exceed limits. There must be a mechanism to “compress history while preserving key information.”

One core serving multiple interaction modes

The same Agent must run on:

Terminal UI (TUI)
Remote Procedure Call (RPC)
Future Web or App interfaces

So the “intelligent core” and “presentation layer” must be decoupled (independent, not bound together).

From Problems to Requirements, Then to Design

Requirements Checklist

A usable Agent framework must at minimum satisfy:

Looping: supports “think → call tool → think again”
Observable: every step is visible to UI or logging
Controllable: can pause, cancel, interrupt, resume
Recoverable: retry on failure, can continue from the last session
Extensible: add new tools, new models, new frontends
Governable: clear boundaries on cost, context, and permissions

End-to-End Flowchart

Going from problems to requirements, and requirements to design, we get the following flowchart:

flowchart TD A[User states goal] --> B[Agent understands current task] B --> C{Need a tool?} C -- No --> D[Give answer directly] C -- Yes --> E[Generate tool call request] E --> F[Execute tool] F --> G[Get tool result] G --> H{Result sufficient?} H -- No --> B H -- Yes --> D D --> I[Stream back to user] I --> J[User can interrupt/add requirements] J --> B

This diagram expresses:

An Agent is a closed-loop system, not a single function call.
“Tools” are capability amplifiers, not accessories.
The user is in the loop, not outside it.

Overall Architecture Diagram

graph LR subgraph Interaction Layer UI1[TUI/CLI] UI2[RPC/API] UI3[Web/App] end subgraph Runtime Layer SESSION[Session Orchestrator] POLICY[Policy Center: Retry/Compression/Budget] end subgraph Core Layer LOOP[Agent Decision Loop] STATE[State Management] EVENTS[Event Bus] end subgraph Capability Layer TOOLS[Tool System] MODEL[Model Adapter] MEMORY[Memory and Context Management] end UI1 --> SESSION UI2 --> SESSION UI3 --> SESSION SESSION --> LOOP SESSION --> POLICY LOOP <--> STATE LOOP --> EVENTS LOOP --> TOOLS LOOP --> MODEL POLICY <--> MEMORY MODEL --> LLM[External Model Service]

Component Diagram (Understanding “Who Owns What”)

flowchart LR USER[User] ORCH[Session Orchestrator] CORE[Agent Core] ADAPTER[Model Adapter] TOOLRUN[Tool Executor] OBS[Observation and Events] USER <--> ORCH ORCH <--> CORE CORE <--> ADAPTER CORE <--> TOOLRUN CORE --> OBS OBS --> ORCH

Responsibility split:

Session Orchestrator: handles user input, session state, retry and compression policies.
Agent Core: only does the “thinking loop” and “state advancement.”
Model Adapter: shields differences between model providers.
Tool Executor: uniformly executes local or remote tools.
Observation and Events: turns the process into visible signals for UI/log systems.

To Land These Designs, What Protocols and Foundational Patterns Are Required?

This section is the “minimum necessities” to complete the design above. We need to consider which engineering practices to introduce from a protocols and design-patterns standpoint. (Like building a skyscraper, you need to define the materials, the common engineering designs you can reuse, and how to make the structure mechanically stand the test of time.)

Most of these protocols are currently designed and implemented by developers on demand, but standards will likely emerge in the near future.

Required Protocols (Skipping Any Causes Loss of Control)

Message Protocol

Unifies how user messages, assistant messages, and tool results are described.

Event Protocol

Unifies how start, update, end, error, and tool execution status are described.
Purpose: lets UI and logs see the “process,” not just the “outcome.”

Tool Contract

Tool name, parameter structure (Schema), and execution return format must be fixed.

Streaming Contract

Supports incremental output (delta) to guarantee real-time user feedback.

Cancellation Contract

Any link in the chain should respond to abort signals, avoiding “can’t stop.”

Error Contract

Failures must be structured (machine-processable), not just string error messages.

Foundational Design Patterns to Understand

For readers without programming experience, you’ll need to learn about these basic programming design patterns from other sources first.

State Machine

An Agent has state transitions at every step (e.g., waiting for input → generating output → tool execution → back to generating).

Publish/Subscribe (Pub/Sub)

Core emits events, UI/logs subscribe to events.
Benefit: core logic doesn’t depend on specific interfaces.

Adapter

Wraps different model interfaces into a unified calling convention.

Strategy

Retry strategies, tool concurrency strategies, compression strategies are interchangeable.

Pipeline

Input preprocessing → model call → tool execution → post-processing is a pluggable chain.

Idempotency and Recoverability

Repeating the same operation should not produce catastrophic side effects; failure should be recoverable.

Case Study: PI Agent’s Design Philosophy and Architecture

The above covers “general Agent framework design.” Now let’s ground it in the recently popular minimalist framework PI Agent.

Let’s look at how this framework designs an Agent.

Design Philosophy

Minimal Core

Core only handles the loop, state, events, and tool orchestration.

Pluggable Periphery

Models, tools, retries, and context handling are all replaceable.

Process Over Outcome

First ensure the process is visible and controllable, then pursue “smart output.”

Session Over Request

Treat the Agent as a long-term session system, not a single API call.

Agent Core Logic Flowchart

flowchart TD START[Start a session turn] --> TURN[Open turn] TURN --> CALL[Call model and stream output] CALL --> CHECK{Tool call in output?} CHECK -- No --> STOPCHECK{Stop?} CHECK -- Yes --> TOOL[Execute tool batch] TOOL --> MERGE[Write tool results back to context] MERGE --> STOPCHECK STOPCHECK -- Stop --> END[End and emit end event] STOPCHECK -- Continue --> NEXT[Enter next turn] NEXT --> TURN

Agent Core Component Diagram

graph TD CORE[Agent Core] S[State Storage] L[Loop Turn Cycle] E[Events Emission] T[Tool Executor] M[Model Stream Call] Q[Queue: steer/followUp] CORE --> S CORE --> L L --> M L --> T L --> E L --> Q T --> E M --> E E --> S

The value of this structure:

The interaction layer only sees events, doesn’t touch core state.
Model replacement doesn’t change the loop skeleton.
Tool extension doesn’t break the core control flow.

Summary

Agent architecture isn’t “making the model smarter”—it’s “making an uncertain model work reliably inside a controllable system.”

You can remember it as this formula:

$$ \text{Usable Agent} = \text{Model Capability} \times \text{Engineering Control Capability} $$

Where engineering control capability mainly comes from:

Loop design
Protocol design
Event observability
Recoverability and extensibility

Judging by current trends, this will very likely be the foundational paradigm for the next generation of application software.

AI Agent Tool Comparison: Why MCP Is Just a Transitional Solution

Mon, 20 Apr 2026 12:37:00 +0800

If you’ve built AI Agents, you’ve likely used MCP and may have come across the concept of Agent Skills. We don’t need to re-explain what they are—the question this post answers is: when both can achieve “letting AI call tools,” why I think MCP is a transitional approach that will be phased out.

The Fundamental Limitation of MCP: The Protocol Layer Can’t Carry Semantics

MCP’s design logic is: give AI a structured tool-calling protocol where tool discovery, invocation, and parsing all follow a fixed format.

The problem is that this protocol is designed for humans, not for AI.

JSON Schema can define parameter types and return value structures, but it cannot convey things like: why this parameter usually takes this value, what prerequisites a tool needs to work, or what its failure modes look like. This context is what AI most needs when making decisions in real scenarios, but the MCP protocol layer simply cannot carry it.

The result: tool-calling capabilities built with MCP depend heavily on prompt engineering—you need to supplement in the prompt what the protocol definition doesn’t include. This shows the protocol layer has a gap, and that gap can’t be fixed by improving the schema, because it’s fundamentally a semantic-loss problem, not a format problem.

Another practical issue is maintenance cost. Every new capability requires a separate MCP server—you need to maintain server code, schema definitions, and network connections. When the protocol version updates, all servers may need to change too. For individual developers or small teams, this complexity is a significant burden.

Why Skill Is Closer to What AI Actually Needs

Agent Skill uses Markdown as the core format for capability packaging—using human language to describe what a capability is, when to use it, and how to use it, with scripts and reference templates attached.

When AI reads a Skill document, it gets more than “this tool’s name and accepted parameters”—it gets the full decision context: when to use it, when not to, how to handle edge cases. This information was always meant for human developers; now it goes directly to AI without secondary translation.

For engineers, using the file system as the foundation for Skill management brings an extra benefit: this workflow aligns perfectly with an engineer’s daily work. Git manages the Skill directory, naturally supporting version control, branching, and PR reviews. AI reads whatever documentation it needs, with no need to understand any protocol layer or maintain a running server process.

For ordinary users, the advantage of Skill is even more direct. Today’s Agents are getting more complete, and “harness engineering” has emerged—users don’t need to understand technical details, just describe what capabilities they need. Installing a Skill might be a one-liner: AI automatically reads the Skill document, understands what the capability does and how to configure it, then automatically handles dependency installation, API configuration, and permission verification—tasks that previously required a technical person. For users, a Skill is a manual for a capability, and the Agent is the executor who reads the manual and gets it done. MCP can’t do this because it requires users to first understand servers, schemas, and protocol versions—these are engineer language, not user language.

Skill’s semantic packaging makes this “zero-barrier installation” possible. When capabilities are described in human-language documents, the Agent can truly make those technical decisions on the user’s behalf. The thicker the protocol layer, the higher this delegation cost—and Skill compresses this layer to the minimum.

	MCP	Agent Skill
Adding a new tool	Write server + schema + config	Write a Markdown file
Semantic expressiveness	Limited by JSON Schema	Free-form Markdown
Context information	Needs prompt engineering supplement	Written in the doc, read directly by AI
Protocol version maintenance	Required	Not required
Installation for ordinary users	Need to understand server and protocol concepts	Agent reads doc, auto-configures
Caller configuration	Need to configure server connection	Read files directly

MCP’s Cloud Advantage Is a False Premise

One often-cited advantage of MCP is cloud deployment—the server runs independently and multiple Agents can share it.

This advantage is real, but it belongs to the “network call” capability category, not to the MCP protocol itself. Agent Skill can be built entirely on REST API calls, and scripts in Skill documents can call any HTTP endpoint. On cloud deployment, Skill doesn’t fall short.

For SaaS services that already have REST APIs, the comparison becomes even clearer:

With MCP: write a server that wraps the REST API, maintain the schema, keep in sync with MCP protocol versions
With Skill: write a Markdown document that clearly describes what the API does and how to call it, and AI reads it and uses it directly

MCP requires you to maintain an extra protocol system and server process, while Skill covers all these needs at lower cost. When a simpler solution can do all the same things, the complex one should step aside.

Skill’s Form Is Also Evolving

But I have to admit that Markdown + file system may not be the end state either.

This approach has an unsolved problem: the dynamism of Skill. When a Skill’s external API dependencies change, or when it needs real-time state, how do static documents in the file system keep up? The current solution is to rely on scripts and templates, but the script execution environment, security boundaries, and state management all lack standard answers.

Additionally, dependency relationships between Skills, priority ordering, and decision logic when multiple Skills apply to one request are all open questions.

My judgment: Skill will evolve, possibly no longer centered on static files in the file system, with some more dynamic mechanism for capability registration and discovery emerging. But that mechanism will most likely be a continuation of the Skill design philosophy, not a return to MCP’s protocol design direction.

Closing Thoughts

MCP isn’t a bad design. It took an important first step in AI tool calling, turning “AI connecting to the external world” from impossible to possible.

But there’s still distance between “possible” and “right.” When we discover a way to package capabilities that better fits how AI thinks, the transitional approach should exit the stage.

Is Agent Skill the final answer? I’m not sure. But it’s closer to what AI truly needs—semantics, context, flexibility, and a calling experience with no extra protocol layer. This packaging is friendly to engineers and even more friendly to ordinary users, because it hides technical complexity inside the document layer, letting the Agent handle things users shouldn’t have to think about.

This direction of exploration deserves to be taken seriously.

Finally, here’s my 2025 dissection (roast) of the MCP concept.

AI Agent + Product Manager = QA Test Engineer

Wed, 08 Oct 2025 20:39:15 +0800

In September, our company organized a discussion on applying AI in the workplace. I happened to be researching end-to-end testing at the time, so I tried OpenCode with Playwright, and the results were astonishingly good.

I chose OpenCode over other AI agent frameworks (such as Claude Code) because it can integrate with the company’s enterprise GitHub Copilot account, which means we can use models like GPT-4 and Claude Sonnet without limits on the corporate intranet.

Playwright, built by Microsoft, is an automation testing framework that can drive browser APIs. Compared to Selenium, it is lighter, the community is more actively maintained, and it pairs better with large language models (there is an official MCP server). Playwright also bundles a webdriver, sparing a lot of environment configuration.

With OpenCode and the Playwright MCP server, and a few well-crafted prompt templates, you can run a complete set of web UI end-to-end test cases without writing a single line of test code. That would have been unthinkable in the past.

I have long believed that asking programmers to write E2E test code is laborious and more harmful than helpful. For edge cases and performance, unit tests and API tests cover more than 90% of the needs. The real value of E2E testing is in catching issues in UI interaction and integration. Using automated E2E test code to cover integration and UI scenarios carries an extremely high maintenance cost — every tiny UI tweak can break the test code — and statistically, more than half of the failing test cases in a test suite are not caused by functional defects at all, but by UI load latency, renamed frontend variables, slow test environments, and so on. For the real corner cases that threaten the integrated environment — for example, request retries caused by network interruptions, or out-of-range parameters from interface changes — writing E2E tests is less efficient than unit tests and API tests. For these reasons I have always encouraged the team to hire a full-time test engineer rather than reserving part of every sprint for developers to maintain E2E tests.

On the other hand, as a project lead, I care more about whether requirements are truly understood and delivered, and how to verify what the engineers actually built.

The arrival of AI agents has changed the agile workflow. With a combination like OpenCode + Playwright MCP server, the AI only needs to read user documentation to pick up the basics of UI operations. It can then open a browser, follow the natural-language description of a test case, and click through page elements step by step to complete an entire business flow. With a bit of guidance it can also produce the exact steps it took, the results, the issues it hit, and a complete test report. This is not far from hiring a junior QA engineer.

Because the maintenance cost drops dramatically (you only maintain a Markdown file describing the test cases), a lot of detailed UI test scenarios that were previously impractical can now be covered by an AI agent. Most importantly, this work does not depend on engineers at all — product managers, POs, or BAs can write test cases directly in natural language, closing the loop between writing user stories and verifying features, and removing the ambiguity that comes from requirements being relayed between business, engineering, and QA.

The Toyota Production System lists several sources of waste in production:

Overproduction
Waiting
Unnecessary transport
Over-processing
Excess inventory
Unnecessary motion
Defects

AI agents address, to some extent, three of these wastes: “overproduction” (writing test code over and over), “waiting” (waiting from requirements to implementation to test cases before a feature can be verified), and “unnecessary transport” (business requirements being passed between different people).