Posts on Steve Sun

How I Use Hermes Agent to Write Code

Sun, 07 Jun 2026 10:00:00 +0800

I run two Hermes Agents in parallel.

One is called Super Juaner and handles daily conversation and information retrieval. The other is called Code Juaner and works exclusively on software engineering. Two independent Telegram bots, independent configurations, independent session databases.

Running two Agents separately is for context and environment isolation.

Core discipline: Code Juaner never writes code.

All coding is delegated to Codex CLI for execution. Code Juaner only does the Product Owner work: writing requirement definitions, making architecture decisions, and accepting deliverables. Codex is the implementer: reads the spec, writes code, runs tests.

Role	Output	Responsibility
Code Juaner (Hermes PO)	Feature doc + acceptance	Requirement definition, architecture decisions, quality control
Codex CLI (Implementer)	Code + tests	Technical solution, coding implementation, self-testing

The flow is simple. The user says “I need a feature.” Code Juaner writes a feature doc containing only the requirement description and verifiable acceptance criteria. Every acceptance criterion must be verifiable—“clicking changes the URL to /zh/monaco” rather than “navigation is correct.” Then Codex reads the doc, plan mode produces a technical spec, and build mode implements plus tests. Code Juaner accepts each criterion one by one; if all pass, deploy; if there are issues, summarize and send back.

Code Juaner never reads code files, never finishes reading code to tell Codex how to write. For project-level questions, delegate to Codex plan mode for investigation. Codex timeout means reporting back directly and waiting for next instructions.

The acceptance standard is that npm run build passing is only the minimum bar. For behavior changes, walk through the full user path in the browser before marking complete.

The Three-Layer Gate

Discipline sounds simple but is easy to forget in practice. The model drifts in long conversations—after thirty turns, it may think it can write code and start modifying files directly. I built three layers of defense.

Layer 1: System prompt (hardest, unbypassable)

Locked in config.yaml: all coding must be delegated to Codex CLI for execution, never write code yourself, report and wait after Codex fails. Injected every turn, can’t be avoided.

The system prompt’s lifecycle in Hermes is longer than SOUL.md; it doesn’t reset between conversations, so it better prevents forgetting after long conversations. Even if the model starts drifting after dozens of turns, this terminal instruction is still there.

Layer 2: SOUL.md + personality (loaded at session start)

SOUL.md defines personality and creed: terse, conclusion-first, type-safety > runtime correctness > performance optimization. But its effective range is the start of each conversation, weaker than the system prompt.

SOUL loads a development-workflow skill I co-developed with Super Juaner, which defines pre-checks when picking up code: must load the workflow skill first before making decisions, don’t skip. This file hosts my entire coding discipline as an initial activation guide.

Layer 3: Plugin gate (physical interception, unbypassable)

Two plugins intercept Code Juaner’s source code read/write. write-code-gate blocks write_file and patch on source files like .ts, .tsx, .py and returns refusal. read-code-gate blocks read_file and search_files on source files. Non-source files like .md, .yaml, .toml pass through.

The blocked extensions cover 30+ common languages. Exempt paths include .hermes/, node_modules/, .next/ and other non-project directories.

Plugins are loaded via Hermes’s pre_tool_call hook, take effect at session start, and remain available after gateway restarts.

The three layers increase in hardness from outside to inside; when rules conflict, the hardest layer wins. The system prompt is a sticky note; the plugin is a locked door.

These practices aren’t new. Most of them are decades of accumulated software engineering—separation of roles, clear responsibilities, acceptance-first—just wearing a new skin in the AI era and landing in a new way.

Dual-Track Workflow

Take different tracks based on task nature.

Track A: New project or feature from zero to one

Full pipeline: Feature Doc → Plan → Build → Verify → Fix (loop).

Phase one: Code Juaner writes the feature doc. Phase two: Codex plan produces the technical spec, including file list, implementation plan, data structures, and test cases. Phase three: Codex build reads the spec, implements all files, runs tests. Phase four: Code Juaner accepts each criterion.

Acceptance walks through five layers: data layer (type definitions, state management), logic layer (hooks, reducers), presentation layer (component rendering, interaction binding), config layer (static config, data loading), browser verification (route checks, localStorage confirmation, screenshot comparison).

Track B: Bug fixes or feature modifications

Most projects go through Track B. Skipping the feature doc only applies to single-file, pure styling, copy changes, or minor config tweaks. Multi-file changes, those involving data persistence, state machines, or new components must write a feature doc plus spec before dispatching Codex.

When delegating to Codex, give only the goal and acceptance criteria, not the implementation plan. Codex reads the code and designs by itself.

Preview Tunnel

In-progress websites need to be previewed on mobile. I use Cloudflare Tunnel to expose the local Next.js dev server to the public internet; Cloudflare assigns a temporary trycloudflare.com domain, open it in a mobile browser to preview.

Early on I stopped and restarted the tunnel after every change, which triggered Cloudflare rate limits. Each stop and restart assigned a new URL, requiring me to reopen it on the phone—annoying.

Later I wrote a tunnel-manager.sh script. The first start runs cloudflared as a background daemon; subsequent builds only restart the next server, not touching the tunnel. The tunnel daemon is reused across preview sessions; it only rebuilds on machine restart or cloudflared crash. The same URL persists through the entire development cycle, eliminating a lot of repetitive operations.

Git plus Vercel Deployment

Hit one silent pitfall: Vercel’s GitHub integration checks whether the commit author email is a real GitHub account email. When it doesn’t match, the CLI reports “Your deployment failed” and vercel inspect shows Builds: [0ms]—the build never started, no logs at all.

Fix: use the system git identity for commit, and rebase to reset the author if there’s a mismatch.

Deployment uses Vercel CLI with --no-wait to avoid timeout blocking. GitHub Actions’ deploy.yml triggers automatically on push to the main branch.

Some Reflections

After all those specific practices, let me share some thoughts.

AI writing code has developed too fast. At the start of the year I was still manually writing every line; by mid-year I already had a pipeline that can independently complete features, acceptance, and deployment. What I do has shifted from writing code to defining rules, setting boundaries, and accepting results.

The engineer’s role is changing. You used to be a bricklayer, placing bricks one by one. Now you’re a rancher: set up the fences, put out enough feed, and let the herd graze and grow. The software industry is shifting from construction to animal husbandry.

This trend will only accelerate. Better models mean simpler harnesses, cheaper execution costs. You don’t need to be the best programmer—you need to be the best rule-maker. Define clearly what’s allowed, what’s not allowed, and what counts as done—leave the rest to the agent.

The time I spend defining boundaries has a much higher return than the time I spend on actual coding. That’s the most surprising discovery.

How to Design an AI Agent

Tue, 26 May 2026 16:11:00 +0800

AI Agents will most likely be the paradigm for future AI software design, so for most developers and non-technical people just getting into vibe coding, understanding how they’re designed and the principles behind them will help you design next-generation application software more effectively.

This post tries to use plain language to help you understand what an AI Agent is, what problems it solves, and which protocols and tools will come into play as part of that infrastructure.

Target audience:

Vibe coders (rapid prototyping, build and iterate)
Programmers
Non-technical users just starting to code

First Principles: What Problems Does an Agent Framework Actually Solve?

Models are powerful, but unreliable

Large Language Models (LLMs) “guess”; they don’t “guarantee.” So you can’t treat them as deterministic programs (same input always produces same output).

Problems to solve:

How to bring unstable output into a controllable flow
How to know where the failure is when something breaks

Real-world task results are usually not simple answers, but complete workflow outputs

Real tasks typically involve:

Read information
Make decisions
Call tools
Continue deciding based on tool results
Ultimately produce documents, code, or other artifacts

This means an Agent’s design goal isn’t limited to “question and answer,” but is a “cyclic decision system.”

Users don’t want to wait until the end to see results

When interacting with AI, users typically want:

Visible process (streaming)
Ability to interrupt (abort the current task)
Ability to add instructions mid-task (steer, guide while executing)

So the system must natively support real-time interaction, not one-shot black-box execution.

Context keeps growing, costs keep growing

The longer the conversation, the larger the input, the slower the speed, the higher the cost, and it may even exceed limits. There must be a mechanism to “compress history while preserving key information.”

One core serving multiple interaction modes

The same Agent must run on:

Terminal UI (TUI)
Remote Procedure Call (RPC)
Future Web or App interfaces

So the “intelligent core” and “presentation layer” must be decoupled (independent, not bound together).

From Problems to Requirements, Then to Design

Requirements Checklist

A usable Agent framework must at minimum satisfy:

Looping: supports “think → call tool → think again”
Observable: every step is visible to UI or logging
Controllable: can pause, cancel, interrupt, resume
Recoverable: retry on failure, can continue from the last session
Extensible: add new tools, new models, new frontends
Governable: clear boundaries on cost, context, and permissions

End-to-End Flowchart

Going from problems to requirements, and requirements to design, we get the following flowchart:

flowchart TD A[User states goal] --> B[Agent understands current task] B --> C{Need a tool?} C -- No --> D[Give answer directly] C -- Yes --> E[Generate tool call request] E --> F[Execute tool] F --> G[Get tool result] G --> H{Result sufficient?} H -- No --> B H -- Yes --> D D --> I[Stream back to user] I --> J[User can interrupt/add requirements] J --> B

This diagram expresses:

An Agent is a closed-loop system, not a single function call.
“Tools” are capability amplifiers, not accessories.
The user is in the loop, not outside it.

Overall Architecture Diagram

graph LR subgraph Interaction Layer UI1[TUI/CLI] UI2[RPC/API] UI3[Web/App] end subgraph Runtime Layer SESSION[Session Orchestrator] POLICY[Policy Center: Retry/Compression/Budget] end subgraph Core Layer LOOP[Agent Decision Loop] STATE[State Management] EVENTS[Event Bus] end subgraph Capability Layer TOOLS[Tool System] MODEL[Model Adapter] MEMORY[Memory and Context Management] end UI1 --> SESSION UI2 --> SESSION UI3 --> SESSION SESSION --> LOOP SESSION --> POLICY LOOP <--> STATE LOOP --> EVENTS LOOP --> TOOLS LOOP --> MODEL POLICY <--> MEMORY MODEL --> LLM[External Model Service]

Component Diagram (Understanding “Who Owns What”)

flowchart LR USER[User] ORCH[Session Orchestrator] CORE[Agent Core] ADAPTER[Model Adapter] TOOLRUN[Tool Executor] OBS[Observation and Events] USER <--> ORCH ORCH <--> CORE CORE <--> ADAPTER CORE <--> TOOLRUN CORE --> OBS OBS --> ORCH

Responsibility split:

Session Orchestrator: handles user input, session state, retry and compression policies.
Agent Core: only does the “thinking loop” and “state advancement.”
Model Adapter: shields differences between model providers.
Tool Executor: uniformly executes local or remote tools.
Observation and Events: turns the process into visible signals for UI/log systems.

To Land These Designs, What Protocols and Foundational Patterns Are Required?

This section is the “minimum necessities” to complete the design above. We need to consider which engineering practices to introduce from a protocols and design-patterns standpoint. (Like building a skyscraper, you need to define the materials, the common engineering designs you can reuse, and how to make the structure mechanically stand the test of time.)

Most of these protocols are currently designed and implemented by developers on demand, but standards will likely emerge in the near future.

Required Protocols (Skipping Any Causes Loss of Control)

Message Protocol

Unifies how user messages, assistant messages, and tool results are described.

Event Protocol

Unifies how start, update, end, error, and tool execution status are described.
Purpose: lets UI and logs see the “process,” not just the “outcome.”

Tool Contract

Tool name, parameter structure (Schema), and execution return format must be fixed.

Streaming Contract

Supports incremental output (delta) to guarantee real-time user feedback.

Cancellation Contract

Any link in the chain should respond to abort signals, avoiding “can’t stop.”

Error Contract

Failures must be structured (machine-processable), not just string error messages.

Foundational Design Patterns to Understand

For readers without programming experience, you’ll need to learn about these basic programming design patterns from other sources first.

State Machine

An Agent has state transitions at every step (e.g., waiting for input → generating output → tool execution → back to generating).

Publish/Subscribe (Pub/Sub)

Core emits events, UI/logs subscribe to events.
Benefit: core logic doesn’t depend on specific interfaces.

Adapter

Wraps different model interfaces into a unified calling convention.

Strategy

Retry strategies, tool concurrency strategies, compression strategies are interchangeable.

Pipeline

Input preprocessing → model call → tool execution → post-processing is a pluggable chain.

Idempotency and Recoverability

Repeating the same operation should not produce catastrophic side effects; failure should be recoverable.

Case Study: PI Agent’s Design Philosophy and Architecture

The above covers “general Agent framework design.” Now let’s ground it in the recently popular minimalist framework PI Agent.

Let’s look at how this framework designs an Agent.

Design Philosophy

Minimal Core

Core only handles the loop, state, events, and tool orchestration.

Pluggable Periphery

Models, tools, retries, and context handling are all replaceable.

Process Over Outcome

First ensure the process is visible and controllable, then pursue “smart output.”

Session Over Request

Treat the Agent as a long-term session system, not a single API call.

Agent Core Logic Flowchart

flowchart TD START[Start a session turn] --> TURN[Open turn] TURN --> CALL[Call model and stream output] CALL --> CHECK{Tool call in output?} CHECK -- No --> STOPCHECK{Stop?} CHECK -- Yes --> TOOL[Execute tool batch] TOOL --> MERGE[Write tool results back to context] MERGE --> STOPCHECK STOPCHECK -- Stop --> END[End and emit end event] STOPCHECK -- Continue --> NEXT[Enter next turn] NEXT --> TURN

Agent Core Component Diagram

graph TD CORE[Agent Core] S[State Storage] L[Loop Turn Cycle] E[Events Emission] T[Tool Executor] M[Model Stream Call] Q[Queue: steer/followUp] CORE --> S CORE --> L L --> M L --> T L --> E L --> Q T --> E M --> E E --> S

The value of this structure:

The interaction layer only sees events, doesn’t touch core state.
Model replacement doesn’t change the loop skeleton.
Tool extension doesn’t break the core control flow.

Summary

Agent architecture isn’t “making the model smarter”—it’s “making an uncertain model work reliably inside a controllable system.”

You can remember it as this formula:

$$ \text{Usable Agent} = \text{Model Capability} \times \text{Engineering Control Capability} $$

Where engineering control capability mainly comes from:

Loop design
Protocol design
Event observability
Recoverability and extensibility

Judging by current trends, this will very likely be the foundational paradigm for the next generation of application software.

AI Agent Tool Comparison: Why MCP Is Just a Transitional Solution

Mon, 20 Apr 2026 12:37:00 +0800

If you’ve built AI Agents, you’ve likely used MCP and may have come across the concept of Agent Skills. We don’t need to re-explain what they are—the question this post answers is: when both can achieve “letting AI call tools,” why I think MCP is a transitional approach that will be phased out.

The Fundamental Limitation of MCP: The Protocol Layer Can’t Carry Semantics

MCP’s design logic is: give AI a structured tool-calling protocol where tool discovery, invocation, and parsing all follow a fixed format.

The problem is that this protocol is designed for humans, not for AI.

JSON Schema can define parameter types and return value structures, but it cannot convey things like: why this parameter usually takes this value, what prerequisites a tool needs to work, or what its failure modes look like. This context is what AI most needs when making decisions in real scenarios, but the MCP protocol layer simply cannot carry it.

The result: tool-calling capabilities built with MCP depend heavily on prompt engineering—you need to supplement in the prompt what the protocol definition doesn’t include. This shows the protocol layer has a gap, and that gap can’t be fixed by improving the schema, because it’s fundamentally a semantic-loss problem, not a format problem.

Another practical issue is maintenance cost. Every new capability requires a separate MCP server—you need to maintain server code, schema definitions, and network connections. When the protocol version updates, all servers may need to change too. For individual developers or small teams, this complexity is a significant burden.

Why Skill Is Closer to What AI Actually Needs

Agent Skill uses Markdown as the core format for capability packaging—using human language to describe what a capability is, when to use it, and how to use it, with scripts and reference templates attached.

When AI reads a Skill document, it gets more than “this tool’s name and accepted parameters”—it gets the full decision context: when to use it, when not to, how to handle edge cases. This information was always meant for human developers; now it goes directly to AI without secondary translation.

For engineers, using the file system as the foundation for Skill management brings an extra benefit: this workflow aligns perfectly with an engineer’s daily work. Git manages the Skill directory, naturally supporting version control, branching, and PR reviews. AI reads whatever documentation it needs, with no need to understand any protocol layer or maintain a running server process.

For ordinary users, the advantage of Skill is even more direct. Today’s Agents are getting more complete, and “harness engineering” has emerged—users don’t need to understand technical details, just describe what capabilities they need. Installing a Skill might be a one-liner: AI automatically reads the Skill document, understands what the capability does and how to configure it, then automatically handles dependency installation, API configuration, and permission verification—tasks that previously required a technical person. For users, a Skill is a manual for a capability, and the Agent is the executor who reads the manual and gets it done. MCP can’t do this because it requires users to first understand servers, schemas, and protocol versions—these are engineer language, not user language.

Skill’s semantic packaging makes this “zero-barrier installation” possible. When capabilities are described in human-language documents, the Agent can truly make those technical decisions on the user’s behalf. The thicker the protocol layer, the higher this delegation cost—and Skill compresses this layer to the minimum.

	MCP	Agent Skill
Adding a new tool	Write server + schema + config	Write a Markdown file
Semantic expressiveness	Limited by JSON Schema	Free-form Markdown
Context information	Needs prompt engineering supplement	Written in the doc, read directly by AI
Protocol version maintenance	Required	Not required
Installation for ordinary users	Need to understand server and protocol concepts	Agent reads doc, auto-configures
Caller configuration	Need to configure server connection	Read files directly

MCP’s Cloud Advantage Is a False Premise

One often-cited advantage of MCP is cloud deployment—the server runs independently and multiple Agents can share it.

This advantage is real, but it belongs to the “network call” capability category, not to the MCP protocol itself. Agent Skill can be built entirely on REST API calls, and scripts in Skill documents can call any HTTP endpoint. On cloud deployment, Skill doesn’t fall short.

For SaaS services that already have REST APIs, the comparison becomes even clearer:

With MCP: write a server that wraps the REST API, maintain the schema, keep in sync with MCP protocol versions
With Skill: write a Markdown document that clearly describes what the API does and how to call it, and AI reads it and uses it directly

MCP requires you to maintain an extra protocol system and server process, while Skill covers all these needs at lower cost. When a simpler solution can do all the same things, the complex one should step aside.

Skill’s Form Is Also Evolving

But I have to admit that Markdown + file system may not be the end state either.

This approach has an unsolved problem: the dynamism of Skill. When a Skill’s external API dependencies change, or when it needs real-time state, how do static documents in the file system keep up? The current solution is to rely on scripts and templates, but the script execution environment, security boundaries, and state management all lack standard answers.

Additionally, dependency relationships between Skills, priority ordering, and decision logic when multiple Skills apply to one request are all open questions.

My judgment: Skill will evolve, possibly no longer centered on static files in the file system, with some more dynamic mechanism for capability registration and discovery emerging. But that mechanism will most likely be a continuation of the Skill design philosophy, not a return to MCP’s protocol design direction.

Closing Thoughts

MCP isn’t a bad design. It took an important first step in AI tool calling, turning “AI connecting to the external world” from impossible to possible.

But there’s still distance between “possible” and “right.” When we discover a way to package capabilities that better fits how AI thinks, the transitional approach should exit the stage.

Is Agent Skill the final answer? I’m not sure. But it’s closer to what AI truly needs—semantics, context, flexibility, and a calling experience with no extra protocol layer. This packaging is friendly to engineers and even more friendly to ordinary users, because it hides technical complexity inside the document layer, letting the Agent handle things users shouldn’t have to think about.

This direction of exploration deserves to be taken seriously.

Finally, here’s my 2025 dissection (roast) of the MCP concept.

The Externalized J-Type

Mon, 03 Nov 2025 22:07:40 +0800

In life, there’s a kind of person who likes to urge others to make plans. As long as something is still up in the air, they feel uneasy.

These people aren’t necessarily company bosses, nor are they necessarily dominant in personality. They just enjoy asking “When can we decide this?” or “What’s the plan for the weekend—where are we going, really?"—seeking certainty from the outside world in that kind of way.

Let’s tentatively call these people the “Externalized J-Type” or “Ex-J.” A J-Type in MBTI is someone with a Judging personality, who likes to predict the future and loves making plans. The opposite is the P-Type, who is free-spirited and easygoing.

An Ex-J is someone who externalizes this J-trait onto others—a J-Type who assimilates other people into becoming J-Types themselves.

The Ex-J keeps certainty for themselves and passes anxiety out to the world.

From my observation, these people usually have a strong need for control. A need for control can grow inward—for example, keeping work and life neatly organized, full of a sense of order. It can also grow outward—for example, trying to manipulate others and demanding they stay in tight alignment with you.

Ex-Js are the outward-spilling type, but most of the time their need for control grows inward. Once their internal need for control gets too much, it inadvertently spills onto others. When their J-desires fall through and life and work can’t be neatly organized, they offload the negative emotions onto the person they see as the source of the uncertainty, and a sense of pressure radiates out.

Recently, in both my work and personal life, I’ve run into two such people. After reflecting, I also discovered through my own self-audit mechanism (the I-Type’s inner search engine) that I’ve inadvertently become an Ex-J myself from time to time.

How do you avoid becoming an Ex-J?

I think the first principle is: don’t keep asking the same question. Working at a foreign company, when I communicate with European colleagues, I notice a big difference from Chinese colleagues: beyond “yes” and “no,” they have a third response—“no response.”

If they haven’t figured it out, they have the right not to respond. Not responding is itself a response. It means “I haven’t decided, let me think about it and get back to you.” But in our language culture, not responding comes across as impolite, lacking confidence…

You can see it from our exam-oriented education system too—we mass-produce J-Types, and some of them gradually degenerate into more extreme Ex-Js, drifting toward the dark side of harming society (just kidding, it’s not actually that serious).

So, when someone doesn’t respond, treat it as another form of response. Then you won’t be an Ex-J.

How do you communicate with an Ex-J?

I don’t know the answer either, so I humbly went to Chat for advice and asked GPT. The answer is: it takes two steps. First, soothe the other person’s emotions and express understanding. Then use vague steps in place of precise conclusions—for example, if you can’t make a plan, just say you can’t decide right now, but if X happens, here’s how we can respond. Let them see your thought process.

This way, by using rough steps or fuzzy time points instead of exact answers, you ease their need for control without letting them pull you into their pace.

Sigh, being a person is exhausting. Finally, I hope you become a P-Type. And if you really can’t, please promise me—at least be a good J-Type, okay?

Omarchy: Some Setup Tweaks for a Chinese-Language Environment

Wed, 22 Oct 2025 23:21:21 +0800

I recently installed DHH’s Omarchy (an Arch Linux distribution based on the Hyprland desktop environment) on my home computer.

After installation, there were a few configuration tweaks I needed to make. I’m recording them in this post for reference.

4K Monitor Settings

Modify the system menu Setup - Monitor, and set the parameters based on your own display’s resolution, following the comments in the configuration file. For example, my 27-inch 4K display:

env = GDK_SCALE,1.75
monitor=,preferred,auto,1.875

Add additional scaling settings for QT applications:

env = QT_AUTO_SCREEN_SCALE_FACTOR,1
env = QT_SCALE_FACTOR,1.75

Chinese Input Method

Refer to Fcitx Best Configuration Practices. The section titled “Installing emacs-rime” in that article can be skipped if you don’t use emacs.

Terminal Font Settings

The default system fonts aren’t very friendly to Chinese in the terminal. I like the Maple Mono font, which you can install via the AUR package maple-mono-nf-cn. Then modify the font settings in ~/.config/alacritty/alacritty.toml:

[font]
normal = { family = "Maple Mono NF CN" }
bold = { family = "Maple Mono NF CN" }
italic = { family = "Maple Mono NF CN" }
size = 12

Disable NumLock by Default

Omarchy enables NumLock on the numpad by default after installation. You can change this in the system menu Setup - Input:

numlock_by_default = false

A Few Neovim Settings

Neovim uses Lazyvim by default. I added a few lines to ~/.config/nvim/lua/config/options.lua—feel free to pick what you need based on the comments:

-- Fix the issue of Chinese characters showing underlines in the terminal
vim.opt.spelllang = { "en", "cjk" }

-- Disable syntax checking in markdown files
vim.api.nvim_create_autocmd("FileType", {
 pattern = "markdown",
 callback = function()
 vim.diagnostic.enable(false)
 end,
})

Then create ~/.config/nvim/lua/plugins/flush.lua with the following content to restore Vim’s default behavior for the s key in normal mode:

return {
 {
 "folke/flash.nvim",
 keys = {
 { "s", mode = { "n", "x", "o" }, false },
 },
 },
}

AI Agent + Product Manager = QA Test Engineer

Wed, 08 Oct 2025 20:39:15 +0800

In September, our company organized a discussion on applying AI in the workplace. I happened to be researching end-to-end testing at the time, so I tried OpenCode with Playwright, and the results were astonishingly good.

I chose OpenCode over other AI agent frameworks (such as Claude Code) because it can integrate with the company’s enterprise GitHub Copilot account, which means we can use models like GPT-4 and Claude Sonnet without limits on the corporate intranet.

Playwright, built by Microsoft, is an automation testing framework that can drive browser APIs. Compared to Selenium, it is lighter, the community is more actively maintained, and it pairs better with large language models (there is an official MCP server). Playwright also bundles a webdriver, sparing a lot of environment configuration.

With OpenCode and the Playwright MCP server, and a few well-crafted prompt templates, you can run a complete set of web UI end-to-end test cases without writing a single line of test code. That would have been unthinkable in the past.

I have long believed that asking programmers to write E2E test code is laborious and more harmful than helpful. For edge cases and performance, unit tests and API tests cover more than 90% of the needs. The real value of E2E testing is in catching issues in UI interaction and integration. Using automated E2E test code to cover integration and UI scenarios carries an extremely high maintenance cost — every tiny UI tweak can break the test code — and statistically, more than half of the failing test cases in a test suite are not caused by functional defects at all, but by UI load latency, renamed frontend variables, slow test environments, and so on. For the real corner cases that threaten the integrated environment — for example, request retries caused by network interruptions, or out-of-range parameters from interface changes — writing E2E tests is less efficient than unit tests and API tests. For these reasons I have always encouraged the team to hire a full-time test engineer rather than reserving part of every sprint for developers to maintain E2E tests.

On the other hand, as a project lead, I care more about whether requirements are truly understood and delivered, and how to verify what the engineers actually built.

The arrival of AI agents has changed the agile workflow. With a combination like OpenCode + Playwright MCP server, the AI only needs to read user documentation to pick up the basics of UI operations. It can then open a browser, follow the natural-language description of a test case, and click through page elements step by step to complete an entire business flow. With a bit of guidance it can also produce the exact steps it took, the results, the issues it hit, and a complete test report. This is not far from hiring a junior QA engineer.

Because the maintenance cost drops dramatically (you only maintain a Markdown file describing the test cases), a lot of detailed UI test scenarios that were previously impractical can now be covered by an AI agent. Most importantly, this work does not depend on engineers at all — product managers, POs, or BAs can write test cases directly in natural language, closing the loop between writing user stories and verifying features, and removing the ambiguity that comes from requirements being relayed between business, engineering, and QA.

The Toyota Production System lists several sources of waste in production:

Overproduction
Waiting
Unnecessary transport
Over-processing
Excess inventory
Unnecessary motion
Defects

AI agents address, to some extent, three of these wastes: “overproduction” (writing test code over and over), “waiting” (waiting from requirements to implementation to test cases before a feature can be verified), and “unnecessary transport” (business requirements being passed between different people).

What to Do When Your Linux System Hits a Kernel Panic

Sat, 19 Jul 2025 01:06:32 +0800

When I turned on my Beelink Ser6 at home one evening, I was greeted with a Kernel Panic 😱.

A lot of people panic at this point, but there’s really no need. All you need is a LiveUSB boot stick to fix it.

However, my Ubuntu install had been running stably for over a year and I didn’t have a LiveUSB lying around at home. Helplessly, I dug out an old computer that had been gathering dust for years, and spent half an hour guessing the login password before finally getting in… then downloaded the Ubuntu ISO and made a LiveUSB.

Below is the troubleshooting process I used after booting from the LiveUSB and entering Try Ubuntu’s Terminal, for your reference.

1. Find the Root Partition and EFI Partition

lsblk -f

This returns something like the following, where the vfat format is the EFI partition and ext4 is the system root partition:

NAME FSTYPE LABEL UUID MOUNTPOINT
nvme0n1 
├─nvme0n1p1 vfat 1234-5678 /boot/efi
└─nvme0n1p2 ext4 955b06a9-983d-4e04-b2ef-60b559db46e6

2. Use `fsck` to Repair Partition Errors

Note: in this step and beyond, the partition path needs to be replaced with the one from your system found in the previous step.

First, repair the root partition:

sudo fsck -f /dev/nvme0n1p2

When prompted, enter y to allow, or a to allow all. I discovered several errors at this step and successfully fixed them.

Next, check and repair the EFI partition:

sudo fsck -f /dev/nvme0n1p1

At this step I got the following prompt:

there are different between boot sector and it's backup：
1) Copy original to backup
2) Copy backup to original
3) No action

Based on what I found online, if the system can boot into GRUB normally, the original sector is good, so I chose 1) Copy original to backup to copy the original boot sector to the backup sector.

3. Mount the Original System and Rebuild initramfs

In this step, mount the original system’s root partition into the current LiveUSB system. To run the necessary commands, also bind-mount four key directories from the LiveUSB system onto the original system.

sudo mkdir -p /mnt/ubuntu
sudo mount /dev/nvme0n1p2 /mnt/ubuntu


sudo mount --bind /dev /mnt/ubuntu/dev
sudo mount --bind /proc /mnt/ubuntu/proc
sudo mount --bind /sys /mnt/ubuntu/sys
sudo mount --bind /run /mnt/ubuntu/run

sudo mount /dev/nvme0n1p1 /mnt/ubuntu/boot/efi

After this, you can switch into the original system’s root shell:

sudo chroot /mnt/ubuntu

Then install GRUB and regenerate the initramfs boot image for the system kernel. Adjust the arguments to grub-install to match your system.

grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=ubuntu
update-initramfs -c -k all
update-grub

Finally, exit the original system’s root shell and reboot:

exit

sudo reboot

Pull out the USB stick and boot into the original system—I was able to log in normally at this point.

Summary

Don’t panic.
Keep a LiveUSB at home.
Use the fsck command to repair partition errors.
Use mount to bind the necessary filesystems onto the original system, enter it, and rebuild initramfs.

How to Get Along with the 'Old Guard'

Sun, 13 Jul 2025 12:40:24 +0800

Thou Shalt Not

Every day, we inevitably have to deal with some “old guard”—people with a bit more seniority (not necessarily referring to age). Most of us, before retirement, have to live in a world where the rules of the game are set by the old guard.

How do you navigate such a world? Here’s one reference, not advice—please don’t imitate.

In 1934, the Motion Picture Producers and Distributors of America introduced the most stringent set of film production rules in history—the Hays Code. The code was intended to raise the moral standards of audiences.

“Prohibit nudity, suggestive content, and lustful kisses on screen. Prohibit depictions of religion, drugs, interracial romance, and revenge plots.”

Just as intended, the code promoted the development of the Hollywood film industry… in a different form.

Before long, Hollywood directors discovered that these strict prohibitions could help them sell tickets, as long as they cleverly evaded the letter of the rules.

The facts proved that audiences did, in fact, still love watching naughty things. The young guns of Hollywood, caught between compliance and defiance, chose to skirt the edges.

Observe its letter and violate its spirit as much as possible.

Comply with the rules in form, and deviate from their spirit as much as possible.

This is a philosophy of compromise with reality, yet not entirely a compromise.

Paramount photographer Whitey Schafer shot a large satirical photograph (the header image for this post) titled “Thou Shalt Not,” depicting “The Ten Things a Producer Must Absolutely Never Do”—and then depicted all ten things within the frame.

Law Defeated
Inside of Thigh
Lace Lingerie
Dead Man
Narcotics
Drinking
Exposed Bosom
Gambling
Pointing Gun
Tommy Gun

Years later, Hollywood scrapped these regulations, replacing them with a more relaxed film rating system. The photo became a classic work of art satirizing that era.

Returning to the original question: in a world where the old guard sets the rules, how do you take care of yourself?

The answer is: play an infinite game (?).

Finite games are played within boundaries. Infinite games are played with boundaries.

How AI Coding Tools Like Cursor Work Under the Hood

Mon, 02 Jun 2025 07:58:17 +0800

In my previous post, How DeepWiki Works, I shared one possible way DeepWiki is implemented. I left a question there: how does DeepWiki chunk a source code repository?

The answer is AST chunking.

In this post I want to analyze how two software development aids — Cursor and Cline — implement “code indexing.” In fact, they are not fundamentally different from DeepWiki; all of them use AST chunking.

AST

An Abstract Syntax Tree (AST) is a tree representation of source code that reflects the code’s syntactic structure. When chunking code, ASTs help us better understand the semantic boundaries of the code.

ASTs are widely used in compilers and source code analysis tools. For example, in the frontend world, Babel and the TypeScript compiler (TSC) use ASTs to transform ES6 or TypeScript code into JavaScript that browsers can run.

Below is a simple example showing how an AST converts TypeScript code into a tree structure. Suppose we have this TypeScript function:

function greet(name: string) {
 return "Hello, " + name;
}

After being processed by an AST tool, it is abstracted into the following tree:

SourceFile
- FunctionDeclaration
  - Identifier: “greet”
  - Parameter
    - Identifier: “name”
  - Block
    - ReturnStatement
      - BinaryExpression
        
        StringLiteral: “Hello, "
        
        Identifier: “name”

A compiler can then walk this tree and translate it node by node into JavaScript code.

Once you understand ASTs, you roughly understand how DeepWiki — and even code editors like Cursor — build code indexes.

Cursor

In Cursor’s official documentation, you can find a description of how it indexes user code.

Cursor scans the user’s repository, computes file hashes, and builds a Merkle tree. Similar to the way Git compares file diffs, Cursor uses the Merkle tree to detect file changes in the user’s workspace and incrementally uploads modified files to Cursor’s servers.

Uploaded files are then chunked and embedded, and stored in a Turbopuffer database. This is the process of building a RAG over the source code.

The chunking step uses an AST tool to structure the code into a syntax tree, then cuts the serialized tree nodes into small chunks, and finally embeds them as vectors for storage.

Turbopuffer does not only store the vectorized code; it also stores metadata such as the line numbers and source file paths of the code segments.

When Cursor tries to autocomplete user code or generate new code from context, it queries the Turbopuffer database, finds the vectors with the highest similarity, and gets the file path and line numbers for that segment. Cursor then reads the corresponding source code from the user’s repository and puts it into the LLM’s system context. Finally, the LLM returns the newly generated code to Cursor.

A user on X put together this flow diagram:

Cline

Cline’s official blog offers a glimpse of how it is implemented.

Cline is an AI agent that helps with coding. Cline does not upload code and build a RAG; instead, it takes a safer and more reliable approach to managing the user’s repository.

Here is the developers’ description of how Cline works:

When you point Cline at a codebase, it doesn’t immediately try to read every file. Instead, it begins by understanding the architecture. Using Abstract Syntax Trees (ASTs), Cline extracts a high-level map of your code – the classes, functions, methods, and their relationships. This happens through our list_code_definition_names tool, which provides structural understanding without requiring full implementation details.

Cline uses its list_code_definition_names tool to convert source code into an AST. Cline treats that AST as a “map” of the entire codebase.

When Cline runs a task automatically, it analyzes the file that needs to be modified, builds an AST for that file, and converts the AST into natural-language context (similar to how DeepWiki turns code into documents). It feeds this context to the LLM, letting the LLM decide whether to modify the file or look at another file to gather more context.

If Cursor compares similarity between vector-space code snippets, Cline converts code snippets into natural-language descriptions and lets the LLM, through semantic understanding, hunt for clues across the repository and compare the semantic similarity of code segments.

Cline’s approach is clearly safer — enterprise users don’t have to worry about Cline abusing the source code. The side effect, however, is higher token consumption. Constantly fetching context across files also takes more time. In some edge cases, Cline may even bounce back and forth between two files, falling into a loop.

In my own experience, Cline performs better than Cursor’s Agent mode on certain models (Deepseek-r1, OpenAI-4o), because Cline’s semantic understanding makes better use of these models’ natural-language abilities than vector similarity does.

For programming-optimized Claude Sonnet, though, there is no significant difference, so users need to choose between higher security and faster response time.

Summary

This post mainly covered how code editors use Abstract Syntax Trees (ASTs) to build code indexes and implement code completion.

In general, ASTs are an important tool for understanding the syntactic structure of code, and different implementations have their own trade-offs.

How DeepWiki Works

Sat, 24 May 2025 12:50:40 +0800

DeepWiki is an AI agent project, provided by Devin.ai, that generates detailed documentation from a source code repository. Ever since it went viral, I have been curious about how it works.

I combed through online resources and several open-source projects and arrived at a relatively clear picture of the workflow. For the harder parts, I will follow up with my findings in later posts.

Building a Map of the Code Structure

At its core, DeepWiki is a RAG system. It takes a source code repository as input, parses the code, and converts it into two parts: metadata representing the syntactic structure and file structure and vector data representing code descriptions and snippets. The metadata is stored in a relational database, while the corresponding code snippets are stored in a vector database for later LLM retrieval.

Generating WIKI Pages

The process of generating a WIKI page is essentially a RAG query:

The program recursively reads the project structure.
It queries the metadata database for the current file’s metadata, then searches the vector database for the most relevant code and description IDs.
It uses those IDs to look up the descriptions in the metadata database, and the corresponding code snippets in the project files.
It assembles all of the above as context, picks an appropriate prompt based on the metadata type (architecture, components, etc.), and feeds it to the LLM.
A front-end rendering engine then renders the LLM output into a documentation page.
Repeat from step 1.

Image from https://www.gptsecurity.info/2024/05/26/RAG/

Difficulty 1: The Chunking Strategy

A particularly interesting part of the process above is how to chunk code before embedding. For natural language, chunking is usually based on paragraphs, sentences, and punctuation, so each chunk contains the full context of a sentence or paragraph.

For code, it is different. A function body, for example, is wrapped in { and }. If you tokenize it with a natural-language tokenizer, the context will be split across different chunks, which hurts the accuracy of vector retrieval.

There are currently two approaches. The first is to chunk the whole file. In that case, the file size cannot exceed the chunk-size limit, and the chunks lack the real call-relationship context. We know that the unit of code organization is not the file (the file tree is just a human-friendly organization) — it is a graph of class- and function-level dependencies.

The second approach is to first use a syntax tool to perform static analysis on the code file, and then split the code along the syntactic structure based on the analysis. This is more complex to implement, and I could not find much material on it online. Fortunately, I came across RAG for a Codebase with 10k Repos, which describes how to use static syntax analysis to chunk code and build an efficient RAG system for a code repository. The article does not provide an open-source implementation, though. Considering that this is a core technology of a commercial product, it is well worth digging deeper into. I will keep following this area of research.

Difficulty 2: Parsing the Syntax Structure

Parsing metadata is somewhat simpler than vector data. I found some clues in another open-source project, Repo Graph.

That project uses tree-sitter to analyze the project’s syntax structure and produces three types of metadata files:

tag.json: basic information such as the path, line number, and description of a file, function, or class.
tree_structure.json: the project’s file tree structure.
*.pkl: a graph of object dependencies.

*.pkl is a graph of object relations that the syntax analyzer obtains by scanning the project’s files, then serializes the Python graph object to disk using the pickle library.

From this implementation, it looks like the embedding process in Difficulty 1 could also use the code metadata generated by tree-sitter to chunk the code by line.

Prompt Engineering

In the RAG query phase, you need to assemble different prompts based on the type of metadata being processed.

The Agent as a Judge project has plenty of prompts worth referencing:

Prompt for generating an overview:

Provide a concise overview of this repository focused primarily on:
* Purpose and Scope: What is this project's main purpose?
* Core Features: What are the key features and capabilities?
* Target audience/users
* Main technologies or frameworks used

Prompt for generating an architecture document:

Create a comprehensive architecture overview for this repository. Include:
* A high-level description of the system architecture
* Main components and their roles
* Data flow between components
* External dependencies and integrations

Prompt for generating a components document:

Provide a comprehensive analysis of all key components in this codebase. For each component:
* Name of the component
* Purpose and main responsibility
* How it interacts with other components
* Design patterns or techniques used
* Key characteristics
* File paths that implement this component

For the rest, please refer to the project files; I won’t enumerate them all here.

Summary

DeepWiki is a code documentation generation tool built on a RAG system. It works through the following steps:

Perform syntactic analysis on the repository to produce metadata and vector data.
Query that data through the RAG system to generate documentation.
Render the results into readable documentation pages with a front-end engine.

There are two main difficulties in implementation:

The code chunking strategy: it must consider the syntactic structure of the code, not just split it the way you would split natural language.
Parsing the syntax structure: tools like tree-sitter can be used to parse the code’s structure.

Although there are some open-source projects to reference, the core chunking strategy implementation still needs to be studied in depth.

Posts on Steve Sun

How I Use Hermes Agent to Write Code

The Three-Layer Gate

Dual-Track Workflow

Preview Tunnel

Git plus Vercel Deployment

Some Reflections

How to Design an AI Agent

First Principles: What Problems Does an Agent Framework Actually Solve?

Models are powerful, but unreliable

Real-world task results are usually not simple answers, but complete workflow outputs

Users don’t want to wait until the end to see results

Context keeps growing, costs keep growing

One core serving multiple interaction modes

From Problems to Requirements, Then to Design

Requirements Checklist

End-to-End Flowchart

Overall Architecture Diagram

Component Diagram (Understanding “Who Owns What”)

To Land These Designs, What Protocols and Foundational Patterns Are Required?

Required Protocols (Skipping Any Causes Loss of Control)

Foundational Design Patterns to Understand

Case Study: PI Agent’s Design Philosophy and Architecture

Design Philosophy

Agent Core Logic Flowchart

Agent Core Component Diagram

Summary

AI Agent Tool Comparison: Why MCP Is Just a Transitional Solution

The Fundamental Limitation of MCP: The Protocol Layer Can’t Carry Semantics

Why Skill Is Closer to What AI Actually Needs

MCP’s Cloud Advantage Is a False Premise

Skill’s Form Is Also Evolving

Closing Thoughts

The Externalized J-Type

Omarchy: Some Setup Tweaks for a Chinese-Language Environment

4K Monitor Settings

Chinese Input Method

Terminal Font Settings

Disable NumLock by Default

A Few Neovim Settings

AI Agent + Product Manager = QA Test Engineer

What to Do When Your Linux System Hits a Kernel Panic

1. Find the Root Partition and EFI Partition

2. Use fsck to Repair Partition Errors

3. Mount the Original System and Rebuild initramfs

Summary

How to Get Along with the 'Old Guard'

How AI Coding Tools Like Cursor Work Under the Hood

AST

Cursor

Cline

Summary

Further Reading

How DeepWiki Works

Building a Map of the Code Structure

Generating WIKI Pages

Difficulty 1: The Chunking Strategy

Difficulty 2: Parsing the Syntax Structure

Prompt Engineering

Summary

Reference Projects

2. Use `fsck` to Repair Partition Errors