AI on Steve Sun

How I Use Hermes Agent to Write Code

Sun, 07 Jun 2026 10:00:00 +0800

I run two Hermes Agents in parallel.

One is called Super Juaner and handles daily conversation and information retrieval. The other is called Code Juaner and works exclusively on software engineering. Two independent Telegram bots, independent configurations, independent session databases.

Running two Agents separately is for context and environment isolation.

Core discipline: Code Juaner never writes code.

All coding is delegated to Codex CLI for execution. Code Juaner only does the Product Owner work: writing requirement definitions, making architecture decisions, and accepting deliverables. Codex is the implementer: reads the spec, writes code, runs tests.

Role	Output	Responsibility
Code Juaner (Hermes PO)	Feature doc + acceptance	Requirement definition, architecture decisions, quality control
Codex CLI (Implementer)	Code + tests	Technical solution, coding implementation, self-testing

The flow is simple. The user says “I need a feature.” Code Juaner writes a feature doc containing only the requirement description and verifiable acceptance criteria. Every acceptance criterion must be verifiable—“clicking changes the URL to /zh/monaco” rather than “navigation is correct.” Then Codex reads the doc, plan mode produces a technical spec, and build mode implements plus tests. Code Juaner accepts each criterion one by one; if all pass, deploy; if there are issues, summarize and send back.

Code Juaner never reads code files, never finishes reading code to tell Codex how to write. For project-level questions, delegate to Codex plan mode for investigation. Codex timeout means reporting back directly and waiting for next instructions.

The acceptance standard is that npm run build passing is only the minimum bar. For behavior changes, walk through the full user path in the browser before marking complete.

The Three-Layer Gate

Discipline sounds simple but is easy to forget in practice. The model drifts in long conversations—after thirty turns, it may think it can write code and start modifying files directly. I built three layers of defense.

Layer 1: System prompt (hardest, unbypassable)

Locked in config.yaml: all coding must be delegated to Codex CLI for execution, never write code yourself, report and wait after Codex fails. Injected every turn, can’t be avoided.

The system prompt’s lifecycle in Hermes is longer than SOUL.md; it doesn’t reset between conversations, so it better prevents forgetting after long conversations. Even if the model starts drifting after dozens of turns, this terminal instruction is still there.

Layer 2: SOUL.md + personality (loaded at session start)

SOUL.md defines personality and creed: terse, conclusion-first, type-safety > runtime correctness > performance optimization. But its effective range is the start of each conversation, weaker than the system prompt.

SOUL loads a development-workflow skill I co-developed with Super Juaner, which defines pre-checks when picking up code: must load the workflow skill first before making decisions, don’t skip. This file hosts my entire coding discipline as an initial activation guide.

Layer 3: Plugin gate (physical interception, unbypassable)

Two plugins intercept Code Juaner’s source code read/write. write-code-gate blocks write_file and patch on source files like .ts, .tsx, .py and returns refusal. read-code-gate blocks read_file and search_files on source files. Non-source files like .md, .yaml, .toml pass through.

The blocked extensions cover 30+ common languages. Exempt paths include .hermes/, node_modules/, .next/ and other non-project directories.

Plugins are loaded via Hermes’s pre_tool_call hook, take effect at session start, and remain available after gateway restarts.

The three layers increase in hardness from outside to inside; when rules conflict, the hardest layer wins. The system prompt is a sticky note; the plugin is a locked door.

These practices aren’t new. Most of them are decades of accumulated software engineering—separation of roles, clear responsibilities, acceptance-first—just wearing a new skin in the AI era and landing in a new way.

Dual-Track Workflow

Take different tracks based on task nature.

Track A: New project or feature from zero to one

Full pipeline: Feature Doc → Plan → Build → Verify → Fix (loop).

Phase one: Code Juaner writes the feature doc. Phase two: Codex plan produces the technical spec, including file list, implementation plan, data structures, and test cases. Phase three: Codex build reads the spec, implements all files, runs tests. Phase four: Code Juaner accepts each criterion.

Acceptance walks through five layers: data layer (type definitions, state management), logic layer (hooks, reducers), presentation layer (component rendering, interaction binding), config layer (static config, data loading), browser verification (route checks, localStorage confirmation, screenshot comparison).

Track B: Bug fixes or feature modifications

Most projects go through Track B. Skipping the feature doc only applies to single-file, pure styling, copy changes, or minor config tweaks. Multi-file changes, those involving data persistence, state machines, or new components must write a feature doc plus spec before dispatching Codex.

When delegating to Codex, give only the goal and acceptance criteria, not the implementation plan. Codex reads the code and designs by itself.

Preview Tunnel

In-progress websites need to be previewed on mobile. I use Cloudflare Tunnel to expose the local Next.js dev server to the public internet; Cloudflare assigns a temporary trycloudflare.com domain, open it in a mobile browser to preview.

Early on I stopped and restarted the tunnel after every change, which triggered Cloudflare rate limits. Each stop and restart assigned a new URL, requiring me to reopen it on the phone—annoying.

Later I wrote a tunnel-manager.sh script. The first start runs cloudflared as a background daemon; subsequent builds only restart the next server, not touching the tunnel. The tunnel daemon is reused across preview sessions; it only rebuilds on machine restart or cloudflared crash. The same URL persists through the entire development cycle, eliminating a lot of repetitive operations.

Git plus Vercel Deployment

Hit one silent pitfall: Vercel’s GitHub integration checks whether the commit author email is a real GitHub account email. When it doesn’t match, the CLI reports “Your deployment failed” and vercel inspect shows Builds: [0ms]—the build never started, no logs at all.

Fix: use the system git identity for commit, and rebase to reset the author if there’s a mismatch.

Deployment uses Vercel CLI with --no-wait to avoid timeout blocking. GitHub Actions’ deploy.yml triggers automatically on push to the main branch.

Some Reflections

After all those specific practices, let me share some thoughts.

AI writing code has developed too fast. At the start of the year I was still manually writing every line; by mid-year I already had a pipeline that can independently complete features, acceptance, and deployment. What I do has shifted from writing code to defining rules, setting boundaries, and accepting results.

The engineer’s role is changing. You used to be a bricklayer, placing bricks one by one. Now you’re a rancher: set up the fences, put out enough feed, and let the herd graze and grow. The software industry is shifting from construction to animal husbandry.

This trend will only accelerate. Better models mean simpler harnesses, cheaper execution costs. You don’t need to be the best programmer—you need to be the best rule-maker. Define clearly what’s allowed, what’s not allowed, and what counts as done—leave the rest to the agent.

The time I spend defining boundaries has a much higher return than the time I spend on actual coding. That’s the most surprising discovery.

How AI Coding Tools Like Cursor Work Under the Hood

Mon, 02 Jun 2025 07:58:17 +0800

In my previous post, How DeepWiki Works, I shared one possible way DeepWiki is implemented. I left a question there: how does DeepWiki chunk a source code repository?

The answer is AST chunking.

In this post I want to analyze how two software development aids — Cursor and Cline — implement “code indexing.” In fact, they are not fundamentally different from DeepWiki; all of them use AST chunking.

AST

An Abstract Syntax Tree (AST) is a tree representation of source code that reflects the code’s syntactic structure. When chunking code, ASTs help us better understand the semantic boundaries of the code.

ASTs are widely used in compilers and source code analysis tools. For example, in the frontend world, Babel and the TypeScript compiler (TSC) use ASTs to transform ES6 or TypeScript code into JavaScript that browsers can run.

Below is a simple example showing how an AST converts TypeScript code into a tree structure. Suppose we have this TypeScript function:

function greet(name: string) {
 return "Hello, " + name;
}

After being processed by an AST tool, it is abstracted into the following tree:

SourceFile
- FunctionDeclaration
  - Identifier: “greet”
  - Parameter
    - Identifier: “name”
  - Block
    - ReturnStatement
      - BinaryExpression
        
        StringLiteral: “Hello, "
        
        Identifier: “name”

A compiler can then walk this tree and translate it node by node into JavaScript code.

Once you understand ASTs, you roughly understand how DeepWiki — and even code editors like Cursor — build code indexes.

Cursor

In Cursor’s official documentation, you can find a description of how it indexes user code.

Cursor scans the user’s repository, computes file hashes, and builds a Merkle tree. Similar to the way Git compares file diffs, Cursor uses the Merkle tree to detect file changes in the user’s workspace and incrementally uploads modified files to Cursor’s servers.

Uploaded files are then chunked and embedded, and stored in a Turbopuffer database. This is the process of building a RAG over the source code.

The chunking step uses an AST tool to structure the code into a syntax tree, then cuts the serialized tree nodes into small chunks, and finally embeds them as vectors for storage.

Turbopuffer does not only store the vectorized code; it also stores metadata such as the line numbers and source file paths of the code segments.

When Cursor tries to autocomplete user code or generate new code from context, it queries the Turbopuffer database, finds the vectors with the highest similarity, and gets the file path and line numbers for that segment. Cursor then reads the corresponding source code from the user’s repository and puts it into the LLM’s system context. Finally, the LLM returns the newly generated code to Cursor.

A user on X put together this flow diagram:

Cline

Cline’s official blog offers a glimpse of how it is implemented.

Cline is an AI agent that helps with coding. Cline does not upload code and build a RAG; instead, it takes a safer and more reliable approach to managing the user’s repository.

Here is the developers’ description of how Cline works:

When you point Cline at a codebase, it doesn’t immediately try to read every file. Instead, it begins by understanding the architecture. Using Abstract Syntax Trees (ASTs), Cline extracts a high-level map of your code – the classes, functions, methods, and their relationships. This happens through our list_code_definition_names tool, which provides structural understanding without requiring full implementation details.

Cline uses its list_code_definition_names tool to convert source code into an AST. Cline treats that AST as a “map” of the entire codebase.

When Cline runs a task automatically, it analyzes the file that needs to be modified, builds an AST for that file, and converts the AST into natural-language context (similar to how DeepWiki turns code into documents). It feeds this context to the LLM, letting the LLM decide whether to modify the file or look at another file to gather more context.

If Cursor compares similarity between vector-space code snippets, Cline converts code snippets into natural-language descriptions and lets the LLM, through semantic understanding, hunt for clues across the repository and compare the semantic similarity of code segments.

Cline’s approach is clearly safer — enterprise users don’t have to worry about Cline abusing the source code. The side effect, however, is higher token consumption. Constantly fetching context across files also takes more time. In some edge cases, Cline may even bounce back and forth between two files, falling into a loop.

In my own experience, Cline performs better than Cursor’s Agent mode on certain models (Deepseek-r1, OpenAI-4o), because Cline’s semantic understanding makes better use of these models’ natural-language abilities than vector similarity does.

For programming-optimized Claude Sonnet, though, there is no significant difference, so users need to choose between higher security and faster response time.

Summary

This post mainly covered how code editors use Abstract Syntax Trees (ASTs) to build code indexes and implement code completion.

In general, ASTs are an important tool for understanding the syntactic structure of code, and different implementations have their own trade-offs.

Pairing with AI for Programming — Testing

Wed, 11 Dec 2024 17:02:43 +0800

The future paradigm of software development will be human-AI collaborative programming. This is already an indisputable fact in the software industry. Programming tools like Windsurf, Cursor, and Copilot have, on one hand, improved development efficiency; on the other hand, they’ve made code more black-boxed, less readable, and harder to maintain.

I try to briefly discuss which software development practices are more suitable for improving the observability and maintainability of AI-generated code in the AI era. All articles titled “Pairing with AI for Programming” are just thoughts to get the conversation started, not a systematic methodology. I welcome readers’ corrections for any mistakes.

What Are the Common Problems with Using AI to Write Code?

Observability problem: AI’s implementation is incomplete and often requires manual modification of fragments

The biggest problem with AI-generated code is that it often introduces subtle errors that are not easily noticed by humans. When humans use prompts to modify code, due to the difficulty of observing AI’s behavior, even after fixing a bug, it may lead to other regression issues (causing errors in existing logic).

Context problem: lacking global context, fragmented code lacks connections

Due to token limits or economic considerations, many editors will optimize the input content, which can easily lead large models to misunderstand local context. They are unable to handle business logic across functional modules. Especially when the project becomes large, complex modules often depend on other modules, and adjusting business logic requires refactoring several code files.

Solution Approach

The core problems of AI-written code can be summarized as low maintainability caused by lack of observability and lack of context. To address these two problems, we need to first review how traditional software processes make code more observable and maintainable.

Human-Led Unit Testing

Unit tests are the specification for code. Complex business logic usually requires reading a lot of code to understand. But experienced programmers will look at the unit tests first. Good unit tests will completely write the module’s expected inputs and outputs into the test cases. In Unit Testing Principles, Practices, and Patterns, the author believes good unit tests should have:

Protection against regressions. That is, tests can prevent previously fixed issues from recurring in regression testing.
Resistance to refactoring. That is, after code refactoring, tests can correctly identify whether the refactoring has affected existing functionality.
Fast feedback. That is, unit tests are easy to run, and when issues are found, they can be quickly located.
Easy to maintain. The maintainability of tests, unlike business code, is reflected in correctly handling dependencies and shared code.

The ultimate purpose of these principles is to ensure that the system under test behaves as expected.

When AI and humans collaborate on code, I personally believe that in writing unit tests, humans should lead (80%) and AI assist (20%), because unit tests define “the behavior I expect.”

Once unit tests are complete, they in turn guide the AI to implement the actual business code. At this point, human involvement decreases and AI takes the lead. Humans repeatedly run unit tests, while passing the test results along with the prompt to the AI, helping AI fix program issues.

Writing AI-Friendly Tests Requires Good Module Design

When writing good tests, you also need to pay attention to correctly splitting modules. A good test typically gives an input and verifies whether the expected result is output. If a module depends on too many external environments for branching logic, the test output will heavily depend on external state. This reduces the module’s observability.

The following two pieces of experience can help you write good code:

When writing tests, test the result of the behavior, not the steps. When writing business code, ask AI to clearly write out the steps.

The “unit” of a unit test doesn’t have to be a single class or function. It can be a group of operations completing an atomic piece of business logic. (Of course, there are different schools of thought supporting class-level testing, but that’s not the focus of this article.) To make AI-generated business code refactor-resistant, you should verify the result of the AI’s behavior, not every implementation step. Coupling test code with implementation steps means that business modifications will break existing tests, making the “expected behavior” constantly have to be modified along with the “specific implementation.”

When AI starts writing business logic, you should drive it step by step, during which humans can correct the AI’s code logic for a particular step. But be careful not to break the test logic.
Stateless code (functional) is the easiest to test

Because its output is invariant. Core code should be kept as stateless as possible, with state and external system dependencies placed in the application service layer. Deep and hard-to-understand core logic should be placed in the domain service layer. The details here can refer to DDD (Domain Driven Design) thinking.

functional_core.png

Summary

This article, as the beginning of a series on human-AI collaborative programming, attempts from a testing perspective to alleviate the observability issues of AI-generated code.

In future articles, I hope to discuss, from an architectural design perspective, how to design AI-friendly architectures that are easy to maintain context for.

The content of the article will continue to be updated over time, and discussion is welcome.

AI on Steve Sun

How I Use Hermes Agent to Write Code

The Three-Layer Gate

Dual-Track Workflow

Preview Tunnel

Git plus Vercel Deployment

Some Reflections

How AI Coding Tools Like Cursor Work Under the Hood

AST

Cursor

Cline

Summary

Further Reading

Pairing with AI for Programming — Testing

What Are the Common Problems with Using AI to Write Code?

Solution Approach

Human-Led Unit Testing

Writing AI-Friendly Tests Requires Good Module Design

Summary