Vibe Coding on Steve Sun

AI Agent + Product Manager = QA Test Engineer

Wed, 08 Oct 2025 20:39:15 +0800

In September, our company organized a discussion on applying AI in the workplace. I happened to be researching end-to-end testing at the time, so I tried OpenCode with Playwright, and the results were astonishingly good.

I chose OpenCode over other AI agent frameworks (such as Claude Code) because it can integrate with the company’s enterprise GitHub Copilot account, which means we can use models like GPT-4 and Claude Sonnet without limits on the corporate intranet.

Playwright, built by Microsoft, is an automation testing framework that can drive browser APIs. Compared to Selenium, it is lighter, the community is more actively maintained, and it pairs better with large language models (there is an official MCP server). Playwright also bundles a webdriver, sparing a lot of environment configuration.

With OpenCode and the Playwright MCP server, and a few well-crafted prompt templates, you can run a complete set of web UI end-to-end test cases without writing a single line of test code. That would have been unthinkable in the past.

I have long believed that asking programmers to write E2E test code is laborious and more harmful than helpful. For edge cases and performance, unit tests and API tests cover more than 90% of the needs. The real value of E2E testing is in catching issues in UI interaction and integration. Using automated E2E test code to cover integration and UI scenarios carries an extremely high maintenance cost — every tiny UI tweak can break the test code — and statistically, more than half of the failing test cases in a test suite are not caused by functional defects at all, but by UI load latency, renamed frontend variables, slow test environments, and so on. For the real corner cases that threaten the integrated environment — for example, request retries caused by network interruptions, or out-of-range parameters from interface changes — writing E2E tests is less efficient than unit tests and API tests. For these reasons I have always encouraged the team to hire a full-time test engineer rather than reserving part of every sprint for developers to maintain E2E tests.

On the other hand, as a project lead, I care more about whether requirements are truly understood and delivered, and how to verify what the engineers actually built.

The arrival of AI agents has changed the agile workflow. With a combination like OpenCode + Playwright MCP server, the AI only needs to read user documentation to pick up the basics of UI operations. It can then open a browser, follow the natural-language description of a test case, and click through page elements step by step to complete an entire business flow. With a bit of guidance it can also produce the exact steps it took, the results, the issues it hit, and a complete test report. This is not far from hiring a junior QA engineer.

Because the maintenance cost drops dramatically (you only maintain a Markdown file describing the test cases), a lot of detailed UI test scenarios that were previously impractical can now be covered by an AI agent. Most importantly, this work does not depend on engineers at all — product managers, POs, or BAs can write test cases directly in natural language, closing the loop between writing user stories and verifying features, and removing the ambiguity that comes from requirements being relayed between business, engineering, and QA.

The Toyota Production System lists several sources of waste in production:

Overproduction
Waiting
Unnecessary transport
Over-processing
Excess inventory
Unnecessary motion
Defects

AI agents address, to some extent, three of these wastes: “overproduction” (writing test code over and over), “waiting” (waiting from requirements to implementation to test cases before a feature can be verified), and “unnecessary transport” (business requirements being passed between different people).

How DeepWiki Works

Sat, 24 May 2025 12:50:40 +0800

DeepWiki is an AI agent project, provided by Devin.ai, that generates detailed documentation from a source code repository. Ever since it went viral, I have been curious about how it works.

I combed through online resources and several open-source projects and arrived at a relatively clear picture of the workflow. For the harder parts, I will follow up with my findings in later posts.

Building a Map of the Code Structure

At its core, DeepWiki is a RAG system. It takes a source code repository as input, parses the code, and converts it into two parts: metadata representing the syntactic structure and file structure and vector data representing code descriptions and snippets. The metadata is stored in a relational database, while the corresponding code snippets are stored in a vector database for later LLM retrieval.

Generating WIKI Pages

The process of generating a WIKI page is essentially a RAG query:

The program recursively reads the project structure.
It queries the metadata database for the current file’s metadata, then searches the vector database for the most relevant code and description IDs.
It uses those IDs to look up the descriptions in the metadata database, and the corresponding code snippets in the project files.
It assembles all of the above as context, picks an appropriate prompt based on the metadata type (architecture, components, etc.), and feeds it to the LLM.
A front-end rendering engine then renders the LLM output into a documentation page.
Repeat from step 1.

Image from https://www.gptsecurity.info/2024/05/26/RAG/

Difficulty 1: The Chunking Strategy

A particularly interesting part of the process above is how to chunk code before embedding. For natural language, chunking is usually based on paragraphs, sentences, and punctuation, so each chunk contains the full context of a sentence or paragraph.

For code, it is different. A function body, for example, is wrapped in { and }. If you tokenize it with a natural-language tokenizer, the context will be split across different chunks, which hurts the accuracy of vector retrieval.

There are currently two approaches. The first is to chunk the whole file. In that case, the file size cannot exceed the chunk-size limit, and the chunks lack the real call-relationship context. We know that the unit of code organization is not the file (the file tree is just a human-friendly organization) — it is a graph of class- and function-level dependencies.

The second approach is to first use a syntax tool to perform static analysis on the code file, and then split the code along the syntactic structure based on the analysis. This is more complex to implement, and I could not find much material on it online. Fortunately, I came across RAG for a Codebase with 10k Repos, which describes how to use static syntax analysis to chunk code and build an efficient RAG system for a code repository. The article does not provide an open-source implementation, though. Considering that this is a core technology of a commercial product, it is well worth digging deeper into. I will keep following this area of research.

Difficulty 2: Parsing the Syntax Structure

Parsing metadata is somewhat simpler than vector data. I found some clues in another open-source project, Repo Graph.

That project uses tree-sitter to analyze the project’s syntax structure and produces three types of metadata files:

tag.json: basic information such as the path, line number, and description of a file, function, or class.
tree_structure.json: the project’s file tree structure.
*.pkl: a graph of object dependencies.

*.pkl is a graph of object relations that the syntax analyzer obtains by scanning the project’s files, then serializes the Python graph object to disk using the pickle library.

From this implementation, it looks like the embedding process in Difficulty 1 could also use the code metadata generated by tree-sitter to chunk the code by line.

Prompt Engineering

In the RAG query phase, you need to assemble different prompts based on the type of metadata being processed.

The Agent as a Judge project has plenty of prompts worth referencing:

Prompt for generating an overview:

Provide a concise overview of this repository focused primarily on:
* Purpose and Scope: What is this project's main purpose?
* Core Features: What are the key features and capabilities?
* Target audience/users
* Main technologies or frameworks used

Prompt for generating an architecture document:

Create a comprehensive architecture overview for this repository. Include:
* A high-level description of the system architecture
* Main components and their roles
* Data flow between components
* External dependencies and integrations

Prompt for generating a components document:

Provide a comprehensive analysis of all key components in this codebase. For each component:
* Name of the component
* Purpose and main responsibility
* How it interacts with other components
* Design patterns or techniques used
* Key characteristics
* File paths that implement this component

For the rest, please refer to the project files; I won’t enumerate them all here.

Summary

DeepWiki is a code documentation generation tool built on a RAG system. It works through the following steps:

Perform syntactic analysis on the repository to produce metadata and vector data.
Query that data through the RAG system to generate documentation.
Render the results into readable documentation pages with a front-end engine.

There are two main difficulties in implementation:

The code chunking strategy: it must consider the syntactic structure of the code, not just split it the way you would split natural language.
Parsing the syntax structure: tools like tree-sitter can be used to parse the code’s structure.

Although there are some open-source projects to reference, the core chunking strategy implementation still needs to be studied in depth.

Reference Projects

Why You Shouldn't Let AI Generate Your Unit Tests

Thu, 01 May 2025 09:27:36 +0800

Recently I heard Hailong Zhang, founder of Gru.ai, mention in a podcast that automatically generating unit tests is the main direction they are pursuing in AI coding.

Gru.ai’s website has these two lines:

Forget about unit testing – get covered automatically Harness the expertise of AI engineers to boost your team’s testing efficiency while reducing costs and ensuring top-notch quality.

Zhang’s insights on AI coding are inspiring. I am skeptical, though, of the claim that using AI to write tests cuts cost and boosts efficiency. I think they themselves weren’t fully confident when writing the second line — they couldn’t help tacking on “ensuring top-notch quality” for reassurance.

Unit tests are the concretization of requirements. They are the smallest-grained, closest-to-the-code constraint tool in the entire testing system. Unit tests are used not only to check whether code meets requirements, but more often to detect corner cases — because what makes a program reliable is that it doesn’t break at the boundaries. That is also what distinguishes an experienced engineer from a junior one.

But what Gru.ai is doing is using AI to raise unit test coverage. As we all know, higher coverage does not equal higher testing efficiency, let alone higher quality.

Letting AI automatically write runnable unit tests from a single prompt is very tempting for junior developers. It’s like a shooter trying to improve their accuracy by firing the gun first and then drawing the bullseye around the bullet hole.

The purpose of improving test coverage is to push human engineers to think carefully about edge cases. Using AI to help humans generate tests as a time-saver is perfectly fine, and Gru.ai instead tells us to “forget about unit testing, get covered automatically.” But the AI usually doesn’t know the edge cases unless a human explicitly tells it. So how does the AI infer the edge cases on its own? And how do we know the AI’s inferred edge cases are correct? If the AI tests the code, who tests the AI?

If products like Cursor embody the Silicon Valley imagination of vibe coding, then Gru.ai embodies the Chinese programmers’ “rosy expectations” of vibe testing.

Pairing with AI for Coding—Pain Points

Sun, 23 Mar 2025 00:00:01 +0800

When pairing with AI for coding, you often run into situations where large models fail to execute tasks correctly. The most common are:

Endless task loops
Models unable to fix environment issues
Models losing context in the second half of a long task

A Few Lessons from My Use

For my own use, I often pair Cline with GitHub Copilot. What I really like about Cline is its Checkpoint restore feature, which lets you re-edit the prompt and resume execution at the point where something went wrong. This lets me call different models for the same task and observe how each handles the problem.

For planning (Plan), I usually use Deepseek-R1, Gemini 2.0 Flash Thinking, or Claude 3.7. Among these, only Claude 3.7 can produce a relatively accurate plan; the others are more or less prone to going off-track. For example, Deepseek-R1 likes to do extra work—when you ask it to translate Chinese, it calls an MCP translation service instead of translating it itself.

From a cost perspective, Gemini 2.0 Flash Thinking is a fast and economical model for simple problems. For complex problems, going straight to Claude 3.7 may be easier to keep costs under control.

For task execution (Act), Deepseek-V3 is very inconsistent—it often gets stuck in loops or loses context. Claude is too expensive, while Gemini 2.0 Flash is a relatively accurate and cost-effective model. The domestic Qwen series doesn’t fully support Function Calling, and Cline doesn’t support them either, so I can’t test them for now.

Tackling the Tricky Problems of AI Coding

I recently read the article AI Blindspots. The author systematically catalogs the problems encountered in AI programming and shares his thinking. It was very inspiring to me. I used an Agent to translate it into Chinese and then polished it by hand—you can read it here: AI Programming Blindspots.

In summary, the three key points to solving AI problems are still: more accurate prompts, more complete context, and a smaller problem scope.

I believe that as technology evolves, programming paradigms will undergo earth-shaking changes. If refactoring becomes this easy, should Martin Fowler’s Refactoring get a new edition for the AI era? If documentation is no longer read by people but fed to models as context, what should documentation look like? Will providing a vectorized documentation interface for large models to call become the new normal in future programming frameworks?

I look forward to the future.