Software Architecture on Steve Sun

How AI Coding Tools Like Cursor Work Under the Hood

Mon, 02 Jun 2025 07:58:17 +0800

In my previous post, How DeepWiki Works, I shared one possible way DeepWiki is implemented. I left a question there: how does DeepWiki chunk a source code repository?

The answer is AST chunking.

In this post I want to analyze how two software development aids — Cursor and Cline — implement “code indexing.” In fact, they are not fundamentally different from DeepWiki; all of them use AST chunking.

AST

An Abstract Syntax Tree (AST) is a tree representation of source code that reflects the code’s syntactic structure. When chunking code, ASTs help us better understand the semantic boundaries of the code.

ASTs are widely used in compilers and source code analysis tools. For example, in the frontend world, Babel and the TypeScript compiler (TSC) use ASTs to transform ES6 or TypeScript code into JavaScript that browsers can run.

Below is a simple example showing how an AST converts TypeScript code into a tree structure. Suppose we have this TypeScript function:

function greet(name: string) {
 return "Hello, " + name;
}

After being processed by an AST tool, it is abstracted into the following tree:

SourceFile
- FunctionDeclaration
  - Identifier: “greet”
  - Parameter
    - Identifier: “name”
  - Block
    - ReturnStatement
      - BinaryExpression
        
        StringLiteral: “Hello, "
        
        Identifier: “name”

A compiler can then walk this tree and translate it node by node into JavaScript code.

Once you understand ASTs, you roughly understand how DeepWiki — and even code editors like Cursor — build code indexes.

Cursor

In Cursor’s official documentation, you can find a description of how it indexes user code.

Cursor scans the user’s repository, computes file hashes, and builds a Merkle tree. Similar to the way Git compares file diffs, Cursor uses the Merkle tree to detect file changes in the user’s workspace and incrementally uploads modified files to Cursor’s servers.

Uploaded files are then chunked and embedded, and stored in a Turbopuffer database. This is the process of building a RAG over the source code.

The chunking step uses an AST tool to structure the code into a syntax tree, then cuts the serialized tree nodes into small chunks, and finally embeds them as vectors for storage.

Turbopuffer does not only store the vectorized code; it also stores metadata such as the line numbers and source file paths of the code segments.

When Cursor tries to autocomplete user code or generate new code from context, it queries the Turbopuffer database, finds the vectors with the highest similarity, and gets the file path and line numbers for that segment. Cursor then reads the corresponding source code from the user’s repository and puts it into the LLM’s system context. Finally, the LLM returns the newly generated code to Cursor.

A user on X put together this flow diagram:

Cline

Cline’s official blog offers a glimpse of how it is implemented.

Cline is an AI agent that helps with coding. Cline does not upload code and build a RAG; instead, it takes a safer and more reliable approach to managing the user’s repository.

Here is the developers’ description of how Cline works:

When you point Cline at a codebase, it doesn’t immediately try to read every file. Instead, it begins by understanding the architecture. Using Abstract Syntax Trees (ASTs), Cline extracts a high-level map of your code – the classes, functions, methods, and their relationships. This happens through our list_code_definition_names tool, which provides structural understanding without requiring full implementation details.

Cline uses its list_code_definition_names tool to convert source code into an AST. Cline treats that AST as a “map” of the entire codebase.

When Cline runs a task automatically, it analyzes the file that needs to be modified, builds an AST for that file, and converts the AST into natural-language context (similar to how DeepWiki turns code into documents). It feeds this context to the LLM, letting the LLM decide whether to modify the file or look at another file to gather more context.

If Cursor compares similarity between vector-space code snippets, Cline converts code snippets into natural-language descriptions and lets the LLM, through semantic understanding, hunt for clues across the repository and compare the semantic similarity of code segments.

Cline’s approach is clearly safer — enterprise users don’t have to worry about Cline abusing the source code. The side effect, however, is higher token consumption. Constantly fetching context across files also takes more time. In some edge cases, Cline may even bounce back and forth between two files, falling into a loop.

In my own experience, Cline performs better than Cursor’s Agent mode on certain models (Deepseek-r1, OpenAI-4o), because Cline’s semantic understanding makes better use of these models’ natural-language abilities than vector similarity does.

For programming-optimized Claude Sonnet, though, there is no significant difference, so users need to choose between higher security and faster response time.

Summary

This post mainly covered how code editors use Abstract Syntax Trees (ASTs) to build code indexes and implement code completion.

In general, ASTs are an important tool for understanding the syntactic structure of code, and different implementations have their own trade-offs.

A General Approach to Server-Side Performance Issues in Go

Tue, 06 May 2025 10:35:41 +0800

I recently ran into a performance issue. A customer reported that two Go-based service processes running in the background of their IPC device kept climbing in memory usage, peaking at 40% of total memory. One of those processes was our log-collection agent.

My first suspicion was a memory leak, because we’d had memory leaks caused by goroutine blocking in the past (I discussed this in Common Patterns of Memory Leaks in Go), so I started by reviewing everywhere we created and released goroutines.

After the last incident, we’d added goroutine leak detection at the unit-test level using go.uber.org/goleak. You only need to add a single line at the start of a test:

func TestXXX(t *testing.T) {
 defer goleak.VerifyNone(t)
 // ...
}

It automatically checks for lingering goroutines after the test finishes. For background goroutines that run on a delay, you can use wait or sleep in the test to wait for them to be released before the test case ends.

The first round of investigation ruled out problems caused by goroutines in the code itself. So I turned my attention to another suspect: scheduled tasks.

According to the customer, memory would slowly climb even with no foreground activity.

In our code, we use the third-party package github.com/robfig/cron/v3, which orchestrates scheduled tasks. Usage looks like this:

c = cron.New()
c.AddFunc("@every 10s", callbackFunc)

This structure defines a scheduled task. Its implementation is also based on goroutines, so I added Go’s built-in pprof to the dependencies in main.go, rebuilt the project binary, and deployed it to a test environment using the same hardware configuration as the customer. This way, once the project started, I could pull memory information from a specific port. (For more on pprof, see Profiling Go Programs.)

I used pprof’s interface to grab heap data at different intervals:

curl -o heap.1.out http://127.0.0.1:6060/debug/pprof/heap

Then used:

go tool pprof -http=:8099 -base heap.1.out heap.2.out

to compare the two results. In the web UI, I selected the In Use Space option, which let me see what memory hadn’t been released.

Even after this second round, I still didn’t find a memory leak. But this time I noticed that one of the scheduled tasks ran every 10 seconds, and CPU usage clearly spiked during execution. Looking at the code for this task, it used the third-party library github.com/shirou/gopsutil/process to query system process IDs and process names.

Looking at the library’s source code, I found that the way it queries process IDs is by loading all process information on the system into memory and then matching the ID or name. So, if the customer’s device has a lot of processes, each query consumes a large amount of memory.

Calling this library from a scheduled task that runs every 10 seconds is clearly very inefficient.

After further communication with the customer, we discovered that of the two high-memory processes, the other one also showed high CPU usage. So we had the customer send us a screenshot of the top command. The moment I saw the screenshot, the truth came into focus:

The customer’s IPC device was a lower-performance version—while it had plenty of memory, the CPU was struggling. When multiple processes run background tasks simultaneously, the CPU periodically maxes out, causing tasks to block. And the third-party library we use implements scheduled tasks on top of goroutines. When the previous task is blocked, the next task still creates a new background goroutine, causing goroutines to pile up in memory.

This was a goroutine blocking problem caused by high CPU usage and overly short intervals in the scheduled tasks.

Once we knew the cause, the rest was straightforward: optimize the code logic, ship a new version, and explain the issue to the customer…

That’s the full process of debugging this Go service performance issue. If you run into something similar, I hope this helps.

Pairing with AI for Programming — Testing

Wed, 11 Dec 2024 17:02:43 +0800

The future paradigm of software development will be human-AI collaborative programming. This is already an indisputable fact in the software industry. Programming tools like Windsurf, Cursor, and Copilot have, on one hand, improved development efficiency; on the other hand, they’ve made code more black-boxed, less readable, and harder to maintain.

I try to briefly discuss which software development practices are more suitable for improving the observability and maintainability of AI-generated code in the AI era. All articles titled “Pairing with AI for Programming” are just thoughts to get the conversation started, not a systematic methodology. I welcome readers’ corrections for any mistakes.

What Are the Common Problems with Using AI to Write Code?

Observability problem: AI’s implementation is incomplete and often requires manual modification of fragments

The biggest problem with AI-generated code is that it often introduces subtle errors that are not easily noticed by humans. When humans use prompts to modify code, due to the difficulty of observing AI’s behavior, even after fixing a bug, it may lead to other regression issues (causing errors in existing logic).

Context problem: lacking global context, fragmented code lacks connections

Due to token limits or economic considerations, many editors will optimize the input content, which can easily lead large models to misunderstand local context. They are unable to handle business logic across functional modules. Especially when the project becomes large, complex modules often depend on other modules, and adjusting business logic requires refactoring several code files.

Solution Approach

The core problems of AI-written code can be summarized as low maintainability caused by lack of observability and lack of context. To address these two problems, we need to first review how traditional software processes make code more observable and maintainable.

Human-Led Unit Testing

Unit tests are the specification for code. Complex business logic usually requires reading a lot of code to understand. But experienced programmers will look at the unit tests first. Good unit tests will completely write the module’s expected inputs and outputs into the test cases. In Unit Testing Principles, Practices, and Patterns, the author believes good unit tests should have:

Protection against regressions. That is, tests can prevent previously fixed issues from recurring in regression testing.
Resistance to refactoring. That is, after code refactoring, tests can correctly identify whether the refactoring has affected existing functionality.
Fast feedback. That is, unit tests are easy to run, and when issues are found, they can be quickly located.
Easy to maintain. The maintainability of tests, unlike business code, is reflected in correctly handling dependencies and shared code.

The ultimate purpose of these principles is to ensure that the system under test behaves as expected.

When AI and humans collaborate on code, I personally believe that in writing unit tests, humans should lead (80%) and AI assist (20%), because unit tests define “the behavior I expect.”

Once unit tests are complete, they in turn guide the AI to implement the actual business code. At this point, human involvement decreases and AI takes the lead. Humans repeatedly run unit tests, while passing the test results along with the prompt to the AI, helping AI fix program issues.

Writing AI-Friendly Tests Requires Good Module Design

When writing good tests, you also need to pay attention to correctly splitting modules. A good test typically gives an input and verifies whether the expected result is output. If a module depends on too many external environments for branching logic, the test output will heavily depend on external state. This reduces the module’s observability.

The following two pieces of experience can help you write good code:

When writing tests, test the result of the behavior, not the steps. When writing business code, ask AI to clearly write out the steps.

The “unit” of a unit test doesn’t have to be a single class or function. It can be a group of operations completing an atomic piece of business logic. (Of course, there are different schools of thought supporting class-level testing, but that’s not the focus of this article.) To make AI-generated business code refactor-resistant, you should verify the result of the AI’s behavior, not every implementation step. Coupling test code with implementation steps means that business modifications will break existing tests, making the “expected behavior” constantly have to be modified along with the “specific implementation.”

When AI starts writing business logic, you should drive it step by step, during which humans can correct the AI’s code logic for a particular step. But be careful not to break the test logic.
Stateless code (functional) is the easiest to test

Because its output is invariant. Core code should be kept as stateless as possible, with state and external system dependencies placed in the application service layer. Deep and hard-to-understand core logic should be placed in the domain service layer. The details here can refer to DDD (Domain Driven Design) thinking.

functional_core.png

Summary

This article, as the beginning of a series on human-AI collaborative programming, attempts from a testing perspective to alleviate the observability issues of AI-generated code.

In future articles, I hope to discuss, from an architectural design perspective, how to design AI-friendly architectures that are easy to maintain context for.

The content of the article will continue to be updated over time, and discussion is welcome.

Dependency Inversion in Go

Thu, 21 Nov 2024 11:26:22 +0800

This article is fairly basic; it was material I used when training Java programmers in Go.

Why Dependency Inversion Principle (DIP)?

Dependency Inversion, also called dependency inversion or DIP, is a very important design principle in software development. Many programmers have never learned about it, or only know the general idea from Java Spring. Today I’d like to use a brief article and a simple Go example to explain how to implement dependency inversion in the simplest way.

If you don’t yet know what it is, you can refer to the description in Wikipedia, or read Martin Fowler’s article on DIP.

The Dependency Inversion Principle addresses a common risk in software development: dependency.

Try to recall:

When you try to use mocks to shield underlying details for testing, you find that the class you want to test references a large number of framework-provided interfaces, requiring you to mock many underlying implementations.
When you try to modify an old low-level class, but there are too many upper-layer service classes depending on it, you worry about side effects while refactoring the upper-layer code at every dependency point.

Let’s analyze these two scenarios:

In scenario 1, the application class depends on the implementation provided by the framework, making it difficult to separate the application class from the framework. The industry method for dealing with this problem is called Inversion of Control (IoC). The application class should not depend on the framework; instead, the framework provides slots, registering the application class with the framework, and the framework uniformly dispatches the application to execute the corresponding methods.

In scenario 2, the service class depends on the low-level class, making modifications to the low-level class increasingly difficult. The solution is Dependency Injection (DI). The upper-layer class does not directly reference the low-level class, but instead, the low-level class on which the upper-layer class depends is injected at the point of use.

Combining these two scenarios captures the core of the Dependency Inversion Principle:

High-level modules should not depend on low-level modules; both should depend on abstractions.
Abstractions should not depend on details. Details should depend on abstractions.

These two principles ensure high cohesion and low coupling among modules, while also creating the conditions for mocking and iteratively updating modules.

Implementing It in Go

Suppose we need to query user information from a user service. There are two interfaces: UserRepository serves as the data layer responsible for querying the database, and UserService handles business logic and depends on UserRepository. At the same time, to facilitate testing, we also need to write a mock data layer implementation. The entire structure is shown in the figure below.

Go example

Next, very easily, we implement the two interfaces and write their implementation classes. At the same time, we also write a NewUserService in the UserService implementation class to inject the UserRepository implementation it depends on.

// Implement the specific interface in user_repository.go
type UserRepository interface {
 GetByID(id int) (*User, error)
 Save(user *User) error
}

// ... specific implementation of UserRepository, omitted

// Implemented in user_service.go
type UserService interface {
 GetUser(id int) (*User, error)
 CreateUser(name string, age int) error
}

// ... specific implementation of UserService, omitted

func NewUserService(repo UserRepository) UserService {
 return &UserServiceImpl{
 repo: repo,
 }
}

So the question arises: can we directly reference the repository in user_service.go? Obviously not, because this would create a dependency between the two modules.

This is the core of dependency inversion: the upper-layer module does not directly reference the lower-layer module; instead, the executing class initializes the Service and injects the dependent lower-layer service.

// In main.go
func main() {
 repo := &MySQLUserRepository{}
 userService := NewUserService(repo)
}

This way, when writing test mock code, you don’t need to modify any code logic. You can simply replace the parameter of NewUserService in the test with a fake test instance.

// In user_service_test.go
func TestUserService() {
 repo := &MockTestUserRepository{}
 userService := NewUserService(repo)
}

In addition, if the data layer changes its implementation or migrates to another database, you only need to modify two places: the data layer’s implementer and the dependency injector. The caller UserService is completely unaffected. The entire project won’t form a dependency trap.

Summary

The two core principles of the Dependency Inversion Principle:

Modules do not depend on other modules, but both depend on abstract interfaces.
Abstract interfaces do not depend on implementations; implementations depend on abstract interfaces.

Implementing these two principles in Go isn’t difficult—you just need to transform the original caller-implementer relationship into a registrar-caller-implementer relationship. There are also some libraries and frameworks in Go that implement dependency inversion, but the core ideas are not different.

Monitoring System Project Retrospective

Thu, 24 Oct 2024 15:52:22 +0800

This article is a retrospective of a large chunk of my work over the past 3 years. As the project’s architect, I’ll also reflect on some issues left over from the early stages of the project and share my personal approach to solving them.

The Project Relies Heavily on an Open-Source Component, Over-Customized

Our project is a collection/monitoring system for logs and software/hardware performance metrics that runs on edge devices. Considering the performance of edge computing devices (IPCs), when selecting open-source components, we emphasized being lightweight and supporting a rich set of output standards. In the early days, the department architect chose Fluent-Bit as the core component. Fluent-Bit is an open-source, lightweight, mildly extensible data collector written in C. It was originally used for log collection and has gradually evolved into a full-featured Agent. Compared to the popular OpenTelemetry, Fluent-Bit is more out-of-the-box and lighter, but harder to modify and extend.

At the very beginning, the whole team had no experience with monitoring systems, so we dug quite a few pitfalls when designing the system. First, users had to perform overly cumbersome operations on the UI, having to sequentially configure the output target (address, port, protocol, format, encryption method, etc.), the type of metrics to collect, and finally click “Apply” manually.

After several iterations, we appropriately simplified the operational logic. But like most programs running on industrial PCs, users typically don’t actively modify the configuration on the UI after initial setup. End users care more about system resource usage and stability. So initially, the team designed this project as a heavy-interaction consumer-facing product—which was a lesson learned.

Second, to accommodate the UI design flow (for example, allowing users to create multiple different configuration items to different target addresses), the backend developers came up with complex workarounds. Because Fluent-Bit is a single-process event-driven model with only a single configuration file, every time the configuration file is modified, the Fluent-Bit process must be restarted. This created a risk of data loss during restarts for a stable-running monitoring system. Additionally, if a newly added configuration item is wrong, it can cause the entire generated configuration file to fail, leading to issues like the Fluent-Bit process hanging.

To solve these problems, the backend engineers came up with various tricks using Fluent-Bit’s parameters. For example, using different tags to route different user configuration items, configuring parameters and filter rules separately for each configuration item. Another example was setting the cache data packet size and cache timeout to 0, so that after Fluent-Bit restarts, it would first try to resend the data cached in the file system, indirectly preventing user data loss.

These tricks not only increased maintenance difficulty but, from the user’s perspective, did not bring any real value improvement.

In retrospect, if the early UI design had been changed to a separate configuration page, it would have simplified the operational flow and reduced the complexity of business code.

Third, the core project’s dependency on Fluent-Bit made it very difficult to migrate to other open-source components. Combined with Fluent-Bit’s high update frequency, the company’s security compliance requirements meant our team had to upgrade Fluent-Bit periodically, while also doing regression testing for all configuration options. Adding to this, Fluent-Bit has very poor customizability; while it supports implementing Output plugins in Go, Input plugins can only be written in C. As a result, to collect data from internal applications, we had to use its TCP and HTTP plugins as intermediaries, deploying multiple Agents to collect data from different internal services. This made later integration testing even more difficult.

Overall, Fluent-Bit’s performance basically met expectations, but various small bugs (for example, the pgsql plugin would block the entire process when the target was unreachable) were not taken seriously by the open-source community maintainers, and the code we submitted to the open-source community was rejected for various reasons. If I had to choose again, I would lean toward using other more extensible open-source components.

Unfamiliarity with Go Led to Chaotic Project Structure

The second challenge the team faced was unfamiliarity with Go. Most of the development members only had Java development experience, so naturally, they wrote Go like Java. Due to the limitations of the framework (Go-Gin), problems arose frequently during development.

The first problem came from object orientation and dependency inversion. Dependency inversion is not unfamiliar to those using Java Spring, but implementing dependency inversion in Go requires using Interface encapsulation, combined with the Go-Mock library for unit testing. Team members unfamiliar with the language’s features often incorrectly encapsulated abstractions, or simply nested functions inside functions, writing spaghetti code. This fully exposed the fact that most domestic Java programmers have not actually received good OOP training. Engineering practices like unit testing and integration testing are also formalistic. Software quality in most enterprises still relies on manual verification by testers.

The second problem is that Go discourages over-abstraction. For things like generics and exception handling, you have to repeat trivial code snippets step by step, which causes Sonar static checks to fail often. Inexperienced colleagues would then use various clever tricks to evade static checks. This also demonstrates the necessity of regular code reviews for development teams.

Fourth, Go is actually a programming language with a less-than-complete community. Many of its frameworks (like the most popular gorm, which is actually a personal project), and tools mature in the Java toolchain like Flyway, need to be replaced by combining multiple open-source projects in Go. So Go is only suitable for developing projects of medium or smaller scale, or for performance-critical platform core components. (Domestically) it isn’t suitable for complex business scenarios.

API Granularity Too Fine, Resource Objects Not Properly Abstracted

In the early days, the team was plagued by management chaos: architecturally, business models weren’t properly abstracted, and resource objects were broken into too many small pieces; management-wise, tasks were decomposed too simplistically, with each colleague individually responsible for a module, leading to dedicated APIs designed for every business process, creating heavy maintenance pressure. Fortunately, with few business scenarios, automated testing could ensure interface reliability to some extent.

At first, when doing automated integration testing, we still used BDD form, writing tests based on business operations. Later, we gradually realized that for this kind of monitoring system, the real user operation logic is actually very simple—what’s complex are the exceptions that may arise from different types of data, different Input, and Output configurations. So we switched to data-driven testing, using configuration files to comprehensively test different types of Fluent-Bit configurations.

In summary, the modifications to Fluent-Bit configuration could actually be implemented entirely with 3~4 broad APIs. In addition to the over-designed flow mentioned earlier, the uncertainty in the early project stages caused developers to over-focus on loose coupling while ignoring maintainability.

Flawed Pipeline Design

Initially, the project followed the integration testing and deployment pattern of other teams in the department, putting Python-written test cases and project deployment scripts in a separate Gitlab repo. The result was that every time the project was deployed, someone had to manually go to a webpage to modify the version number to trigger the pipeline. From a continuous integration perspective, having business code and test cases separated meant that every commit had to be submitted to a different repo, and in case of conflicts, multiple integration tests had to be run separately (long time, slow feedback).

Later, we made some adjustments, merging multiple small modules into a Monorepo, while putting some API-related integration tests inside the backend code to reduce the number of commits and make atomic commits easier.

However, the deployment problem remained unresolved, because there were too many modules on the edge platform, system integration required cooperation from multiple teams, deployment and release cycles were long, and there were too many points of failure. For this situation, the department’s technical lead set strict processes for code submission, testing, review, and documentation updates, but the root problem was still ambiguous team responsibilities, the department’s teams spanning multiple countries and time zones, and the lack of a unified scheduling and communication mechanism. These problems can only be gradually alleviated by management, or as the business converges, reducing and diverting project teams.

Summary

Overall, many of the problems our team encountered stemmed from a lack of project and technical team management experience in the project’s early stages. Not understanding the business vision, they brought their experience in making consumer-facing SaaS products to the industrial sector, applying familiar development paradigms to manufacturing. Of course, to be honest, on the business side, the department has many long processes, and business leaders can only feel around blindly; user feedback has to first reach the Support team, then be reported upward, and only finally reach the development team. This meant that the products we developed took at least 3-6 months to receive effective feedback. Iteration cycles were too long, and R&D worked in isolation.

Notes on the RESTful Web Services Cookbook

Sat, 13 Jul 2024 16:12:34 +0800

RESTful Web Services Cookbook is a short, concise guide to designing RESTful APIs. This post (notes) records the key points from the book.

Since RESTful conventions are second nature to most backend developers, I will skip the well-known parts and focus on the details in the book that many developers tend to overlook.

HTTP Methods

GET

Performs safe and idempotent retrieval of information.

POST

The target of execution is a collection of resources (a factory), not a specific URI.

Use cases:

Create a new resource by treating a resource as a factory.
Modify one or more resources through a controller resource.
Execute queries that require a large data input (many parameters).
When no other HTTP method seems appropriate, perform an unsafe or non-idempotent operation.

Approach:

Designate an existing resource as the factory for creating new resources. Although any resource can be used as a factory, the common practice is to use a collection resource.
Have the client submit a POST request to the factory resource, attaching a representation of the resource to be created. Through the optional Slug header, the client can suggest a name to the server as part of the URI of the created resource.
After the resource is created, return response code 201 (Created) and include the URI of the new resource in the Location header.
If the response body contains a full representation of the new resource, include the URI of the new resource in the Content-Location header.

PUT

Only use PUT to create a new resource when the client controls the structure of the URI. In other words, PUT can also create a resource, but only when the client specifies the URI.

Determining the Granularity of Resource Objects

Resources should be designed to match the client’s usage patterns, not based on existing database or object models.

Cacheability
Reduce modification frequency
Mutability — separate mutable from immutable data

How to Design Composite Resources?

Composite resources reduce the visibility of the uniform interface because their representations contain data that overlaps with other resources.

If composite resources are used infrequently, consider using caching instead.
Consider the network overhead — would a composite resource reduce server throughput and increase latency?

HTTP Body

Taking a JSON body as an example:

It is best to include a self-referential link.
If the results are paginated, it is best to include a link to the next page.
If the results are paginated, indicate the size of the collection (the total).
If the queried object is localized, add a property to indicate the language of the localized content.

{
 "name": "John",
 "id": "urn:example:user:1234",
 "link": {
 "rel": "self",
 "href": "http://www.example.org/person/john"
 },
 "address": {
 "id": "urn:example:address:4567",
 "link": {
 "rel": "self",
 "href": "http://www.example.org/person/john/address"
 }
 }
}

HTTP Response

For client errors, return a 4xx status code plus a Date (the time the error occurred).
For server errors, return a 5xx status code plus a Date (the time the error occurred).
The body should describe the error. If there are external documents and links for reference, provide a Link header or include the link directly in the body.
To support later tracing and analysis, errors are logged on the server. Provide an identifier or link that can be used to locate the error.

Designing the Query Structure

Designing Query Requests

To improve caching and performance, try to avoid range queries. Workarounds include:
- Use predefined queries
- Alternatively, use the HTTP header: Range
Avoid using general-purpose query languages (SQL, XPATH).
Avoid tight coupling between the URI and the underlying data storage (treating the backend as a database on the front end).
For requests with many parameters, consider using POST (since URIs have a maximum length)
- The downside of a POST interface is that it loses caching ability
- POST requests are not cacheable, so the Cache-Control and Expires headers are useless
- To solve the caching problem, have the POST create a temporary resource, return the link to the client, and let the client use GET to fetch that resource next time

Designing Query Response Results

Return a collection. Add appropriate cache expiration headers.
If there are no results, return an empty collection.

How to Design an Industry-Standard Audit System

Mon, 15 Apr 2024 16:44:40 +0800

An audit trail is a service within a system that records critical security information such as user behavior logs and control component activity logs. Logs are typically arranged in chronological order, recording “who did what and when.”

Below is the Kubernetes official documentation’s description of its audit service:

Kubernetes auditing provides a security-relevant, time-ordered set of records documenting the sequence of activities that affected the system by individual users, by applications using the Kubernetes API, and by the control plane itself.

The audit feature enables cluster administrators to answer the following questions:

What happened?

When did it happen?

Who initiated it?

On what (which) objects did the activity occur?

Where was it observed?

Where was it initiated from?

What are the subsequent actions taken on the activity?

What Capabilities Should an Audit System Have?

Log content is tamper-proof.
Log chain structure is complete: individual log entries cannot be arbitrarily added or removed.
Compatibility: clients sending logs should avoid invasive designs.
The system’s encryption service should be initialized as early as possible to reduce unprotected log exposure.
Service restart/shutdown should not cause audit log inconsistency. If a service is shut down under emergency conditions, the audit logs should remain verifiable.
Key security: encryption keys (used to compute integrity checks) should be stored in a dedicated key store and reside in memory for the shortest possible time.
Performance: ability to verify protected logs within seconds.
Log rotation friendliness: audit logs should be compatible with typical log rotation strategies of distributed systems.
Observability: logs should be easily parsed (machine-readable) and human-readable. Compatible with mainstream log processor formats, with dimensions designed to facilitate future filtering and screening.

Common industry standards related to auditing include IEC 62443 and NIST SP 800-92. Below are the audit-related sections in IEC.

Industry Standard	Section	Security Level
IEC 62443-4-2:2019	CR2.8	SL-C 1
IEC 62443-4-2:2019	CR6.1	SL-C 1
IEC 62443-4-2:2019	CR6.2	SL_C 2
IEC 62443-4-2:2019	CR1.13	SL_C 1
IEC 62443-4-2:2019	CR2.9	SL_C 1
IEC 62443-4-2:2019	CR2.10	SL_C 1
IEC 62443-4-2:2019	CR3.7	SL_C 1
IEC 62443-4-2:2019	CR3.9	SL_C 2

What Protocols or Standards Should Audit Log Format Follow?

For locally running software, Syslog typically has better system compatibility. For projects using ELK to collect logs, CEF is more suitable. In other cases, custom JSON is recommended.

Below is a comparison of the three formats (protocols).

Common Event Format (CEF)

A log format used by Elastic-Search, designed based on event-sourcing principles. The advantage is less redundant information, suitable for building monitoring systems in conjunction with the ELK stack. Its transport is based on the Syslog protocol while extending readable key-value pairs. The text-based design also allows CEF-format logs to be written to files. Overall, it is the most balanced of these formats in terms of readability, efficiency, and standardization.

Syslog

Syslog is the default audit log format for Linux operating systems, typically using its RFC 5424 version. Most SIEM¹ systems support importing this format. The Syslog protocol has great adaptability, and mTLS-based Syslog transport can maximize system security while remaining compatible with traditional software. However, for microservices, implementing and maintaining the standard protocol is costly. Therefore, services like AWS CloudTrail and OpenTelemetry have opted for the simpler HTTPS + JSON format.

JSON Lines

Most SaaS products use JSON—it’s simple and efficient. JSON has the characteristic of more redundant information, but the structure is easy to parse. For example, below are the fields in the log model mentioned in the OpenTelemetry official documentation:

Field Name	Description
Timestamp	Time when the event occurred.
ObservedTimestamp	Time when the event was observed.
TraceId	Request trace id.
SpanId	Request span id.
TraceFlags	W3C trace flag.
SeverityText	The severity text (also known as log level).
SeverityNumber	Numerical value of the severity.
Body	The body of the log record.
Resource	Describes the source of the log.
Attributes	Additional structured information.

What Security Requirements Apply to Audit Logs?

For audit logs, security requirements are higher than for general log systems.

Security can typically be considered from three dimensions: Confidentiality, Integrity, and Availability.

Confidentiality

Attackers can exploit system security vulnerabilities to obtain special privileges and then view certain audit logs.

The following measures can be taken:

Encrypt logs: use encryption technology to protect logs, ensuring that only authorized users can access and modify them.
Access control: restrict access to log-sending and log-receiving interfaces.
Sensitive information filtering: do not record user sensitive information in logs, such as passwords, certificates, etc.

Integrity

Attackers can exploit system security vulnerabilities to modify or delete certain audit logs.

In addition to the encryption and access control mentioned above, the following measures can also be taken:

Integrity checks²: add hash values to log entries so that any tampering or truncation can be quickly detected during log verification.
Regular backups: regularly back up logs to prevent attackers from deleting or modifying all log entries.

Log file limitations: in addition to limiting the size of log files, it’s typically necessary to limit the number of backups, maximum backup days, etc. Below are the parameters in Kubernetes for log file storage:

--audit-log-path specifies the log file path used to write audit events. If this flag is not specified, the log backend is disabled.
--audit-log-maxage defines the maximum number of days to retain old audit log files.
--audit-log-maxbackup defines the maximum number of audit log files to retain.
--audit-log-maxsize defines the maximum size (in megabytes) of audit log files before rotation.

Availability

Attackers can attack the audit trail service, causing the audit trail service to run out of memory, disk space, etc.

The audit service should cache audit-related context, such as the mapping between service names and IDs, event IDs and descriptions, etc. When different services send messages to the audit service, the message structure should be designed with minimal length as a principle. The audit service’s policy should allow users to configure log levels, filter rules, etc., to reduce system burden.

Log Export

In addition to exporting file-format logs, the audit service usually needs to support export to third-party systems. We typically refer to third-party services that analyze and store logs as SIEM (Security Information and Event Management). In Kubernetes, the module that exports logs to third-party web services is called a webhook.

Exporting to third-party systems can typically use the standard Syslog format or JSON Lines, which has the widest support. Additionally, you need to consider log truncation, and the configuration of third-party systems’ batch and stream processing. You can refer to this Kubernetes document.

Architecture Designs of Open-Source Projects

Due to different design focuses, each of the following open-source projects needs careful consideration of its advantages and disadvantages, whether its features meet your needs, and whether the system environment is distributed or monolithic.

Auditd

auditd-architecture

The default audit service for most Linux systems, when paired with tools like rsyslog, can solve local device log collection, viewing, and filtering. rsyslog’s string template-based log format configuration can meet the integration needs of users using different SIEM systems.

Advantages: process-based communication, standard log format, easy export. Excellent performance.
Disadvantages: the process model is not suitable for network services.

AWS Cloud Trail and Kubernetes

aws log

AWS CloudTrail adopts a model where application services actively push audit events. Users can set policies for designing tracking services, and the collected logs flow as needed into subsequent batch and stream processing toolchains.

Kubernetes’s log collection is similar to AWS’s implementation, also based on a centralized service, but this architecture is not designed solely for audit logs. It follows many Kubernetes declarative design philosophies and is well worth studying.

kubernetes log

For example, Kubernetes has stages specifically designed for auditing:

Each request can be recorded with its associated stages. The defined stages are:

RequestReceived - The stage corresponds to the event generated when the audit handler receives a request, and before delegating to the remaining handlers.

ResponseStarted - The event is generated after the response message headers are sent, but before the response message body is sent. Only long-running requests (such as watch) generate this stage.

ResponseComplete - When the response message body is complete and no more data needs to be transmitted.

Panic - Generated when a panic occurs.

Kubernetes audit events use a different message structure from the Event API³.

In summary, the cloud platform’s audit service design can be summarized as:

Advantages: microservice design, more flexible JSON format logs, centralized log collection service easily integrates with more application services and exports to open-source data processing tools.
Disadvantages: distributed architecture requires higher storage, server-side encryption, communication security, and integrity.

OpenTelemetry

OpenTel

OpenTelemetry is currently the most mainstream logging framework in cloud-native environments. It supports both invasive (SDK) and non-invasive (Agent) log collection modes. The Collector design allows some log processing work to be done on the log sender side.

Advantages: microservice design, supports infrastructure like Kubernetes, multi-language and multi-platform SDK and extensibility. Comprehensive security and integrity considerations. Suitable for small and medium-sized enterprises.
Disadvantages: in most cases, log collection still requires invasive modification of code inside the app. Log collection tools don’t support languages like Go well (as of this writing).

Summary

An audit trail refers to the time-ordered records of all operations or events affecting the system, used to track system activity and verify whether violations have occurred.

Audit logs should have the following characteristics:

Tamper-proof (encrypted storage, integrity verification)
High performance (fast verification)
Observability (machine/human readable)
Security (confidentiality, availability, integrity)

Common audit log formats include Syslog, CEF, and JSON, with the main differences being redundant information, readability, and compatibility with log collection systems.

Audit logs have high security requirements:

Confidentiality: only authorized users can access, achieved through access control
Availability: prevent deletion or destruction by attackers, achieved through resource limits, multi-replica storage, etc.
Integrity: prevent tampering or truncation, achieved through encryption, integrity verification, etc.

Some typical audit log system architectures:

Native Linux log programs such as Auditd, rsyslog
Cloud products like AWS
OpenTelemetry

SIEM stands for Security Information and Event Management. https://www.microsoft.com/en-us/security/business/security-101/what-is-siem ↩︎
For log encryption, the server typically adds an additional checksum chain to logs for verification. You can refer to Amazon’s implementation of server-side encryption (SSE-S3). ↩︎
Kubernetes audit event structure definition ↩︎

Software Architecture on Steve Sun

How AI Coding Tools Like Cursor Work Under the Hood

AST

Cursor

Cline

Summary

Further Reading

A General Approach to Server-Side Performance Issues in Go

Pairing with AI for Programming — Testing

What Are the Common Problems with Using AI to Write Code?

Solution Approach

Human-Led Unit Testing

Writing AI-Friendly Tests Requires Good Module Design

Summary

Dependency Inversion in Go

Why Dependency Inversion Principle (DIP)?

Implementing It in Go

Summary

Monitoring System Project Retrospective

The Project Relies Heavily on an Open-Source Component, Over-Customized

Unfamiliarity with Go Led to Chaotic Project Structure

API Granularity Too Fine, Resource Objects Not Properly Abstracted

Flawed Pipeline Design

Summary

Notes on the RESTful Web Services Cookbook

HTTP Methods

GET

POST

PUT

Determining the Granularity of Resource Objects

How to Design Composite Resources?

HTTP Body

HTTP Response

Designing the Query Structure

Designing Query Requests

Designing Query Response Results

How to Design an Industry-Standard Audit System

What Capabilities Should an Audit System Have?

Related Industry Standards

What Protocols or Standards Should Audit Log Format Follow?

Common Event Format (CEF)

Syslog

JSON Lines

What Security Requirements Apply to Audit Logs?

Confidentiality

Integrity

Availability

Log Export

Architecture Designs of Open-Source Projects

Auditd

AWS Cloud Trail and Kubernetes

OpenTelemetry

Summary