Steve Sun

How to Design an Industry-Standard Audit System

中文

An audit trail is a service within a system that records critical security information such as user behavior logs and control component activity logs. Logs are typically arranged in chronological order, recording “who did what and when.”

Below is the Kubernetes official documentation’s description of its audit service:

Kubernetes auditing provides a security-relevant, time-ordered set of records documenting the sequence of activities that affected the system by individual users, by applications using the Kubernetes API, and by the control plane itself.

The audit feature enables cluster administrators to answer the following questions:

  • What happened?
  • When did it happen?
  • Who initiated it?
  • On what (which) objects did the activity occur?
  • Where was it observed?
  • Where was it initiated from?
  • What are the subsequent actions taken on the activity?

What Capabilities Should an Audit System Have?

  1. Log content is tamper-proof.
  2. Log chain structure is complete: individual log entries cannot be arbitrarily added or removed.
  3. Compatibility: clients sending logs should avoid invasive designs.
  4. The system’s encryption service should be initialized as early as possible to reduce unprotected log exposure.
  5. Service restart/shutdown should not cause audit log inconsistency. If a service is shut down under emergency conditions, the audit logs should remain verifiable.
  6. Key security: encryption keys (used to compute integrity checks) should be stored in a dedicated key store and reside in memory for the shortest possible time.
  7. Performance: ability to verify protected logs within seconds.
  8. Log rotation friendliness: audit logs should be compatible with typical log rotation strategies of distributed systems.
  9. Observability: logs should be easily parsed (machine-readable) and human-readable. Compatible with mainstream log processor formats, with dimensions designed to facilitate future filtering and screening.

Common industry standards related to auditing include IEC 62443 and NIST SP 800-92. Below are the audit-related sections in IEC.

Industry StandardSectionSecurity Level
IEC 62443-4-2:2019CR2.8SL-C 1
IEC 62443-4-2:2019CR6.1SL-C 1
IEC 62443-4-2:2019CR6.2SL_C 2
IEC 62443-4-2:2019CR1.13SL_C 1
IEC 62443-4-2:2019CR2.9SL_C 1
IEC 62443-4-2:2019CR2.10SL_C 1
IEC 62443-4-2:2019CR3.7SL_C 1
IEC 62443-4-2:2019CR3.9SL_C 2

What Protocols or Standards Should Audit Log Format Follow?

For locally running software, Syslog typically has better system compatibility. For projects using ELK to collect logs, CEF is more suitable. In other cases, custom JSON is recommended.

Below is a comparison of the three formats (protocols).

Common Event Format (CEF)

A log format used by Elastic-Search, designed based on event-sourcing principles. The advantage is less redundant information, suitable for building monitoring systems in conjunction with the ELK stack. Its transport is based on the Syslog protocol while extending readable key-value pairs. The text-based design also allows CEF-format logs to be written to files. Overall, it is the most balanced of these formats in terms of readability, efficiency, and standardization.

Syslog

Syslog is the default audit log format for Linux operating systems, typically using its RFC 5424 version. Most SIEM1 systems support importing this format. The Syslog protocol has great adaptability, and mTLS-based Syslog transport can maximize system security while remaining compatible with traditional software. However, for microservices, implementing and maintaining the standard protocol is costly. Therefore, services like AWS CloudTrail and OpenTelemetry have opted for the simpler HTTPS + JSON format.

JSON Lines

Most SaaS products use JSON—it’s simple and efficient. JSON has the characteristic of more redundant information, but the structure is easy to parse. For example, below are the fields in the log model mentioned in the OpenTelemetry official documentation:

Field NameDescription
TimestampTime when the event occurred.
ObservedTimestampTime when the event was observed.
TraceIdRequest trace id.
SpanIdRequest span id.
TraceFlagsW3C trace flag.
SeverityTextThe severity text (also known as log level).
SeverityNumberNumerical value of the severity.
BodyThe body of the log record.
ResourceDescribes the source of the log.
AttributesAdditional structured information.

What Security Requirements Apply to Audit Logs?

For audit logs, security requirements are higher than for general log systems.

Security can typically be considered from three dimensions: Confidentiality, Integrity, and Availability.

Confidentiality

Attackers can exploit system security vulnerabilities to obtain special privileges and then view certain audit logs.

The following measures can be taken:

Integrity

Attackers can exploit system security vulnerabilities to modify or delete certain audit logs.

In addition to the encryption and access control mentioned above, the following measures can also be taken:

Log file limitations: in addition to limiting the size of log files, it’s typically necessary to limit the number of backups, maximum backup days, etc. Below are the parameters in Kubernetes for log file storage:

Availability

Attackers can attack the audit trail service, causing the audit trail service to run out of memory, disk space, etc.

The audit service should cache audit-related context, such as the mapping between service names and IDs, event IDs and descriptions, etc. When different services send messages to the audit service, the message structure should be designed with minimal length as a principle. The audit service’s policy should allow users to configure log levels, filter rules, etc., to reduce system burden.

Log Export

In addition to exporting file-format logs, the audit service usually needs to support export to third-party systems. We typically refer to third-party services that analyze and store logs as SIEM (Security Information and Event Management). In Kubernetes, the module that exports logs to third-party web services is called a webhook.

Exporting to third-party systems can typically use the standard Syslog format or JSON Lines, which has the widest support. Additionally, you need to consider log truncation, and the configuration of third-party systems’ batch and stream processing. You can refer to this Kubernetes document.

Architecture Designs of Open-Source Projects

Due to different design focuses, each of the following open-source projects needs careful consideration of its advantages and disadvantages, whether its features meet your needs, and whether the system environment is distributed or monolithic.

Auditd

auditd-architecture
auditd-architecture

The default audit service for most Linux systems, when paired with tools like rsyslog, can solve local device log collection, viewing, and filtering. rsyslog’s string template-based log format configuration can meet the integration needs of users using different SIEM systems.

AWS Cloud Trail and Kubernetes

aws log
aws log

AWS CloudTrail adopts a model where application services actively push audit events. Users can set policies for designing tracking services, and the collected logs flow as needed into subsequent batch and stream processing toolchains.

Kubernetes’s log collection is similar to AWS’s implementation, also based on a centralized service, but this architecture is not designed solely for audit logs. It follows many Kubernetes declarative design philosophies and is well worth studying.

kubernetes log
kubernetes log

For example, Kubernetes has stages specifically designed for auditing:

Each request can be recorded with its associated stages. The defined stages are:

  • RequestReceived - The stage corresponds to the event generated when the audit handler receives a request, and before delegating to the remaining handlers.
  • ResponseStarted - The event is generated after the response message headers are sent, but before the response message body is sent. Only long-running requests (such as watch) generate this stage.
  • ResponseComplete - When the response message body is complete and no more data needs to be transmitted.
  • Panic - Generated when a panic occurs.

Kubernetes audit events use a different message structure from the Event API3.

In summary, the cloud platform’s audit service design can be summarized as:

OpenTelemetry

OpenTel
OpenTel

OpenTelemetry is currently the most mainstream logging framework in cloud-native environments. It supports both invasive (SDK) and non-invasive (Agent) log collection modes. The Collector design allows some log processing work to be done on the log sender side.

Summary

An audit trail refers to the time-ordered records of all operations or events affecting the system, used to track system activity and verify whether violations have occurred.

Audit logs should have the following characteristics:

Common audit log formats include Syslog, CEF, and JSON, with the main differences being redundant information, readability, and compatibility with log collection systems.

Audit logs have high security requirements:

Some typical audit log system architectures:


  1. SIEM stands for Security Information and Event Management. https://www.microsoft.com/en-us/security/business/security-101/what-is-siem ↩︎

  2. For log encryption, the server typically adds an additional checksum chain to logs for verification. You can refer to Amazon’s implementation of server-side encryption (SSE-S3)↩︎

  3. Kubernetes audit event structure definition ↩︎

#Audit