Semtest

The llms.txt file is a single reference document designed to be fed into an LLM assistant (Claude, ChatGPT, Gemini, etc.) so it has full context on how to create, configure, and run semtests in your project. Instead of explaining semtest from scratch each session, drop this file into your assistant's context and it can immediately start writing spec files, setting up config, and helping you build out your test suite.
Download llms.txt
# semtest

> Semantic testing for codebases. Natural-language test specs evaluated by LLM CLIs.

semtest is a CLI tool that runs natural-language test cases against a codebase using LLM CLI tools (Claude Code, Gemini CLI, Codex, Aider, and 10 others). You write spec files describing what should be true about the code, and semtest orchestrates the LLM to verify each expectation.

semtest does not call LLM APIs directly. It invokes locally installed CLI tools as subprocesses in the project directory. The LLM reads the codebase through its own workspace access. Your code never passes through semtest — the tool just coordinates.

---

## Installation

npm install -g @westopp/semtest
# or
pnpm add -g @westopp/semtest

Requires Node >= 20 and at least one supported LLM CLI installed on the PATH. Verify with:

semtest list        # shows all available model keys
which claude        # or gemini, codex, aider, opencode, goose, etc.

---

## Quick setup

semtest init        # creates semtest.config.ts with defaults
mkdir semtests      # conventional directory for spec files

---

## Configuration

Config file is searched by walking up the directory tree from CWD. First match wins:

1. semtest.config.ts
2. semtest.config.js
3. semtest.config.mjs

Use the defineConfig helper for type safety:

import { defineConfig } from "@westopp/semtest";

export default defineConfig({
  testMatch: ["**/*.spec.md", "**/*.test.md"],
  testPathIgnorePatterns: ["node_modules", "dist", ".git", "vendor"],
  output: "semtest-results/",
  llm: "claude-code-sonnet-4-6",
  timeout: 0,
  repeat: 1,
  bail: false,
  verbose: false,
  strict: false,
  skipValidation: false,
  debug: false,
  timestamp: false,
  includePassing: false,
  junit: false,
  skipPermissionsIfPossible: false,
});

### Config properties

| Property | Type | Default | Description |
|----------|------|---------|-------------|
| testMatch | string[] | ["**/*.spec.md", "**/*.test.md"] | Glob patterns for discovering test files |
| testPathIgnorePatterns | string[] | ["node_modules", "dist", ".git", "vendor"] | Patterns to exclude from discovery |
| output | string | "semtest-results/" | Report output directory |
| llm | string | "claude-code-sonnet-4-6" | Model key (tool + model). Run semtest list for options |
| timeout | number | 0 | Per-test timeout in ms. 0 = no timeout |
| repeat | number | 1 | Run each test N times |
| bail | boolean or number | false | true = stop after first failure. number = stop after N failures |
| verbose | boolean | false | Detailed per-test output |
| strict | boolean | false | Exit code 2 on validation issues |
| skipValidation | boolean | false | Skip post-run validation |
| debug | boolean | false | Log raw LLM output to debug directory |
| timestamp | boolean | false | Save timestamped copy of Markdown report |
| includePassing | boolean | false | Include passing tests in Markdown report |
| junit | boolean | false | Generate JUnit XML report |
| skipPermissionsIfPossible | boolean | false | Skip LLM CLI permission prompts where supported |

CLI flags override config values.

---

## Writing spec files

Spec files are Markdown files describing expectations about the codebase. The LLM reads each file, identifies the distinct scenarios within it, examines the codebase, and returns a pass/fail verdict for each scenario.

### File naming

Default discovery patterns: **/*.spec.md and **/*.test.md. Customizable via testMatch.

### Structure

Markdown headings create clear scenario boundaries. Each heading becomes a separate test scenario with its own verdict.

# Authentication

## Session Handling

User sessions should expire after 30 minutes of inactivity.
The session store should be configured in src/config/.

## Password Hashing

Passwords must be hashed using bcrypt before storage.
Plain text passwords should never be written to the database.

This produces two scenarios: "Session Handling" and "Password Hashing".

Single-scenario files work too:

# License File

The repository root should contain a LICENSE or LICENSE.md file.

### Frontmatter

Spec files support YAML frontmatter to override config per-test:

---
tags: security, ci
timeout: 120000
llm: claude-code-opus-4-6
skipPermissionsIfPossible: true
---

# Security Audit

All API routes require authentication middleware.

| Field | Type | Description |
|-------|------|-------------|
| tags | string[] or comma-separated | Tags for filtering with --tag |
| timeout | number | Per-test timeout in ms (overrides config) |
| llm | string | Model key for this test (overrides config) |
| skipPermissionsIfPossible | boolean | Permission bypass override |

### Directory structure

Subdirectories become group labels in reports:

semtests/
├── auth/
│   ├── login.spec.md
│   └── session.spec.md
├── api/
│   ├── routes.spec.md
│   └── validation.spec.md
└── infra/
    └── config.spec.md

### Writing effective specs

- Be specific about what, not how. Describe the expected behavior, not implementation details.
- Reference concrete paths when helpful. Name specific files or modules when testing something about them.
- One concern per scenario. Each scenario should test one thing for a clear pass/fail verdict.
- Know what the LLM can verify. The LLM reads code statically — it does not execute it. Test structure, configuration, naming, presence of logic, and content. Do not test runtime behavior.

Good:
API error responses should include a "message" field with a human-readable description.

Too vague:
Error handling should be good.

Too prescriptive:
The catch block on line 45 of src/api/handler.ts should call formatError().

### Example spec files

Security:
# Security

No sensitive information is hardcoded. All secrets and config
values are pulled from environment variables or appConfig.

Content accuracy:
# Content accuracy

The contact information displayed in our application matches
the contact details published on our main site https://example.com.

Internationalization:
# Internationalisation

Each of our provided translations are close in meaning to
each other and to the English source strings.

Code quality:
# Copy quality

There are no spelling mistakes in any user-facing text
across the application.

---

## CLI commands

### semtest run [files...] [options]

Run semantic tests and generate reports.

| Option | Description |
|--------|-------------|
| [files...] | Specific test files or directories. Omit to run all |
| --tag <tags> | Comma-separated tag filter |
| -t, --testNamePattern <pattern> | Regex filter on test name/path |
| --repeat <n> | Run each test N times (default: 1) |
| --bail | Stop after first failure |
| --maxfail <n> | Stop after N failures |
| --verbose | Detailed per-test output |
| --timestamp | Save timestamped Markdown report |
| --include-passing | Include passing tests in Markdown report |
| --strict | Exit code 2 on validation issues |
| --skip-validation | Skip post-run validation |
| --debug | Log raw LLM output to debug directory |
| --timeout <ms> | Per-test timeout in milliseconds |
| --junit | Generate JUnit XML report |
| --skip-permissions-if-possible | Skip LLM CLI permission prompts |

### semtest init

Scaffolds a new semtest.config.ts with defaults.

### semtest list [--json]

Lists all available model keys grouped by tool. Use --json for machine-readable output.

### semtest uninstall

Prints uninstall instructions.

### Exit codes

| Code | Meaning |
|------|---------|
| 0 | All tests passed |
| 1 | One or more tests failed |
| 2 | Error (LLM failure, timeout, or validation issue with --strict) |

Precedence: error > fail > pass. Any error in the run means exit 2, regardless of passing tests.

---

## Supported models

Model keys follow the pattern {tool}-{model}. Set via llm in config or per-test frontmatter.

### Claude Code

| Key | Model |
|-----|-------|
| claude-code-opus-4-6 | claude-opus-4-6 |
| claude-code-sonnet-4-6 | claude-sonnet-4-6 (default) |
| claude-code-sonnet-4-5 | claude-sonnet-4-5-20250929 |
| claude-code-haiku-4-5 | claude-haiku-4-5-20251001 |

### Gemini CLI

| Key | Model |
|-----|-------|
| gemini-2.5-pro | gemini-2.5-pro |
| gemini-2.5-flash | gemini-2.5-flash |
| gemini-2.5-flash-lite | gemini-2.5-flash-lite |
| gemini-2.0-flash | gemini-2.0-flash |

### Codex CLI

| Key | Model |
|-----|-------|
| codex-o3 | o3 |
| codex-o4-mini | o4-mini |
| codex-gpt-4.1 | gpt-4.1 |
| codex-gpt-4.1-mini | gpt-4.1-mini |
| codex-gpt-4.1-nano | gpt-4.1-nano |

### Aider

| Key | Model |
|-----|-------|
| aider-claude-opus-4-6 | claude-opus-4-6 |
| aider-claude-sonnet-4-6 | claude-sonnet-4-6 |
| aider-claude-sonnet-4-5 | claude-sonnet-4-5 |
| aider-claude-haiku-4-5 | claude-haiku-4-5 |
| aider-gpt-4.1 | gpt-4.1 |
| aider-gpt-4.1-mini | gpt-4.1-mini |
| aider-o3 | o3 |
| aider-o4-mini | o4-mini |
| aider-gemini-2.5-pro | gemini-2.5-pro |
| aider-gemini-2.5-flash | gemini-2.5-flash |
| aider-deepseek-r1 | deepseek-r1 |
| aider-deepseek-v3 | deepseek-v3 |

### OpenCode, Goose, Crush, Forge

These tools share the same model key pattern and support: Claude (Opus, Sonnet 4.6, Sonnet 4.5, Haiku), GPT-4.1, GPT-4.1-mini, o3, o4-mini, Gemini 2.5 Pro, Gemini 2.5 Flash.

Key format: {tool}-{model} (e.g., opencode-claude-sonnet-4-6, goose-gpt-4.1, crush-o3, forge-gemini-2.5-pro).

### GitHub Copilot

| Key | Model |
|-----|-------|
| copilot-claude-opus-4-6 | claude-opus-4-6 |
| copilot-claude-sonnet-4-6 | claude-sonnet-4-6 |
| copilot-claude-sonnet-4-5 | claude-sonnet-4-5-20250929 |
| copilot-gpt-4.1 | gpt-4.1 |
| copilot-gpt-4.1-mini | gpt-4.1-mini |
| copilot-o3 | o3 |
| copilot-o4-mini | o4-mini |
| copilot-gemini-2.5-pro | gemini-2.5-pro |

### Qwen

| Key | Model |
|-----|-------|
| qwen3-coder-plus | qwen3-coder-plus |
| qwen3-coder | qwen3-coder |
| qwen3-coder-fast | qwen3-coder-fast |

### Default-only tools

These tools use their built-in model configuration and do not accept a model parameter:

| Key | Tool |
|-----|------|
| plandex-default | Plandex |
| openhands-default | OpenHands |
| cursor-default | Cursor |
| amp-default | Amp |

Run semtest list for the full, up-to-date list.

---

## Reports and output

All reports are written to the output directory (default: semtest-results/).

| File | When | Format |
|------|------|--------|
| latest.md | Always | Markdown. Summary table + per-test results grouped by directory. Failures include expectation, observed, location, resolution |
| ci-results.json | Always | JSON. Fields: status, summary (total, passed, failed, errored, invalid, skipped), tests array, validation |
| junit-results.xml | --junit | JUnit XML. Compatible with GitHub Actions, GitLab CI, Jenkins, CircleCI |
| YYYY-MM-DDTHH-MM-SS.md | --timestamp | Timestamped copy of latest.md |
| debug/*.json | --debug | Per-test raw LLM output with all attempts |

### Test statuses

| Status | Meaning | Exit code |
|--------|---------|-----------|
| pass | Expectation met | 0 |
| fail | Expectation not met | 1 |
| error | LLM failure or timeout | 2 |
| invalid | Not a testable spec | 0 (2 with --strict) |
| skip | Skipped | 0 |

---

## CI integration

Use --skip-permissions-if-possible and --junit for CI pipelines:

semtest run --skip-permissions-if-possible --junit

Exit codes integrate directly with CI: 0 = green, non-zero = red.

JUnit XML is picked up automatically by most CI test reporters (GitHub Actions, GitLab, Jenkins).

For machine-readable results, parse semtest-results/ci-results.json.

---

## Key concepts

- Spec file: A .spec.md or .test.md file containing test expectations in natural language.
- Scenario: A single testable expectation within a spec file. Identified by the LLM from headings, numbered items, or structural markers.
- Tool: The LLM CLI that evaluates the code (Claude Code, Gemini CLI, Codex, etc.).
- Model key: A string like claude-code-sonnet-4-6 identifying both tool and model.
- Status: The verdict for each scenario: pass, fail, error, invalid, or skip.
LLMs.txt