# semtest

> Semantic testing for codebases. Natural-language test specs evaluated by LLM CLIs.

semtest is a CLI tool that runs natural-language test cases against a codebase using LLM CLI tools (Claude Code, Gemini CLI, Codex, Aider, and 10 others). You write spec files describing what should be true about the code, and semtest orchestrates the LLM to verify each expectation.

semtest does not call LLM APIs directly. It invokes locally installed CLI tools as subprocesses in the project directory. The LLM reads the codebase through its own workspace access. Your code never passes through semtest — the tool just coordinates.

---

## Installation

```bash
npm install -g @westopp/semtest
# or
pnpm add -g @westopp/semtest
```

Requires Node >= 20 and at least one supported LLM CLI installed on the PATH. Verify with:

```bash
semtest list        # shows all available model keys
which claude        # or gemini, codex, aider, opencode, goose, etc.
```

---

## Quick setup

```bash
semtest init        # creates semtest.config.ts with defaults
mkdir semtests      # conventional directory for spec files
```

---

## Configuration

Config file is searched by walking up the directory tree from CWD. First match wins:

1. `semtest.config.ts`
2. `semtest.config.js`
3. `semtest.config.mjs`

Use the `defineConfig` helper for type safety:

```ts
import { defineConfig } from "@westopp/semtest";

export default defineConfig({
  testMatch: ["**/*.spec.md", "**/*.test.md"],
  testPathIgnorePatterns: ["node_modules", "dist", ".git", "vendor"],
  output: "semtest-results/",
  llm: "claude-code-sonnet-4-6",
  timeout: 0,
  repeat: 1,
  bail: false,
  verbose: false,
  strict: false,
  skipValidation: false,
  debug: false,
  timestamp: false,
  includePassing: false,
  junit: false,
  skipPermissionsIfPossible: false,
});
```

### Config properties

| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `testMatch` | `string[]` | `["**/*.spec.md", "**/*.test.md"]` | Glob patterns for discovering test files |
| `testPathIgnorePatterns` | `string[]` | `["node_modules", "dist", ".git", "vendor"]` | Patterns to exclude from discovery |
| `output` | `string` | `"semtest-results/"` | Report output directory |
| `llm` | `string` | `"claude-code-sonnet-4-6"` | Model key (tool + model). Run `semtest list` for options |
| `timeout` | `number` | `0` | Per-test timeout in ms. 0 = no timeout |
| `repeat` | `number` | `1` | Run each test N times |
| `bail` | `boolean \| number` | `false` | true = stop after first failure. number = stop after N failures |
| `verbose` | `boolean` | `false` | Detailed per-test output |
| `strict` | `boolean` | `false` | Exit code 2 on validation issues |
| `skipValidation` | `boolean` | `false` | Skip post-run validation |
| `debug` | `boolean` | `false` | Log raw LLM output to debug directory |
| `timestamp` | `boolean` | `false` | Save timestamped copy of Markdown report |
| `includePassing` | `boolean` | `false` | Include passing tests in Markdown report |
| `junit` | `boolean` | `false` | Generate JUnit XML report |
| `skipPermissionsIfPossible` | `boolean` | `false` | Skip LLM CLI permission prompts where supported |

CLI flags override config values.

---

## Writing spec files

Spec files are Markdown files describing expectations about the codebase. The LLM reads each file, identifies the distinct scenarios within it, examines the codebase, and returns a pass/fail verdict for each scenario.

### File naming

Default discovery patterns: `**/*.spec.md` and `**/*.test.md`. Customizable via `testMatch`.

### Structure

Markdown headings create clear scenario boundaries. Each heading becomes a separate test scenario with its own verdict.

```markdown
# Authentication

## Session Handling

User sessions should expire after 30 minutes of inactivity.
The session store should be configured in src/config/.

## Password Hashing

Passwords must be hashed using bcrypt before storage.
Plain text passwords should never be written to the database.
```

This produces two scenarios: "Session Handling" and "Password Hashing".

Single-scenario files work too:

```markdown
# License File

The repository root should contain a LICENSE or LICENSE.md file.
```

### Frontmatter

Spec files support YAML frontmatter to override config per-test:

```markdown
---
tags: security, ci
timeout: 120000
llm: claude-code-opus-4-6
skipPermissionsIfPossible: true
---

# Security Audit

All API routes require authentication middleware.
```

| Field | Type | Description |
|-------|------|-------------|
| `tags` | `string[]` or comma-separated | Tags for filtering with `--tag` |
| `timeout` | `number` | Per-test timeout in ms (overrides config) |
| `llm` | `string` | Model key for this test (overrides config) |
| `skipPermissionsIfPossible` | `boolean` | Permission bypass override |

### Directory structure

Subdirectories become group labels in reports:

```
semtests/
├── auth/
│   ├── login.spec.md
│   └── session.spec.md
├── api/
│   ├── routes.spec.md
│   └── validation.spec.md
└── infra/
    └── config.spec.md
```

### Writing effective specs

- **Be specific about what, not how.** Describe the expected behavior, not implementation details.
- **Reference concrete paths when helpful.** Name specific files or modules when testing something about them.
- **One concern per scenario.** Each scenario should test one thing for a clear pass/fail verdict.
- **Know what the LLM can verify.** The LLM reads code statically — it does not execute it. Test structure, configuration, naming, presence of logic, and content. Do not test runtime behavior.

Good:
```markdown
API error responses should include a "message" field with a human-readable description.
```

Too vague:
```markdown
Error handling should be good.
```

Too prescriptive:
```markdown
The catch block on line 45 of src/api/handler.ts should call formatError().
```

### Example spec files

Security:
```markdown
# Security

No sensitive information is hardcoded. All secrets and config
values are pulled from environment variables or appConfig.
```

Content accuracy:
```markdown
# Content accuracy

The contact information displayed in our application matches
the contact details published on our main site https://example.com.
```

Internationalization:
```markdown
# Internationalisation

Each of our provided translations are close in meaning to
each other and to the English source strings.
```

Code quality:
```markdown
# Copy quality

There are no spelling mistakes in any user-facing text
across the application.
```

---

## CLI commands

### `semtest run [files...] [options]`

Run semantic tests and generate reports.

| Option | Description |
|--------|-------------|
| `[files...]` | Specific test files or directories. Omit to run all |
| `--tag <tags>` | Comma-separated tag filter |
| `-t, --testNamePattern <pattern>` | Regex filter on test name/path |
| `--repeat <n>` | Run each test N times (default: 1) |
| `--bail` | Stop after first failure |
| `--maxfail <n>` | Stop after N failures |
| `--verbose` | Detailed per-test output |
| `--timestamp` | Save timestamped Markdown report |
| `--include-passing` | Include passing tests in Markdown report |
| `--strict` | Exit code 2 on validation issues |
| `--skip-validation` | Skip post-run validation |
| `--debug` | Log raw LLM output to debug directory |
| `--timeout <ms>` | Per-test timeout in milliseconds |
| `--junit` | Generate JUnit XML report |
| `--skip-permissions-if-possible` | Skip LLM CLI permission prompts |

### `semtest init`

Scaffolds a new `semtest.config.ts` with defaults.

### `semtest list [--json]`

Lists all available model keys grouped by tool. Use `--json` for machine-readable output.

### `semtest uninstall`

Prints uninstall instructions.

### Exit codes

| Code | Meaning |
|------|---------|
| 0 | All tests passed |
| 1 | One or more tests failed |
| 2 | Error (LLM failure, timeout, or validation issue with --strict) |

Precedence: error > fail > pass. Any error in the run means exit 2, regardless of passing tests.

---

## Supported models

Model keys follow the pattern `{tool}-{model}`. Set via `llm` in config or per-test frontmatter.

### Claude Code

| Key | Model |
|-----|-------|
| `claude-code-opus-4-6` | `claude-opus-4-6` |
| `claude-code-sonnet-4-6` | `claude-sonnet-4-6` (default) |
| `claude-code-sonnet-4-5` | `claude-sonnet-4-5-20250929` |
| `claude-code-haiku-4-5` | `claude-haiku-4-5-20251001` |

### Gemini CLI

| Key | Model |
|-----|-------|
| `gemini-2.5-pro` | `gemini-2.5-pro` |
| `gemini-2.5-flash` | `gemini-2.5-flash` |
| `gemini-2.5-flash-lite` | `gemini-2.5-flash-lite` |
| `gemini-2.0-flash` | `gemini-2.0-flash` |

### Codex CLI

| Key | Model |
|-----|-------|
| `codex-o3` | `o3` |
| `codex-o4-mini` | `o4-mini` |
| `codex-gpt-4.1` | `gpt-4.1` |
| `codex-gpt-4.1-mini` | `gpt-4.1-mini` |
| `codex-gpt-4.1-nano` | `gpt-4.1-nano` |

### Aider

| Key | Model |
|-----|-------|
| `aider-claude-opus-4-6` | `claude-opus-4-6` |
| `aider-claude-sonnet-4-6` | `claude-sonnet-4-6` |
| `aider-claude-sonnet-4-5` | `claude-sonnet-4-5` |
| `aider-claude-haiku-4-5` | `claude-haiku-4-5` |
| `aider-gpt-4.1` | `gpt-4.1` |
| `aider-gpt-4.1-mini` | `gpt-4.1-mini` |
| `aider-o3` | `o3` |
| `aider-o4-mini` | `o4-mini` |
| `aider-gemini-2.5-pro` | `gemini-2.5-pro` |
| `aider-gemini-2.5-flash` | `gemini-2.5-flash` |
| `aider-deepseek-r1` | `deepseek-r1` |
| `aider-deepseek-v3` | `deepseek-v3` |

### OpenCode, Goose, Crush, Forge

These tools share the same model key pattern and support: Claude (Opus, Sonnet 4.6, Sonnet 4.5, Haiku), GPT-4.1, GPT-4.1-mini, o3, o4-mini, Gemini 2.5 Pro, Gemini 2.5 Flash.

Key format: `{tool}-{model}` (e.g., `opencode-claude-sonnet-4-6`, `goose-gpt-4.1`, `crush-o3`, `forge-gemini-2.5-pro`).

### GitHub Copilot

| Key | Model |
|-----|-------|
| `copilot-claude-opus-4-6` | `claude-opus-4-6` |
| `copilot-claude-sonnet-4-6` | `claude-sonnet-4-6` |
| `copilot-claude-sonnet-4-5` | `claude-sonnet-4-5-20250929` |
| `copilot-gpt-4.1` | `gpt-4.1` |
| `copilot-gpt-4.1-mini` | `gpt-4.1-mini` |
| `copilot-o3` | `o3` |
| `copilot-o4-mini` | `o4-mini` |
| `copilot-gemini-2.5-pro` | `gemini-2.5-pro` |

### Qwen

| Key | Model |
|-----|-------|
| `qwen3-coder-plus` | `qwen3-coder-plus` |
| `qwen3-coder` | `qwen3-coder` |
| `qwen3-coder-fast` | `qwen3-coder-fast` |

### Default-only tools

These tools use their built-in model configuration and do not accept a model parameter:

| Key | Tool |
|-----|------|
| `plandex-default` | Plandex |
| `openhands-default` | OpenHands |
| `cursor-default` | Cursor |
| `amp-default` | Amp |

Run `semtest list` for the full, up-to-date list.

---

## Reports and output

All reports are written to the output directory (default: `semtest-results/`).

| File | When | Format |
|------|------|--------|
| `latest.md` | Always | Markdown. Summary table + per-test results grouped by directory. Failures include expectation, observed, location, resolution |
| `ci-results.json` | Always | JSON. Fields: status, summary (total, passed, failed, errored, invalid, skipped), tests array, validation |
| `junit-results.xml` | `--junit` | JUnit XML. Compatible with GitHub Actions, GitLab CI, Jenkins, CircleCI |
| `YYYY-MM-DDTHH-MM-SS.md` | `--timestamp` | Timestamped copy of latest.md |
| `debug/*.json` | `--debug` | Per-test raw LLM output with all attempts |

### Test statuses

| Status | Meaning | Exit code |
|--------|---------|-----------|
| `pass` | Expectation met | 0 |
| `fail` | Expectation not met | 1 |
| `error` | LLM failure or timeout | 2 |
| `invalid` | Not a testable spec | 0 (2 with --strict) |
| `skip` | Skipped | 0 |

---

## CI integration

Use `--skip-permissions-if-possible` and `--junit` for CI pipelines:

```bash
semtest run --skip-permissions-if-possible --junit
```

Exit codes integrate directly with CI: 0 = green, non-zero = red.

JUnit XML is picked up automatically by most CI test reporters (GitHub Actions, GitLab, Jenkins).

For machine-readable results, parse `semtest-results/ci-results.json`.

---

## Key concepts

- **Spec file**: A `.spec.md` or `.test.md` file containing test expectations in natural language.
- **Scenario**: A single testable expectation within a spec file. Identified by the LLM from headings, numbered items, or structural markers.
- **Tool**: The LLM CLI that evaluates the code (Claude Code, Gemini CLI, Codex, etc.).
- **Model key**: A string like `claude-code-sonnet-4-6` identifying both tool and model.
- **Status**: The verdict for each scenario: pass, fail, error, invalid, or skip.