Under the hood

Execution model

When you run semtest run, here's what happens step by step:

  1. Config loaded. Semtest walks up the directory tree from the current working directory looking for semtest.config.ts (or .js, .mjs). The first one found is used.

  2. Specs discovered. The project is walked recursively. Files matching the testMatch glob patterns are collected, paths matching testPathIgnorePatterns are excluded, and results are sorted alphabetically.

  3. Prompt built. For each spec file, semtest constructs a prompt containing the file's content and instructions for the LLM to identify scenarios, evaluate them against the codebase, and return structured JSON verdicts.

  4. LLM subprocess spawned. The configured LLM CLI (e.g., claude) is spawned as a child process in the current working directory. The prompt is passed as a command-line argument or via stdin, depending on the runner.

  5. LLM reads the codebase. The LLM CLI has access to the project directory through its own file-reading capabilities — the same access it would have in an interactive session. Semtest does not send your source code in the prompt.

  6. Response parsed. The LLM returns a JSON array of verdicts. Semtest extracts the JSON, handling fenced code blocks, bracket-matching, and other formatting variations.

  7. Retries if needed. If the LLM returns an empty response, semtest retries up to 3 times. Parse failures and timeouts are not retried.

  8. Reports generated. Results from all spec files are aggregated into Markdown, JSON, and optionally JUnit XML reports.

What the LLM receives

The prompt for each spec file looks like this:

You are a semantic test evaluator. Your task is to evaluate whether
the codebase in the current directory meets ALL test scenarios
described in the file below.

## Semantic Test File: auth.spec.md

[full contents of your spec file]

## Instructions

1. Examine the codebase in the current working directory.
2. Identify ALL distinct test scenarios or expectations in the file above.
3. For each test scenario, extract an ID or slug that identifies it.
4. Evaluate each test scenario against the codebase.
5. Respond with ONLY a JSON array.

[output format instructions for pass/fail/invalid/skip]

The LLM gets: the spec content, evaluation instructions, and the expected output format. Your code is not in the prompt — the LLM accesses it via the CLI's workspace.

Retry and timeout behavior

Retries: If the LLM subprocess returns an empty response (zero-length stdout), semtest retries up to 3 times per spec file. Only empty responses trigger retries — parse failures and non-zero exit codes do not.

Timeouts: When --timeout is set, the LLM subprocess is killed with SIGTERM after the specified duration. If the process doesn't exit promptly, SIGKILL follows. Timed-out tests receive error status with a message indicating the timeout.

Security and privacy

When semtest invokes an LLM CLI, that CLI runs in your project directory with the same file access it would have in an interactive session:

  • Your spec content is sent to the LLM provider as part of the prompt
  • Your source code is read by the LLM CLI through its own file access — semtest does not transmit it, but the LLM does read it to evaluate your specs
  • Network requests are made by the underlying LLM CLI to its provider's API

Semtest itself stores nothing remotely. All output stays local in the output directory.

If you're working on proprietary or sensitive code, review your LLM provider's data handling and retention policies. The relevant policies depend on which runner you've configured — each provider (Anthropic, OpenAI, Google) has its own terms.

Cost

Each spec file triggers one LLM invocation. The total cost depends on:

  • Number of spec files — More files = more invocations
  • Spec and codebase size — Larger specs and codebases consume more tokens
  • Model choice — The model determines per-token pricing
Model size Examples Cost Best for
Large Opus, o3, 2.5-pro Highest Critical specs, complex codebases
Medium Sonnet, o4-mini, 2.5-flash, gpt-4.1 Moderate Daily use (default)
Small Haiku, gpt-4.1-mini/nano, 2.5-flash-lite Lowest Large test suites, quick iteration

Start with a medium model like claude-code-sonnet-4-6. Move to a larger model if you need better accuracy on complex evaluations, or a smaller one if you're running many tests and want to keep costs down. See Tools & Models for all available model keys.