Semtest

What are semtests?

semtest, short for semantic test, is a natural language testing tool. Think Jest but you write your test cases in plain English (or any language). Your coding agent — Claude Code, OpenCode, or whatever you use — conducts the testing and generates a report for you.

More broadly, semantic tests are a new door opened by the rise of AI coding tools. They expand the testing landscape into territory that can leverage bash, web search, skills and an LLM's semantic reasoning to make assertions we simply couldn't automate before. Used appropriately, that's a powerful tool at our disposal that can greatly increase the quality of our output. Semtests are bound to a CI process. They're repeatable and programmable. Test runs return a report identifying which tests passed, which failed, and the reasoning behind each failure.

Where are they useful?

The best way I've found to illustrate the use cases for semtests is through the semtests themselves. Here are a few examples to showcase what they are, what they aren't and where they can be useful.

# Spelling & Grammar

Review all client-facing copy across the project for spelling and grammatical errors. Flag any issues found with their file location and the suggested correction.

# No Hardcoded Secrets

Scan the codebase for hardcoded secrets, API keys, tokens and credentials.
All sensitive values must be referenced from environment variables.

# Ingress Validation

Compare all API routes defined in the application source code against the ingress configuration.

Verify that every route defined in the application has a corresponding ingress rule exposing it.
Ensure that every ingress rule maps to a route that exists in the application — there should be no stale or orphaned ingress rules.
No v1 routes should be exposed — v1 has been deprecated and all traffic should be routed to v2 or above.

# Accessibility

Review all client-facing pages and components.

Ensure all images have alt text that accurately and meaningfully describes the image content.
Verify that all form inputs have associated labels that clearly describe the expected input.
Confirm that semantic HTML tags (nav, main, article, aside, header, footer) are used correctly based on the content they contain.
Ensure screen reader experience is coherent — elements should be labelled and ordered in a way that makes sense when read sequentially.

Any instance where you need repeated assertions about a set of files — especially judgments outside the scope of linting and current testing tools — semtests are a good option. They make it possible to programmatically automate aspects of quality assurance that we've previously been forced to do manually.

Getting started

Install

semtest is available as an npm package and through Homebrew.

npm install -g @westopp/semtest

brew install westopp/semtest/semtest

Requires Node 20 or above.

Initialise

Run semtest init at the root of your project. This generates a semtest.config.ts file:

import { defineConfig } from "@westopp/semtest";

export default defineConfig({
  output: "semtest-results/",
  llm: "claude-code-sonnet-4-6",
});

Configuration

The config file controls how semtest discovers and runs your tests. Here are some of the key properties:

Property	Default	Description
`output`	`"semtest-results/"`	Directory where reports are written
`llm`	`"claude-code-sonnet-4-6"`	The LLM model to use for test evaluation
`testMatch`	`["/.spec.md", "/.test.md"]`	Glob patterns for test file discovery
`timeout`	`0`	Per-test timeout in milliseconds (0 = no timeout)
`repeat`	`1`	Number of times to repeat each test for confidence
`bail`	`false`	Stop on first failure
`verbose`	`false`	Show detailed per-test output in the terminal
`junit`	`false`	Generate a JUnit XML report

semtest supports 60+ model configurations across Claude Code, Gemini CLI, Codex, Aider, OpenCode, Goose, GitHub Copilot and more. Run semtest list to see all available models.

For the full configuration reference, visit semtest.westopp.com.

Writing a test

Create a .spec.md or .test.md file anywhere in your project. A semtest is a markdown file containing natural language that describes what you expect from your codebase.

# Spelling & Grammar

Review all client-facing copy across the project for spelling and grammatical errors.
Flag any issues found with their file location and the suggested correction.

That's it. No imports, no setup, no boilerplate.

You can optionally add YAML frontmatter to override config on a per-test basis:

tags: quality, content
timeout: 60000
llm: claude-code-opus-4-6
# Translation Quality

Scan all translation files in this project. The source language is English.

Verify that translations are accurate representations of the English source.
Ensure no translations contain offensive, inappropriate or culturally insensitive language.
Confirm that no translation keys are missing across any supported locale.

A single file can contain multiple test scenarios. semtest will identify distinct tests based on headings and structural markers and evaluate each one independently.

Running tests

# Run all tests
semtest run

# Run specific files or directories
semtest run semtests/security/
semtest run semtests/spelling.spec.md

# Filter by name
semtest run --testNamePattern "security"

# Filter by tags
semtest run --tag "critical,quality"

# Repeat tests for confidence
semtest run --repeat 3

What to expect

When you run semtest run, each test file is passed to your configured LLM which evaluates the test against your codebase and returns a pass or fail judgment.

In the terminal you'll see a progress spinner per test followed by a summary:

Results: 4 passed, 1 failed, 5 total

semtest generates two report files in your output directory:

latest.md — A human-readable markdown report with a summary table and detailed results for each failing test including what was expected, what was observed, the relevant file location and a suggested resolution.

ci-results.json — A machine-readable JSON report suitable for CI/CD pipelines.

The process exits with code 0 if all tests pass and 1 if any test fails — so you can integrate semtest directly into your CI pipeline.

Trade-offs

The shortcomings of semtests mirror the shortcomings of AI in general.

Slow. LLM inference takes time. A semtest will never be as fast as a unit test. Factor this into when and how often you run them.

Expensive at scale. Some semtests require injecting an entire codebase into context. Multiple tests like this, having many tests, or running semtests frequently can cause a noticeable spike in inference costs.

Non-deterministic. Prone to producing different results across runs. This is the most significant limitation and needs to be accounted for.

Considerations

The above are real cons that need to be acknowledged and factored into how you use semtests. Here are some things to consider.

Write precise tests. Avoid overly broad, vague or ambiguous tests that are open to interpretation. The more specific your assertions, the more consistent your results. For additional confidence, use the repeat property to run tests multiple times. We're also working on mechanisms to reduce LLM response temperature to produce more deterministic results.

AI is improving. Models are getting faster, cheaper and smarter on an almost monthly cadence. As they improve, so will your semtests. As has become a common saying in the AI space — this is the worst it's going to be.

Review the output. As with much of the work AI produces, reviewing it is key. Analyse the output for correctness to better understand these shortcomings and guarantee better results over time.

What's next

Semtest Driven Development

The idea behind Semtest Driven Development (SDD) is marrying test-driven development and spec-driven development into a single approach.

You define your specification as semtest files. Because semtests are markdown files containing both natural language and code snippets, they act as a descriptive specification of what needs to be built. Because they also define what success looks like and are bound to a CI process, they double as an architectural test suite.

This creates a red-green workflow for your coding agents. Failing semtests output the reasons for failure, which can be fed as context to coding agents to resolve the issues and better comply with the spec. Passing semtests confirm that the given unit of work was completed as required.

Your spec is your test suite. One artifact, two purposes.

Semantic Assertions

Semtests aren't intended to compete with conventional testing methods — they're intended to complement them as part of a rich test suite. Taking this a step further: what if semtests could be embedded directly within conventional testing tools to enable more capable unit, integration and end-to-end testing?

Imagine if Vitest exposed a semantic assertion method that could make AI-based assertions within a test case:

describe('UserProfile', () => {
  it('should display an accessible and informative error state', async () => {
    const result = render(<UserProfile userId="invalid" />)

    await semanticExpect(result.container.innerHTML)
      .toSatisfy('The error message is user-friendly, non-technical, and suggests a next step')
  })
})

This tighter integration would bridge the gap between deterministic and semantic testing, allowing developers to make AI-powered assertions without leaving their existing test runner or workflow. The same speed, cost and determinism considerations apply — but scoped to individual assertions rather than entire test files, giving developers fine-grained control over where they trade precision for judgment.

Semtests: An Introduction