Use the LLM Where It Matters, Not for Everything: A Hybrid Workflow for Rule Checking (Part 2/5)
In Part 1, I introduced the main idea: take messy stakeholder input, check it against requirement-writing rules, generate a violation report, and use that report to rewrite it into a clearer system requirement.
That sounds clean in principle. But once you try to implement it, one question shows up almost immediately: Do we really need an LLM to detect every rule violation?
LLMs are flexible and great at rewriting, suggestion generation, and ambiguity resolution. But they are also the most expensive part of the workflow, the least deterministic, and honestly not always the best tool for enforcing strict rules. They can hallucinate issues or over-flag things that a simpler check would handle more reliably.
So this part focuses on a more engineering-oriented question: where should we actually use an LLM, and where can simpler methods do better?
Also worth saying upfront: this part is model-agnostic. The point is not that one specific model solves requirements engineering. It is that an LLM should play a targeted role inside a larger workflow, not own the whole thing.
Why "LLM for everything" breaks down fast
If every requirement is checked against every rule using an LLM, the workflow becomes expensive, slower, and harder to trust. That is especially true because many violations are not subtle. If a statement contains phrases like "as required", "etc.", or uses a number without a unit, there is no strong reason to spend an LLM call just to notice that.
Some checks are pattern-matching problems, not reasoning problems. That led us to split rule checking into three layers: programmatic checks for deterministic rules, NLP for lightweight linguistic structure, and LLM calls only where semantic judgment is actually needed.
In the current INCOSE rule catalog, every rule is assigned a primary engine: `python`, `spacy`, or `llm`. The idea is simple: each method should do the kind of work it is actually good at.
Three types of rule violations:
Type 1: Surface-level and deterministic
Vague terms, escape clauses, open-ended phrases, superfluous infinitives, absolutes, acronym formatting, and missing units. These can be detected through word lists, regular expressions, or simple structural heuristics. Programmatic checks are usually enough.
Type 2: Linguistic structure problems
Passive voice, bundled actions in one sentence, pronoun usage, and indefinite temporal markers. The problem is not one forbidden word; it is how the sentence is built. NLP helps more than plain string matching here.
Type 3: Semantic and contextual issues
Whether the subject and verb match the intended entity. Whether a condition is only implied rather than stated. Whether the wording constrains the solution before the design is ready. Whether performance is actually measurable. These require interpretation, not just detection. That is where the LLM earns its place.
What programmatic checks handle surprisingly well
A large share of rules can be checked with straightforward logic. Vague terms, escape clauses, open-ended phrases, superfluous infinitives, punctuation patterns, acronym forms, and decimal formatting can all be flagged without touching a language model.
Take this requirement: "The robot shall be able to pick and place components as fast as required."
A programmatic layer already catches several things before any LLM call:
- "to be able to" flagged as a superfluous infinitive
- "as required" flagged as an escape clause
- "fast" marked as a likely vague performance term
Or this one: "The robot shall operate at 300."
No semantic reasoning needed to spot the missing unit. Programmatic checks are fast, predictable, and easy to explain. That last point matters a lot. Engineers don't just want a score. They want to know exactly why something was flagged.
What NLP adds that pure logic misses
The next layer uses spaCy, not because it "solves language", but because it gives just enough structure to go beyond raw keyword matching. In practice, this means tokenizing the sentence, identifying parts of speech, detecting grammatical dependencies, and finding subjects, verbs, and clause markers.
That is useful when the problem is not one forbidden word, but the grammatical form of the sentence. Passive voice, bundled actions, pronoun usage, temporal dependency words like "before", "after", or "until", these are structure problems, not vocabulary problems.
Lightweight NLP handles them without spending a full LLM call on a sentence where the issue is just grammatical.
Where the LLM still adds real value
Once the obvious and structural issues are filtered out, what remains are the rules where language understanding actually matters. Take appropriate subject-verb usage. A sentence can be grammatically correct and still assign an action to the wrong entity. Pattern matching cannot reliably catch that.
The same goes for explicit vs. implied conditions, solution-free wording, measurable performance, and enumeration ambiguity. Consider: "The user shall use an Arduino-based controller to control the robot precisely."
The deeper issue is not just the words. Is this expressing a real system requirement? Is it mixing user behavior with system behavior? Is it prematurely locking in an implementation choice? That kind of judgment is exactly what the LLM is useful for. So the LLM still matters, but in a narrower role: semantic checks that require genuine interpretation, rewrite suggestions, and the final consolidation pass.
What this looks like in the code
The split is reflected directly in the implementation. Rules are seeded with an engine assignment (`python`, `spacy`, or `llm`), and the workflow routes each rule accordingly:
rules_for_programmatic = ["R1", "R4", "R6", "R7", "R8", "R9", "R10", "R13", "R14", "R16", "R17", "R20", "R21", "R26", "R32", "R36", "R37", "R38", "R39", "R40" ]
coded_violations = run_coded_checks(req_text, allowed_rules=allowed)
rules_for_spacy = ["R2", "R5", "R11", "R12", "R18", "R19", "R24", "R35"]
spacy_violations = run_spacy_checks(req_text, allowed_rules=allowed)
rules_for_llm = ["R3", "R15", "R27", "R31", "R34", "R41"]
llm_violations = llm_check_rules(req_text, rules_for_llm, allowed_rules=allowed)
Simple checks stay local. NLP handles structure. LLM handles semantics and can be routed to a local model or a hosted API, depending on the setup.
The new workflow
Step 1: Run programmatic checks to catch obvious textual and formatting violations.
Step 2: Run lightweight NLP checks to inspect sentence structure, voice, pronouns, and clause boundaries.
Step 3: Send only the semantically difficult rules to the LLM.
Step 4: Consolidate into one transparent violation report, with source labels so each issue stays traceable back to which engine caught it.
Step 5: Use that report as input to rewriting, instead of asking the LLM to rewrite blindly. This shifts the LLM from an all-purpose checker to a targeted semantic assistant.
Why this hybrid approach is better
The obvious benefit is cost. But it goes further than that.
It improves transparency. If a requirement was flagged because it contains "as required", that should come from a deterministic rule with a clear explanation, not from a black-box judgment. It improves repeatability. Programmatic and NLP-based checks are easier to test, compare across runs, and debug. And it builds trust. Engineers are more willing to accept AI assistance when the workflow does not pretend that one model is magically doing everything. A layered pipeline feels like an engineering system, not an oracle.
What this still does not solve
A hybrid workflow is better, but it is not complete.
Some rules depend on project context: glossary terms, allowed units, naming conventions, and domain wording. That becomes especially important when the system moves from detecting violations to rewriting requirements in a way that is actually useful for a specific project.
Some rules are set-level rather than sentence-level. You cannot fully judge unique expression or grouping from one isolated sentence. And some cases will still need human review, especially when domain intent is unclear. The takeaway is not "replace the LLM." It is to be more deliberate about where each method helps and where it only adds cost or noise.
What comes next...
Once this hybrid checking workflow was in place, a different question became more pressing:
Even if we detect violations efficiently, how do we guide the system toward producing better requirement statements in the first place?
That is where patterns, TBD handling, and project context come in.
In Part 3, I'll look at how predefined requirement patterns, controlled placeholders, and project context can improve consistency, reduce ambiguity, and help the LLM generate stronger draft system requirements — without pretending that missing information is already known.