Use the LLM Where It Matters, Not for Everything: A Hybrid Workflow for Rule Checking (Part 2/5) (Copy)

Tech

Apr 1

Written By RG .

In Part 1, I introduced the main idea behind our prototype: take messy stakeholder input, check it against requirement-writing rules, generate a violation report, use that report to rewrite it into a clearer system requirement, and export the result into a SysML v2-compatible form.

That sounds good in principle. But once you try to implement it, one practical question shows up almost immediately:

Do we really need an LLM to detect every rule violation, and is it even the most efficient way to do that?

LLMs are flexible, they understand natural language, and they are especially good at rewriting, suggestion generation, and ambiguity resolution. But they are also the most expensive part of the workflow, the least deterministic part, and not always the best tool for enforcing strict rules. They can also hallucinate issues or over-flag statements that a simpler check could handle more reliably.

So Part 2 focuses on a more engineering-oriented question: where should we actually use an LLM, and where can simpler methods do the job better?

Also, this part is intentionally model-agnostic. The point is not that one specific model solves requirements engineering. The point is that an LLM can play a targeted role inside a larger workflow. In the current implementation, the project can route LLM calls to different backends, including locally hosted models and hosted APIs. Here, when I say "rules," I mainly mean the INCOSE requirement-writing rules that form the basis of our checking and rewriting workflow.

Why "LLM for everything" becomes inefficient fast

If every requirement is checked against every rule using an LLM, the workflow becomes expensive, slower, and harder to make transparent.

That is especially true in requirements engineering, because many violations are not subtle at all. If a statement contains phrases such as "as required", "etc.", "to be able to", or uses a quantity without a unit, there is no strong reason to spend an LLM call just to notice that.

In other words, some checks are basically pattern-matching problems, not reasoning problems.

That led us to split the rule-checking task into three layers:

programmatic checks for deterministic rules
NLP for lightweight linguistic structure analysis
LLM calls only for the rules that actually need semantic judgment

This hybrid approach is now reflected directly in the project structure. In the current rule catalog, the seeded rules are assigned a primary engine: programmatic checks, spaCy-based NLP, or LLM-based checking. The point is not that one method is universally better. The point is that each method should do the kind of work it is actually good at.

Three types of rule violations

Once we stopped looking at rule checking as one monolithic task, the design became much clearer. There are three main kinds of violations:

Type 1: Surface-level and deterministic

These are violations that can often be detected through explicit word lists, regular expressions, or simple structural heuristics.

Examples include vague terms, escape clauses, open-ended phrases, superfluous infinitives, absolutes, acronym formatting issues, decimal formatting issues, and missing or inconsistent units.

For these, programmatic checks are usually enough.

Type 2: Linguistic structure problems

These are cases where the wording may not be wrong because of one forbidden word, but because of how the sentence is built.

Examples include passive voice, indefinite articles where a defined entity is intended, multiple clause structures in one statement, multiple main actions in one sentence, pronoun usage, and indefinite temporal markers.

For these, lightweight NLP helps more than plain string matching.

Type 3: Semantic and contextual issues

These are the harder ones. They often require interpretation rather than simple detection.

Examples include whether the subject and verb are appropriate for the entity, whether the logic is truly ambiguous, whether a condition is only implied rather than stated explicitly, whether the wording is solution-driven rather than solution-free, and whether performance is actually measurable.

For these, an LLM is still useful, because the issue is not just in the words themselves, but in the meaning behind them.

What programmatic checks can do surprisingly well

A large share of rules can be checked using straightforward programmatic logic. In our prototype, this includes things like vague terms, escape clauses, open-ended clauses, superfluous infinitives, absolutes, punctuation-related patterns, acronym forms, abbreviations, style-guide conformance, and decimal formatting.

That may sound almost too simple, but simplicity is exactly the advantage here.

If the text says: "The robot shall be able to pick and place components fast as required", then a programmatic layer can already catch several issues before we ever ask the LLM anything:

"to be able to" can be flagged as a superfluous infinitive
"as required" can be flagged as an escape clause
"fast" can be marked as a likely performance problem candidate

Likewise, if a requirement says: "The robot shall operate at 300", we do not need semantic reasoning to notice that the number is missing a unit.

Programmatic checks are attractive for three reasons:

They are cheap.
They are predictable.
They are easy to explain to the user.

That matters in requirements engineering, because people do not just want a score or a yes/no answer. They want to know exactly why something was flagged, and programmatic checks can point directly to the offending phrase.

What NLP adds that pure logic misses

The next layer in the workflow is lightweight NLP. In our case, we use spaCy, not because it "solves language", but because it gives us just enough structure to go beyond raw keyword matching.

When people hear NLP, it can sound more mysterious than it really is. Here it mainly means:

tokenizing the sentence
identifying parts of speech
detecting grammatical dependencies
finding subjects, verbs, clause markers, and conjunctions

This is useful when the problem is not one forbidden word, but the grammatical form of the sentence. For example, NLP can help detect:

passive voice instead of active voice
whether a sentence has a clear subject and main verb
whether two actions are bundled into one requirement
whether conjunctions are joining multiple thoughts
whether pronouns such as "it" or "they" make the requirement less self-contained
whether words like "before", "after", or "until" are functioning as temporal dependency markers

This is the middle ground between simple pattern matching and full semantic reasoning. It gives more linguistic awareness than plain regex checks, but is still much cheaper and more stable than calling an LLM for every sentence. In practice, it is especially useful for rules that are about sentence structure rather than domain meaning.

Where the LLM still adds real value

Once the obvious and structural issues are filtered out, the remaining rules are the ones where language understanding really matters.

Take a rule like appropriate subject-verb usage. A sentence may be grammatically correct and still assign the action to the wrong entity. That is not something a simple pattern check can reliably decide.

The same goes for rules such as logical expressions, explicit versus implied conditions, solution-free wording, measurable performance, and enumeration or set-related ambiguity.

These cases require the system to interpret intent, not just syntax. For instance, consider a requirement like: "The user shall use an Arduino-based controller to control the robot precisely."

The problem is not only the words. The deeper question is whether the statement is expressing a real requirement, mixing user behavior with system behavior, or prematurely constraining the solution by naming a specific implementation choice. That is exactly the kind of judgment where an LLM is useful.

So the LLM still matters, but in a narrower and more deliberate role:

semantic judgment for the harder rules
generation of higher-quality suggestions
rewriting based on the consolidated violation report

What this looks like in the implementation

This split is not just a conceptual idea. It is reflected directly in the code.

At a high level, the workflow first runs programmatic checks, then NLP checks, then sends only selected rules to the LLM layer:

```python

coded_violations = run_coded_checks(req_text, allowed_rules=allowed)

spacy_violations = run_spacy_checks(req_text, allowed_rules=allowed)

rules_for_llm = ["R3", "R4", "R15", "R31", "R34", "R36"]

llm_violations = llm_check_rules(req_text, rules_for_llm, allowed_rules=allowed)

The rule catalog also makes the split explicit. Some rules are seeded with `engine = "python"`, some with `engine = "spacy"`, and some with `engine = "llm"`. That keeps simple checks local, uses NLP for structure, and leaves semantic checks flexible enough to run on either local or hosted LLM backends.

The new workflow

This leads to a workflow that is much more practical than an LLM-only pipeline.

Step 1: Run programmatic checks to catch obvious textual and formatting violations.

Step 2: Run lightweight NLP checks to inspect sentence structure, clause boundaries, voice, pronouns, and related grammatical signals.

Step 3: Send only the remaining semantically difficult rules to the LLM.

Step 4: Consolidate the results into one transparent violation report while preserving source labels, so each issue remains traceable.

Step 5: Use that violation report as input to rewriting instead of asking the LLM to rewrite blindly.

This shifts the LLM from all-purpose checker to targeted semantic assistant.

Why this hybrid approach is better

The main benefit is cost reduction, but that is not the only benefit.

It also improves transparency. If a requirement was flagged because it contains "as required", that should come from a deterministic rule with a clear explanation, not from a black-box judgment that merely says the sentence is vague.

It improves repeatability too. Programmatic and NLP-based checks are easier to test, compare across runs, and debug when they behave unexpectedly.

And it improves trust. Engineers are usually more willing to accept AI assistance when the workflow is not pretending that one model is magically doing everything. A layered pipeline feels more like an engineering system and less like an oracle.

What this still does not solve

A hybrid workflow is better, but it is not a silver bullet.

Some rules still depend on project context, such as glossary terms, allowed units, naming conventions, and domain-specific wording. That becomes especially important when the system moves from detecting violations to rewriting requirements in a way that is actually useful for a specific project.

Some rules are really set-level rather than sentence-level. You cannot fully judge things like unique expression, grouping, or structured requirement sets from one isolated sentence.

And some cases will still need human review, especially when domain intent is unclear or when a statement is technically valid but poorly aligned with project conventions.

So the takeaway is not "replace the LLM." It is to be more deliberate about where each method actually helps and where it only adds cost or noise.

The real takeaway is simpler:

Use deterministic logic where logic is enough.
Use NLP where sentence structure matters.
Use the LLM where semantic judgment is actually needed.

What comes next

Once we had this hybrid rule-checking workflow, another question became more important:

Even if we detect violations efficiently, how do we guide the system toward producing better requirement statements in the first place?

That is where patterns, TBD handling, and project context come in.

In Part 3, I will look at how predefined requirement patterns, controlled placeholders, and project context can improve consistency, reduce ambiguity, and help the LLM generate stronger draft system requirements without pretending that missing information is already known.

RG .