At Comand AI we are building the next generation suite of AI native software for the military. In this post, we are going to explain our current approach to build workflows that natively integrate with the latest advances in AI. Our systems are designed with the idea to quickly harness the power of latest ML models while delivering a field ready performance on specific workflows.
The pace of advances in AI in the last few years has been tremendous. Every week, AI companies release new foundational models that break records on all benchmarks. In particular, the Large Language Models have been at the core of the AI hype since the launch of ChatGPT in 2022.
An LLM (Large Language Model) is a probabilistic machine learning model that learns distributions of words in the written language. We use two types of LLMs: generative models and embedding models. Through their training, LLMs learn - within their billions of parameters - relationships between words. To be able to better generalize to unseen text, LLMs also learn grammar, vocabulary, and abstract concepts to better predict words.
Generative models (decoder-only) are trained on billions of sentences to predict the next word given the preceding words.Thanks to their training, generative LLMs can analyze text, extract information, and produce summaries or answers that follow given instructions.
Embedding models (encoder-only) are trained to predict missing words in a text (e.g. 15% of the words are masked during training). These models generate semantically rich vector representations of texts but lack the text-generation capability of generative models.
At Comand AI, our objective is to integrate the huge automation potential of LLMs in the military workflow to dramatically decrease the execution time (e.g. a 100x reduction) while substantially increasing the quality of the outputs (e.g. world-level expert performance).
The answer to this question is both affirmative and negative.
On one side, we use powerful open-source models available to the public. One of our objectives is to keep an edge and benefit from the latest advances. We regularly integrate new generations of models after internal evaluations and comparisons with international benchmarks. This strategy ensures that we continuously benefit from the most effective models. If necessary, we can quickly switch to another model - there is no exclusive dependency.
The performances are good out-of-the-box on military texts because modern LLMs have been exposed to a wide variety of materials, including many texts relating to the military field: doctrines, manuals, press articles, official reports, academic publications, etc. They understand multiple languages (French, English, Russian, Chinese, Ukrainian, etc.) and already master quite well the vocabulary, acronyms and language structures specific to the military environment. This makes LLMs a tool that can be plugged in out of the box to perform military workflows.
On the other hand, while pre-trained LLMs are very useful components in our processes, we can’t solely rely on them to hope to achieve our workflows with a good enough performance. These models achieve acceptable performance on tasks that are well-represented in their training data (for example, general knowledge) or that have been specifically optimized for strong results on certain benchmarks (such as programming tasks).
In general, when attempting to use a pre-trained model for specialized processes - either due to the nature of the tasks or the high degree of precision required - out-of-the-box performance is not sufficient for production use. This is especially important to unlock productivity on a per user level.
There are several ways to adapt models to specific processes.
There are several technical approaches to improve the performance of an open-source model on a specific process:
All the above techniques treat the process we want to achieve as a black box: the LLM is asked to solve the task end-to-end without relying on a specific problem-solving method or a defined operational workflow as a guide.
Many modern prompt-tuning techniques - such as chain-of-thought prompting - include instructions that force the LLM to reason step by step. These techniques allow models to handle more complex problems with greater accuracy. Agents are also a natural evolution of this capacity of the LLM to reason and produce structured outputs to call external tools.
However, these techniques remain general-purpose since they must work across a wide variety of processes that are not known in advance. Furthermore, the task is still considered as a single block, even if the LLM is encouraged to decompose it into steps internally.
In practice, for production systems, the state of the art for many workflows consists of breaking down the overall process into simple, modular, and testable sub-tasks. Each sub-task is formulated to maximize model performance — for example, by providing only the necessary context for that specific sub-task.
The inputs and outputs of each sub-task are structured and deterministically controlled, ensuring testability, traceability, and overall reliability far superior to what a standalone LLM can achieve on the whole process.
The fact that the workflow is scoped in advance allows to build guardrails to drastically reduce the errors from hallucinations. This significantly improves accuracy and reduces the risk of errors. Specifically, generative LLMs work auto-regressively: text is generated one token at a time, predicting the most probable next token given the previous ones. This intrinsic mechanism of current LLMs (based on decoder-style transformer architectures) gives them a high tendency to hallucinate - that is, to follow a completely erroneous line of reasoning.
The more complex and lengthy the task, the higher the likelihood of the model “derailing” from correct reasoning. For this reason, decomposing the process into smaller, verifiable stages is essential to ensure reliability.
Additionally, an LLM alone — such as ChatGPT — is a simple system: it takes text as input and produces text as output, token by token.
The amount of text that can be processed in one go is limited (in theory, over 200k words for the most recent models, but in practice often less than 50k useful words1), as is the length of the output (typically a few thousand words before hallucinations appear).
For instance, it is not possible to synthesize hundreds of documents using ChatGPT. For now, other machine learning techniques combined together are better suited for this purpose— such as extracting and normalizing the documents, performing topic modeling followed by synthesizing the documents to produce a coherent and exhaustive summary.
An LLM is only a tool - a very large matrix of numerical parameters. The real risk lies primarily in how it is deployed. In our products, the actions that can be performed by an LLM are strictly controlled by typed outputs. It can perform only predefined tasks and cannot act outside the boundaries strictly defined by Comand AI’s code
In the context of our on premise deployments:
Comand AI is developing a complete system to perform military planning and lessons learned workflows end-to-end using AI techniques. LLMs are part of this platform, but they are used as specialized subsystems to accomplish certain sub-processes, in combination with other AI methods when more appropriate. In particular, we do not use LLMs to execute the entire process from start to finish, as systems like ChatGPT might attempt to do.
We have analyzed the workflows and decomposed them into sub-processes. Each sub-process is validated and tested using datasets; inputs and outputs are optimized, structured, and verified at every stage.
Certain workflow stages — such as producing synthesis notes over a large number of documents — are simply not feasible with LLMs alone. For these, we use complementary techniques alongside LLMs to accomplish the required processes.
1 YaRN: Efficient Context Window Extension of Large Language Models