Unlocking knowledge — without the cloud and without LLMs
Case Study: NLP-based knowledge extraction from three decades of research documentation
Putting it in context: Why not simply use an LLM?
Large Language Models and Agentic AI are fascinating technologies — and we deploy them where they deliver the greatest value. Yet the reflexive reach for an LLM as a first step is in many cases unnecessary, overpriced and introduces uncertainties: hallucinations, lack of traceability, data protection risks through cloud dependency, high ongoing costs.
This project demonstrates the alternative: classical NLP methods deliver better, more reproducible results for structured tasks such as keyword extraction, clustering and summarisation — with full data protection, full control and no ongoing API costs. They run entirely locally, are transparent and deterministic.
This does not mean forgoing modern methods. It means choosing the right method for the right problem. Every component used here can be extended with Large Language Models at any time — ideally open source and operated locally. The architecture is designed for it.
Starting position
The research and development department of an internationally leading automotive group held an extensive archive: more than 10,000 documents from three decades of research. Reports, studies, technical documentation — written by different authors, in various formats, with heterogeneous quality and structure.
This archive represented an enormous knowledge asset. Yet it was effectively inaccessible.
Challenge
The existing search function was rudimentary and returned insufficient results. Anyone researching a specific topic had to know exactly what they were looking for — exploratory discovery was impossible. Thematic connections between documents remained invisible. Manual review of the entire archive was out of the question at this scale.
A central constraint compounded the challenge: the documents were confidential. Cloud-based solutions or external APIs were ruled out entirely on data protection grounds. Everything had to run on-premise. For security reasons, the client could not grant direct access to production data — development took place on a representative test corpus.
The solution also had to function as a self-service tool: after project completion, the client needed to be able to continue working independently, without permanent external dependency.
Objectives
Three core objectives defined the project:
- Automatic content indexing — Every document was to be machine-enriched with keywords, summaries and thematic classifications
- Thematic exploration — Users should be able to browse the archive like a library, discovering connections between documents that were previously invisible
- Independent continued use — The tool had to be built so the client could operate and extend it without external help
Approach
The solution deliberately relied on robust, proven computational linguistics methods — no experimental approaches, no cloud dependencies, no proprietary models. Three core NLP components were developed and combined:
Keyword Extraction
Three different algorithms were implemented and compared in parallel: RAKE (frequency-based, fast), PMI (distinguishes document-specific from general terms) and Gensim/TextRank (ranking by information content). Combining multiple approaches delivered significantly more robust results than any single method.
Text Summarisation
Automatic summaries of varying lengths were generated for each document. The TextRank algorithm identifies sentences that capture the most central concepts of a document — domain-independent and without training data.
Clustering
K-Means clustering on TF-IDF vectors grouped the 10,000-plus documents into thematic clusters. Silhouette scores served as quality measures. Visualisation via dimensionality reduction (TruncatedSVD) made the thematic landscape of the archive graspable at a glance.
All results were stored in an efficient data format (Parquet), enabling fast access and easy extension.
Implementation
Development followed a strictly iterative process in close coordination with the client. Regular interim reviews ensured that results matched actual needs and did not drift past the users.
Data quality posed a particular challenge: documents spanning three decades brought wildly different formats, writing styles and structures. Some documents were very short and provided little context for the algorithms. The NLP pipeline had to be robust enough to deliver meaningful results even for these edge cases.
The centrepiece became the NLP Inspector — an interactive user interface that brought all components together. Users could search by text, but — and this was the real breakthrough — they could also simply navigate the thematic landscape. Like browsing the shelves of a library and stumbling upon unexpected connections.
The cluster view showed at a glance which subject areas the archive contained and how they related. Clicking a cluster revealed the associated documents with their keywords and summaries. Named entity recognition complemented the indexing with people, organisations and places.
Results
For the first time, the knowledge accumulated over three decades of research was not merely archived but genuinely accessible and explorable. Contexts and connections that had previously been invisible became transparent.
The 10,859 documents were grouped into 9 thematic clusters representing different research priorities — from biodiversity and climate research to materials science and algorithmic methods.
Additional research expenditure was avoided because existing knowledge could be built upon instead of unknowingly duplicating work. The tool enabled systematic checks before starting new research projects to see what already existed on a topic.
The project was continued and expanded in-house by the client after completion — the strongest signal of a solution's value: the client invests their own budget in further development.
What changed
Before the project: an archive nobody used. Institutional knowledge that walked out the door whenever a staff member left. Research that took days — or simply never happened, because nobody knew what already existed.
After the project: second-accurate access to three decades of research documentation. Internal research that previously required asking the right expert is now self-service. Fully on-premise, fully operated in-house — no external dependency after project completion. Continued and expanded by the client ever since.
Success factors and lessons learned
The strongest impact came not from algorithmic sophistication, but from how results were presented. The library metaphor — browsing through knowledge rather than searching for it — made the tool intuitively usable for non-technical staff and generated immediate enthusiasm.
This project was completed entirely without cloud services, without Large Language Models and without proprietary AI platforms. It demonstrates that robust, classical NLP methods — properly combined and well visualised — can deliver transformative business value.
From the outset, the tool was built so the client could operate and extend it independently. No ongoing licence costs, no external dependency. This investment in self-sufficiency paid off: the client continued developing the solution after project completion.
Close collaboration with the client in regular review cycles ensured the solution remained aligned with actual needs — not with technical possibilities that look impressive but miss the user.
We had decades of knowledge in our archive but couldn't use it. The NLP solution enabled us for the first time to discover thematic connections that had previously been hidden — completely securely, without a single document leaving our network.
— Head of Research & Development
This case study describes an anonymised client project. Industry and company context are accurately represented; identifying details have been changed.