Mujtaba Ayub

← All projects · Self-built AI & data tools

Document Intelligence, Semantic Search

What do fifteen years of journal entries say?

Conversing with 15 Years of Journal Entries

Using natural language processing to tag and semantically analyse unstructured content. Making it queryable through LLM models.

Personal build

Journal archive pipeline: fifteen years of raw entries are tagged, embedded, and made searchable and queryable through an AI assistant. Three stages left to right — a stack of dated raw entries, a tagging and embedding step, and a query interface returning a cited answer. raw corpus 2011-03-14 2017-08-02 2024-11-19 3,500+ entries tag · embed career travel family health + semantic vectors ask anything… When did I first write about moving abroad? First mention: 2013 — cites 3 source entries tagged · searchable · queryable

The problem

Fifteen years of writing, drafts, entries, and notes, locked in formats that can't be asked anything. The same problem every organization has with its customer feedback, meeting notes, and research archives.

The approach

An ingestion pipeline pulls and semantically buckets the material, an LLM tags each entry against an evolving theme taxonomy, and embeddings make the whole archive semantically searchable. This is then made accessible through a dashboard and conversational AI assistants.

What it revealed

Themes and sentiment trace arcs invisible at the entry level. Any unstructured text, notes, charts, or visuals archive can be made meaningful and become a queryable asset with this architecture.

The same architecture applies to research repositories, presentation deck archives, customer feedback, meeting notes, or field survey notes, reanimating years of unstructured material that would otherwise be in danger of slipping into forgotten archives.

Under the hood

Python · Gmail API · SQLite · LLM tagging · sentence-transformers · Natural Language Processing