Agentic Process Discovery (v2)
Architecture Diagram
Indexing Flow

Query / Inference Flow

Introduction
This system implements Agentic Process Discovery (APD) — a technique for reverse-engineering recurring administrative processes (e.g., “expense approvals”, “meeting scheduling”, “project handoffs”) from unstructured communication data like Gmail and WhatsApp. The user provides a loose natural-language topic, and the system leverages a local LLM and vector database to synthesize a structured process narrative decomposed into ordered, discrete subtasks. PII (people names, phone numbers) is anonymised at display time using spaCy NER and regex.
Workflow
The workflow has two main phases:
- Indexing Phase:
- Data Ingestion: Fetch emails from Gmail using the Gmail API (filtering replies) and parse WhatsApp chat exports into conversation windows.
- Vector Embedding: Create L2-normalised embeddings using the
BAAI/bge-base-en-v1.5model. - Vector Database: Store documents in ChromaDB in separate collections (
emailsandwhatsapp).
- Query & Inference Phase:
- Query Generation: Use a local LLM (
llama3.2via Ollama) to generate 4-5 stage-targeted retrieval queries representing different parts of a workflow based on the user’s topic. - Evidence Retrieval: Search ChromaDB per query to retrieve top-N results and deduplicate them by subject and sender.
- Narrative Generation: Use the LLM to synthesize a structured process narrative with ordered subtasks, triggers, owners, outputs, and friction points.
- PII Anonymization: Apply spaCy NER and regex to mask names and phone numbers before displaying the narrative and evidence in the Gradio UI.
- Query Generation: Use a local LLM (
The system strictly utilizes local inference and vector storage components (aside from Gmail ingestion) ensuring that sensitive administrative workflows and correspondence remain private.
Technologies:
Python ChromaDB Ollama Llama3.2 Gradio Hugging Face Transformers spaCy Gmail API
Concepts / Algorithms:
Agentic Process Discovery Vector Embeddings RAG Named Entity Recognition (NER) Semantic Search LLM Inference
