Agentic Process Discovery (v2)

View Code on GitHub

Architecture Diagram

Indexing Flow

Indexing Flow Architecture

Query / Inference Flow

Query Flow Architecture

Introduction

This system implements Agentic Process Discovery (APD) — a technique for reverse-engineering recurring administrative processes (e.g., “expense approvals”, “meeting scheduling”, “project handoffs”) from unstructured communication data like Gmail and WhatsApp. The user provides a loose natural-language topic, and the system leverages a local LLM and vector database to synthesize a structured process narrative decomposed into ordered, discrete subtasks. PII (people names, phone numbers) is anonymised at display time using spaCy NER and regex.

Workflow

The workflow has two main phases:

  1. Indexing Phase:
    • Data Ingestion: Fetch emails from Gmail using the Gmail API (filtering replies) and parse WhatsApp chat exports into conversation windows.
    • Vector Embedding: Create L2-normalised embeddings using the BAAI/bge-base-en-v1.5 model.
    • Vector Database: Store documents in ChromaDB in separate collections (emails and whatsapp).
  2. Query & Inference Phase:
    • Query Generation: Use a local LLM (llama3.2 via Ollama) to generate 4-5 stage-targeted retrieval queries representing different parts of a workflow based on the user’s topic.
    • Evidence Retrieval: Search ChromaDB per query to retrieve top-N results and deduplicate them by subject and sender.
    • Narrative Generation: Use the LLM to synthesize a structured process narrative with ordered subtasks, triggers, owners, outputs, and friction points.
    • PII Anonymization: Apply spaCy NER and regex to mask names and phone numbers before displaying the narrative and evidence in the Gradio UI.

The system strictly utilizes local inference and vector storage components (aside from Gmail ingestion) ensuring that sensitive administrative workflows and correspondence remain private.

Technologies:
Python ChromaDB Ollama Llama3.2 Gradio Hugging Face Transformers spaCy Gmail API

Concepts / Algorithms:
Agentic Process Discovery Vector Embeddings RAG Named Entity Recognition (NER) Semantic Search LLM Inference