Sherlock-docs - Intelligent Legal Document Processing

Sherlock-docs is an intelligent processing system for judicial documents (tutelas and habeas corpus) that combines OCR, named entity recognition (NER), and duplicate detection. All processing is 100% local, with no data sent to external APIs.

Key Features

Hybrid OCR: PaddleOCR + Tesseract for scanned and digital documents
Legal NER: Entity extraction with SpaCy, F1 score of 85.3% validated by humans
Multi-Level Duplicate Detection: SHA-256 hash, LZJD fuzzy hash, TF-IDF, Sentence-Transformers
Advanced Search: SQLite FTS5 full-text search with advanced filtering
Complete REST API: 22 FastAPI endpoints with JWT authentication and Swagger documentation
Graphical Interface: 9 Streamlit pages for interactive document management
CLI: 20 commands for batch operations
Active Learning: Interface for human validation of NER entities
Export: Excel report generation

Technologies Used

Backend: Python 3.12.4, FastAPI, Pydantic
OCR: PaddleOCR + Tesseract
NLP/NER: SpaCy (Spanish legal model)
Database: SQLite + FTS5 (33 columns, 11 indexes, WAL mode)
Frontend: Streamlit (9 pages)
Architecture: Layered Clean Architecture, Result-Oriented Programming (returns)
Security: JWT, RBAC, rate limiting, security headers
Deployment: Docker, Easypanel, automated GitHub Webhook

Architecture

5 layers with ServiceContainer (14 lazy-loaded properties):

Core: Entities, value objects
Application: Use cases, DTOs, ServiceContainer
Infrastructure: OCR, NER, deduplication, logging
Persistence: SQLite + FTS5 with ISP ports
Interfaces: Streamlit GUI, CLI (20 commands), FastAPI REST (22 endpoints)

Achievements

1,832 tests (1,770 unit + 62 integration) with 87% coverage
NER F1 85.3% with data validated by human operators
22 REST endpoints documented and functional
16/16 security SECs + 4 SEC-API implemented
9 sprints completed (S23-S28) with 30 planning documents
mypy 0 errors in strict mode across 130 files
Certified SDD audit: 39 conforming, 0 defects
Code quality: 9.1/10

Impact

Sherlock-docs processes ~100 documents/day with a performance target of <2 minutes for 20-page documents and <100ms for digital documents, eliminating manual classification of judicial documents and automatically detecting duplicate filings.

This system was born from the need to automate the registration and classification of tutelas and habeas corpus in the Colombian judicial system, where duplicate detection and accurate extraction of procedural party information are critical to judicial office efficiency.