Sherlock-docs - Intelligent Legal Document Processing

Dec 1, 2025 · 2 min read

Sherlock-docs is an intelligent processing system for judicial documents (tutelas and habeas corpus) that combines OCR, named entity recognition (NER), and duplicate detection. All processing is 100% local, with no data sent to external APIs.

Key Features

  • Hybrid OCR: PaddleOCR + Tesseract for scanned and digital documents
  • Legal NER: Entity extraction with SpaCy, F1 score of 85.3% validated by humans
  • Multi-Level Duplicate Detection: SHA-256 hash, LZJD fuzzy hash, TF-IDF, Sentence-Transformers
  • Advanced Search: SQLite FTS5 full-text search with advanced filtering
  • Complete REST API: 22 FastAPI endpoints with JWT authentication and Swagger documentation
  • Graphical Interface: 9 Streamlit pages for interactive document management
  • CLI: 20 commands for batch operations
  • Active Learning: Interface for human validation of NER entities
  • Export: Excel report generation

Technologies Used

  • Backend: Python 3.12.4, FastAPI, Pydantic
  • OCR: PaddleOCR + Tesseract
  • NLP/NER: SpaCy (Spanish legal model)
  • Database: SQLite + FTS5 (33 columns, 11 indexes, WAL mode)
  • Frontend: Streamlit (9 pages)
  • Architecture: Layered Clean Architecture, Result-Oriented Programming (returns)
  • Security: JWT, RBAC, rate limiting, security headers
  • Deployment: Docker, Easypanel, automated GitHub Webhook

Architecture

5 layers with ServiceContainer (14 lazy-loaded properties):

  • Core: Entities, value objects
  • Application: Use cases, DTOs, ServiceContainer
  • Infrastructure: OCR, NER, deduplication, logging
  • Persistence: SQLite + FTS5 with ISP ports
  • Interfaces: Streamlit GUI, CLI (20 commands), FastAPI REST (22 endpoints)

Achievements

  • 1,832 tests (1,770 unit + 62 integration) with 87% coverage
  • NER F1 85.3% with data validated by human operators
  • 22 REST endpoints documented and functional
  • 16/16 security SECs + 4 SEC-API implemented
  • 9 sprints completed (S23-S28) with 30 planning documents
  • mypy 0 errors in strict mode across 130 files
  • Certified SDD audit: 39 conforming, 0 defects
  • Code quality: 9.1/10

Impact

Sherlock-docs processes ~100 documents/day with a performance target of <2 minutes for 20-page documents and <100ms for digital documents, eliminating manual classification of judicial documents and automatically detecting duplicate filings.

This system was born from the need to automate the registration and classification of tutelas and habeas corpus in the Colombian judicial system, where duplicate detection and accurate extraction of procedural party information are critical to judicial office efficiency.