<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>OCR | Daniel Arbelaez Alvarez</title><link>https://portfolio.sprintjudicial.com/en/tags/ocr/</link><atom:link href="https://portfolio.sprintjudicial.com/en/tags/ocr/index.xml" rel="self" type="application/rss+xml"/><description>OCR</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 01 Dec 2025 00:00:00 +0000</lastBuildDate><image><url>https://portfolio.sprintjudicial.com/media/icon_hu7729264130191091259.png</url><title>OCR</title><link>https://portfolio.sprintjudicial.com/en/tags/ocr/</link></image><item><title>Sherlock-docs - Intelligent Legal Document Processing</title><link>https://portfolio.sprintjudicial.com/en/project/sherlock-docs/</link><pubDate>Mon, 01 Dec 2025 00:00:00 +0000</pubDate><guid>https://portfolio.sprintjudicial.com/en/project/sherlock-docs/</guid><description>&lt;p>&lt;strong>Sherlock-docs&lt;/strong> is an intelligent processing system for judicial documents (tutelas and habeas corpus) that combines OCR, named entity recognition (NER), and duplicate detection. All processing is 100% local, with no data sent to external APIs.&lt;/p>
&lt;h2 id="key-features">Key Features&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Hybrid OCR&lt;/strong>: PaddleOCR + Tesseract for scanned and digital documents&lt;/li>
&lt;li>&lt;strong>Legal NER&lt;/strong>: Entity extraction with SpaCy, F1 score of 85.3% validated by humans&lt;/li>
&lt;li>&lt;strong>Multi-Level Duplicate Detection&lt;/strong>: SHA-256 hash, LZJD fuzzy hash, TF-IDF, Sentence-Transformers&lt;/li>
&lt;li>&lt;strong>Advanced Search&lt;/strong>: SQLite FTS5 full-text search with advanced filtering&lt;/li>
&lt;li>&lt;strong>Complete REST API&lt;/strong>: 22 FastAPI endpoints with JWT authentication and Swagger documentation&lt;/li>
&lt;li>&lt;strong>Graphical Interface&lt;/strong>: 9 Streamlit pages for interactive document management&lt;/li>
&lt;li>&lt;strong>CLI&lt;/strong>: 20 commands for batch operations&lt;/li>
&lt;li>&lt;strong>Active Learning&lt;/strong>: Interface for human validation of NER entities&lt;/li>
&lt;li>&lt;strong>Export&lt;/strong>: Excel report generation&lt;/li>
&lt;/ul>
&lt;h2 id="technologies-used">Technologies Used&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Backend&lt;/strong>: Python 3.12.4, FastAPI, Pydantic&lt;/li>
&lt;li>&lt;strong>OCR&lt;/strong>: PaddleOCR + Tesseract&lt;/li>
&lt;li>&lt;strong>NLP/NER&lt;/strong>: SpaCy (Spanish legal model)&lt;/li>
&lt;li>&lt;strong>Database&lt;/strong>: SQLite + FTS5 (33 columns, 11 indexes, WAL mode)&lt;/li>
&lt;li>&lt;strong>Frontend&lt;/strong>: Streamlit (9 pages)&lt;/li>
&lt;li>&lt;strong>Architecture&lt;/strong>: Layered Clean Architecture, Result-Oriented Programming (returns)&lt;/li>
&lt;li>&lt;strong>Security&lt;/strong>: JWT, RBAC, rate limiting, security headers&lt;/li>
&lt;li>&lt;strong>Deployment&lt;/strong>: Docker, Easypanel, automated GitHub Webhook&lt;/li>
&lt;/ul>
&lt;h2 id="architecture">Architecture&lt;/h2>
&lt;p>5 layers with ServiceContainer (14 lazy-loaded properties):&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core&lt;/strong>: Entities, value objects&lt;/li>
&lt;li>&lt;strong>Application&lt;/strong>: Use cases, DTOs, ServiceContainer&lt;/li>
&lt;li>&lt;strong>Infrastructure&lt;/strong>: OCR, NER, deduplication, logging&lt;/li>
&lt;li>&lt;strong>Persistence&lt;/strong>: SQLite + FTS5 with ISP ports&lt;/li>
&lt;li>&lt;strong>Interfaces&lt;/strong>: Streamlit GUI, CLI (20 commands), FastAPI REST (22 endpoints)&lt;/li>
&lt;/ul>
&lt;h2 id="achievements">Achievements&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>1,832 tests&lt;/strong> (1,770 unit + 62 integration) with &lt;strong>87% coverage&lt;/strong>&lt;/li>
&lt;li>&lt;strong>NER F1 85.3%&lt;/strong> with data validated by human operators&lt;/li>
&lt;li>&lt;strong>22 REST endpoints&lt;/strong> documented and functional&lt;/li>
&lt;li>&lt;strong>16/16 security SECs&lt;/strong> + 4 SEC-API implemented&lt;/li>
&lt;li>&lt;strong>9 sprints&lt;/strong> completed (S23-S28) with 30 planning documents&lt;/li>
&lt;li>&lt;strong>mypy 0 errors&lt;/strong> in strict mode across 130 files&lt;/li>
&lt;li>&lt;strong>Certified SDD audit&lt;/strong>: 39 conforming, 0 defects&lt;/li>
&lt;li>&lt;strong>Code quality&lt;/strong>: 9.1/10&lt;/li>
&lt;/ul>
&lt;h2 id="impact">Impact&lt;/h2>
&lt;p>Sherlock-docs processes ~100 documents/day with a performance target of &amp;lt;2 minutes for 20-page documents and &amp;lt;100ms for digital documents, eliminating manual classification of judicial documents and automatically detecting duplicate filings.&lt;/p>
&lt;p>This system was born from the need to automate the registration and classification of tutelas and habeas corpus in the Colombian judicial system, where duplicate detection and accurate extraction of procedural party information are critical to judicial office efficiency.&lt;/p></description></item></channel></rss>