Compare commits

...

38 Commits

Author SHA1 Message Date
Aaron Roberts
02185bef46 Adding missed files 2026-06-30 12:16:16 +01:00
Aaron Roberts
04bbbebd5a Remove Freeform and Find from UI. Allow Description to be added to Reviewed job 2026-06-29 13:09:01 +01:00
Aaron Roberts
48f958de6c Added job review toggle 2026-06-23 10:43:44 +01:00
Aaron Roberts
91c134faa7 Add updated_at column and trigger for Qdrant re-sync detection
Adds updated_at TIMESTAMPTZ to ocr_jobs, stamped automatically by a
BEFORE UPDATE trigger. The sync process can use updated_at > qdrant_synced_at
to detect jobs that need re-ingestion after edits or reviews.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 23:12:33 +01:00
Aaron Roberts
38ac36b18e Add qdrant_synced_at column 2026-06-19 17:47:53 +01:00
Aaron Roberts
ab19725e0b Remove AnimatePresence mode=wait to fix blank screen on view transitions
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-10 22:04:52 +01:00
Aaron Roberts
a511db78cb Fix blank screen on Analyze; add mode selector to result view
showResultView now only activates after results exist (not during loading),
preventing AnimatePresence from blanking the screen mid-transition.
Adds a mode selector + Analyze button at the top of the result view so
additional modes can be run without leaving the page.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-10 21:55:23 +01:00
Aaron Roberts
07b2f2b6bc Fix stale editedOcrText reference in handleDownload dependency array
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-10 21:44:36 +01:00
Aaron Roberts
ae0ac3af59 Store all mode results (OCR, Describe, Freeform) in a single job record
- DB: add describe_text and freeform_text columns (ALTER TABLE IF NOT EXISTS)
- Backend: commit and review endpoints accept/persist all three text fields
- App: accumulate results per mode in state; tabs appear when >1 mode run;
  all results sent on Commit Job
- JobDetail: tabbed text panel shows whichever fields are populated, all editable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-10 12:28:01 +01:00
Aaron Roberts
4ab87d2e6f Extend commit workflow to Describe and Freeform modes
All text-output modes (plain_ocr, describe, freeform) now show the
full-screen editable result view with metadata fields and Commit Job
button. The textarea label reflects the active mode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-10 10:38:27 +01:00
Aaron Roberts
cc5ce0c6be Fix suggestions fetch using wrong API base URL
Fallback was http://localhost:8000/api instead of /api, causing silent
failure in containerized deployments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 18:37:13 +01:00
Aaron Roberts
02e3099388 Add delete job functionality with confirmation step
Adds DELETE /api/jobs/{id} endpoint (removes DB record and image file),
and a two-step Delete / Confirm button on the review page that returns
to the job list on success.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 18:33:46 +01:00
Aaron Roberts
dc5a1a4ff5 Add book title to autocomplete suggestions
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 18:29:14 +01:00
Aaron Roberts
5ea18d76d6 Add autocomplete suggestions for Author, Chapter, and Reviewer fields
Adds a GET /api/jobs/suggestions endpoint that returns distinct values for
author, chapter, and reviewer_name from the database, and wires them into
HTML datalist elements on the New Job, result view, and Browse Jobs pages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 18:24:49 +01:00
Aaron Roberts
1d15b5f0c1 Add unique constraint to prevent duplicate (author, chapter, page) submissions
Adds a PostgreSQL partial unique index on (author, chapter, page) where all
three fields are non-null, and returns HTTP 409 when a duplicate is detected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 18:19:54 +01:00
Aaron Roberts
cb704a2f27 Double image/text section height to 130vh
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 18:13:11 +01:00
Aaron Roberts
3ca40a2255 Revert to 50/50 image/text column split
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 18:10:51 +01:00
Aaron Roberts
6f86f872a9 Make image display significantly taller
Give the image+text row an explicit 65vh height instead of flex-1 inside
a viewport-locked container. Remove the overall height constraint so
metadata and commit rows sit naturally below with scroll if needed.
Image and textarea containers now use h-full to fill the fixed row height.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 18:10:39 +01:00
Aaron Roberts
7381ecd12e Increase image display size to 60% of the split layout
Change image/text column ratio from 50/50 to 60/40 (3fr 2fr) on both
the New Job result view and the Browse Jobs detail view.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 18:05:09 +01:00
Aaron Roberts
247a5e4b0e Full-screen side-by-side layout for New Job and Browse Jobs
New Job (plain_ocr):
- After OCR completes, the entire main area becomes a flex-column view
  pinned to viewport height: image and editable textarea side by side at
  top (filling available space), metadata fields in a compact row below,
  Commit Job button at the bottom
- "New Analysis" button in the header returns to the upload view
- ResultPanel reverted to simple rendered-output only (no commit logic)

Browse Jobs:
- Selecting a job replaces the search list with a full-screen detail view
  using the same layout: image | editable textarea on top, all metadata
  fields + Reviewer name + action button in a single row below
- "Back to results" button returns to the search/list grid
- Search results now display as a responsive card grid

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:57:11 +01:00
Aaron Roberts
9356ba6d1b Side-by-side image/text layout and editable metadata on review
New Job page:
- OCR result now shows source image and editable textarea side by side
- Grounding-box overlay preview moved into the non-commit branch

Browse Jobs / Review page:
- JobDetail uses a 2-column layout: image + read-only info on left,
  all editable fields on right
- Author, book, chapter, and page are now editable inputs (not read-only)
- Text textarea is always editable (for both unreviewed and reviewed jobs)
- Reviewer name pre-filled for reviewed jobs; button becomes "Save Changes"
- Outer grid changed to 1/3 list + 2/3 detail for more review space

Backend:
- PUT /api/jobs/{id}/review now accepts and saves author, book,
  chapter, page alongside reviewed_text and reviewer_name

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:38:36 +01:00
Aaron Roberts
da7957d7d5 Fix commit job and OCR text editing
- OCR text is now shown in an editable textarea (plain_ocr mode) so
  users can correct it before committing
- editedOcrText state tracks edits; commit job sends the edited value
  instead of the original result.text
- Remove silent early-return guard that blocked commit when text was empty
- Copy and download also use the edited text

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:11:49 +01:00
Aaron Roberts
fd747e6c23 Add job tracking with PostgreSQL, image storage, and review workflow
- Add PostgreSQL service to docker-compose with health check and postgres_data volume
- Mount ./ocr_images as bind volume for persistent image storage
- Add backend/database.py with schema init and get_db() context manager
- Add 5 new API endpoints: POST /api/jobs, GET /api/jobs (search), GET /api/jobs/{id},
  GET /api/jobs/{id}/image, PUT /api/jobs/{id}/review
- Jobs are saved with author/book/chapter/page metadata, auto UUID, and submitted_at timestamp
- Jobs start as 'unreviewed'; review captures edited text, reviewer name, and reviewed_at
- Add MetadataForm.jsx (author/book/chapter/page inputs) to the New Job panel
- Add JobsPanel.jsx with search/filter, paginated list, and detail pane with review form
- Add "Commit Job" button to ResultPanel (plain_ocr mode only) with success/error feedback
- Add "New Job" / "Browse Jobs" navigation to the app header

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 16:48:12 +01:00
Aaron Roberts
68147eb97c .env 2026-06-09 15:10:25 +01:00
Aaron Roberts
ba313ee808 stack.env 2026-06-09 15:06:02 +01:00
Aaron Roberts
bd19e09630 Adding .env for portainer 2026-06-09 14:15:34 +01:00
Ray Dumasia
3dac0741b1 Fix RCE vulnerability and harden security
- Replace eval() with ast.literal_eval() in pdf_utils.py to fix
  unauthenticated remote code execution via crafted PDF uploads
  (reported by OX Security)
- Sanitize HTML output with DOMPurify to prevent XSS
- Restrict CORS origins (configurable via CORS_ORIGINS env var)
- Suppress raw exception details in API error responses
- Cap Image.MAX_IMAGE_PIXELS to prevent decompression bomb DoS
- Add security regression test suite

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 09:01:52 +01:00
Ray Dumasia
e24f064042 Add CTRL-V support as suggested by @p-xiexin 2025-11-15 23:32:33 +00:00
rdumasia303
e82cd2abf0 Merge pull request #22 from rdumasia303/claude/add-pdf-support-016ikhUYeakWY2dah4X9STAX
Claude/add pdf support 016ikh u yeak wy2dah4 x9 stax
2025-11-15 23:00:51 +00:00
rdumasia303
7b7d368c94 Update latest updates section to November 2025 2025-11-15 22:58:28 +00:00
Claude
efa2bd265b Enhance README with comprehensive PDF processing documentation
- Add prominent "What's New" section highlighting v2.2.0 features
- Add detailed "How to Use" guide for both Image OCR and PDF Processing
- Include output format comparison table
- Add use cases and tips for best results
- Expand tech stack section with new dependencies
- Better structure with clear sections for new users
2025-11-15 22:55:43 +00:00
Claude
e33e9be75a Fix Dockerfile to copy all Python files including pdf_utils and format_converter 2025-11-15 14:38:54 +00:00
Claude
e578276d3e Add PDF processing and multi-format document conversion
Features added:
- PDF to image conversion with configurable DPI
- Multi-page PDF processing with OCR
- Export to Markdown, HTML, DOCX, and JSON formats
- Automatic image extraction from PDFs
- Formula and formatting preservation
- Real-time progress tracking for multi-page documents

Backend changes:
- New /api/process-pdf endpoint for PDF processing
- pdf_utils.py: PDF conversion and image extraction utilities
- format_converter.py: Document format conversion (MD, HTML, DOCX)
- Updated dependencies: PyMuPDF, img2pdf, python-docx, markdown

Frontend changes:
- File type toggle (Image OCR / PDF Processing)
- PDFProcessor component with format selection
- Updated ImageUpload to support both images and PDFs
- Progress bars for multi-page processing
- Download options for converted documents

Documentation:
- Updated README with PDF processing features
- Added API documentation for /api/process-pdf endpoint
- Added format conversion examples
2025-11-15 14:25:09 +00:00
rdumasia303
5ba45f7db2 Update README.md with new content 2025-10-23 01:14:24 +01:00
rdumasia303
fd063c0e71 Add MIT License to the project 2025-10-23 01:06:22 +01:00
rdumasia303
0fb5760b11 Merge pull request #11 from dnnspaul/main
Fix incorrect OCR instructions + show advanced settings
2025-10-22 23:52:30 +01:00
Dennis Paul
23bbd1fc8d show advanced settings toggle 2025-10-23 00:05:24 +02:00
Dennis Paul
225655d02c (#10) Fix incorrect OCR instruction 2025-10-23 00:05:00 +02:00
25 changed files with 4010 additions and 505 deletions

23
.env Normal file
View File

@@ -0,0 +1,23 @@
# DeepSeek OCR Application Configuration
# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
# Frontend Configuration
FRONTEND_PORT=3000
# Model Configuration
MODEL_NAME=deepseek-ai/DeepSeek-OCR
HF_HOME=/models
# CORS Configuration (comma-separated origins, defaults to http://localhost:3000)
CORS_ORIGINS=http://localhost:3000
# Upload Configuration
MAX_UPLOAD_SIZE_MB=100
# Processing Configuration
BASE_SIZE=1024
IMAGE_SIZE=640
CROP_MODE=true

View File

@@ -11,9 +11,34 @@ FRONTEND_PORT=3000
MODEL_NAME=deepseek-ai/DeepSeek-OCR
HF_HOME=/models
# OCR model selection
# Register the local DeepSeek-OCR model (set to false for an Ollama-only deployment)
ENABLE_DEEPSEEK_LOCAL=true
# External Ollama host the backend should call (no trailing slash)
OLLAMA_BASE_URL=http://host.docker.internal:11434
# Comma-separated Ollama vision model tags to surface in the UI.
# Pull these on the Ollama host first, e.g. `ollama pull glm-ocr`.
OLLAMA_MODELS=glm-ocr,llama3.2-vision,minicpm-v,qwen2.5vl
# Default model id selected in the UI (deepseek-local or ollama:<tag>)
DEFAULT_OCR_MODEL=deepseek-local
# Per-request timeout (seconds) for Ollama calls
OLLAMA_TIMEOUT=300
# CORS Configuration (comma-separated origins, defaults to http://localhost:3000)
CORS_ORIGINS=http://localhost:3000
# Upload Configuration
MAX_UPLOAD_SIZE_MB=100
# PostgreSQL Configuration
POSTGRES_USER=ocr_user
POSTGRES_PASSWORD=ocr_password
POSTGRES_DB=ocr_db
DATABASE_URL=postgresql://ocr_user:ocr_password@postgres:5432/ocr_db
# OCR Image Storage (host path mounted into container)
OCR_IMAGES_DIR=/data/ocr_images
# Processing Configuration
BASE_SIZE=1024
IMAGE_SIZE=640

2
.gitignore vendored
View File

@@ -46,7 +46,7 @@ yarn.lock
pnpm-lock.yaml
# Environment
.env
#.env
.env.local
.env.development.local
.env.test.local

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 rdumasia303
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

257
README.md
View File

@@ -1,10 +1,54 @@
# 🚀 DeepSeek OCR - React + FastAPI
Modern OCR web application powered by DeepSeek-OCR with a stunning React frontend and FastAPI backend.
Modern OCR web application powered by DeepSeek-OCR with a stunning React frontend and FastAPI backend. **Now with PDF processing and multi-format document conversion!**
![DeepSeek OCR in Action](assets/multi-bird.png)
> **Recent Updates (v2.1.1)**
## ✨ What's New in v2.2.0 - PDF Processing & Document Conversion
We've added powerful PDF processing capabilities based on community feedback! Here's what you can do now:
### 📄 Process Entire PDF Documents
- Upload PDF files up to 100MB
- Automatic multi-page OCR processing
- Real-time progress tracking for large documents
- Extract text from scanned PDFs or image-based documents
### 🔄 Convert to Multiple Formats
Export your OCR results in the format you need:
- **Markdown (.md)** - Clean, structured text perfect for documentation
- **HTML (.html)** - Styled documents with embedded images and tables
- **Word (.docx)** - Professional documents with formatting, tables, and images
- **JSON** - Structured data for programmatic access
### 🖼️ Automatic Image Extraction
- Detects and extracts images from PDF pages
- Embeds images in exported documents
- Preserves image placement and context
### 📐 Formula & Formatting Preservation
- Maintains mathematical formulas (LaTeX syntax)
- Preserves tables, headings, and document structure
- Cleans up special characters while keeping formatting intact
### 🎯 Use Cases
- **Document Digitization** - Convert scanned PDFs to editable formats
- **Data Extraction** - Pull structured data from forms and invoices
- **Content Migration** - Convert PDFs to Markdown for wikis/documentation
- **Academic Papers** - Extract text and formulas from research papers
- **Business Documents** - Convert reports to Word for editing
---
> **Latest Updates (v2.2.0)** - November 2025
> - 🎉 **NEW: PDF Processing** - Upload PDFs and extract text from all pages
> - 🎉 **NEW: Multi-Format Export** - Convert to Markdown, HTML, DOCX, or JSON
> - 🎉 **NEW: Automatic Image Extraction** - Extract and preserve images from PDFs
> - 🎉 **NEW: Progress Tracking** - Real-time progress for multi-page documents
> - ✅ Dual mode: Image OCR + PDF Processing with format conversion
> - ✅ Enhanced document processing with formula and formatting preservation
>
> **Previous Updates (v2.1.1)**
> - ✅ Fixed image removal button - now properly clears and allows re-upload
> - ✅ Fixed multiple bounding boxes parsing - handles `[[x1,y1,x2,y2], [x1,y1,x2,y2]]` format
> - ✅ Simplified to 4 core working modes for better stability
@@ -37,24 +81,80 @@ Modern OCR web application powered by DeepSeek-OCR with a stunning React fronten
- **Backend API**: http://localhost:8000 (or your configured API_PORT)
- **API Docs**: http://localhost:8000/docs
## 🎓 How to Use
### Processing Images (Single Image OCR)
1. Select **"Image OCR"** mode in the toggle
2. Upload an image (PNG, JPG, WEBP, etc.)
3. Choose your OCR mode:
- **Plain OCR** - Extract all text
- **Describe** - Get image description
- **Find** - Locate specific terms
- **Freeform** - Use custom prompts
4. Click **"Analyze Image"**
5. View results with bounding boxes (if enabled)
6. Copy or download the extracted text
### Processing PDFs (Multi-Page Documents) - NEW!
1. Select **"PDF Processing"** mode in the toggle
2. Upload a PDF file (up to 100MB)
3. Choose your OCR mode (same as above)
4. Select **output format**:
- 📝 **Markdown** - For documentation, wikis, GitHub
- 🌐 **HTML** - For web publishing, styled viewing
- 📄 **DOCX** - For Word editing, professional documents
- 📊 **JSON** - For programmatic access, data extraction
5. Click **"Process PDF"**
6. Watch the progress bar as pages are processed
7. Your file downloads automatically when complete!
### Tips for Best Results
- **For scanned documents**: Use higher DPI (144-300) in advanced settings
- **For tables**: The model excels at extracting structured data
- **For formulas**: Mathematical notation is preserved in output
- **For images in PDFs**: Enable "Extract Images" to include them in output
- **For large PDFs**: JSON format is fastest, DOCX takes longer due to formatting
### Output Format Comparison
| Format | Best For | Features | File Size |
|--------|----------|----------|-----------|
| **Markdown** | Documentation, GitHub, wikis | Clean text, tables, code blocks | Smallest |
| **HTML** | Web viewing, sharing | Styled output, embedded images, tables | Medium |
| **DOCX** | Editing, professional docs | Full formatting, images, tables | Largest |
| **JSON** | Data processing, APIs | Structured data, metadata, page info | Small |
## Features
### 4 Core OCR Modes
### Dual Processing Modes
#### 📸 **Image OCR** (4 Core Modes)
- **Plain OCR** - Raw text extraction from any image
- **Describe** - Generate intelligent image descriptions
- **Find** - Locate specific terms with visual bounding boxes
- **Freeform** - Custom prompts for specialized tasks
#### 📄 **PDF Processing** (NEW!)
- **Multi-Page Processing** - Process entire PDF documents page by page
- **Format Conversion** - Export to Markdown, HTML, DOCX, or JSON
- **Image Extraction** - Automatically extract and preserve embedded images
- **Formula Preservation** - Maintain mathematical formulas and special formatting
- **Progress Tracking** - Real-time progress updates for large documents
### UI Features
- 🎨 Glass morphism design with animated gradients
- 🎯 Drag & drop file upload (up to 100MB by default)
- 🗑️ Easy image removal and re-upload
- 🎯 Drag & drop file upload (Images up to 10MB, PDFs up to 100MB)
- 🔄 Easy file removal and re-upload
- 📦 Grounding box visualization with proper coordinate scaling
- ✨ Smooth animations (Framer Motion)
- 📋 Copy/Download results
- 📋 Copy/Download results in multiple formats
- 🎛️ Advanced settings dropdown
- 📝 HTML and Markdown rendering for formatted output
- 🔍 Multiple bounding box support (handles multiple instances of found terms)
- 📊 Progress bars for multi-page PDF processing
- 💾 Direct download for converted documents (MD, HTML, DOCX)
## Configuration
@@ -72,6 +172,13 @@ FRONTEND_PORT=3000
MODEL_NAME=deepseek-ai/DeepSeek-OCR
HF_HOME=/models
# OCR model selection (DeepSeek + Ollama)
ENABLE_DEEPSEEK_LOCAL=true # register the local GPU model
OLLAMA_BASE_URL=http://host.docker.internal:11434 # external Ollama host
OLLAMA_MODELS=glm-ocr,llama3.2-vision,minicpm-v,qwen2.5vl
DEFAULT_OCR_MODEL=deepseek-local # deepseek-local or ollama:<tag>
OLLAMA_TIMEOUT=300 # per-request timeout (seconds)
# Upload Configuration
MAX_UPLOAD_SIZE_MB=100 # Maximum file upload size
@@ -86,19 +193,68 @@ CROP_MODE=true # Enable dynamic cropping for large images
- `API_HOST`: Backend API host (default: 0.0.0.0)
- `API_PORT`: Backend API port (default: 8000)
- `FRONTEND_PORT`: Frontend port (default: 3000)
- `MODEL_NAME`: HuggingFace model identifier
- `MODEL_NAME`: HuggingFace model identifier for the local DeepSeek-OCR model
- `HF_HOME`: Model cache directory
- `ENABLE_DEEPSEEK_LOCAL`: Register the local DeepSeek-OCR model (set `false` for an Ollama-only deployment with no GPU model loaded)
- `OLLAMA_BASE_URL`: URL of an external Ollama server the backend calls for non-DeepSeek models
- `OLLAMA_MODELS`: Comma-separated Ollama vision model tags to expose in the UI (pull them on the Ollama host first, e.g. `ollama pull glm-ocr`)
- `DEFAULT_OCR_MODEL`: Model id selected by default (`deepseek-local` or `ollama:<tag>`)
- `OLLAMA_TIMEOUT`: Per-request timeout in seconds for Ollama calls
- `MAX_UPLOAD_SIZE_MB`: Maximum file upload size in megabytes
- `BASE_SIZE`: Base image processing size (affects memory usage)
- `IMAGE_SIZE`: Tile size for dynamic cropping
- `CROP_MODE`: Enable/disable dynamic image cropping
### Choosing an OCR Model
The **Model** selector (next to the Mode selector) chooses which backend runs the OCR:
- **DeepSeek-OCR (local GPU)** — the default. Loaded lazily on first use. Supports
every mode including grounding/bounding-box modes (Find), plus the Advanced
Settings (base size, crop mode, etc.).
- **Ollama models** — any vision model pulled on your Ollama host and listed in
`OLLAMA_MODELS` (e.g. `glm-ocr`, `llama3.2-vision`). These run remotely on the
Ollama server. They return **plain text only**: bounding boxes are not produced,
so grounding modes (Find) and the DeepSeek-specific Advanced Settings are ignored
/ disabled when an Ollama model is selected.
Setup for Ollama models:
```bash
# On the machine running Ollama
ollama pull glm-ocr
ollama pull llama3.2-vision
# Point the backend at it (in .env), then restart
OLLAMA_BASE_URL=http://host.docker.internal:11434
OLLAMA_MODELS=glm-ocr,llama3.2-vision
```
`GET /api/models` returns the registered models and their capabilities; the UI
populates the selector from it. The model used for each job is stored on the job
record (`ocr_model`) and shown in the Browse Jobs view.
## Tech Stack
- **Frontend**: React 18 + Vite 5 + TailwindCSS 3 + Framer Motion 11
- **Backend**: FastAPI + PyTorch + Transformers 4.46 + DeepSeek-OCR
### Frontend
- **Framework**: React 18 + Vite 5
- **Styling**: TailwindCSS 3 + Custom Glass Morphism
- **Animations**: Framer Motion 11
- **HTTP Client**: Axios
- **File Upload**: React Dropzone
### Backend
- **API Framework**: FastAPI (async Python web framework)
- **ML/AI**: PyTorch + Transformers 4.46 + DeepSeek-OCR
- **PDF Processing**: PyMuPDF (fitz) + img2pdf
- **Document Conversion**:
- python-docx (Word documents)
- markdown (Markdown processing)
- Custom HTML generator
- **Configuration**: python-decouple for environment management
- **Server**: Nginx (reverse proxy)
### Infrastructure
- **Server**: Nginx (reverse proxy & static file serving)
- **Container**: Docker + Docker Compose with multi-stage builds
- **GPU**: NVIDIA CUDA support (tested on RTX 3090, RTX 5090)
@@ -106,19 +262,26 @@ CROP_MODE=true # Enable dynamic cropping for large images
```
deepseek-ocr/
├── backend/ # FastAPI backend
│ ├── main.py
├── backend/ # FastAPI backend
│ ├── main.py # Main API with OCR and PDF endpoints
│ ├── pdf_utils.py # PDF processing utilities (NEW)
│ ├── format_converter.py # Document format conversion (NEW)
│ ├── requirements.txt
│ └── Dockerfile
├── frontend/ # React frontend
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/
│ │ ├── App.jsx
│ │ │ ├── ImageUpload.jsx # File upload (images & PDFs)
│ │ │ ├── PDFProcessor.jsx # PDF processing UI (NEW)
│ │ │ ├── ModeSelector.jsx
│ │ │ ├── ResultPanel.jsx
│ │ │ └── AdvancedSettings.jsx
│ │ ├── App.jsx # Main app with dual mode support
│ │ └── main.jsx
│ ├── package.json
│ ├── nginx.conf
│ └── Dockerfile
├── models/ # Model cache
├── models/ # Model cache
└── docker-compose.yml
```
@@ -255,6 +418,7 @@ For large images, the model uses dynamic cropping:
**Parameters:**
- `image` (file, required) - Image file to process (up to 100MB)
- `model` (string) - OCR model id from `GET /api/models` (default: registry default). Grounding/Advanced settings apply to DeepSeek only.
- `mode` (string) - OCR mode: `plain_ocr` | `describe` | `find_ref` | `freeform`
- `prompt` (string) - Custom prompt for freeform mode
- `grounding` (bool) - Enable bounding boxes (auto-enabled for find_ref)
@@ -288,6 +452,64 @@ For large images, the model uses dynamic cropping:
- **Supports multiple boxes**: When finding multiple instances, format is `[[x1,y1,x2,y2], [x1,y1,x2,y2], ...]`
- Frontend automatically displays all boxes overlaid on the image with unique colors
### POST /api/process-pdf (NEW!)
Process PDF documents with OCR and export to various formats.
**Parameters:**
- `pdf_file` (file, required) - PDF file to process (up to 100MB)
- `model` (string) - OCR model id from `GET /api/models` (default: registry default)
- `mode` (string) - OCR mode: `plain_ocr` | `describe` | `find_ref` | `freeform`
- `prompt` (string) - Custom prompt for freeform mode
- `output_format` (string) - Output format: `markdown` | `html` | `docx` | `json`
- `grounding` (bool) - Enable bounding boxes (default: false)
- `include_caption` (bool) - Add image descriptions (default: false)
- `extract_images` (bool) - Extract embedded images from PDF (default: true)
- `dpi` (int) - PDF rendering resolution (default: 144)
- `base_size` (int) - Base processing size (default: 1024)
- `image_size` (int) - Tile size for cropping (default: 640)
- `crop_mode` (bool) - Enable dynamic cropping (default: true)
**Response Formats:**
**JSON Format** (`output_format=json`):
```json
{
"success": true,
"total_pages": 5,
"pages": [
{
"page_number": 1,
"text": "Extracted and cleaned text...",
"raw_text": "Raw model output with tags...",
"boxes": [{"label": "field", "box": [x1, y1, x2, y2]}],
"images": ["base64_encoded_image_data..."],
"image_dims": {"w": 1920, "h": 1080}
}
],
"metadata": {
"mode": "plain_ocr",
"grounding": false,
"extract_images": true,
"dpi": 144
}
}
```
**File Downloads** (`output_format=markdown|html|docx`):
- Returns the document as a downloadable file
- Markdown: `.md` file with preserved formatting
- HTML: `.html` file with embedded styling and images
- DOCX: `.docx` Word document with tables and formatting
**Features:**
- 📄 Multi-page processing with progress tracking
- 🖼️ Automatic image extraction and embedding
- 📐 Formula and formatting preservation
- 🎨 Styled HTML output with tables and code blocks
- 📝 Clean Markdown with proper structure
- 📋 Professional DOCX with headings and tables
## Examples
Here are some example images showcasing different OCR capabilities:
@@ -325,3 +547,8 @@ docker-compose build frontend
## License
This project uses the DeepSeek-OCR model. Refer to the model's license terms.
<!-- Small note and direct link to license at the bottom -->
<!-- MIT License: this repository is licensed under the MIT License. See the full text in the LICENSE file. -->
Note: Licensed under the MIT License. View the full license: [LICENSE](./LICENSE)

View File

@@ -12,7 +12,7 @@ COPY requirements.txt .
RUN pip install --upgrade pip && pip install -r requirements.txt
# Copy backend code
COPY main.py .
COPY *.py .
EXPOSE 8000

115
backend/database.py Normal file
View File

@@ -0,0 +1,115 @@
import os
import psycopg2
import psycopg2.extras
from contextlib import contextmanager
from decouple import config as env_config
DATABASE_URL = env_config(
"DATABASE_URL",
default="postgresql://ocr_user:ocr_password@postgres:5432/ocr_db"
)
def _get_conn():
return psycopg2.connect(DATABASE_URL, cursor_factory=psycopg2.extras.RealDictCursor)
def init_db():
"""Create tables if they don't exist. Called once at startup."""
conn = None
try:
conn = _get_conn()
with conn.cursor() as cur:
cur.execute("""
CREATE TABLE IF NOT EXISTS ocr_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
author TEXT,
book TEXT,
chapter TEXT,
page TEXT,
submitted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
image_path TEXT NOT NULL,
original_filename TEXT,
ocr_text TEXT,
status TEXT NOT NULL DEFAULT 'unreviewed',
reviewed_text TEXT,
reviewer_name TEXT,
reviewed_at TIMESTAMPTZ,
mode TEXT
)
""")
# Index for fast full-text-style searches on common fields
cur.execute("""
CREATE INDEX IF NOT EXISTS ocr_jobs_status_idx ON ocr_jobs(status)
""")
cur.execute("""
CREATE INDEX IF NOT EXISTS ocr_jobs_submitted_at_idx ON ocr_jobs(submitted_at DESC)
""")
# Add columns introduced after initial schema (safe to run repeatedly)
cur.execute("""
ALTER TABLE ocr_jobs
ADD COLUMN IF NOT EXISTS describe_text TEXT
""")
cur.execute("""
ALTER TABLE ocr_jobs
ADD COLUMN IF NOT EXISTS freeform_text TEXT
""")
cur.execute("""
ALTER TABLE ocr_jobs
ADD COLUMN IF NOT EXISTS qdrant_synced_at TIMESTAMPTZ
""")
cur.execute("""
ALTER TABLE ocr_jobs
ADD COLUMN IF NOT EXISTS updated_at TIMESTAMPTZ
""")
# Which OCR model produced this job (e.g. "deepseek-local", "ollama:glm-ocr")
cur.execute("""
ALTER TABLE ocr_jobs
ADD COLUMN IF NOT EXISTS ocr_model TEXT
""")
# Trigger function: stamp updated_at on every row update
cur.execute("""
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql
""")
cur.execute("""
CREATE OR REPLACE TRIGGER ocr_jobs_set_updated_at
BEFORE UPDATE ON ocr_jobs
FOR EACH ROW EXECUTE FUNCTION set_updated_at()
""")
# Unique constraint: prevent duplicate (author, chapter, page) submissions.
# Applies only when all three fields are non-null.
cur.execute("""
CREATE UNIQUE INDEX IF NOT EXISTS ocr_jobs_author_chapter_page_unique
ON ocr_jobs (author, chapter, page)
WHERE author IS NOT NULL AND chapter IS NOT NULL AND page IS NOT NULL
""")
conn.commit()
print("Database initialized.")
except Exception as exc:
print(f"Database init failed: {exc}")
if conn:
conn.rollback()
raise
finally:
if conn:
conn.close()
@contextmanager
def get_db():
"""Yield a connection and auto-commit/rollback."""
conn = _get_conn()
try:
yield conn
conn.commit()
except Exception:
conn.rollback()
raise
finally:
conn.close()

326
backend/format_converter.py Normal file
View File

@@ -0,0 +1,326 @@
"""
Document Format Conversion Utilities
Handles conversion to Markdown, HTML, DOCX while preserving formatting
"""
import re
from typing import List, Dict, Any
from io import BytesIO
from docx import Document
from docx.shared import Pt, Inches, RGBColor
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
import markdown
import base64
from PIL import Image
class DocumentConverter:
"""Handles conversion of OCR results to various document formats"""
def __init__(self):
self.page_separator = '<--- Page Split --->'
def to_markdown(self, pages_content: List[Dict[str, Any]], include_images: bool = True) -> str:
"""
Convert OCR results to Markdown format
Args:
pages_content: List of page dictionaries with text and metadata
include_images: Whether to include image references
Returns:
Markdown formatted string
"""
md_content = []
for idx, page in enumerate(pages_content):
# Add page header
md_content.append(f"# Page {idx + 1}\n")
text = page.get('text', '')
# Process and clean the text
if include_images and 'images' in page:
# Replace image placeholders with actual markdown image syntax
for img_idx, img_data in enumerate(page.get('images', [])):
placeholder = f"[IMAGE_{img_idx}]"
img_ref = f"![Image {img_idx + 1}](data:image/jpeg;base64,{img_data})"
text = text.replace(placeholder, img_ref)
md_content.append(text)
md_content.append("\n\n---\n\n") # Page separator
return "\n".join(md_content)
def to_html(self, pages_content: List[Dict[str, Any]], include_images: bool = True) -> str:
"""
Convert OCR results to HTML format
Args:
pages_content: List of page dictionaries with text and metadata
include_images: Whether to include images
Returns:
HTML formatted string
"""
html_parts = []
# HTML header
html_parts.append("""
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>OCR Results</title>
<style>
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
max-width: 900px;
margin: 40px auto;
padding: 20px;
line-height: 1.6;
background-color: #f5f5f5;
}
.page {
background: white;
padding: 40px;
margin-bottom: 30px;
box-shadow: 0 2px 8px rgba(0,0,0,0.1);
border-radius: 8px;
}
.page-header {
color: #333;
border-bottom: 2px solid #4CAF50;
padding-bottom: 10px;
margin-bottom: 20px;
}
table {
border-collapse: collapse;
width: 100%;
margin: 20px 0;
}
th, td {
border: 1px solid #ddd;
padding: 12px;
text-align: left;
}
th {
background-color: #4CAF50;
color: white;
}
tr:nth-child(even) {
background-color: #f9f9f9;
}
img {
max-width: 100%;
height: auto;
margin: 15px 0;
border-radius: 4px;
}
code {
background-color: #f4f4f4;
padding: 2px 6px;
border-radius: 3px;
font-family: 'Courier New', monospace;
}
pre {
background-color: #f4f4f4;
padding: 15px;
border-radius: 5px;
overflow-x: auto;
}
</style>
</head>
<body>
<h1>DeepSeek OCR Results</h1>
""")
# Process each page
for idx, page in enumerate(pages_content):
html_parts.append(f' <div class="page">')
html_parts.append(f' <h2 class="page-header">Page {idx + 1}</h2>')
text = page.get('text', '')
# Handle images if present
if include_images and 'images' in page:
for img_idx, img_data in enumerate(page.get('images', [])):
placeholder = f"[IMAGE_{img_idx}]"
img_tag = f'<img src="data:image/jpeg;base64,{img_data}" alt="Image {img_idx + 1}" />'
text = text.replace(placeholder, img_tag)
# Convert markdown to HTML if the text appears to be markdown
if self._is_markdown(text):
html_content = markdown.markdown(text, extensions=['tables', 'fenced_code'])
else:
# Otherwise, preserve the HTML or wrap in paragraph
html_content = text if '<' in text else f'<p>{text.replace(chr(10), "<br>")}</p>'
html_parts.append(f' {html_content}')
html_parts.append(' </div>')
# HTML footer
html_parts.append("""
</body>
</html>
""")
return "\n".join(html_parts)
def to_docx(self, pages_content: List[Dict[str, Any]], include_images: bool = True) -> BytesIO:
"""
Convert OCR results to DOCX format
Args:
pages_content: List of page dictionaries with text and metadata
include_images: Whether to include images
Returns:
BytesIO object containing the DOCX file
"""
doc = Document()
# Set default font
style = doc.styles['Normal']
font = style.font
font.name = 'Calibri'
font.size = Pt(11)
# Add title
title = doc.add_heading('DeepSeek OCR Results', 0)
title.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
# Process each page
for idx, page in enumerate(pages_content):
# Add page heading
page_heading = doc.add_heading(f'Page {idx + 1}', level=1)
page_heading.alignment = WD_PARAGRAPH_ALIGNMENT.LEFT
text = page.get('text', '')
# Handle images
if include_images and 'images' in page:
for img_idx, img_data in enumerate(page.get('images', [])):
placeholder = f"[IMAGE_{img_idx}]"
# Add image to document
try:
img_bytes = base64.b64decode(img_data)
img_stream = BytesIO(img_bytes)
doc.add_picture(img_stream, width=Inches(5))
text = text.replace(placeholder, '')
except Exception as e:
print(f"Error adding image to DOCX: {e}")
# Process text content
self._add_formatted_text_to_doc(doc, text)
# Add page break (except for last page)
if idx < len(pages_content) - 1:
doc.add_page_break()
# Save to BytesIO
docx_buffer = BytesIO()
doc.save(docx_buffer)
docx_buffer.seek(0)
return docx_buffer
def _is_markdown(self, text: str) -> bool:
"""Check if text appears to be markdown formatted"""
markdown_patterns = [
r'^#+\s', # Headers
r'\*\*.*\*\*', # Bold
r'\*.*\*', # Italic
r'^\*\s', # Lists
r'^\d+\.\s', # Numbered lists
r'\[.*\]\(.*\)', # Links
r'```', # Code blocks
]
for pattern in markdown_patterns:
if re.search(pattern, text, re.MULTILINE):
return True
return False
def _add_formatted_text_to_doc(self, doc: Document, text: str):
"""
Add formatted text to document, preserving structure
Args:
doc: Document object
text: Text to add
"""
# Split into paragraphs
paragraphs = text.split('\n\n')
for para in paragraphs:
if not para.strip():
continue
# Check for headers
if para.startswith('# '):
doc.add_heading(para.replace('# ', ''), level=1)
elif para.startswith('## '):
doc.add_heading(para.replace('## ', ''), level=2)
elif para.startswith('### '):
doc.add_heading(para.replace('### ', ''), level=3)
# Check for tables (simple detection)
elif '|' in para and para.count('|') > 2:
self._add_table_to_doc(doc, para)
# Check for code blocks
elif para.startswith('```'):
code_text = para.strip('```').strip()
p = doc.add_paragraph()
run = p.add_run(code_text)
run.font.name = 'Courier New'
run.font.size = Pt(10)
else:
# Regular paragraph
doc.add_paragraph(para.strip())
def _add_table_to_doc(self, doc: Document, table_text: str):
"""
Add a table to the document from markdown-style table text
Args:
doc: Document object
table_text: Table in markdown format
"""
rows = [row.strip() for row in table_text.split('\n') if row.strip()]
# Filter out separator rows
data_rows = [row for row in rows if not re.match(r'^[\|\s\-:]+$', row)]
if not data_rows:
return
# Parse table data
table_data = []
for row in data_rows:
cells = [cell.strip() for cell in row.split('|')]
cells = [c for c in cells if c] # Remove empty cells
if cells:
table_data.append(cells)
if not table_data:
return
# Create table
max_cols = max(len(row) for row in table_data)
table = doc.add_table(rows=len(table_data), cols=max_cols)
table.style = 'Light Grid Accent 1'
# Populate table
for i, row_data in enumerate(table_data):
row = table.rows[i]
for j, cell_text in enumerate(row_data):
if j < len(row.cells):
row.cells[j].text = cell_text
# Make header row bold
if i == 0:
for paragraph in row.cells[j].paragraphs:
for run in paragraph.runs:
run.font.bold = True

File diff suppressed because it is too large Load Diff

215
backend/pdf_utils.py Normal file
View File

@@ -0,0 +1,215 @@
"""
PDF Processing Utilities for DeepSeek OCR
Handles PDF to image conversion and batch processing
"""
import ast
import io
import re
from typing import List, Tuple, Dict, Any
import fitz # PyMuPDF
import img2pdf
from PIL import Image
import numpy as np
def pdf_to_images_high_quality(pdf_bytes: bytes, dpi: int = 144) -> List[Image.Image]:
"""
Convert PDF pages to high-quality PIL images
Args:
pdf_bytes: PDF file as bytes
dpi: Resolution for rendering (default: 144)
Returns:
List of PIL Image objects, one per page
"""
images = []
# Open PDF from bytes
pdf_document = fitz.open(stream=pdf_bytes, filetype="pdf")
# Calculate zoom factor from DPI
zoom = dpi / 72.0
matrix = fitz.Matrix(zoom, zoom)
# Process each page
for page_num in range(pdf_document.page_count):
page = pdf_document[page_num]
# Render page to pixmap
pixmap = page.get_pixmap(matrix=matrix, alpha=False)
# Allow reasonably large images (200 megapixels) but not decompression bombs
Image.MAX_IMAGE_PIXELS = 200_000_000
# Convert to PIL Image
img_data = pixmap.tobytes("png")
img = Image.open(io.BytesIO(img_data))
# Ensure RGB mode
if img.mode in ('RGBA', 'LA'):
background = Image.new('RGB', img.size, (255, 255, 255))
background.paste(img, mask=img.split()[-1] if img.mode == 'RGBA' else None)
img = background
elif img.mode != 'RGB':
img = img.convert('RGB')
images.append(img)
pdf_document.close()
return images
def images_to_pdf(pil_images: List[Image.Image]) -> bytes:
"""
Convert list of PIL images to PDF bytes
Args:
pil_images: List of PIL Image objects
Returns:
PDF file as bytes
"""
if not pil_images:
return b''
image_bytes_list = []
for img in pil_images:
# Ensure RGB mode
if img.mode != 'RGB':
img = img.convert('RGB')
# Convert to JPEG bytes
img_buffer = io.BytesIO()
img.save(img_buffer, format='JPEG', quality=95)
img_bytes = img_buffer.getvalue()
image_bytes_list.append(img_bytes)
# Convert to PDF
pdf_bytes = img2pdf.convert(image_bytes_list)
return pdf_bytes
def extract_ref_patterns(text: str) -> Tuple[List[Tuple], List[str], List[str]]:
"""
Extract reference patterns from OCR output
Args:
text: OCR output text with reference tags
Returns:
Tuple of (all_matches, image_matches, other_matches)
"""
pattern = r'(<\|ref\|>(.*?)<\|/ref\|><\|det\|>(.*?)<\|/det\|>)'
matches = re.findall(pattern, text, re.DOTALL)
matches_image = []
matches_other = []
for match in matches:
if '<|ref|>image<|/ref|>' in match[0]:
matches_image.append(match[0])
else:
matches_other.append(match[0])
return matches, matches_image, matches_other
def parse_coordinates(ref_text: Tuple, image_width: int, image_height: int) -> Dict[str, Any]:
"""
Parse coordinates from reference text
Args:
ref_text: Tuple of (full_match, label, coordinates)
image_width: Image width in pixels
image_height: Image height in pixels
Returns:
Dictionary with label and scaled coordinates
"""
try:
label_type = ref_text[1]
cor_list = ast.literal_eval(ref_text[2])
# Scale coordinates from 0-999 to actual pixels
scaled_boxes = []
for points in cor_list:
x1, y1, x2, y2 = points
scaled_box = [
int(x1 / 999 * image_width),
int(y1 / 999 * image_height),
int(x2 / 999 * image_width),
int(y2 / 999 * image_height)
]
scaled_boxes.append(scaled_box)
return {
'label': label_type,
'boxes': scaled_boxes
}
except Exception as e:
print(f"Error parsing coordinates: {e}")
return None
def crop_images_from_refs(image: Image.Image, refs: List[Tuple]) -> List[Image.Image]:
"""
Crop images based on reference bounding boxes
Args:
image: Source PIL Image
refs: List of reference tuples
Returns:
List of cropped PIL Images
"""
cropped_images = []
image_width, image_height = image.size
for ref in refs:
coord_data = parse_coordinates(ref, image_width, image_height)
if coord_data and coord_data['label'] == 'image':
for box in coord_data['boxes']:
x1, y1, x2, y2 = box
try:
cropped = image.crop((x1, y1, x2, y2))
cropped_images.append(cropped)
except Exception as e:
print(f"Error cropping image: {e}")
continue
return cropped_images
def clean_markdown_content(content: str, image_refs: List[str], other_refs: List[str]) -> str:
"""
Clean markdown content by removing reference tags
Args:
content: Raw OCR output with tags
image_refs: List of image reference tags
other_refs: List of other reference tags
Returns:
Cleaned markdown content
"""
cleaned = content
# Remove image reference tags (will be replaced with markdown images)
for ref in image_refs:
cleaned = cleaned.replace(ref, '')
# Remove other reference tags and clean up formatting
for ref in other_refs:
cleaned = cleaned.replace(ref, '')
# Clean up LaTeX and formatting
cleaned = (cleaned
.replace('\\coloneqq', ':=')
.replace('\\eqqcolon', '=:')
.replace('\n\n\n\n', '\n\n')
.replace('\n\n\n', '\n\n'))
return cleaned

489
backend/providers.py Normal file
View File

@@ -0,0 +1,489 @@
"""
OCR provider abstraction.
Each provider knows how to turn an image + a semantic OCR request (mode, prompt,
options) into raw model text. DeepSeek-specific prompt tokens and grounding-box
parsing live here too so the FastAPI routes stay model-agnostic.
Two providers ship today:
- DeepSeekLocalProvider -> the local HF transformers DeepSeek-OCR model (GPU)
- OllamaProvider -> any vision model served by an external Ollama host
The registry is built from environment variables at startup (see build_registry()).
"""
import os
import re
import base64
import tempfile
import shutil
from abc import ABC, abstractmethod
from typing import List, Dict, Any, Optional
from decouple import config as env_config
# httpx is only needed when an Ollama model is actually used; import lazily so the
# backend can run DeepSeek-only without the dependency installed.
try:
import httpx
except Exception: # pragma: no cover - exercised only when httpx is missing
httpx = None
# =============================================================================
# Prompt builders
# =============================================================================
def build_prompt(
mode: str,
user_prompt: str,
grounding: bool,
find_term: Optional[str],
schema: Optional[str],
include_caption: bool,
) -> str:
"""Build the DeepSeek-OCR prompt (with its special tokens) based on mode."""
parts: List[str] = ["<image>"]
mode_requires_grounding = mode in {"find_ref", "layout_map", "pii_redact"}
if grounding or mode_requires_grounding:
parts.append("<|grounding|>")
parts.append(_instruction_for_mode(mode, user_prompt, find_term, schema, include_caption))
return "\n".join(parts)
def build_ollama_prompt(
mode: str,
user_prompt: str,
find_term: Optional[str],
schema: Optional[str],
include_caption: bool,
) -> str:
"""Build a plain natural-language prompt for a generic vision model.
No DeepSeek grounding tokens — Ollama vision models receive the image
separately and respond in plain text.
"""
if mode == "plain_ocr":
instruction = (
"Transcribe all of the text in this image exactly as it appears, "
"preserving line breaks and reading order. Output only the transcribed "
"text with no commentary."
)
elif mode == "markdown":
instruction = (
"Convert this document image to clean GitHub-flavored Markdown, "
"preserving headings, lists, and tables. Output only the Markdown."
)
elif mode == "tables_csv":
instruction = (
"Extract every table in this image and output CSV only. Use commas with "
"minimal quoting. If there are multiple tables, separate them with a line "
"containing '---'. Output only the CSV."
)
elif mode == "tables_md":
instruction = (
"Extract every table in this image as GitHub-flavored Markdown tables. "
"Output only the tables."
)
elif mode == "kv_json":
schema_text = schema.strip() if schema else "{}"
instruction = (
"Extract the key fields from this image and return strict JSON only "
f"(no prose). Use this schema, filling in the values: {schema_text}"
)
elif mode == "figure_chart":
instruction = (
"Parse the figure in this image. First extract any numeric series as a "
"two-column table (x,y). Then add a line containing '---' followed by a "
"two-sentence summary of the chart."
)
elif mode == "find_ref":
key = (find_term or "").strip() or "Total"
instruction = (
f"Find every occurrence of '{key}' in this image and quote the surrounding "
"text for each match. If it does not appear, say so."
)
elif mode == "layout_map":
instruction = (
'Identify the layout blocks in this image and return a JSON array of '
'objects {"type": one of ["title","paragraph","table","figure"]}. '
"Do not include the text content."
)
elif mode == "pii_redact":
instruction = (
"Find all emails, phone numbers, postal addresses, and IBANs in this image. "
'Return a JSON array of objects {"label", "text"}.'
)
elif mode == "multilingual":
instruction = (
"Transcribe all of the text in this image exactly, detecting the language "
"automatically and preserving the original script. Output only the text."
)
elif mode == "describe":
instruction = "Describe this image, focusing on the key visible elements."
elif mode == "freeform":
instruction = user_prompt.strip() if user_prompt else "Transcribe the text in this image."
else:
instruction = "Transcribe the text in this image."
if include_caption and mode != "describe":
instruction += "\nThen add a one-paragraph description of the image."
return instruction
def _instruction_for_mode(
mode: str,
user_prompt: str,
find_term: Optional[str],
schema: Optional[str],
include_caption: bool,
) -> str:
"""The DeepSeek instruction text (without the <image>/<|grounding|> prefix tokens)."""
if mode == "plain_ocr":
instruction = "Free OCR."
elif mode == "markdown":
instruction = "Convert the document to markdown."
elif mode == "tables_csv":
instruction = (
"Extract every table and output CSV only. "
"Use commas, minimal quoting. If multiple tables, separate with a line containing '---'."
)
elif mode == "tables_md":
instruction = "Extract every table as GitHub-flavored Markdown tables. Output only the tables."
elif mode == "kv_json":
schema_text = schema.strip() if schema else "{}"
instruction = (
"Extract key fields and return strict JSON only. "
f"Use this schema (fill the values): {schema_text}"
)
elif mode == "figure_chart":
instruction = (
"Parse the figure. First extract any numeric series as a two-column table (x,y). "
"Then summarize the chart in 2 sentences. Output the table, then a line '---', then the summary."
)
elif mode == "find_ref":
key = (find_term or "").strip() or "Total"
instruction = f"Locate <|ref|>{key}<|/ref|> in the image."
elif mode == "layout_map":
instruction = (
'Return a JSON array of blocks with fields {"type":["title","paragraph","table","figure"],'
'"box":[x1,y1,x2,y2]}. Do not include any text content.'
)
elif mode == "pii_redact":
instruction = (
'Find all occurrences of emails, phone numbers, postal addresses, and IBANs. '
'Return a JSON array of objects {label, text, box:[x1,y1,x2,y2]}.'
)
elif mode == "multilingual":
instruction = "Free OCR. Detect the language automatically and output in the same script."
elif mode == "describe":
instruction = "Describe this image. Focus on visible key elements."
elif mode == "freeform":
instruction = user_prompt.strip() if user_prompt else "OCR this image."
else:
instruction = "OCR this image."
if include_caption and mode != "describe":
instruction = instruction + "\nThen add a one-paragraph description of the image."
return instruction
# =============================================================================
# Grounding parser (DeepSeek-specific; no-op on plain text)
# =============================================================================
DET_BLOCK = re.compile(
r"<\|ref\|>(?P<label>.*?)<\|/ref\|>\s*<\|det\|>\s*(?P<coords>\[.*\])\s*<\|/det\|>",
re.DOTALL,
)
def clean_grounding_text(text: str) -> str:
"""Remove grounding tags from text for display, keeping labels."""
cleaned = re.sub(
r"<\|ref\|>(.*?)<\|/ref\|>\s*<\|det\|>\s*\[.*\]\s*<\|/det\|>",
r"\1",
text,
flags=re.DOTALL,
)
cleaned = re.sub(r"<\|grounding\|>", "", cleaned)
return cleaned.strip()
def parse_detections(text: str, image_width: int, image_height: int) -> List[Dict[str, Any]]:
"""Parse grounding boxes from text and scale 0-999 normalized coords to pixels."""
boxes: List[Dict[str, Any]] = []
for m in DET_BLOCK.finditer(text or ""):
label = m.group("label").strip()
coords_str = m.group("coords").strip()
try:
import ast
parsed = ast.literal_eval(coords_str)
if (
isinstance(parsed, list)
and len(parsed) == 4
and all(isinstance(n, (int, float)) for n in parsed)
):
box_coords = [parsed]
elif isinstance(parsed, list):
box_coords = parsed
else:
raise ValueError("Unsupported coords structure")
for box in box_coords:
if isinstance(box, (list, tuple)) and len(box) >= 4:
x1 = int(float(box[0]) / 999 * image_width)
y1 = int(float(box[1]) / 999 * image_height)
x2 = int(float(box[2]) / 999 * image_width)
y2 = int(float(box[3]) / 999 * image_height)
boxes.append({"label": label, "box": [x1, y1, x2, y2]})
except Exception as e:
print(f"❌ Grounding parse failed: {e}")
continue
return boxes
# =============================================================================
# Providers
# =============================================================================
GROUNDING_MODES = {"find_ref", "layout_map", "pii_redact"}
class ProviderError(Exception):
"""Raised when a provider cannot fulfil a request (e.g. backend unreachable)."""
class OCRProvider(ABC):
"""Turns an image + OCR request into raw model text."""
id: str
label: str
capabilities: Dict[str, Any]
@abstractmethod
def run(
self,
image_path: str,
*,
mode: str,
prompt: str,
grounding: bool,
find_term: Optional[str],
schema: Optional[str],
include_caption: bool,
options: Dict[str, Any],
) -> str:
"""Return the raw text output of the model for this image/request."""
def info(self) -> Dict[str, Any]:
return {"id": self.id, "label": self.label, "capabilities": self.capabilities}
class DeepSeekLocalProvider(OCRProvider):
"""Local HF transformers DeepSeek-OCR model. Loaded lazily on first use."""
def __init__(self):
self.id = "deepseek-local"
self.label = "DeepSeek-OCR (local GPU)"
self.capabilities = {"grounding": True, "advanced_settings": True}
self._model = None
self._tokenizer = None
@property
def loaded(self) -> bool:
return self._model is not None and self._tokenizer is not None
def _ensure_loaded(self):
if self.loaded:
return
# Heavy imports kept local so an Ollama-only deployment never needs torch.
import torch
from transformers import AutoModel, AutoTokenizer
os.environ.pop("TRANSFORMERS_CACHE", None)
model_name = env_config("MODEL_NAME", default="deepseek-ai/DeepSeek-OCR")
hf_home = env_config("HF_HOME", default="/models")
os.makedirs(hf_home, exist_ok=True)
print(f"🚀 Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
use_safetensors=True,
attn_implementation="eager",
torch_dtype=torch.bfloat16,
).eval().to("cuda")
try:
if getattr(tokenizer, "pad_token_id", None) is None and getattr(tokenizer, "eos_token_id", None) is not None:
tokenizer.pad_token = tokenizer.eos_token
if getattr(model.config, "pad_token_id", None) is None and getattr(tokenizer, "pad_token_id", None) is not None:
model.config.pad_token_id = tokenizer.pad_token_id
except Exception:
pass
self._model = model
self._tokenizer = tokenizer
print("✅ DeepSeek-OCR loaded and ready!")
def run(self, image_path, *, mode, prompt, grounding, find_term, schema, include_caption, options):
self._ensure_loaded()
prompt_text = build_prompt(
mode=mode,
user_prompt=prompt,
grounding=grounding,
find_term=find_term,
schema=schema,
include_caption=include_caption,
)
out_dir = tempfile.mkdtemp(prefix="dsocr_")
try:
res = self._model.infer(
self._tokenizer,
prompt=prompt_text,
image_file=image_path,
output_path=out_dir,
base_size=int(options.get("base_size", 1024)),
image_size=int(options.get("image_size", 640)),
crop_mode=bool(options.get("crop_mode", True)),
save_results=False,
test_compress=bool(options.get("test_compress", False)),
eval_mode=True,
)
if isinstance(res, str):
text = res.strip()
elif isinstance(res, dict) and "text" in res:
text = str(res["text"]).strip()
elif isinstance(res, (list, tuple)):
text = "\n".join(map(str, res)).strip()
else:
text = ""
if not text:
mmd = os.path.join(out_dir, "result.mmd")
if os.path.exists(mmd):
with open(mmd, "r", encoding="utf-8") as fh:
text = fh.read().strip()
return text
finally:
shutil.rmtree(out_dir, ignore_errors=True)
class OllamaProvider(OCRProvider):
"""A single vision model served by an external Ollama host."""
def __init__(self, tag: str, base_url: str, label: Optional[str] = None):
self.tag = tag
self.base_url = base_url.rstrip("/")
self.id = f"ollama:{tag}"
self.label = label or f"{tag} (Ollama)"
# Generic vision models don't emit DeepSeek grounding tokens.
self.capabilities = {"grounding": False, "advanced_settings": False}
def run(self, image_path, *, mode, prompt, grounding, find_term, schema, include_caption, options):
if httpx is None:
raise ProviderError("httpx is not installed; cannot reach Ollama.")
prompt_text = build_ollama_prompt(
mode=mode,
user_prompt=prompt,
find_term=find_term,
schema=schema,
include_caption=include_caption,
)
with open(image_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
payload = {
"model": self.tag,
"prompt": prompt_text,
"images": [img_b64],
"stream": False,
}
timeout = float(env_config("OLLAMA_TIMEOUT", default=300.0, cast=float))
try:
resp = httpx.post(f"{self.base_url}/api/generate", json=payload, timeout=timeout)
resp.raise_for_status()
data = resp.json()
except httpx.HTTPStatusError as e:
detail = ""
try:
detail = e.response.json().get("error", "")
except Exception:
detail = e.response.text[:200]
raise ProviderError(f"Ollama returned {e.response.status_code}: {detail}") from e
except httpx.HTTPError as e:
raise ProviderError(f"Could not reach Ollama at {self.base_url}: {e}") from e
return (data.get("response") or "").strip()
# =============================================================================
# Registry
# =============================================================================
class ModelRegistry:
def __init__(self, providers: List[OCRProvider], default_id: str):
self._providers: Dict[str, OCRProvider] = {p.id: p for p in providers}
# Fall back to the first registered provider if the configured default is gone.
self.default_id = default_id if default_id in self._providers else (
next(iter(self._providers), None)
)
def get(self, model_id: Optional[str]) -> OCRProvider:
chosen = model_id or self.default_id
provider = self._providers.get(chosen)
if provider is None:
raise ProviderError(f"Unknown model '{chosen}'.")
return provider
def list_models(self) -> List[Dict[str, Any]]:
out = []
for p in self._providers.values():
entry = p.info()
entry["default"] = (p.id == self.default_id)
out.append(entry)
return out
def build_registry() -> ModelRegistry:
"""Build the provider registry from environment variables.
Env:
ENABLE_DEEPSEEK_LOCAL - register the local DeepSeek-OCR model (default: true)
OLLAMA_BASE_URL - Ollama host (default: http://host.docker.internal:11434)
OLLAMA_MODELS - comma-separated tags to surface (e.g. "glm-ocr,llama3.2-vision")
DEFAULT_OCR_MODEL - id to select by default (default: deepseek-local)
"""
providers: List[OCRProvider] = []
enable_deepseek = env_config("ENABLE_DEEPSEEK_LOCAL", default="true").strip().lower() in {"1", "true", "yes"}
if enable_deepseek:
providers.append(DeepSeekLocalProvider())
base_url = env_config("OLLAMA_BASE_URL", default="http://host.docker.internal:11434")
raw_tags = env_config("OLLAMA_MODELS", default="")
tags = [t.strip() for t in raw_tags.split(",") if t.strip()]
for tag in tags:
providers.append(OllamaProvider(tag=tag, base_url=base_url))
default_id = env_config("DEFAULT_OCR_MODEL", default="deepseek-local")
if not providers:
# Defensive: nothing configured. Register DeepSeek so the app still starts.
providers.append(DeepSeekLocalProvider())
default_id = "deepseek-local"
registry = ModelRegistry(providers, default_id)
print(f"🧠 OCR models registered: {[p.id for p in providers]} (default: {registry.default_id})")
return registry

View File

@@ -11,3 +11,9 @@ pillow
safetensors
torch
python-decouple>=3.8
PyMuPDF>=1.23.0
img2pdf>=0.5.0
python-docx>=1.1.0
markdown>=3.5.0
psycopg2-binary>=2.9.0
httpx>=0.27.0

150
backend/test_security.py Normal file
View File

@@ -0,0 +1,150 @@
"""
Security regression tests for the eval() RCE vulnerability (OX Security disclosure).
The vulnerability allowed arbitrary code execution via crafted OCR output
that was passed to eval() in parse_coordinates(). The fix uses ast.literal_eval()
which only allows literal data structures.
This test is self-contained and does not require backend dependencies.
Run: python test_security.py
"""
import ast
def parse_coordinates(ref_text, image_width, image_height):
"""
Minimal reproduction of pdf_utils.parse_coordinates using the patched code.
This mirrors the fixed version that uses ast.literal_eval() instead of eval().
"""
try:
label_type = ref_text[1]
cor_list = ast.literal_eval(ref_text[2])
scaled_boxes = []
for points in cor_list:
x1, y1, x2, y2 = points
scaled_box = [
int(x1 / 999 * image_width),
int(y1 / 999 * image_height),
int(x2 / 999 * image_width),
int(y2 / 999 * image_height)
]
scaled_boxes.append(scaled_box)
return {
'label': label_type,
'boxes': scaled_boxes
}
except Exception as e:
print(f" [Blocked] {type(e).__name__}: {e}")
return None
def test_legitimate_coordinates():
"""Verify that normal coordinate parsing still works."""
ref_text = ("full_match", "text", "[[312, 339, 480, 681]]")
result = parse_coordinates(ref_text, 1000, 1000)
assert result is not None, "Legitimate coordinates should parse successfully"
assert result['label'] == 'text'
assert len(result['boxes']) == 1
print("PASS: Legitimate coordinates parse correctly")
def test_multiple_boxes():
"""Verify multiple bounding boxes still work."""
ref_text = ("full_match", "image", "[[100, 200, 300, 400], [500, 600, 700, 800]]")
result = parse_coordinates(ref_text, 1000, 1000)
assert result is not None, "Multiple boxes should parse successfully"
assert len(result['boxes']) == 2
print("PASS: Multiple bounding boxes parse correctly")
def test_rce_blocked_import_os():
"""The original exploit: __import__('os').system('...') must be blocked."""
malicious = "__import__('os').system('echo HACKED')"
ref_text = ("full_match", "exploit", malicious)
result = parse_coordinates(ref_text, 1000, 1000)
assert result is None, "Code execution payload should be rejected"
print("PASS: __import__('os').system() payload is blocked")
def test_rce_blocked_exec():
"""exec() based payloads must be blocked."""
malicious = "exec('import os; os.system(\"echo HACKED\")')"
ref_text = ("full_match", "exploit", malicious)
result = parse_coordinates(ref_text, 1000, 1000)
assert result is None, "exec() payload should be rejected"
print("PASS: exec() payload is blocked")
def test_rce_blocked_eval():
"""Nested eval() payloads must be blocked."""
malicious = "eval('__import__(\"os\").popen(\"id\").read()')"
ref_text = ("full_match", "exploit", malicious)
result = parse_coordinates(ref_text, 1000, 1000)
assert result is None, "Nested eval() payload should be rejected"
print("PASS: Nested eval() payload is blocked")
def test_rce_blocked_lambda():
"""Lambda-based payloads must be blocked."""
malicious = "(lambda: __import__('os').system('echo HACKED'))()"
ref_text = ("full_match", "exploit", malicious)
result = parse_coordinates(ref_text, 1000, 1000)
assert result is None, "Lambda payload should be rejected"
print("PASS: Lambda payload is blocked")
def test_rce_blocked_comprehension():
"""List comprehension code execution must be blocked."""
malicious = "[__import__('os').system('echo HACKED') for x in [1]]"
ref_text = ("full_match", "exploit", malicious)
result = parse_coordinates(ref_text, 1000, 1000)
assert result is None, "List comprehension payload should be rejected"
print("PASS: List comprehension payload is blocked")
if __name__ == "__main__":
print("=" * 60)
print("Security Regression Tests (OX Security RCE disclosure)")
print("=" * 60)
print()
tests = [
test_legitimate_coordinates,
test_multiple_boxes,
test_rce_blocked_import_os,
test_rce_blocked_exec,
test_rce_blocked_eval,
test_rce_blocked_lambda,
test_rce_blocked_comprehension,
]
passed = 0
failed = 0
for test in tests:
try:
test()
passed += 1
except AssertionError as e:
print(f"FAIL: {test.__name__}: {e}")
failed += 1
except Exception as e:
print(f"ERROR: {test.__name__}: {e}")
failed += 1
print()
print(f"Results: {passed} passed, {failed} failed out of {len(tests)} tests")
if failed == 0:
print("All security tests passed - RCE vulnerability is patched.")
else:
print("WARNING: Some tests failed!")

View File

@@ -1,4 +1,19 @@
services:
postgres:
image: postgres:16-alpine
container_name: deepseek-ocr-postgres
environment:
POSTGRES_USER: ${POSTGRES_USER:-ocr_user}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-ocr_password}
POSTGRES_DB: ${POSTGRES_DB:-ocr_db}
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-ocr_user} -d ${POSTGRES_DB:-ocr_db}"]
interval: 5s
timeout: 5s
retries: 10
backend:
build: ./backend
container_name: deepseek-ocr-backend
@@ -10,8 +25,23 @@ services:
API_HOST: ${API_HOST:-0.0.0.0}
API_PORT: ${API_PORT:-8000}
MAX_UPLOAD_SIZE_MB: ${MAX_UPLOAD_SIZE_MB:-100}
DATABASE_URL: ${DATABASE_URL:-postgresql://ocr_user:ocr_password@postgres:5432/ocr_db}
OCR_IMAGES_DIR: ${OCR_IMAGES_DIR:-/data/ocr_images}
ENABLE_DEEPSEEK_LOCAL: ${ENABLE_DEEPSEEK_LOCAL:-true}
OLLAMA_BASE_URL: ${OLLAMA_BASE_URL:-http://host.docker.internal:11434}
OLLAMA_MODELS: ${OLLAMA_MODELS:-}
DEFAULT_OCR_MODEL: ${DEFAULT_OCR_MODEL:-deepseek-local}
OLLAMA_TIMEOUT: ${OLLAMA_TIMEOUT:-300}
# Lets the container reach an Ollama server running on the Docker host
# (works out of the box on Docker Desktop; required for Linux engines).
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
- ./models:/models
- ./ocr_images:/data/ocr_images
depends_on:
postgres:
condition: service_healthy
deploy:
resources:
reservations:
@@ -22,8 +52,6 @@ services:
shm_size: "4g"
ports:
- "${API_PORT:-8000}:${API_PORT:-8000}"
networks:
- ocr-network
frontend:
build: ./frontend
@@ -32,9 +60,10 @@ services:
- "${FRONTEND_PORT:-3000}:80"
depends_on:
- backend
networks:
- ocr-network
volumes:
postgres_data:
networks:
ocr-network:
driver: bridge
default:
name: rw-research

View File

@@ -10,6 +10,7 @@
},
"dependencies": {
"axios": "^1.6.5",
"dompurify": "^3.3.3",
"framer-motion": "^11.0.0",
"lucide-react": "^0.344.0",
"react": "^18.3.1",

View File

@@ -1,66 +1,118 @@
import { useState, useCallback } from 'react'
import { useState, useCallback, useEffect } from 'react'
import { useSuggestions } from './hooks/useSuggestions'
import { useModels } from './hooks/useModels'
import { motion, AnimatePresence } from 'framer-motion'
import { Sparkles, Zap, Loader2 } from 'lucide-react'
import {
Sparkles, Zap, Loader2, Settings, Image as ImageIcon, FileText,
Layers, ChevronLeft, CheckCircle2, Database,
} from 'lucide-react'
import ImageUpload from './components/ImageUpload'
import ModeSelector from './components/ModeSelector'
import ModelSelector from './components/ModelSelector'
import ResultPanel from './components/ResultPanel'
import AdvancedSettings from './components/AdvancedSettings'
import PDFProcessor from './components/PDFProcessor'
import MetadataForm from './components/MetadataForm'
import JobsPanel from './components/JobsPanel'
import axios from 'axios'
const API_BASE = import.meta.env.VITE_API_URL || '/api'
const INPUT_CLASS =
'w-full bg-white/5 border border-white/10 rounded-lg px-3 py-2 text-sm text-gray-200 ' +
'placeholder-gray-600 focus:outline-none focus:border-purple-500/50 transition-colors'
function App() {
const [view, setView] = useState('new_job')
// OCR state
const { models, loading: modelsLoading } = useModels()
const [model, setModel] = useState(null)
const [mode, setMode] = useState('plain_ocr')
const [fileType, setFileType] = useState('image')
const [image, setImage] = useState(null)
const [imagePreview, setImagePreview] = useState(null)
const [result, setResult] = useState(null)
const [loading, setLoading] = useState(false)
const [error, setError] = useState(null)
// Form state
const [showAdvanced, setShowAdvanced] = useState(false)
const [includeCaption, setIncludeCaption] = useState(false)
const [prompt, setPrompt] = useState('')
const [findTerm, setFindTerm] = useState('')
const [advancedSettings, setAdvancedSettings] = useState({
base_size: 1024,
image_size: 640,
crop_mode: true,
test_compress: false
base_size: 1024, image_size: 640, crop_mode: true, test_compress: false,
})
const suggestions = useSuggestions()
const [metadata, setMetadata] = useState({ author: '', book: '', chapter: '', page: '' })
// Results accumulated per mode: { plain_ocr: 'text', describe: 'text', freeform: 'text' }
const [modeResults, setModeResults] = useState({})
const [editedResults, setEditedResults] = useState({})
const [activeResultMode, setActiveResultMode] = useState(null)
const [commitLoading, setCommitLoading] = useState(false)
const [commitResult, setCommitResult] = useState(null)
// Modes that produce editable text output and can be committed to the DB
const COMMITTABLE_MODES = new Set(['plain_ocr', 'describe'])
const MODE_LABELS = { plain_ocr: 'OCR Text', describe: 'Description' }
// Pick the default model once the list loads
useEffect(() => {
if (!model && models.length > 0) {
setModel((models.find(m => m.default) || models[0]).id)
}
}, [models, model])
// Show the full-screen result view once at least one committable mode has a result
const showResultView = view === 'new_job' && Object.keys(modeResults).length > 0
const handleFileTypeChange = useCallback((newType) => {
setImage(null)
if (imagePreview) URL.revokeObjectURL(imagePreview)
setImagePreview(null)
setError(null)
setResult(null)
setFileType(newType)
}, [imagePreview])
const handleImageSelect = useCallback((file) => {
if (file === null) {
// Clear everything when removing image
setImage(null)
if (imagePreview) {
URL.revokeObjectURL(imagePreview)
}
if (imagePreview && fileType === 'image') URL.revokeObjectURL(imagePreview)
setImagePreview(null)
setError(null)
setResult(null)
setModeResults({})
setEditedResults({})
setActiveResultMode(null)
setCommitResult(null)
} else {
setImage(file)
setImagePreview(URL.createObjectURL(file))
setImagePreview(fileType === 'image' ? URL.createObjectURL(file) : file)
setError(null)
setResult(null)
setModeResults({})
setEditedResults({})
setActiveResultMode(null)
setCommitResult(null)
}
}, [imagePreview])
}, [imagePreview, fileType])
const handleSubmit = async () => {
if (!image) {
setError('Please upload an image first')
return
}
if (!image) { setError('Please upload an image first'); return }
setLoading(true)
setError(null)
setCommitResult(null)
try {
const formData = new FormData()
formData.append('image', image)
if (model) formData.append('model', model)
formData.append('mode', mode)
formData.append('prompt', prompt)
// Enable grounding only for find mode
formData.append('grounding', mode === 'find_ref')
formData.append('include_caption', false)
formData.append('include_caption', includeCaption)
formData.append('find_term', findTerm)
formData.append('schema', '')
formData.append('base_size', advancedSettings.base_size)
@@ -69,12 +121,16 @@ function App() {
formData.append('test_compress', advancedSettings.test_compress)
const response = await axios.post(`${API_BASE}/ocr`, formData, {
headers: {
'Content-Type': 'multipart/form-data',
},
headers: { 'Content-Type': 'multipart/form-data' },
})
setResult(response.data)
if (COMMITTABLE_MODES.has(mode)) {
const text = response.data.text || ''
setModeResults(prev => ({ ...prev, [mode]: text }))
setEditedResults(prev => ({ ...prev, [mode]: text }))
setActiveResultMode(mode)
}
setCommitResult(null)
} catch (err) {
setError(err.response?.data?.detail || err.message || 'An error occurred')
} finally {
@@ -82,31 +138,61 @@ function App() {
}
}
const handleCopy = useCallback(() => {
if (result?.text) {
navigator.clipboard.writeText(result.text)
const handleNewAnalysis = () => {
setResult(null)
setModeResults({})
setEditedResults({})
setActiveResultMode(null)
setCommitResult(null)
}
const handleCommitJob = useCallback(async () => {
if (!image) return
setCommitLoading(true)
setCommitResult(null)
try {
const formData = new FormData()
formData.append('image', image)
formData.append('author', metadata.author)
formData.append('book', metadata.book)
formData.append('chapter', metadata.chapter)
formData.append('page', metadata.page)
formData.append('ocr_text', editedResults.plain_ocr || '')
formData.append('describe_text', editedResults.describe || '')
formData.append('freeform_text', editedResults.freeform || '')
formData.append('mode', mode)
if (model) formData.append('ocr_model', model)
const response = await axios.post(`${API_BASE}/jobs`, formData, {
headers: { 'Content-Type': 'multipart/form-data' },
})
setCommitResult({ success: true, job: response.data })
} catch (err) {
setCommitResult({ success: false, error: err.response?.data?.detail || err.message })
} finally {
setCommitLoading(false)
}
}, [result])
}, [image, editedResults, metadata, mode, model])
const handleCopy = useCallback(() => {
const text = (activeResultMode && editedResults[activeResultMode]) || result?.text
if (text) navigator.clipboard.writeText(text)
}, [activeResultMode, editedResults, result])
const handleDownload = useCallback(() => {
if (!result?.text) return
const extensions = {
plain_ocr: 'txt',
describe: 'txt',
find_ref: 'txt',
freeform: 'txt',
}
const ext = extensions[mode] || 'txt'
const blob = new Blob([result.text], { type: 'text/plain' })
const text = (activeResultMode && editedResults[activeResultMode]) || result?.text
if (!text) return
const ext = { plain_ocr: 'txt', describe: 'txt', find_ref: 'txt', freeform: 'txt' }[mode] || 'txt'
const blob = new Blob([text], { type: 'text/plain' })
const url = URL.createObjectURL(blob)
const a = document.createElement('a')
a.href = url
a.download = `deepseek-ocr-result.${ext}`
a.click()
URL.revokeObjectURL(url)
}, [result, mode])
}, [activeResultMode, editedResults, result, mode])
const metaField = (key) => (e) => setMetadata(m => ({ ...m, [key]: e.target.value }))
return (
<div className="min-h-screen relative overflow-hidden">
@@ -116,27 +202,13 @@ function App() {
<div className="absolute inset-0 bg-[url('data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNjAiIGhlaWdodD0iNjAiIHZpZXdCb3g9IjAgMCA2MCA2MCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj48ZyBmaWxsPSJub25lIiBmaWxsLXJ1bGU9ImV2ZW5vZGQiPjxwYXRoIGQ9Ik0zNiAxOGMzLjMxIDAgNiAyLjY5IDYgNnMtMi42OSA2LTYgNi02LTIuNjktNi02IDIuNjktNiA2LTZ6TTI0IDZjMy4zMSAwIDYgMi42OSA2IDZzLTIuNjkgNi02IDYtNi0yLjY5LTYtNiAyLjY5LTYgNi02ek00OCAzNmMzLjMxIDAgNiAyLjY5IDYgNnMtMi42OSA2LTYgNi02LTIuNjktNi02IDIuNjktNiA2LTZ6IiBzdHJva2U9InJnYmEoMTQ3LCA1MSwgMjM0LCAwLjEpIiBzdHJva2Utd2lkdGg9IjIiLz48L2c+PC9zdmc+')] opacity-30" />
<motion.div
className="absolute top-20 left-20 w-96 h-96 bg-purple-500/10 rounded-full blur-3xl"
animate={{
scale: [1, 1.2, 1],
opacity: [0.3, 0.5, 0.3],
}}
transition={{
duration: 8,
repeat: Infinity,
ease: "easeInOut"
}}
animate={{ scale: [1, 1.2, 1], opacity: [0.3, 0.5, 0.3] }}
transition={{ duration: 8, repeat: Infinity, ease: 'easeInOut' }}
/>
<motion.div
className="absolute bottom-20 right-20 w-96 h-96 bg-cyan-500/10 rounded-full blur-3xl"
animate={{
scale: [1.2, 1, 1.2],
opacity: [0.5, 0.3, 0.5],
}}
transition={{
duration: 8,
repeat: Infinity,
ease: "easeInOut"
}}
animate={{ scale: [1.2, 1, 1.2], opacity: [0.5, 0.3, 0.5] }}
transition={{ duration: 8, repeat: Infinity, ease: 'easeInOut' }}
/>
</div>
@@ -144,11 +216,7 @@ function App() {
<header className="sticky top-0 z-50 glass border-b border-white/10">
<div className="max-w-7xl mx-auto px-6 py-4">
<div className="flex items-center justify-between">
<motion.div
className="flex items-center gap-3"
initial={{ opacity: 0, x: -20 }}
animate={{ opacity: 1, x: 0 }}
>
<motion.div className="flex items-center gap-3" initial={{ opacity: 0, x: -20 }} animate={{ opacity: 1, x: 0 }}>
<div className="relative">
<div className="absolute inset-0 bg-gradient-to-r from-purple-500 to-cyan-500 rounded-xl blur-lg opacity-75" />
<div className="relative bg-gradient-to-br from-purple-600 to-cyan-500 p-2 rounded-xl">
@@ -160,97 +228,353 @@ function App() {
<p className="text-xs text-gray-400">Next-Gen Vision AI</p>
</div>
</motion.div>
<nav className="flex gap-2">
{showResultView && (
<motion.button
onClick={handleNewAnalysis}
className="flex items-center gap-2 px-4 py-2 rounded-xl text-sm font-medium glass text-gray-400 hover:bg-white/5 transition-all"
whileHover={{ scale: 1.02 }} whileTap={{ scale: 0.98 }}
>
<ChevronLeft className="w-4 h-4" />
New Analysis
</motion.button>
)}
<motion.button
onClick={() => setView('new_job')}
className={`flex items-center gap-2 px-4 py-2 rounded-xl text-sm font-medium transition-all ${view === 'new_job' ? 'bg-gradient-to-r from-purple-600 to-cyan-600 text-white' : 'glass text-gray-400 hover:bg-white/5'}`}
whileHover={{ scale: 1.02 }} whileTap={{ scale: 0.98 }}
>
<Zap className="w-4 h-4" />
New Job
</motion.button>
<motion.button
onClick={() => setView('jobs')}
className={`flex items-center gap-2 px-4 py-2 rounded-xl text-sm font-medium transition-all ${view === 'jobs' ? 'bg-gradient-to-r from-purple-600 to-cyan-600 text-white' : 'glass text-gray-400 hover:bg-white/5'}`}
whileHover={{ scale: 1.02 }} whileTap={{ scale: 0.98 }}
>
<Layers className="w-4 h-4" />
Browse Jobs
</motion.button>
</nav>
</div>
</div>
</header>
{/* Main Content */}
<main className="max-w-7xl mx-auto px-6 py-8">
<div className="grid lg:grid-cols-2 gap-6">
{/* Left Panel - Upload & Controls */}
<motion.div
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
transition={{ delay: 0.1 }}
className="space-y-6"
>
{/* Mode Selector with integrated inputs */}
<ModeSelector
mode={mode}
onModeChange={setMode}
prompt={prompt}
onPromptChange={setPrompt}
findTerm={findTerm}
onFindTermChange={setFindTerm}
/>
<main className="max-w-7xl mx-auto px-6 py-6">
<AnimatePresence>
{/* Image Upload */}
<ImageUpload
onImageSelect={handleImageSelect}
preview={imagePreview}
/>
{/* Action Button */}
<motion.button
onClick={handleSubmit}
disabled={!image || loading}
className={`w-full relative overflow-hidden rounded-2xl p-[2px] ${
!image || loading ? 'opacity-50 cursor-not-allowed' : ''
}`}
whileHover={!loading && image ? { scale: 1.02 } : {}}
whileTap={!loading && image ? { scale: 0.98 } : {}}
{/* ── Full-screen OCR result view ── */}
{showResultView ? (
<motion.div
key="ocr_result"
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
exit={{ opacity: 0, y: -20 }}
className="flex flex-col gap-4"
>
<div className="absolute inset-0 bg-gradient-to-r from-purple-600 via-pink-600 to-cyan-600 animate-gradient" />
<div className="relative bg-dark-100 px-8 py-4 rounded-2xl flex items-center justify-center gap-3">
{loading ? (
<>
<Loader2 className="w-5 h-5 animate-spin" />
<span className="font-semibold">Processing Magic...</span>
</>
) : (
<>
<Zap className="w-5 h-5" />
<span className="font-semibold">Analyze Image</span>
</>
)}
{/* Run additional modes */}
<div className="glass p-4 rounded-2xl flex-shrink-0">
<div className="mb-3">
<ModelSelector
models={models} value={model} onChange={setModel} loading={modelsLoading}
/>
</div>
<ModeSelector mode={mode} onModeChange={setMode} />
<div className="flex items-center gap-3 mt-3">
<motion.button
onClick={handleSubmit}
disabled={loading}
className={`flex items-center gap-2 px-5 py-2 rounded-xl font-medium text-sm transition-all ${loading ? 'opacity-50 cursor-not-allowed bg-white/5' : 'bg-gradient-to-r from-purple-600 to-cyan-600'}`}
whileHover={!loading ? { scale: 1.02 } : {}}
whileTap={!loading ? { scale: 0.98 } : {}}
>
{loading
? <><Loader2 className="w-4 h-4 animate-spin" /> Processing...</>
: <><Zap className="w-4 h-4" /> Analyze</>}
</motion.button>
{error && <p className="text-sm text-red-400">{error}</p>}
</div>
</div>
</motion.button>
{error && (
<motion.div
initial={{ opacity: 0, y: -10 }}
animate={{ opacity: 1, y: 0 }}
className="glass p-4 rounded-2xl border-red-500/50 bg-red-500/10"
>
<p className="text-sm text-red-400">{error}</p>
</motion.div>
)}
</motion.div>
{/* Image + Text */}
<div className="grid gap-6" style={{ gridTemplateColumns: '1fr 1fr', height: '130vh' }}>
{imagePreview && typeof imagePreview === 'string' ? (
<div className="glass rounded-2xl overflow-hidden flex items-center justify-center bg-black/20 h-full">
<img
src={imagePreview}
alt="Source"
className="w-full h-full object-contain"
/>
</div>
) : (
<div className="glass rounded-2xl flex items-center justify-center h-full">
<p className="text-gray-500 text-sm">No preview</p>
</div>
)}
<div className="glass rounded-2xl p-4 flex flex-col h-full">
{/* Mode tabs — only shown when multiple modes have results */}
{Object.keys(modeResults).length > 1 && (
<div className="flex gap-1 mb-3 flex-shrink-0">
{Object.keys(modeResults).map(m => (
<button
key={m}
onClick={() => setActiveResultMode(m)}
className={`px-3 py-1 rounded-lg text-xs font-medium transition-colors ${
activeResultMode === m
? 'bg-purple-600 text-white'
: 'bg-white/5 text-gray-400 hover:bg-white/10'
}`}
>
{MODE_LABELS[m] || m}
</button>
))}
</div>
)}
<p className="text-xs text-gray-400 mb-2 flex-shrink-0">
{MODE_LABELS[activeResultMode] || 'Result'}
<span className="text-purple-400 ml-1">(edit before committing)</span>
</p>
{loading && COMMITTABLE_MODES.has(mode) ? (
<div className="flex-1 flex items-center justify-center">
<Loader2 className="w-8 h-8 animate-spin text-purple-400" />
</div>
) : (
<textarea
value={activeResultMode ? (editedResults[activeResultMode] ?? '') : ''}
onChange={e => setEditedResults(prev => ({ ...prev, [activeResultMode]: e.target.value }))}
className="flex-1 w-full bg-transparent text-sm text-gray-200 font-mono resize-none focus:outline-none min-h-0"
placeholder="Run a mode to see results here..."
/>
)}
</div>
</div>
{/* Right Panel - Results */}
<motion.div
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
transition={{ delay: 0.2 }}
>
<ResultPanel
result={result}
loading={loading}
imagePreview={imagePreview}
onCopy={handleCopy}
onDownload={handleDownload}
/>
</motion.div>
</div>
{/* Metadata row */}
<div className="glass p-4 rounded-2xl flex-shrink-0">
<datalist id="rv-authors">
{suggestions.authors.map(a => <option key={a} value={a} />)}
</datalist>
<datalist id="rv-books">
{(suggestions.books || []).map(b => <option key={b} value={b} />)}
</datalist>
<datalist id="rv-chapters">
{suggestions.chapters.map(c => <option key={c} value={c} />)}
</datalist>
<div className="grid grid-cols-4 gap-4">
{[
{ key: 'author', label: 'Author', placeholder: 'Author name', list: 'rv-authors' },
{ key: 'book', label: 'Book', placeholder: 'Book title', list: 'rv-books' },
{ key: 'chapter', label: 'Chapter', placeholder: 'Chapter', list: 'rv-chapters' },
{ key: 'page', label: 'Page', placeholder: 'Page number', list: undefined },
].map(({ key, label, placeholder, list }) => (
<div key={key}>
<label className="text-xs text-gray-400 mb-1 block">{label}</label>
<input
type="text"
list={list}
value={metadata[key]}
onChange={metaField(key)}
placeholder={placeholder}
className={INPUT_CLASS}
/>
</div>
))}
</div>
</div>
{/* Commit row */}
<div className="flex items-center gap-4 flex-shrink-0">
<AnimatePresence>
{commitResult?.success && (
<motion.div
initial={{ opacity: 0, x: -10 }} animate={{ opacity: 1, x: 0 }} exit={{ opacity: 0 }}
className="flex-1 glass p-3 rounded-xl bg-green-500/10 border border-green-500/20"
>
<p className="text-xs text-green-400">
Job saved &mdash; ID: <span className="font-mono">{commitResult.job?.id}</span>
</p>
</motion.div>
)}
{commitResult && !commitResult.success && (
<motion.div
initial={{ opacity: 0, x: -10 }} animate={{ opacity: 1, x: 0 }} exit={{ opacity: 0 }}
className="flex-1 glass p-3 rounded-xl bg-red-500/10 border border-red-500/20"
>
<p className="text-xs text-red-400">{commitResult.error}</p>
</motion.div>
)}
</AnimatePresence>
<motion.button
onClick={handleCommitJob}
disabled={commitLoading || commitResult?.success}
className={`flex items-center gap-2 px-6 py-3 rounded-xl font-medium text-sm transition-all flex-shrink-0 ${
commitLoading || commitResult?.success
? 'opacity-50 cursor-not-allowed bg-white/5'
: 'bg-gradient-to-r from-blue-600 to-indigo-600 hover:from-blue-500 hover:to-indigo-500'
}`}
whileHover={!commitLoading && !commitResult?.success ? { scale: 1.02 } : {}}
whileTap={!commitLoading && !commitResult?.success ? { scale: 0.98 } : {}}
>
{commitLoading ? (
<><Loader2 className="w-4 h-4 animate-spin" /> Committing...</>
) : commitResult?.success ? (
<><CheckCircle2 className="w-4 h-4" /> Committed</>
) : (
<><Database className="w-4 h-4" /> Commit Job</>
)}
</motion.button>
</div>
</motion.div>
) : view === 'jobs' ? (
<motion.div
key="jobs"
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
exit={{ opacity: 0, y: -20 }}
>
<JobsPanel />
</motion.div>
) : (
/* ── Upload / Controls layout ── */
<motion.div
key="new_job"
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
exit={{ opacity: 0, y: -20 }}
>
<div className="grid lg:grid-cols-2 gap-6">
{/* Left Panel */}
<motion.div
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
transition={{ delay: 0.1 }}
className="space-y-6"
>
{/* File Type Toggle */}
<div className="glass p-4 rounded-2xl">
<div className="grid grid-cols-2 gap-2">
<motion.button
onClick={() => handleFileTypeChange('image')}
className={`p-3 rounded-xl text-sm font-medium transition-all flex items-center justify-center gap-2 ${fileType === 'image' ? 'bg-gradient-to-r from-purple-600 to-cyan-600 text-white' : 'glass text-gray-400 hover:bg-white/5'}`}
whileHover={{ scale: 1.02 }} whileTap={{ scale: 0.98 }}
>
<ImageIcon className="w-4 h-4" /> Image OCR
</motion.button>
<motion.button
onClick={() => handleFileTypeChange('pdf')}
className={`p-3 rounded-xl text-sm font-medium transition-all flex items-center justify-center gap-2 ${fileType === 'pdf' ? 'bg-gradient-to-r from-purple-600 to-cyan-600 text-white' : 'glass text-gray-400 hover:bg-white/5'}`}
whileHover={{ scale: 1.02 }} whileTap={{ scale: 0.98 }}
>
<FileText className="w-4 h-4" /> PDF Processing
</motion.button>
</div>
</div>
<MetadataForm metadata={metadata} onChange={setMetadata} suggestions={suggestions} />
<ModelSelector
models={models} value={model} onChange={setModel} loading={modelsLoading}
/>
<ModeSelector mode={mode} onModeChange={setMode} />
<ImageUpload onImageSelect={handleImageSelect} preview={imagePreview} fileType={fileType} />
<motion.button
onClick={() => setShowAdvanced(!showAdvanced)}
className="w-full glass px-4 py-3 rounded-2xl flex items-center justify-between hover:bg-white/5 transition-colors"
whileHover={{ scale: 1.01 }} whileTap={{ scale: 0.99 }}
>
<div className="flex items-center gap-2">
<Settings className="w-4 h-4 text-purple-400" />
<span className="text-sm font-medium text-gray-300">Advanced Settings</span>
</div>
<motion.div animate={{ rotate: showAdvanced ? 180 : 0 }} transition={{ duration: 0.3 }}>
<svg className="w-4 h-4 text-gray-400" fill="none" stroke="currentColor" viewBox="0 0 24 24">
<path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M19 9l-7 7-7-7" />
</svg>
</motion.div>
</motion.button>
<AnimatePresence>
{showAdvanced && (
<AdvancedSettings
settings={advancedSettings} onSettingsChange={setAdvancedSettings}
includeCaption={includeCaption} onIncludeCaptionChange={setIncludeCaption}
/>
)}
</AnimatePresence>
{fileType === 'pdf' ? (
<PDFProcessor
pdfFile={image} mode={mode} prompt={prompt} model={model}
advancedSettings={advancedSettings} includeCaption={includeCaption}
/>
) : (
<>
<motion.button
onClick={handleSubmit}
disabled={!image || loading}
className={`w-full relative overflow-hidden rounded-2xl p-[2px] ${!image || loading ? 'opacity-50 cursor-not-allowed' : ''}`}
whileHover={!loading && image ? { scale: 1.02 } : {}}
whileTap={!loading && image ? { scale: 0.98 } : {}}
>
<div className="absolute inset-0 bg-gradient-to-r from-purple-600 via-pink-600 to-cyan-600 animate-gradient" />
<div className="relative bg-dark-100 px-8 py-4 rounded-2xl flex items-center justify-center gap-3">
{loading ? (
<><Loader2 className="w-5 h-5 animate-spin" /><span className="font-semibold">Processing Magic...</span></>
) : (
<><Zap className="w-5 h-5" /><span className="font-semibold">Analyze Image</span></>
)}
</div>
</motion.button>
{error && (
<motion.div
initial={{ opacity: 0, y: -10 }} animate={{ opacity: 1, y: 0 }}
className="glass p-4 rounded-2xl border-red-500/50 bg-red-500/10"
>
<p className="text-sm text-red-400">{error}</p>
</motion.div>
)}
</>
)}
</motion.div>
{/* Right Panel - Results (non-plain_ocr modes or loading) */}
<motion.div
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
transition={{ delay: 0.2 }}
>
<ResultPanel
result={result}
loading={loading}
imagePreview={imagePreview}
onCopy={handleCopy}
onDownload={handleDownload}
/>
</motion.div>
</div>
</motion.div>
)}
</AnimatePresence>
</main>
{/* Footer */}
<footer className="mt-20 border-t border-white/10 glass">
<div className="max-w-7xl mx-auto px-6 py-8 text-center">
<div className="max-w-7xl mx-auto px-6 py-8 text-center space-y-2">
<p className="text-sm text-gray-400">
Powered by <span className="gradient-text font-semibold">DeepSeek-OCR</span>
Powered by <span className="gradient-text font-semibold">DeepSeek-OCR</span> &bull;
Built with <span className="text-pink-400"></span> using React + FastAPI
</p>
<p className="text-xs text-gray-500">
Thanks to <a href="https://github.com/p-xiexin" target="_blank" rel="noopener noreferrer" className="text-purple-400 hover:text-purple-300 transition-colors">@p-xiexin</a> for the clipboard paste idea!
</p>
</div>
</footer>
</div>

View File

@@ -1,18 +1,54 @@
import { useCallback } from 'react'
import { useCallback, useEffect } from 'react'
import { motion } from 'framer-motion'
import { useDropzone } from 'react-dropzone'
import { Upload, Image as ImageIcon, X } from 'lucide-react'
import { Upload, Image as ImageIcon, X, FileText, Clipboard } from 'lucide-react'
export default function ImageUpload({ onImageSelect, preview }) {
export default function ImageUpload({ onImageSelect, preview, fileType = 'image' }) {
const onDrop = useCallback((acceptedFiles) => {
if (acceptedFiles?.[0]) {
onImageSelect(acceptedFiles[0])
}
}, [onImageSelect])
const isPDF = fileType === 'pdf'
// Handle clipboard paste
useEffect(() => {
// Only enable paste for images, not PDFs
if (isPDF) return
const handlePaste = async (e) => {
const items = e.clipboardData?.items
if (!items) return
for (let i = 0; i < items.length; i++) {
const item = items[i]
if (item.type.indexOf('image') !== -1) {
e.preventDefault()
const blob = item.getAsFile()
if (blob) {
// Create a File object with a proper name
const file = new File([blob], `pasted-image-${Date.now()}.png`, {
type: blob.type,
})
onImageSelect(file)
}
break
}
}
}
document.addEventListener('paste', handlePaste)
return () => document.removeEventListener('paste', handlePaste)
}, [onImageSelect, isPDF])
const { getRootProps, getInputProps, isDragActive } = useDropzone({
onDrop,
accept: {
accept: isPDF ? {
'application/pdf': ['.pdf']
} : {
'image/*': ['.png', '.jpg', '.jpeg', '.webp', '.gif', '.bmp']
},
multiple: false
@@ -21,8 +57,14 @@ export default function ImageUpload({ onImageSelect, preview }) {
return (
<div className="glass p-6 rounded-2xl space-y-4">
<div className="flex items-center justify-between">
<h3 className="font-semibold text-gray-200">Upload Image</h3>
<ImageIcon className="w-5 h-5 text-purple-400" />
<h3 className="font-semibold text-gray-200">
{isPDF ? 'Upload PDF' : 'Upload Image'}
</h3>
{isPDF ? (
<FileText className="w-5 h-5 text-purple-400" />
) : (
<ImageIcon className="w-5 h-5 text-purple-400" />
)}
</div>
{!preview ? (
@@ -59,11 +101,25 @@ export default function ImageUpload({ onImageSelect, preview }) {
<div>
<p className="text-lg font-medium text-gray-200">
{isDragActive ? 'Drop it like it\'s hot! 🔥' : 'Drag & drop your image'}
{isDragActive
? 'Drop it like it\'s hot! 🔥'
: isPDF
? 'Drag & drop your PDF'
: 'Drag & drop your image'
}
</p>
<p className="text-sm text-gray-400 mt-1">
or click to browse PNG, JPG, WEBP up to 10MB
{isPDF
? 'or click to browse • PDF files up to 100MB'
: 'or click to browse • PNG, JPG, WEBP up to 10MB'
}
</p>
{!isPDF && (
<p className="text-xs text-purple-400 mt-2 flex items-center justify-center gap-1.5">
<Clipboard className="w-3.5 h-3.5" />
<span>Press Ctrl+V to paste from clipboard</span>
</p>
)}
</div>
</div>
</motion.div>
@@ -73,11 +129,21 @@ export default function ImageUpload({ onImageSelect, preview }) {
animate={{ opacity: 1, scale: 1 }}
className="relative group rounded-2xl overflow-hidden"
>
<img
src={preview}
alt="Preview"
className="w-full rounded-2xl border border-white/10"
/>
{isPDF ? (
<div className="flex items-center justify-center p-12 bg-white/5 border border-white/10 rounded-2xl">
<div className="text-center">
<FileText className="w-16 h-16 mx-auto mb-3 text-purple-400" />
<p className="text-sm text-gray-300 font-medium">PDF Ready</p>
<p className="text-xs text-gray-500 mt-1">{preview?.name || 'Document loaded'}</p>
</div>
</div>
) : (
<img
src={preview}
alt="Preview"
className="w-full rounded-2xl border border-white/10"
/>
)}
<div className="absolute top-3 right-3 flex gap-2">
<motion.button
onClick={(e) => {
@@ -87,7 +153,7 @@ export default function ImageUpload({ onImageSelect, preview }) {
className="bg-red-500/90 backdrop-blur-sm px-3 py-2 rounded-full opacity-100 hover:bg-red-600 transition-colors flex items-center gap-2 shadow-lg"
whileHover={{ scale: 1.05 }}
whileTap={{ scale: 0.95 }}
title="Remove image"
title={isPDF ? "Remove PDF" : "Remove image"}
>
<X className="w-4 h-4" />
<span className="text-sm font-medium">Remove</span>

View File

@@ -0,0 +1,665 @@
import { useState, useEffect, useCallback } from 'react'
import { useSuggestions } from '../hooks/useSuggestions'
import { useModels } from '../hooks/useModels'
import { motion, AnimatePresence } from 'framer-motion'
import {
Search, ChevronLeft, ChevronRight, CheckCircle2, Clock,
FileText, Loader2, Save, RefreshCw, Trash2, Sparkles,
} from 'lucide-react'
import axios from 'axios'
const API_BASE = import.meta.env.VITE_API_URL || '/api'
const INPUT_CLASS =
'w-full bg-white/5 border border-white/10 rounded-lg px-3 py-2 text-sm text-gray-200 ' +
'placeholder-gray-600 focus:outline-none focus:border-purple-500/50 transition-colors'
const STATUS_COLORS = {
unreviewed: 'text-amber-400 bg-amber-400/10 border-amber-400/30',
reviewed: 'text-green-400 bg-green-400/10 border-green-400/30',
}
function StatusBadge({ status }) {
const Icon = status === 'reviewed' ? CheckCircle2 : Clock
return (
<span className={`inline-flex items-center gap-1 px-2 py-0.5 rounded-full text-xs border ${STATUS_COLORS[status] || 'text-gray-400'}`}>
<Icon className="w-3 h-3" />
{status}
</span>
)
}
// ─────────────────────────────────────────────────────────────
// Full-screen Job Detail
// ─────────────────────────────────────────────────────────────
function JobDetail({ jobId, onClose, onReviewed, onDeleted, suggestions = {} }) {
const { models } = useModels()
const [job, setJob] = useState(null)
const [loading, setLoading] = useState(true)
const [error, setError] = useState(null)
const [describeModel, setDescribeModel] = useState('')
const [generatingDescribe, setGeneratingDescribe] = useState(false)
const [editedText, setEditedText] = useState('')
const [editDescribeText, setEditDescribeText] = useState('')
const [editFreeformText, setEditFreeformText] = useState('')
const [activeTab, setActiveTab] = useState('ocr')
const [editAuthor, setEditAuthor] = useState('')
const [editBook, setEditBook] = useState('')
const [editChapter, setEditChapter] = useState('')
const [editPage, setEditPage] = useState('')
const [reviewerName, setReviewerName] = useState('')
const [submitting, setSubmitting] = useState(false)
const [saveResult, setSaveResult] = useState(null)
const [confirmDelete, setConfirmDelete] = useState(false)
const [deleting, setDeleting] = useState(false)
const [togglingStatus, setTogglingStatus] = useState(false)
useEffect(() => {
let cancelled = false
setLoading(true)
setError(null)
setSaveResult(null)
axios.get(`${API_BASE}/jobs/${jobId}`)
.then(res => {
if (!cancelled) {
const d = res.data
setJob(d)
setEditedText(d.reviewed_text ?? d.ocr_text ?? '')
setEditDescribeText(d.describe_text ?? '')
setEditFreeformText(d.freeform_text ?? '')
setEditAuthor(d.author || '')
setEditBook(d.book || '')
setEditChapter(d.chapter || '')
setEditPage(d.page || '')
setReviewerName(d.reviewer_name || '')
// Default to the OCR tab when there's OCR text, otherwise Description
if (d.reviewed_text || d.ocr_text) setActiveTab('ocr')
else setActiveTab('describe')
}
})
.catch(err => {
if (!cancelled) setError(err.response?.data?.detail || err.message)
})
.finally(() => { if (!cancelled) setLoading(false) })
return () => { cancelled = true }
}, [jobId])
// Default the Describe model to the job's original model (if available) or the registry default
useEffect(() => {
if (!describeModel && models.length > 0) {
const def = models.find(m => m.default) || models[0]
const fromJob = job?.ocr_model && models.some(m => m.id === job.ocr_model) ? job.ocr_model : null
setDescribeModel(fromJob || def.id)
}
}, [models, job, describeModel])
const handleGenerateDescribe = async () => {
setGeneratingDescribe(true)
setSaveResult(null)
try {
const res = await axios.post(`${API_BASE}/jobs/${jobId}/describe`, {
model: describeModel || null,
})
setJob(res.data)
setEditDescribeText(res.data.describe_text || '')
onReviewed(res.data)
} catch (err) {
setSaveResult({ success: false, error: err.response?.data?.detail || err.message })
} finally {
setGeneratingDescribe(false)
}
}
const handleSave = async () => {
if (!reviewerName.trim()) {
setSaveResult({ success: false, error: 'Reviewer name is required.' })
return
}
setSubmitting(true)
setSaveResult(null)
try {
const res = await axios.put(`${API_BASE}/jobs/${jobId}/review`, {
reviewed_text: editedText,
reviewer_name: reviewerName.trim(),
author: editAuthor,
book: editBook,
chapter: editChapter,
page: editPage,
describe_text: editDescribeText || null,
freeform_text: editFreeformText || null,
})
setJob(res.data)
setSaveResult({ success: true })
onReviewed(res.data)
} catch (err) {
setSaveResult({ success: false, error: err.response?.data?.detail || err.message })
} finally {
setSubmitting(false)
}
}
const handleToggleStatus = async () => {
// Marking reviewed accepts BOTH the reviewed document text and the description,
// so it goes through the full review save (not a status-only flip).
if (!isReviewed) {
setTogglingStatus(true)
try {
await handleSave()
} finally {
setTogglingStatus(false)
}
return
}
// Reverting to unreviewed preserves the saved reviewed text and description.
setTogglingStatus(true)
setSaveResult(null)
try {
const res = await axios.put(`${API_BASE}/jobs/${jobId}/status`, {
status: 'unreviewed',
reviewer_name: reviewerName.trim() || null,
})
setJob(res.data)
setReviewerName(res.data.reviewer_name || '')
onReviewed(res.data)
} catch (err) {
setSaveResult({ success: false, error: err.response?.data?.detail || err.message })
} finally {
setTogglingStatus(false)
}
}
const handleDelete = async () => {
setDeleting(true)
try {
await axios.delete(`${API_BASE}/jobs/${jobId}`)
onDeleted(jobId)
} catch (err) {
setSaveResult({ success: false, error: err.response?.data?.detail || err.message })
setConfirmDelete(false)
} finally {
setDeleting(false)
}
}
const isReviewed = job?.status === 'reviewed'
return (
<motion.div
key={jobId}
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
exit={{ opacity: 0, y: -20 }}
className="flex flex-col gap-4"
>
{/* Top bar */}
<div className="flex items-center gap-4 flex-shrink-0">
<motion.button
onClick={onClose}
className="flex items-center gap-2 glass glass-hover px-4 py-2 rounded-xl text-sm text-gray-300"
whileHover={{ scale: 1.02 }} whileTap={{ scale: 0.98 }}
>
<ChevronLeft className="w-4 h-4" />
Back to results
</motion.button>
{job && (
<>
<StatusBadge status={job.status} />
<motion.button
onClick={handleToggleStatus}
disabled={togglingStatus}
title={isReviewed ? 'Revert to unreviewed' : 'Mark as reviewed'}
className={`flex items-center gap-1 px-3 py-1.5 rounded-lg text-xs font-medium transition-colors disabled:opacity-50 ${
isReviewed
? 'glass glass-hover text-amber-400 hover:bg-amber-500/10'
: 'glass glass-hover text-green-400 hover:bg-green-500/10'
}`}
whileHover={!togglingStatus ? { scale: 1.02 } : {}}
whileTap={!togglingStatus ? { scale: 0.98 } : {}}
>
{togglingStatus ? (
<Loader2 className="w-3.5 h-3.5 animate-spin" />
) : isReviewed ? (
<Clock className="w-3.5 h-3.5" />
) : (
<CheckCircle2 className="w-3.5 h-3.5" />
)}
{isReviewed ? 'Mark Unreviewed' : 'Mark Reviewed'}
</motion.button>
<span className="text-xs text-gray-500 font-mono hidden sm:block">{job.id}</span>
</>
)}
<div className="ml-auto flex items-center gap-2">
{confirmDelete ? (
<>
<span className="text-xs text-red-400">Delete this job permanently?</span>
<motion.button
onClick={handleDelete}
disabled={deleting}
className="flex items-center gap-1 px-3 py-2 rounded-xl text-sm font-medium bg-red-600 hover:bg-red-500 disabled:opacity-50"
whileHover={{ scale: 1.02 }} whileTap={{ scale: 0.98 }}
>
{deleting ? <Loader2 className="w-4 h-4 animate-spin" /> : <Trash2 className="w-4 h-4" />}
Confirm
</motion.button>
<motion.button
onClick={() => setConfirmDelete(false)}
className="px-3 py-2 rounded-xl text-sm glass glass-hover text-gray-300"
whileHover={{ scale: 1.02 }} whileTap={{ scale: 0.98 }}
>
Cancel
</motion.button>
</>
) : (
<motion.button
onClick={() => setConfirmDelete(true)}
className="flex items-center gap-2 px-3 py-2 rounded-xl text-sm glass glass-hover text-red-400 hover:bg-red-500/10"
whileHover={{ scale: 1.02 }} whileTap={{ scale: 0.98 }}
>
<Trash2 className="w-4 h-4" />
Delete
</motion.button>
)}
</div>
</div>
{loading && (
<div className="flex-1 flex items-center justify-center">
<Loader2 className="w-8 h-8 animate-spin text-purple-400" />
</div>
)}
{error && (
<div className="glass p-4 rounded-xl border-red-500/30 bg-red-500/10 flex-shrink-0">
<p className="text-sm text-red-400">{error}</p>
</div>
)}
{job && !loading && (
<>
{/* Image + Text */}
<div className="grid gap-6" style={{ gridTemplateColumns: '1fr 1fr', height: '130vh' }}>
<div className="glass rounded-2xl overflow-hidden flex items-center justify-center bg-black/20 h-full">
<img
src={`${API_BASE}/jobs/${job.id}/image`}
alt="Job source"
className="w-full h-full object-contain"
onError={e => { e.target.style.display = 'none' }}
/>
</div>
<div className="glass rounded-2xl p-4 flex flex-col h-full">
{/* Tabs — only show tabs that have content */}
{(() => {
const tabs = [
job.ocr_text || job.reviewed_text ? { id: 'ocr', label: 'OCR Text' } : null,
{ id: 'describe', label: 'Description' },
].filter(Boolean)
return tabs.length > 1 ? (
<div className="flex gap-1 mb-3 flex-shrink-0">
{tabs.map(t => (
<button
key={t.id}
onClick={() => setActiveTab(t.id)}
className={`px-3 py-1 rounded-lg text-xs font-medium transition-colors ${
activeTab === t.id
? 'bg-purple-600 text-white'
: 'bg-white/5 text-gray-400 hover:bg-white/10'
}`}
>
{t.label}
</button>
))}
</div>
) : null
})()}
<p className="text-xs text-gray-400 mb-2 flex-shrink-0">
{{ ocr: isReviewed ? 'Reviewed Text' : 'OCR Text', describe: 'Description' }[activeTab]}
<span className="text-purple-400 ml-1">(editable)</span>
</p>
{activeTab === 'ocr' && (
<>
<textarea
value={editedText}
onChange={e => setEditedText(e.target.value)}
className="flex-1 w-full bg-transparent text-sm text-gray-200 font-mono resize-none focus:outline-none min-h-0"
placeholder="OCR text..."
/>
{isReviewed && job.ocr_text && (
<details className="flex-shrink-0 mt-2 border-t border-white/10 pt-2">
<summary className="cursor-pointer text-xs text-gray-500 hover:text-gray-400 transition-colors">
Original OCR Text
</summary>
<pre className="text-xs text-gray-600 whitespace-pre-wrap font-mono mt-1 max-h-28 overflow-y-auto">
{job.ocr_text}
</pre>
</details>
)}
</>
)}
{activeTab === 'describe' && (
<>
<div className="flex items-center gap-2 mb-2 flex-shrink-0">
<select
value={describeModel}
onChange={e => setDescribeModel(e.target.value)}
disabled={generatingDescribe || models.length === 0}
className="bg-white/5 border border-white/10 rounded-lg px-2 py-1.5 text-xs text-gray-200 focus:outline-none focus:border-purple-500/50"
>
{models.length === 0 && <option value="">No models</option>}
{models.map(m => (
<option key={m.id} value={m.id}>{m.label}{m.default ? ' (default)' : ''}</option>
))}
</select>
<motion.button
onClick={handleGenerateDescribe}
disabled={generatingDescribe || !describeModel}
className={`flex items-center gap-1.5 px-3 py-1.5 rounded-lg text-xs font-medium transition-all ${
generatingDescribe || !describeModel
? 'opacity-50 cursor-not-allowed bg-white/5'
: 'bg-gradient-to-r from-violet-600 to-purple-600 hover:from-violet-500 hover:to-purple-500'
}`}
whileHover={!generatingDescribe && describeModel ? { scale: 1.02 } : {}}
whileTap={!generatingDescribe && describeModel ? { scale: 0.98 } : {}}
title="Run Describe on this job's image and save it"
>
{generatingDescribe
? <><Loader2 className="w-3.5 h-3.5 animate-spin" /> Generating</>
: <><Sparkles className="w-3.5 h-3.5" /> Generate Description</>}
</motion.button>
</div>
<textarea
value={editDescribeText}
onChange={e => setEditDescribeText(e.target.value)}
className="flex-1 w-full bg-transparent text-sm text-gray-200 font-mono resize-none focus:outline-none min-h-0"
placeholder="No description yet — pick a model and click Generate Description, or type one here."
/>
</>
)}
</div>
</div>
{/* Metadata + reviewer row */}
<div className="glass p-4 rounded-2xl flex-shrink-0">
<datalist id="jd-authors">
{(suggestions.authors || []).map(a => <option key={a} value={a} />)}
</datalist>
<datalist id="jd-books">
{(suggestions.books || []).map(b => <option key={b} value={b} />)}
</datalist>
<datalist id="jd-chapters">
{(suggestions.chapters || []).map(c => <option key={c} value={c} />)}
</datalist>
<datalist id="jd-reviewers">
{(suggestions.reviewers || []).map(r => <option key={r} value={r} />)}
</datalist>
<div className="grid grid-cols-6 gap-4">
<div>
<label className="text-xs text-gray-400 mb-1 block">Author</label>
<input type="text" list="jd-authors" value={editAuthor} onChange={e => setEditAuthor(e.target.value)} placeholder="Author" className={INPUT_CLASS} />
</div>
<div>
<label className="text-xs text-gray-400 mb-1 block">Book</label>
<input type="text" list="jd-books" value={editBook} onChange={e => setEditBook(e.target.value)} placeholder="Book title" className={INPUT_CLASS} />
</div>
<div>
<label className="text-xs text-gray-400 mb-1 block">Chapter</label>
<input type="text" list="jd-chapters" value={editChapter} onChange={e => setEditChapter(e.target.value)} placeholder="Chapter" className={INPUT_CLASS} />
</div>
<div>
<label className="text-xs text-gray-400 mb-1 block">Page</label>
<input type="text" value={editPage} onChange={e => setEditPage(e.target.value)} placeholder="Page" className={INPUT_CLASS} />
</div>
<div>
<label className="text-xs text-gray-400 mb-1 block">Reviewer</label>
<input type="text" list="jd-reviewers" value={reviewerName} onChange={e => setReviewerName(e.target.value)} placeholder="Your name" className={INPUT_CLASS} />
</div>
<div className="flex flex-col justify-end">
<motion.button
onClick={handleSave}
disabled={submitting || !reviewerName.trim()}
className={`w-full flex items-center justify-center gap-2 px-4 py-2 rounded-lg font-medium text-sm transition-all ${
submitting || !reviewerName.trim()
? 'opacity-50 cursor-not-allowed bg-white/5'
: isReviewed
? 'bg-gradient-to-r from-blue-600 to-indigo-600 hover:from-blue-500 hover:to-indigo-500'
: 'bg-gradient-to-r from-green-600 to-emerald-600 hover:from-green-500 hover:to-emerald-500'
}`}
whileHover={!submitting && reviewerName.trim() ? { scale: 1.02 } : {}}
whileTap={!submitting && reviewerName.trim() ? { scale: 0.98 } : {}}
>
{submitting ? (
<><Loader2 className="w-4 h-4 animate-spin" /> Saving...</>
) : isReviewed ? (
<><Save className="w-4 h-4" /> Save Changes</>
) : (
<><CheckCircle2 className="w-4 h-4" /> Mark Reviewed</>
)}
</motion.button>
</div>
</div>
{!isReviewed && (
<p className="text-xs text-gray-500 mt-2">
Marking reviewed accepts both the reviewed document text and the description.
</p>
)}
{saveResult && (
<motion.div
initial={{ opacity: 0, y: -4 }} animate={{ opacity: 1, y: 0 }}
className={`mt-3 p-2 rounded-lg text-xs ${saveResult.success ? 'bg-green-500/10 text-green-400' : 'bg-red-500/10 text-red-400'}`}
>
{saveResult.success
? (isReviewed ? 'Changes saved!' : 'Job marked as reviewed!')
: saveResult.error}
</motion.div>
)}
{/* Read-only info row */}
<div className="flex gap-6 mt-3 pt-3 border-t border-white/10">
{job.submitted_at && (
<span className="text-xs text-gray-500">Submitted: {new Date(job.submitted_at).toLocaleString()}</span>
)}
{isReviewed && job.reviewed_at && (
<span className="text-xs text-gray-500">Last reviewed: {new Date(job.reviewed_at).toLocaleString()}</span>
)}
{job.mode && <span className="text-xs text-gray-500">Mode: {job.mode}</span>}
{job.ocr_model && <span className="text-xs text-gray-500">Model: {job.ocr_model}</span>}
</div>
</div>
</>
)}
</motion.div>
)
}
// ─────────────────────────────────────────────────────────────
// Search / List view
// ─────────────────────────────────────────────────────────────
export default function JobsPanel() {
const suggestions = useSuggestions()
const [search, setSearch] = useState('')
const [filterStatus, setFilterStatus] = useState('')
const [filterAuthor, setFilterAuthor] = useState('')
const [filterBook, setFilterBook] = useState('')
const [jobs, setJobs] = useState([])
const [total, setTotal] = useState(0)
const [page, setPage] = useState(0)
const [loading, setLoading] = useState(false)
const [error, setError] = useState(null)
const [selectedJobId, setSelectedJobId] = useState(null)
const LIMIT = 20
const fetchJobs = useCallback(async (pageNum = 0) => {
setLoading(true)
setError(null)
try {
const params = new URLSearchParams()
if (search.trim()) params.set('search', search.trim())
if (filterStatus) params.set('status', filterStatus)
if (filterAuthor.trim()) params.set('author', filterAuthor.trim())
if (filterBook.trim()) params.set('book', filterBook.trim())
params.set('limit', LIMIT)
params.set('offset', pageNum * LIMIT)
const res = await axios.get(`${API_BASE}/jobs?${params}`)
setJobs(res.data.jobs)
setTotal(res.data.total)
setPage(pageNum)
} catch (err) {
setError(err.response?.data?.detail || err.message)
} finally {
setLoading(false)
}
}, [search, filterStatus, filterAuthor, filterBook])
useEffect(() => { fetchJobs(0) }, []) // eslint-disable-line react-hooks/exhaustive-deps
const handleReviewed = (updatedJob) => {
setJobs(prev => prev.map(j => j.id === updatedJob.id ? { ...j, ...updatedJob } : j))
}
const totalPages = Math.ceil(total / LIMIT)
// When a job is selected show full-screen detail
if (selectedJobId) {
return (
<AnimatePresence mode="wait">
<JobDetail
key={selectedJobId}
jobId={selectedJobId}
onClose={() => setSelectedJobId(null)}
onReviewed={handleReviewed}
onDeleted={(id) => {
setJobs(prev => prev.filter(j => j.id !== id))
setTotal(prev => prev - 1)
setSelectedJobId(null)
}}
suggestions={suggestions}
/>
</AnimatePresence>
)
}
return (
<motion.div
key="job_list"
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
exit={{ opacity: 0, y: -20 }}
className="space-y-4"
>
{/* Search form */}
<div className="glass p-4 rounded-2xl space-y-3">
<form onSubmit={e => { e.preventDefault(); fetchJobs(0) }} className="flex gap-2">
<input
type="text"
value={search}
onChange={e => setSearch(e.target.value)}
placeholder="Search all fields..."
className={`${INPUT_CLASS} flex-1`}
/>
<motion.button
type="submit"
className="flex items-center gap-2 px-4 py-2 rounded-lg bg-gradient-to-r from-purple-600 to-cyan-600 text-sm font-medium"
whileHover={{ scale: 1.02 }} whileTap={{ scale: 0.98 }}
>
<Search className="w-4 h-4" /> Search
</motion.button>
</form>
<datalist id="jp-authors">
{suggestions.authors.map(a => <option key={a} value={a} />)}
</datalist>
<datalist id="jp-books">
{(suggestions.books || []).map(b => <option key={b} value={b} />)}
</datalist>
<div className="grid grid-cols-3 gap-2">
<select value={filterStatus} onChange={e => setFilterStatus(e.target.value)} className={INPUT_CLASS}>
<option value="">All statuses</option>
<option value="unreviewed">Unreviewed</option>
<option value="reviewed">Reviewed</option>
</select>
<input type="text" list="jp-authors" value={filterAuthor} onChange={e => setFilterAuthor(e.target.value)} placeholder="Author..." className={INPUT_CLASS} />
<input type="text" list="jp-books" value={filterBook} onChange={e => setFilterBook(e.target.value)} placeholder="Book..." className={INPUT_CLASS} />
</div>
<div className="flex items-center justify-between">
<span className="text-xs text-gray-500">{total} job{total !== 1 ? 's' : ''} found</span>
<button onClick={() => fetchJobs(page)} className="flex items-center gap-1 text-xs text-gray-400 hover:text-gray-200 transition-colors">
<RefreshCw className="w-3 h-3" /> Refresh
</button>
</div>
</div>
{loading && <div className="flex justify-center py-8"><Loader2 className="w-6 h-6 animate-spin text-purple-400" /></div>}
{error && (
<div className="glass p-4 rounded-xl border-red-500/30 bg-red-500/10">
<p className="text-sm text-red-400">{error}</p>
</div>
)}
{!loading && !error && jobs.length === 0 && (
<div className="glass p-8 rounded-2xl text-center">
<FileText className="w-10 h-10 mx-auto mb-3 text-gray-600" />
<p className="text-gray-400">No jobs found</p>
<p className="text-xs text-gray-500 mt-1">Commit your first OCR job from the New Job tab</p>
</div>
)}
{/* Results grid */}
<div className="grid grid-cols-1 sm:grid-cols-2 lg:grid-cols-3 xl:grid-cols-4 gap-3">
<AnimatePresence>
{jobs.map(job => (
<motion.button
key={job.id}
onClick={() => setSelectedJobId(job.id)}
className="text-left glass p-4 rounded-xl border border-white/5 hover:border-white/20 hover:bg-white/5 transition-all"
initial={{ opacity: 0, y: 10 }}
animate={{ opacity: 1, y: 0 }}
exit={{ opacity: 0 }}
whileHover={{ scale: 1.02 }}
whileTap={{ scale: 0.98 }}
layout
>
<div className="flex items-start justify-between gap-2 mb-2">
<StatusBadge status={job.status} />
</div>
{job.book && <p className="text-sm font-medium text-gray-200 truncate">{job.book}</p>}
<div className="flex items-center gap-2 mt-0.5">
{job.chapter && <span className="text-xs text-gray-500">Ch. {job.chapter}</span>}
{job.page && <span className="text-xs text-gray-500">p. {job.page}</span>}
</div>
{job.author && <p className="text-xs text-gray-400 mt-1">{job.author}</p>}
<div className="flex items-center justify-between mt-2">
<p className="text-xs text-gray-600 font-mono">{new Date(job.submitted_at).toLocaleDateString()}</p>
{job.ocr_model && <span className="text-[10px] text-gray-500 truncate ml-2">{job.ocr_model}</span>}
</div>
</motion.button>
))}
</AnimatePresence>
</div>
{totalPages > 1 && (
<div className="flex items-center justify-center gap-3">
<button onClick={() => fetchJobs(page - 1)} disabled={page === 0} className="glass glass-hover p-2 rounded-lg disabled:opacity-30">
<ChevronLeft className="w-4 h-4" />
</button>
<span className="text-sm text-gray-400">Page {page + 1} of {totalPages}</span>
<button onClick={() => fetchJobs(page + 1)} disabled={page >= totalPages - 1} className="glass glass-hover p-2 rounded-lg disabled:opacity-30">
<ChevronRight className="w-4 h-4" />
</button>
</div>
)}
</motion.div>
)
}

View File

@@ -0,0 +1,77 @@
import { BookOpen } from 'lucide-react'
export default function MetadataForm({ metadata, onChange, suggestions = {} }) {
const { author, book, chapter, page } = metadata
const { authors = [], books = [], chapters = [] } = suggestions
const field = (key) => (e) => onChange({ ...metadata, [key]: e.target.value })
const inputClass =
'w-full bg-white/5 border border-white/10 rounded-lg px-3 py-2 text-sm text-gray-200 ' +
'placeholder-gray-600 focus:outline-none focus:border-purple-500/50 transition-colors'
return (
<div className="glass p-4 rounded-2xl space-y-3">
<div className="flex items-center gap-2">
<BookOpen className="w-4 h-4 text-purple-400" />
<h3 className="text-sm font-medium text-gray-300">Job Metadata</h3>
</div>
<datalist id="mf-authors">
{authors.map(a => <option key={a} value={a} />)}
</datalist>
<datalist id="mf-books">
{books.map(b => <option key={b} value={b} />)}
</datalist>
<datalist id="mf-chapters">
{chapters.map(c => <option key={c} value={c} />)}
</datalist>
<div className="grid grid-cols-2 gap-3">
<div>
<label className="text-xs text-gray-400 mb-1 block">Author</label>
<input
type="text"
list="mf-authors"
value={author}
onChange={field('author')}
placeholder="Author name"
className={inputClass}
/>
</div>
<div>
<label className="text-xs text-gray-400 mb-1 block">Book</label>
<input
type="text"
list="mf-books"
value={book}
onChange={field('book')}
placeholder="Book title"
className={inputClass}
/>
</div>
<div>
<label className="text-xs text-gray-400 mb-1 block">Chapter</label>
<input
type="text"
list="mf-chapters"
value={chapter}
onChange={field('chapter')}
placeholder="Chapter"
className={inputClass}
/>
</div>
<div>
<label className="text-xs text-gray-400 mb-1 block">Page</label>
<input
type="text"
value={page}
onChange={field('page')}
placeholder="Page number"
className={inputClass}
/>
</div>
</div>
</div>
)
}

View File

@@ -1,41 +1,30 @@
import { motion } from 'framer-motion'
import { FileText, Eye, Search, Wand2 } from 'lucide-react'
import { FileText, Eye } from 'lucide-react'
const modes = [
{ id: 'plain_ocr', name: 'Plain OCR', icon: FileText, color: 'from-blue-500 to-cyan-500', desc: 'Extract raw text', needsInput: false },
{ id: 'describe', name: 'Describe', icon: Eye, color: 'from-violet-500 to-purple-500', desc: 'Image description', needsInput: false },
{ id: 'find_ref', name: 'Find', icon: Search, color: 'from-yellow-500 to-orange-500', desc: 'Locate specific terms', needsInput: 'findTerm' },
{ id: 'freeform', name: 'Freeform', icon: Wand2, color: 'from-fuchsia-500 to-pink-500', desc: 'Custom prompt', needsInput: 'prompt' },
{ id: 'plain_ocr', name: 'Plain OCR', icon: FileText, color: 'from-blue-500 to-cyan-500', desc: 'Extract raw text' },
{ id: 'describe', name: 'Describe', icon: Eye, color: 'from-violet-500 to-purple-500', desc: 'Image description' },
]
export default function ModeSelector({
mode,
onModeChange,
prompt,
onPromptChange,
findTerm,
onFindTermChange
}) {
const selectedMode = modes.find(m => m.id === mode)
const needsInput = selectedMode?.needsInput
export default function ModeSelector({ mode, onModeChange }) {
return (
<div className="glass p-4 rounded-2xl space-y-3">
<h3 className="text-sm font-semibold text-gray-200">Mode</h3>
<div className="grid grid-cols-4 gap-2">
<div className="grid grid-cols-2 gap-2">
{modes.map((m) => {
const Icon = m.icon
const isSelected = mode === m.id
return (
<motion.button
key={m.id}
onClick={() => onModeChange(m.id)}
title={m.desc}
className={`
relative p-2 rounded-xl text-center transition-all
${isSelected
? 'glass border-white/20 shadow-lg'
${isSelected
? 'glass border-white/20 shadow-lg'
: 'bg-white/5 border border-white/10 hover:border-white/20'
}
`}
@@ -49,12 +38,12 @@ export default function ModeSelector({
transition={{ type: "spring", bounce: 0.2, duration: 0.6 }}
/>
)}
<div className="relative space-y-1">
<div className={`
w-8 h-8 mx-auto rounded-lg flex items-center justify-center
${isSelected
? `bg-gradient-to-br ${m.color}`
${isSelected
? `bg-gradient-to-br ${m.color}`
: 'bg-white/10'
}
`}>
@@ -68,38 +57,6 @@ export default function ModeSelector({
)
})}
</div>
{needsInput === 'findTerm' && (
<motion.div
initial={{ opacity: 0, height: 0 }}
animate={{ opacity: 1, height: 'auto' }}
exit={{ opacity: 0, height: 0 }}
>
<input
type="text"
value={findTerm}
onChange={(e) => onFindTermChange(e.target.value)}
placeholder="Enter term to find (e.g., Total, Invoice #)"
className="w-full bg-white/5 border border-white/10 rounded-xl px-3 py-2 text-sm focus:outline-none focus:border-purple-500 transition-colors"
/>
</motion.div>
)}
{needsInput === 'prompt' && (
<motion.div
initial={{ opacity: 0, height: 0 }}
animate={{ opacity: 1, height: 'auto' }}
exit={{ opacity: 0, height: 0 }}
>
<textarea
value={prompt}
onChange={(e) => onPromptChange(e.target.value)}
placeholder="Enter your custom prompt..."
className="w-full bg-white/5 border border-white/10 rounded-xl px-3 py-2 text-sm focus:outline-none focus:border-purple-500 transition-colors resize-none"
rows={2}
/>
</motion.div>
)}
</div>
)
}

View File

@@ -0,0 +1,33 @@
import { Cpu } from 'lucide-react'
const SELECT_CLASS =
'w-full bg-white/5 border border-white/10 rounded-lg px-3 py-2 text-sm text-gray-200 ' +
'focus:outline-none focus:border-purple-500/50 transition-colors'
// Dropdown to pick which OCR model runs the analysis.
// `models` comes from the useModels() hook; `value` is the selected model id.
export default function ModelSelector({ models, value, onChange, loading }) {
return (
<div className="glass p-4 rounded-2xl space-y-3">
<div className="flex items-center gap-2">
<Cpu className="w-4 h-4 text-purple-400" />
<h3 className="text-sm font-semibold text-gray-200">Model</h3>
</div>
<select
value={value || ''}
onChange={e => onChange(e.target.value)}
disabled={loading || models.length === 0}
className={SELECT_CLASS}
>
{loading && <option value="">Loading models</option>}
{!loading && models.length === 0 && <option value="">No models available</option>}
{models.map(m => (
<option key={m.id} value={m.id}>
{m.label}{m.default ? ' (default)' : ''}
</option>
))}
</select>
</div>
)
}

View File

@@ -0,0 +1,234 @@
import { useState, useCallback } from 'react'
import { motion, AnimatePresence } from 'framer-motion'
import { FileText, Download, Loader2, CheckCircle2, AlertCircle } from 'lucide-react'
import axios from 'axios'
const API_BASE = import.meta.env.VITE_API_URL || '/api'
function PDFProcessor({ pdfFile, mode, prompt, model, advancedSettings, includeCaption }) {
const [processing, setProcessing] = useState(false)
const [progress, setProgress] = useState(0)
const [result, setResult] = useState(null)
const [error, setError] = useState(null)
const [outputFormat, setOutputFormat] = useState('markdown')
const formats = [
{ value: 'markdown', label: 'Markdown', ext: 'md', icon: '📝' },
{ value: 'html', label: 'HTML', ext: 'html', icon: '🌐' },
{ value: 'docx', label: 'Word', ext: 'docx', icon: '📄' },
{ value: 'json', label: 'JSON', ext: 'json', icon: '📊' }
]
const handleProcess = useCallback(async () => {
if (!pdfFile) return
setProcessing(true)
setError(null)
setProgress(0)
try {
const formData = new FormData()
formData.append('pdf_file', pdfFile)
if (model) formData.append('model', model)
formData.append('mode', mode)
formData.append('prompt', prompt)
formData.append('output_format', outputFormat)
formData.append('grounding', mode === 'find_ref')
formData.append('include_caption', includeCaption)
formData.append('extract_images', true)
formData.append('dpi', 144)
formData.append('base_size', advancedSettings.base_size)
formData.append('image_size', advancedSettings.image_size)
formData.append('crop_mode', advancedSettings.crop_mode)
const response = await axios.post(`${API_BASE}/process-pdf`, formData, {
headers: {
'Content-Type': 'multipart/form-data',
},
responseType: outputFormat === 'json' ? 'json' : 'blob',
onUploadProgress: (progressEvent) => {
const percentCompleted = Math.round((progressEvent.loaded * 100) / progressEvent.total)
setProgress(percentCompleted)
}
})
if (outputFormat === 'json') {
setResult(response.data)
} else {
// For file downloads (markdown, html, docx)
const format = formats.find(f => f.value === outputFormat)
const blob = new Blob([response.data], {
type: response.headers['content-type']
})
const url = URL.createObjectURL(blob)
const a = document.createElement('a')
a.href = url
a.download = `ocr_result.${format.ext}`
a.click()
URL.revokeObjectURL(url)
setResult({
success: true,
message: `Document downloaded as ${format.label}`,
format: outputFormat
})
}
setProgress(100)
} catch (err) {
console.error('PDF processing error:', err)
setError(err.response?.data?.detail || err.message || 'Failed to process PDF')
} finally {
setProcessing(false)
}
}, [pdfFile, mode, prompt, model, outputFormat, includeCaption, advancedSettings])
const handleDownloadJSON = useCallback(() => {
if (!result || outputFormat !== 'json') return
const blob = new Blob([JSON.stringify(result, null, 2)], { type: 'application/json' })
const url = URL.createObjectURL(blob)
const a = document.createElement('a')
a.href = url
a.download = 'ocr_result.json'
a.click()
URL.revokeObjectURL(url)
}, [result, outputFormat])
return (
<div className="space-y-4">
{/* Format Selector */}
<div className="glass p-6 rounded-2xl space-y-3">
<label className="block text-sm font-medium text-gray-300 mb-3">
Output Format
</label>
<div className="grid grid-cols-2 gap-2">
{formats.map((format) => (
<motion.button
key={format.value}
onClick={() => setOutputFormat(format.value)}
className={`p-3 rounded-xl text-sm font-medium transition-all ${
outputFormat === format.value
? 'bg-gradient-to-r from-purple-600 to-cyan-600 text-white'
: 'glass text-gray-400 hover:bg-white/5'
}`}
whileHover={{ scale: 1.02 }}
whileTap={{ scale: 0.98 }}
>
<span className="mr-2">{format.icon}</span>
{format.label}
</motion.button>
))}
</div>
</div>
{/* Process Button */}
<motion.button
onClick={handleProcess}
disabled={!pdfFile || processing}
className={`w-full relative overflow-hidden rounded-2xl p-[2px] ${
!pdfFile || processing ? 'opacity-50 cursor-not-allowed' : ''
}`}
whileHover={!processing && pdfFile ? { scale: 1.02 } : {}}
whileTap={!processing && pdfFile ? { scale: 0.98 } : {}}
>
<div className="absolute inset-0 bg-gradient-to-r from-purple-600 via-pink-600 to-cyan-600 animate-gradient" />
<div className="relative bg-dark-100 px-8 py-4 rounded-2xl flex items-center justify-center gap-3">
{processing ? (
<>
<Loader2 className="w-5 h-5 animate-spin" />
<span className="font-semibold">Processing PDF...</span>
</>
) : (
<>
<FileText className="w-5 h-5" />
<span className="font-semibold">Process PDF</span>
</>
)}
</div>
</motion.button>
{/* Progress Bar */}
<AnimatePresence>
{processing && progress > 0 && (
<motion.div
initial={{ opacity: 0, height: 0 }}
animate={{ opacity: 1, height: 'auto' }}
exit={{ opacity: 0, height: 0 }}
className="glass p-4 rounded-2xl"
>
<div className="flex items-center justify-between mb-2">
<span className="text-sm text-gray-400">Processing...</span>
<span className="text-sm font-medium text-purple-400">{progress}%</span>
</div>
<div className="h-2 bg-dark-200 rounded-full overflow-hidden">
<motion.div
className="h-full bg-gradient-to-r from-purple-600 to-cyan-600"
initial={{ width: 0 }}
animate={{ width: `${progress}%` }}
transition={{ duration: 0.3 }}
/>
</div>
</motion.div>
)}
</AnimatePresence>
{/* Error Display */}
<AnimatePresence>
{error && (
<motion.div
initial={{ opacity: 0, y: -10 }}
animate={{ opacity: 1, y: 0 }}
exit={{ opacity: 0, y: -10 }}
className="glass p-4 rounded-2xl border-red-500/50 bg-red-500/10 flex items-start gap-3"
>
<AlertCircle className="w-5 h-5 text-red-400 flex-shrink-0 mt-0.5" />
<div>
<p className="text-sm font-medium text-red-400">Processing Failed</p>
<p className="text-xs text-red-300 mt-1">{error}</p>
</div>
</motion.div>
)}
</AnimatePresence>
{/* Success Display */}
<AnimatePresence>
{result && !error && (
<motion.div
initial={{ opacity: 0, y: -10 }}
animate={{ opacity: 1, y: 0 }}
exit={{ opacity: 0, y: -10 }}
className="glass p-6 rounded-2xl border-green-500/50 bg-green-500/10"
>
<div className="flex items-start gap-3">
<CheckCircle2 className="w-5 h-5 text-green-400 flex-shrink-0 mt-0.5" />
<div className="flex-1">
<p className="text-sm font-medium text-green-400">
{result.message || 'PDF processed successfully!'}
</p>
{outputFormat === 'json' && result.pages && (
<div className="mt-3 space-y-2">
<p className="text-xs text-gray-400">
Processed {result.total_pages} page{result.total_pages > 1 ? 's' : ''}
</p>
<motion.button
onClick={handleDownloadJSON}
className="glass px-4 py-2 rounded-xl text-sm font-medium hover:bg-white/5 transition-colors flex items-center gap-2"
whileHover={{ scale: 1.02 }}
whileTap={{ scale: 0.98 }}
>
<Download className="w-4 h-4" />
Download JSON
</motion.button>
</div>
)}
</div>
</div>
</motion.div>
)}
</AnimatePresence>
</div>
)
}
export default PDFProcessor

View File

@@ -2,6 +2,7 @@ import { useEffect, useRef, useState, useCallback } from 'react'
import { motion, AnimatePresence } from 'framer-motion'
import { Copy, Download, Sparkles, Loader2, CheckCircle2, ChevronDown } from 'lucide-react'
import ReactMarkdown from 'react-markdown'
import DOMPurify from 'dompurify'
export default function ResultPanel({ result, loading, imagePreview, onCopy, onDownload }) {
const canvasRef = useRef(null)
@@ -204,20 +205,20 @@ export default function ResultPanel({ result, loading, imagePreview, onCopy, onD
exit={{ opacity: 0, y: -20 }}
className="space-y-4"
>
{/* Preview with boxes */}
{/* Preview with boxes (grounding modes) */}
{imagePreview && result.boxes && result.boxes.length > 0 && (
<div className="relative rounded-xl overflow-hidden border border-white/10 bg-black">
<img
<img
ref={imgRef}
src={imagePreview}
alt="Result"
className="w-full block"
src={imagePreview}
alt="Result"
className="w-full block"
onLoad={() => {
console.log('🖼️ Image loaded, triggering draw')
setImageLoaded(true)
}}
/>
<canvas
<canvas
ref={canvasRef}
className="absolute top-0 left-0 w-full h-full pointer-events-none"
style={{ display: 'block' }}
@@ -225,15 +226,13 @@ export default function ResultPanel({ result, loading, imagePreview, onCopy, onD
</div>
)}
{/* Text result */}
{/* Rendered text result */}
<div className="bg-white/5 border border-white/10 rounded-xl p-4 max-h-96 overflow-y-auto">
{isHTML ? (
<div
<div
className="prose prose-invert prose-sm max-w-none"
dangerouslySetInnerHTML={{ __html: result.text }}
style={{
color: '#e5e7eb',
}}
dangerouslySetInnerHTML={{ __html: DOMPurify.sanitize(result.text) }}
style={{ color: '#e5e7eb' }}
/>
) : isMarkdown ? (
<div className="prose prose-invert prose-sm max-w-none">

View File

@@ -0,0 +1,24 @@
import { useState, useEffect } from 'react'
const API_BASE = import.meta.env.VITE_API_URL || '/api'
// Fetches the OCR models available for selection. Returns { models, loading }.
// Each model: { id, label, capabilities: { grounding, advanced_settings }, default }
export function useModels() {
const [models, setModels] = useState([])
const [loading, setLoading] = useState(true)
useEffect(() => {
let cancelled = false
fetch(`${API_BASE}/models`)
.then(r => (r.ok ? r.json() : null))
.then(data => {
if (!cancelled && data?.models) setModels(data.models)
})
.catch(() => {})
.finally(() => { if (!cancelled) setLoading(false) })
return () => { cancelled = true }
}, [])
return { models, loading }
}

View File

@@ -0,0 +1,16 @@
import { useState, useEffect } from 'react'
const API_BASE = import.meta.env.VITE_API_URL || '/api'
export function useSuggestions() {
const [suggestions, setSuggestions] = useState({ authors: [], books: [], chapters: [], reviewers: [] })
useEffect(() => {
fetch(`${API_BASE}/jobs/suggestions`)
.then(r => r.ok ? r.json() : null)
.then(data => { if (data) setSuggestions(data) })
.catch(() => {})
}, [])
return suggestions
}