Note

: This was a quickly vibe-coded project to test out DeepSeek-OCR! It basically works quite nice on an RTX 5090. The "Find" mode grounding boxes aren't quite working yet - probably my fault in not interpreting the dimensions correctly, but the core OCR functionality is pretty nice so far.

Quick Start

docker compose up --build

Then open:

Frontend: http://localhost:3000
Backend API: http://localhost:8000
API Docs: http://localhost:8000/docs

Features

4 OCR Modes

Plain OCR - Raw text extraction
Describe - Generate image descriptions
Find - Locate specific terms (grounding boxes WIP)
Freeform - Custom prompts for anything

UI Features

🎨 Glass morphism design with animated gradients
🎯 Drag & drop file upload
📦 Grounding box visualization (WIP - dimensions need fixing)
✨ Smooth animations (Framer Motion)
📋 Copy/Download results
🎛️ Advanced settings dropdown
📝 Markdown rendering for formatted output

Tech Stack

Frontend: React 18 + Vite 5 + TailwindCSS 3 + Framer Motion 11
Backend: FastAPI + PyTorch + Transformers 4.46 + DeepSeek-OCR
Server: Nginx (reverse proxy)
Container: Docker + Docker Compose with multi-stage builds
GPU: NVIDIA CUDA support (tested on RTX 3090)

Project Structure

deepseek-ocr/
├── backend/           # FastAPI backend
│   ├── main.py
│   ├── requirements.txt
│   └── Dockerfile
├── frontend/          # React frontend
│   ├── src/
│   │   ├── components/
│   │   ├── App.jsx
│   │   └── main.jsx
│   ├── package.json
│   ├── nginx.conf
│   └── Dockerfile
├── models/            # Model cache
└── docker-compose.yml

Development

Backend

cd backend
pip install -r requirements.txt
uvicorn main:app --reload --host 0.0.0.0 --port 8000

Frontend

cd frontend
npm install
npm run dev

Requirements

Docker & Docker Compose
NVIDIA GPU with CUDA support (tested on RTX 3090)
nvidia-docker runtime
~8-12GB VRAM for model

Known Issues

📦 Find mode grounding boxes: Not rendering correctly - likely dimension scaling issue in the canvas overlay logic. Boxes are detected and returned by the backend, but the frontend visualization needs work.

API Usage

POST /api/ocr

Parameters:

image (file, required)
mode (string): plain_ocr | describe | find_ref | freeform
prompt (string): Custom prompt for freeform mode
grounding (bool): Enable bounding boxes (auto-enabled for find_ref)
find_term (string): Term to locate in find_ref mode
base_size (int): Base processing size (default: 1024)
image_size (int): Image size (default: 640)
crop_mode (bool): Enable crop mode (default: true)

Response:

{
  "success": true,
  "text": "Extracted text...",
  "boxes": [{"label": "field", "box": [x1, y1, x2, y2]}],
  "image_dims": {"w": 1920, "h": 1080},
  "metadata": {...}
}

Troubleshooting

GPU not detected

nvidia-smi
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Port conflicts

sudo lsof -i :3000
sudo lsof -i :8000

Frontend build issues

cd frontend
rm -rf node_modules package-lock.json
docker-compose build frontend

License

This project uses the DeepSeek-OCR model. Refer to the model's license terms.

Languages

JavaScript 55.1%

Python 42.3%

CSS 0.9%

TypeScript 0.7%

Dockerfile 0.6%

Other 0.4%