Add in .env.example for setting ports, fix upload limit, fix bounding box, can now dismiss previous image, change markdown expectation to HTML - not MD. updated README with nvidia driver/container instructions
This commit is contained in:
262
README.md
262
README.md
@@ -2,43 +2,103 @@
|
|||||||
|
|
||||||
Modern OCR web application powered by DeepSeek-OCR with a stunning React frontend and FastAPI backend.
|
Modern OCR web application powered by DeepSeek-OCR with a stunning React frontend and FastAPI backend.
|
||||||
|
|
||||||
> **Note**: This was a quickly vibe-coded project to test out DeepSeek-OCR! It basically works quite nice on an RTX 5090. The "Find" mode grounding boxes aren't quite working yet - probably my fault in not interpreting the dimensions correctly, but the core OCR functionality is pretty nice so far.
|
> **Recent Updates (v2.1.1)**
|
||||||
|
> - ✅ Fixed image removal button - now properly clears and allows re-upload
|
||||||
|
> - ✅ Fixed multiple bounding boxes parsing - handles `[[x1,y1,x2,y2], [x1,y1,x2,y2]]` format
|
||||||
|
> - ✅ Simplified to 4 core working modes for better stability
|
||||||
|
> - ✅ Fixed bounding box coordinate scaling (normalized 0-999 → actual pixels)
|
||||||
|
> - ✅ Fixed HTML rendering (model outputs HTML, not Markdown)
|
||||||
|
> - ✅ Increased file upload limit to 100MB (configurable)
|
||||||
|
> - ✅ Added .env configuration support
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
```bash
|
1. **Clone and configure:**
|
||||||
docker compose up --build
|
```bash
|
||||||
```
|
git clone <repository-url>
|
||||||
|
cd deepseek_ocr_app
|
||||||
|
|
||||||
|
# Copy and customize environment variables
|
||||||
|
cp .env.example .env
|
||||||
|
# Edit .env to configure ports, upload limits, etc.
|
||||||
|
```
|
||||||
|
|
||||||
Then open:
|
2. **Start the application:**
|
||||||
- **Frontend**: http://localhost:3000
|
```bash
|
||||||
- **Backend API**: http://localhost:8000
|
docker compose up --build
|
||||||
- **API Docs**: http://localhost:8000/docs
|
```
|
||||||
|
|
||||||
|
The first run will download the model (~5-10GB), which may take some time.
|
||||||
|
|
||||||
|
3. **Access the application:**
|
||||||
|
- **Frontend**: http://localhost:3000 (or your configured FRONTEND_PORT)
|
||||||
|
- **Backend API**: http://localhost:8000 (or your configured API_PORT)
|
||||||
|
- **API Docs**: http://localhost:8000/docs
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
### 4 OCR Modes
|
### 4 Core OCR Modes
|
||||||
- **Plain OCR** - Raw text extraction
|
- **Plain OCR** - Raw text extraction from any image
|
||||||
- **Describe** - Generate image descriptions
|
- **Describe** - Generate intelligent image descriptions
|
||||||
- **Find** - Locate specific terms (grounding boxes WIP)
|
- **Find** - Locate specific terms with visual bounding boxes
|
||||||
- **Freeform** - Custom prompts for anything
|
- **Freeform** - Custom prompts for specialized tasks
|
||||||
|
|
||||||
### UI Features
|
### UI Features
|
||||||
- 🎨 Glass morphism design with animated gradients
|
- 🎨 Glass morphism design with animated gradients
|
||||||
- 🎯 Drag & drop file upload
|
- 🎯 Drag & drop file upload (up to 100MB by default)
|
||||||
- 📦 Grounding box visualization (WIP - dimensions need fixing)
|
- 🗑️ Easy image removal and re-upload
|
||||||
|
- 📦 Grounding box visualization with proper coordinate scaling
|
||||||
- ✨ Smooth animations (Framer Motion)
|
- ✨ Smooth animations (Framer Motion)
|
||||||
- 📋 Copy/Download results
|
- 📋 Copy/Download results
|
||||||
- 🎛️ Advanced settings dropdown
|
- 🎛️ Advanced settings dropdown
|
||||||
- 📝 Markdown rendering for formatted output
|
- 📝 HTML and Markdown rendering for formatted output
|
||||||
|
- 🔍 Multiple bounding box support (handles multiple instances of found terms)
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
The application can be configured via the `.env` file:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# API Configuration
|
||||||
|
API_HOST=0.0.0.0
|
||||||
|
API_PORT=8000
|
||||||
|
|
||||||
|
# Frontend Configuration
|
||||||
|
FRONTEND_PORT=3000
|
||||||
|
|
||||||
|
# Model Configuration
|
||||||
|
MODEL_NAME=deepseek-ai/DeepSeek-OCR
|
||||||
|
HF_HOME=/models
|
||||||
|
|
||||||
|
# Upload Configuration
|
||||||
|
MAX_UPLOAD_SIZE_MB=100 # Maximum file upload size
|
||||||
|
|
||||||
|
# Processing Configuration
|
||||||
|
BASE_SIZE=1024 # Base processing resolution
|
||||||
|
IMAGE_SIZE=640 # Tile processing resolution
|
||||||
|
CROP_MODE=true # Enable dynamic cropping for large images
|
||||||
|
```
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
|
||||||
|
- `API_HOST`: Backend API host (default: 0.0.0.0)
|
||||||
|
- `API_PORT`: Backend API port (default: 8000)
|
||||||
|
- `FRONTEND_PORT`: Frontend port (default: 3000)
|
||||||
|
- `MODEL_NAME`: HuggingFace model identifier
|
||||||
|
- `HF_HOME`: Model cache directory
|
||||||
|
- `MAX_UPLOAD_SIZE_MB`: Maximum file upload size in megabytes
|
||||||
|
- `BASE_SIZE`: Base image processing size (affects memory usage)
|
||||||
|
- `IMAGE_SIZE`: Tile size for dynamic cropping
|
||||||
|
- `CROP_MODE`: Enable/disable dynamic image cropping
|
||||||
|
|
||||||
## Tech Stack
|
## Tech Stack
|
||||||
|
|
||||||
- **Frontend**: React 18 + Vite 5 + TailwindCSS 3 + Framer Motion 11
|
- **Frontend**: React 18 + Vite 5 + TailwindCSS 3 + Framer Motion 11
|
||||||
- **Backend**: FastAPI + PyTorch + Transformers 4.46 + DeepSeek-OCR
|
- **Backend**: FastAPI + PyTorch + Transformers 4.46 + DeepSeek-OCR
|
||||||
|
- **Configuration**: python-decouple for environment management
|
||||||
- **Server**: Nginx (reverse proxy)
|
- **Server**: Nginx (reverse proxy)
|
||||||
- **Container**: Docker + Docker Compose with multi-stage builds
|
- **Container**: Docker + Docker Compose with multi-stage builds
|
||||||
- **GPU**: NVIDIA CUDA support (tested on RTX 5090)
|
- **GPU**: NVIDIA CUDA support (tested on RTX 3090, RTX 5090)
|
||||||
|
|
||||||
## Project Structure
|
## Project Structure
|
||||||
|
|
||||||
@@ -62,56 +122,170 @@ deepseek-ocr/
|
|||||||
|
|
||||||
## Development
|
## Development
|
||||||
|
|
||||||
### Backend
|
Docker compose cycle to test:
|
||||||
```bash
|
```bash
|
||||||
cd backend
|
docker compose down
|
||||||
pip install -r requirements.txt
|
docker compose up --build
|
||||||
uvicorn main:app --reload --host 0.0.0.0 --port 8000
|
|
||||||
```
|
|
||||||
|
|
||||||
### Frontend
|
|
||||||
```bash
|
|
||||||
cd frontend
|
|
||||||
npm install
|
|
||||||
npm run dev
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
- Docker & Docker Compose
|
### Hardware
|
||||||
- NVIDIA GPU with CUDA support (tested on RTX 5090)
|
- NVIDIA GPU with CUDA support
|
||||||
- nvidia-docker runtime
|
- Recommended: RTX 3090, RTX 4090, RTX 5090, or newer
|
||||||
- ~8-12GB VRAM for model
|
- Minimum: 8-12GB VRAM for the model
|
||||||
|
- More VRAM allows for larger batch sizes and higher resolution images
|
||||||
|
|
||||||
## Known Issues
|
### Software
|
||||||
|
- **Docker & Docker Compose** (latest version recommended)
|
||||||
|
|
||||||
- 📦 **Find mode grounding boxes**: Not rendering correctly - likely dimension scaling issue in the canvas overlay logic. Boxes are detected and returned by the backend, but the frontend visualization needs work.
|
- **NVIDIA Driver** - Installing NVIDIA Drivers on Ubuntu (Blackwell/RTX 5090)
|
||||||
|
|
||||||
|
**Note**: Getting NVIDIA drivers working on Blackwell GPUs can be a pain! Here's what worked:
|
||||||
|
|
||||||
|
The key requirements for RTX 5090 on Ubuntu 24.04:
|
||||||
|
- Use the open-source driver (nvidia-driver-570-open or newer, like nvidia-driver-580-open)
|
||||||
|
- Upgrade to kernel 6.11+ (6.14+ recommended for best stability)
|
||||||
|
- Enable Resize Bar in BIOS/UEFI (critical!)
|
||||||
|
|
||||||
|
**Step-by-Step Instructions:**
|
||||||
|
|
||||||
|
1. Install NVIDIA Open Driver (580 or newer)
|
||||||
|
```bash
|
||||||
|
sudo add-apt-repository ppa:graphics-drivers/ppa
|
||||||
|
sudo apt update
|
||||||
|
sudo apt remove --purge nvidia*
|
||||||
|
sudo nvidia-installer --uninstall # If you have it
|
||||||
|
sudo apt autoremove
|
||||||
|
sudo apt install nvidia-driver-580-open
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Upgrade Linux Kernel to 6.11+ (for Ubuntu 24.04 LTS)
|
||||||
|
```bash
|
||||||
|
sudo apt install --install-recommends linux-generic-hwe-24.04 linux-headers-generic-hwe-24.04
|
||||||
|
sudo update-initramfs -u
|
||||||
|
sudo apt autoremove
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Reboot
|
||||||
|
```bash
|
||||||
|
sudo reboot
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Enable Resize Bar in UEFI/BIOS
|
||||||
|
- Restart and enter UEFI (usually F2, Del, or F12 during boot)
|
||||||
|
- Find and enable "Resize Bar" or "Smart Access Memory"
|
||||||
|
- This will also enable "Above 4G Decoding" and disable "CSM" (Compatibility Support Module)—that's expected!
|
||||||
|
- Save and exit
|
||||||
|
|
||||||
|
5. Verify Installation
|
||||||
|
```bash
|
||||||
|
nvidia-smi
|
||||||
|
```
|
||||||
|
You should see your RTX 5090 listed!
|
||||||
|
|
||||||
|
💡 **Why open drivers?** I dunno, but the open drivers have better support for Blackwell GPUs. Without Resize Bar enabled, you'll get a black screen even with correct drivers!
|
||||||
|
|
||||||
|
Credit: Solution adapted from [this Reddit thread](https://www.reddit.com/r/linux_gaming/comments/1i3h4gn/blackwell_on_linux/).
|
||||||
|
|
||||||
|
- **NVIDIA Container Toolkit** (required for GPU access in Docker)
|
||||||
|
- Installation guide: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
|
||||||
|
|
||||||
|
### System Requirements
|
||||||
|
- ~20GB free disk space (for model weights and Docker images)
|
||||||
|
- 16GB+ system RAM recommended
|
||||||
|
- Fast internet connection for initial model download (~5-10GB)
|
||||||
|
|
||||||
|
## Known Issues & Fixes
|
||||||
|
|
||||||
|
### ✅ FIXED: Image removal and re-upload (v2.1.1)
|
||||||
|
- **Issue**: Couldn't remove uploaded image and upload a new one
|
||||||
|
- **Fix**: Added prominent "Remove" button that clears image state and allows fresh upload
|
||||||
|
|
||||||
|
### ✅ FIXED: Multiple bounding boxes (v2.1.1)
|
||||||
|
- **Issue**: Only single bounding box worked, multiple boxes like `[[x1,y1,x2,y2], [x1,y1,x2,y2]]` failed
|
||||||
|
- **Fix**: Updated parser to handle both single and array of coordinate arrays using `ast.literal_eval`
|
||||||
|
|
||||||
|
### ✅ FIXED: Grounding box coordinate scaling (v2.1)
|
||||||
|
- **Issue**: Bounding boxes weren't displaying correctly
|
||||||
|
- **Cause**: Model outputs coordinates normalized to 0-999, not actual pixel dimensions
|
||||||
|
- **Fix**: Backend now properly scales coordinates using the formula: `actual_coord = (normalized_coord / 999) * image_dimension`
|
||||||
|
|
||||||
|
### ✅ FIXED: HTML vs Markdown rendering (v2.1)
|
||||||
|
- **Issue**: Output was being rendered as Markdown when model outputs HTML
|
||||||
|
- **Cause**: Model is trained to output HTML (especially for tables)
|
||||||
|
- **Fix**: Frontend now detects and renders HTML properly using `dangerouslySetInnerHTML`
|
||||||
|
|
||||||
|
### ✅ FIXED: Limited upload size (v2.1)
|
||||||
|
- **Issue**: Large images couldn't be uploaded
|
||||||
|
- **Fix**: Increased nginx `client_max_body_size` to 100MB (configurable via .env)
|
||||||
|
|
||||||
|
### ⚠️ Simplified Mode Selection (v2.1.1)
|
||||||
|
- **Change**: Reduced from 12 modes to 4 core working modes
|
||||||
|
- **Reason**: Advanced modes (tables, layout, PII, multilingual) need additional testing
|
||||||
|
- **Working modes**: Plain OCR, Describe, Find, Freeform
|
||||||
|
- **Future**: Additional modes will be re-enabled after thorough testing
|
||||||
|
|
||||||
|
## How the Model Works
|
||||||
|
|
||||||
|
### Coordinate System
|
||||||
|
The DeepSeek-OCR model uses a normalized coordinate system (0-999) for bounding boxes:
|
||||||
|
- All coordinates are output in range [0, 999]
|
||||||
|
- Backend scales: `pixel_coord = (model_coord / 999) * actual_dimension`
|
||||||
|
- This ensures consistency across different image sizes
|
||||||
|
|
||||||
|
### Dynamic Cropping
|
||||||
|
For large images, the model uses dynamic cropping:
|
||||||
|
- Images ≤640x640: Direct processing
|
||||||
|
- Larger images: Split into tiles based on aspect ratio
|
||||||
|
- Global view (BASE_SIZE) + Local views (IMAGE_SIZE tiles)
|
||||||
|
- See `process/image_process.py` for implementation details
|
||||||
|
|
||||||
|
### Output Format
|
||||||
|
- Plain text modes: Return raw text
|
||||||
|
- Table modes: Return HTML tables or CSV
|
||||||
|
- JSON modes: Return structured JSON
|
||||||
|
- Grounding modes: Return text with `<|ref|>label<|/ref|><|det|>[[coords]]<|/det|>` tags
|
||||||
|
|
||||||
## API Usage
|
## API Usage
|
||||||
|
|
||||||
### POST /api/ocr
|
### POST /api/ocr
|
||||||
|
|
||||||
**Parameters:**
|
**Parameters:**
|
||||||
- `image` (file, required)
|
- `image` (file, required) - Image file to process (up to 100MB)
|
||||||
- `mode` (string): plain_ocr | describe | find_ref | freeform
|
- `mode` (string) - OCR mode: `plain_ocr` | `describe` | `find_ref` | `freeform`
|
||||||
- `prompt` (string): Custom prompt for freeform mode
|
- `prompt` (string) - Custom prompt for freeform mode
|
||||||
- `grounding` (bool): Enable bounding boxes (auto-enabled for find_ref)
|
- `grounding` (bool) - Enable bounding boxes (auto-enabled for find_ref)
|
||||||
- `find_term` (string): Term to locate in find_ref mode
|
- `find_term` (string) - Term to locate in find_ref mode (supports multiple matches)
|
||||||
- `base_size` (int): Base processing size (default: 1024)
|
- `base_size` (int) - Base processing size (default: 1024)
|
||||||
- `image_size` (int): Image size (default: 640)
|
- `image_size` (int) - Tile size for cropping (default: 640)
|
||||||
- `crop_mode` (bool): Enable crop mode (default: true)
|
- `crop_mode` (bool) - Enable dynamic cropping (default: true)
|
||||||
|
- `include_caption` (bool) - Add image description (default: false)
|
||||||
|
|
||||||
**Response:**
|
**Response:**
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"success": true,
|
"success": true,
|
||||||
"text": "Extracted text...",
|
"text": "Extracted text or HTML output...",
|
||||||
"boxes": [{"label": "field", "box": [x1, y1, x2, y2]}],
|
"boxes": [{"label": "field", "box": [x1, y1, x2, y2]}],
|
||||||
"image_dims": {"w": 1920, "h": 1080},
|
"image_dims": {"w": 1920, "h": 1080},
|
||||||
"metadata": {...}
|
"metadata": {
|
||||||
|
"mode": "layout_map",
|
||||||
|
"grounding": true,
|
||||||
|
"base_size": 1024,
|
||||||
|
"image_size": 640,
|
||||||
|
"crop_mode": true
|
||||||
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Note on Bounding Boxes:**
|
||||||
|
- The model outputs coordinates normalized to 0-999
|
||||||
|
- The backend automatically scales them to actual image dimensions
|
||||||
|
- Coordinates are in [x1, y1, x2, y2] format (top-left, bottom-right)
|
||||||
|
- **Supports multiple boxes**: When finding multiple instances, format is `[[x1,y1,x2,y2], [x1,y1,x2,y2], ...]`
|
||||||
|
- Frontend automatically displays all boxes overlaid on the image with unique colors
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
### GPU not detected
|
### GPU not detected
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ import torch
|
|||||||
from transformers import AutoModel, AutoTokenizer
|
from transformers import AutoModel, AutoTokenizer
|
||||||
from PIL import Image
|
from PIL import Image
|
||||||
import uvicorn
|
import uvicorn
|
||||||
|
from decouple import config as env_config
|
||||||
|
|
||||||
# -----------------------------
|
# -----------------------------
|
||||||
# Lifespan context for model loading
|
# Lifespan context for model loading
|
||||||
@@ -26,8 +27,8 @@ async def lifespan(app: FastAPI):
|
|||||||
|
|
||||||
# Environment setup
|
# Environment setup
|
||||||
os.environ.pop("TRANSFORMERS_CACHE", None)
|
os.environ.pop("TRANSFORMERS_CACHE", None)
|
||||||
MODEL_NAME = os.environ.get("MODEL_NAME", "deepseek-ai/DeepSeek-OCR")
|
MODEL_NAME = env_config("MODEL_NAME", default="deepseek-ai/DeepSeek-OCR")
|
||||||
HF_HOME = os.environ.get("HF_HOME", "/models")
|
HF_HOME = env_config("HF_HOME", default="/models")
|
||||||
os.makedirs(HF_HOME, exist_ok=True)
|
os.makedirs(HF_HOME, exist_ok=True)
|
||||||
|
|
||||||
# Load model
|
# Load model
|
||||||
@@ -138,7 +139,7 @@ def build_prompt(
|
|||||||
elif mode == "multilingual":
|
elif mode == "multilingual":
|
||||||
instruction = "Free OCR. Detect the language automatically and output in the same script."
|
instruction = "Free OCR. Detect the language automatically and output in the same script."
|
||||||
elif mode == "describe":
|
elif mode == "describe":
|
||||||
instruction = "Describe this image concisely in 2-3 sentences. Focus on visible key elements."
|
instruction = "Describe this image. Focus on visible key elements."
|
||||||
elif mode == "freeform":
|
elif mode == "freeform":
|
||||||
instruction = user_prompt.strip() if user_prompt else "OCR this image."
|
instruction = user_prompt.strip() if user_prompt else "OCR this image."
|
||||||
else:
|
else:
|
||||||
@@ -153,36 +154,82 @@ def build_prompt(
|
|||||||
# -----------------------------
|
# -----------------------------
|
||||||
# Grounding parser
|
# Grounding parser
|
||||||
# -----------------------------
|
# -----------------------------
|
||||||
|
# Match a full detection block and capture the coordinates as the entire list expression
|
||||||
|
# Examples of captured coords (including outer brackets):
|
||||||
|
# - [[312, 339, 480, 681]]
|
||||||
|
# - [[504, 700, 625, 910], [771, 570, 996, 996]]
|
||||||
|
# - [[110, 310, 255, 800], [312, 343, 479, 680], ...]
|
||||||
|
# Using a greedy bracket capture ensures we include all inner lists up to the last ']' before </|det|>
|
||||||
DET_BLOCK = re.compile(
|
DET_BLOCK = re.compile(
|
||||||
r"<\|ref\|>(?P<label>.*?)<\|/ref\|>\s*<\|det\|>\s*\[\s*\[\s*(?P<coords>[^\]]+?)\s*\]\s*\]\s*<\|/det\|>",
|
r"<\|ref\|>(?P<label>.*?)<\|/ref\|>\s*<\|det\|>\s*(?P<coords>\[.*\])\s*<\|/det\|>",
|
||||||
re.DOTALL,
|
re.DOTALL,
|
||||||
)
|
)
|
||||||
|
|
||||||
def clean_grounding_text(text: str) -> str:
|
def clean_grounding_text(text: str) -> str:
|
||||||
"""Remove grounding tags from text for display, keeping labels"""
|
"""Remove grounding tags from text for display, keeping labels"""
|
||||||
# Replace <|ref|>label<|/ref|><|det|>[[...]]<|/det|> with just "label"
|
# Replace <|ref|>label<|/ref|><|det|>[...any nested lists...]<|/det|> with just the label
|
||||||
cleaned = re.sub(
|
cleaned = re.sub(
|
||||||
r"<\|ref\|>(.*?)<\|/ref\|>\s*<\|det\|>\s*\[\s*\[[^\]]+\]\s*\]\s*<\|/det\|>",
|
r"<\|ref\|>(.*?)<\|/ref\|>\s*<\|det\|>\s*\[.*\]\s*<\|/det\|>",
|
||||||
r"\1",
|
r"\1",
|
||||||
text,
|
text,
|
||||||
flags=re.DOTALL
|
flags=re.DOTALL,
|
||||||
)
|
)
|
||||||
# Also remove any standalone grounding tags
|
# Also remove any standalone grounding tags
|
||||||
cleaned = re.sub(r"<\|grounding\|>", "", cleaned)
|
cleaned = re.sub(r"<\|grounding\|>", "", cleaned)
|
||||||
return cleaned.strip()
|
return cleaned.strip()
|
||||||
|
|
||||||
def parse_detections(text: str) -> List[Dict[str, Any]]:
|
def parse_detections(text: str, image_width: int, image_height: int) -> List[Dict[str, Any]]:
|
||||||
"""Parse grounding boxes from text"""
|
"""Parse grounding boxes from text and scale from 0-999 normalized coords to actual image dimensions
|
||||||
|
|
||||||
|
Handles both single and multiple bounding boxes:
|
||||||
|
- Single: <|ref|>label<|/ref|><|det|>[[x1,y1,x2,y2]]<|/det|>
|
||||||
|
- Multiple: <|ref|>label<|/ref|><|det|>[[x1,y1,x2,y2], [x1,y1,x2,y2], ...]<|/det|>
|
||||||
|
"""
|
||||||
boxes: List[Dict[str, Any]] = []
|
boxes: List[Dict[str, Any]] = []
|
||||||
for m in DET_BLOCK.finditer(text or ""):
|
for m in DET_BLOCK.finditer(text or ""):
|
||||||
label = m.group("label").strip()
|
label = m.group("label").strip()
|
||||||
coords = [c.strip() for c in m.group("coords").split(",")]
|
coords_str = m.group("coords").strip()
|
||||||
|
|
||||||
|
print(f"🔍 DEBUG: Found detection for '{label}'")
|
||||||
|
print(f"📦 Raw coords string (with brackets): {coords_str}")
|
||||||
|
|
||||||
try:
|
try:
|
||||||
nums = list(map(float, coords[:4]))
|
import ast
|
||||||
except Exception:
|
|
||||||
|
# Parse the full bracket expression directly (handles single and multiple)
|
||||||
|
parsed = ast.literal_eval(coords_str)
|
||||||
|
|
||||||
|
# Normalize to a list of lists
|
||||||
|
if (
|
||||||
|
isinstance(parsed, list)
|
||||||
|
and len(parsed) == 4
|
||||||
|
and all(isinstance(n, (int, float)) for n in parsed)
|
||||||
|
):
|
||||||
|
# Single box provided as [x1,y1,x2,y2]
|
||||||
|
box_coords = [parsed]
|
||||||
|
print("📦 Single box (flat list) detected")
|
||||||
|
elif isinstance(parsed, list):
|
||||||
|
box_coords = parsed
|
||||||
|
print(f"📦 Boxes detected: {len(box_coords)}")
|
||||||
|
else:
|
||||||
|
raise ValueError("Unsupported coords structure")
|
||||||
|
|
||||||
|
# Process each box
|
||||||
|
for idx, box in enumerate(box_coords):
|
||||||
|
if isinstance(box, (list, tuple)) and len(box) >= 4:
|
||||||
|
x1 = int(float(box[0]) / 999 * image_width)
|
||||||
|
y1 = int(float(box[1]) / 999 * image_height)
|
||||||
|
x2 = int(float(box[2]) / 999 * image_width)
|
||||||
|
y2 = int(float(box[3]) / 999 * image_height)
|
||||||
|
print(f" Box {idx+1}: {box} → [{x1}, {y1}, {x2}, {y2}]")
|
||||||
|
boxes.append({"label": label, "box": [x1, y1, x2, y2]})
|
||||||
|
else:
|
||||||
|
print(f" ⚠️ Skipping invalid box: {box}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Parsing failed: {e}")
|
||||||
continue
|
continue
|
||||||
if len(nums) == 4:
|
|
||||||
boxes.append({"label": label, "box": nums})
|
print(f"🎯 Total boxes parsed: {len(boxes)}")
|
||||||
return boxes
|
return boxes
|
||||||
|
|
||||||
# -----------------------------
|
# -----------------------------
|
||||||
@@ -289,8 +336,8 @@ async def ocr_inference(
|
|||||||
if not text:
|
if not text:
|
||||||
text = "No text returned by model."
|
text = "No text returned by model."
|
||||||
|
|
||||||
# Parse grounding boxes
|
# Parse grounding boxes with proper coordinate scaling
|
||||||
boxes = parse_detections(text) if ("<|det|>" in text or "<|ref|>" in text) else []
|
boxes = parse_detections(text, orig_w or 1, orig_h or 1) if ("<|det|>" in text or "<|ref|>" in text) else []
|
||||||
|
|
||||||
# Clean grounding tags from display text, but keep the labels
|
# Clean grounding tags from display text, but keep the labels
|
||||||
display_text = clean_grounding_text(text) if ("<|ref|>" in text or "<|grounding|>" in text) else text
|
display_text = clean_grounding_text(text) if ("<|ref|>" in text or "<|grounding|>" in text) else text
|
||||||
@@ -302,6 +349,7 @@ async def ocr_inference(
|
|||||||
return JSONResponse({
|
return JSONResponse({
|
||||||
"success": True,
|
"success": True,
|
||||||
"text": display_text,
|
"text": display_text,
|
||||||
|
"raw_text": text, # Include raw model output for debugging
|
||||||
"boxes": boxes,
|
"boxes": boxes,
|
||||||
"image_dims": {"w": orig_w, "h": orig_h},
|
"image_dims": {"w": orig_w, "h": orig_h},
|
||||||
"metadata": {
|
"metadata": {
|
||||||
@@ -326,4 +374,6 @@ async def ocr_inference(
|
|||||||
shutil.rmtree(out_dir, ignore_errors=True)
|
shutil.rmtree(out_dir, ignore_errors=True)
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
uvicorn.run(app, host="0.0.0.0", port=8000)
|
host = env_config("API_HOST", default="0.0.0.0")
|
||||||
|
port = env_config("API_PORT", default=8000, cast=int)
|
||||||
|
uvicorn.run(app, host=host, port=port)
|
||||||
|
|||||||
@@ -10,3 +10,4 @@ easydict
|
|||||||
pillow
|
pillow
|
||||||
safetensors
|
safetensors
|
||||||
torch
|
torch
|
||||||
|
python-decouple>=3.8
|
||||||
|
|||||||
@@ -2,9 +2,14 @@ services:
|
|||||||
backend:
|
backend:
|
||||||
build: ./backend
|
build: ./backend
|
||||||
container_name: deepseek-ocr-backend
|
container_name: deepseek-ocr-backend
|
||||||
|
env_file:
|
||||||
|
- .env
|
||||||
environment:
|
environment:
|
||||||
MODEL_NAME: deepseek-ai/DeepSeek-OCR
|
MODEL_NAME: ${MODEL_NAME:-deepseek-ai/DeepSeek-OCR}
|
||||||
HF_HOME: /models
|
HF_HOME: ${HF_HOME:-/models}
|
||||||
|
API_HOST: ${API_HOST:-0.0.0.0}
|
||||||
|
API_PORT: ${API_PORT:-8000}
|
||||||
|
MAX_UPLOAD_SIZE_MB: ${MAX_UPLOAD_SIZE_MB:-100}
|
||||||
volumes:
|
volumes:
|
||||||
- ./models:/models
|
- ./models:/models
|
||||||
deploy:
|
deploy:
|
||||||
@@ -16,7 +21,7 @@ services:
|
|||||||
capabilities: [gpu]
|
capabilities: [gpu]
|
||||||
shm_size: "4g"
|
shm_size: "4g"
|
||||||
ports:
|
ports:
|
||||||
- "8000:8000"
|
- "${API_PORT:-8000}:${API_PORT:-8000}"
|
||||||
networks:
|
networks:
|
||||||
- ocr-network
|
- ocr-network
|
||||||
|
|
||||||
@@ -24,7 +29,7 @@ services:
|
|||||||
build: ./frontend
|
build: ./frontend
|
||||||
container_name: deepseek-ocr-frontend
|
container_name: deepseek-ocr-frontend
|
||||||
ports:
|
ports:
|
||||||
- "3000:80"
|
- "${FRONTEND_PORT:-3000}:80"
|
||||||
depends_on:
|
depends_on:
|
||||||
- backend
|
- backend
|
||||||
networks:
|
networks:
|
||||||
|
|||||||
@@ -4,6 +4,9 @@ server {
|
|||||||
root /usr/share/nginx/html;
|
root /usr/share/nginx/html;
|
||||||
index index.html;
|
index index.html;
|
||||||
|
|
||||||
|
# Allow larger file uploads (100MB)
|
||||||
|
client_max_body_size 100M;
|
||||||
|
|
||||||
# Gzip compression
|
# Gzip compression
|
||||||
gzip on;
|
gzip on;
|
||||||
gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
|
gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
|
||||||
|
|||||||
@@ -27,11 +27,22 @@ function App() {
|
|||||||
})
|
})
|
||||||
|
|
||||||
const handleImageSelect = useCallback((file) => {
|
const handleImageSelect = useCallback((file) => {
|
||||||
setImage(file)
|
if (file === null) {
|
||||||
setImagePreview(URL.createObjectURL(file))
|
// Clear everything when removing image
|
||||||
setError(null)
|
setImage(null)
|
||||||
setResult(null)
|
if (imagePreview) {
|
||||||
}, [])
|
URL.revokeObjectURL(imagePreview)
|
||||||
|
}
|
||||||
|
setImagePreview(null)
|
||||||
|
setError(null)
|
||||||
|
setResult(null)
|
||||||
|
} else {
|
||||||
|
setImage(file)
|
||||||
|
setImagePreview(URL.createObjectURL(file))
|
||||||
|
setError(null)
|
||||||
|
setResult(null)
|
||||||
|
}
|
||||||
|
}, [imagePreview])
|
||||||
|
|
||||||
const handleSubmit = async () => {
|
const handleSubmit = async () => {
|
||||||
if (!image) {
|
if (!image) {
|
||||||
@@ -47,7 +58,8 @@ function App() {
|
|||||||
formData.append('image', image)
|
formData.append('image', image)
|
||||||
formData.append('mode', mode)
|
formData.append('mode', mode)
|
||||||
formData.append('prompt', prompt)
|
formData.append('prompt', prompt)
|
||||||
formData.append('grounding', mode === 'find_ref') // Auto-enable for find mode
|
// Enable grounding only for find mode
|
||||||
|
formData.append('grounding', mode === 'find_ref')
|
||||||
formData.append('include_caption', false)
|
formData.append('include_caption', false)
|
||||||
formData.append('find_term', findTerm)
|
formData.append('find_term', findTerm)
|
||||||
formData.append('schema', '')
|
formData.append('schema', '')
|
||||||
@@ -81,12 +93,9 @@ function App() {
|
|||||||
|
|
||||||
const extensions = {
|
const extensions = {
|
||||||
plain_ocr: 'txt',
|
plain_ocr: 'txt',
|
||||||
markdown: 'md',
|
describe: 'txt',
|
||||||
tables_csv: 'csv',
|
find_ref: 'txt',
|
||||||
tables_md: 'md',
|
freeform: 'txt',
|
||||||
kv_json: 'json',
|
|
||||||
layout_map: 'json',
|
|
||||||
pii_redact: 'json',
|
|
||||||
}
|
}
|
||||||
|
|
||||||
const ext = extensions[mode] || 'txt'
|
const ext = extensions[mode] || 'txt'
|
||||||
|
|||||||
@@ -71,27 +71,28 @@ export default function ImageUpload({ onImageSelect, preview }) {
|
|||||||
<motion.div
|
<motion.div
|
||||||
initial={{ opacity: 0, scale: 0.9 }}
|
initial={{ opacity: 0, scale: 0.9 }}
|
||||||
animate={{ opacity: 1, scale: 1 }}
|
animate={{ opacity: 1, scale: 1 }}
|
||||||
className="relative group"
|
className="relative group rounded-2xl overflow-hidden"
|
||||||
>
|
>
|
||||||
<img
|
<img
|
||||||
src={preview}
|
src={preview}
|
||||||
alt="Preview"
|
alt="Preview"
|
||||||
className="w-full rounded-2xl border border-white/10"
|
className="w-full rounded-2xl border border-white/10"
|
||||||
/>
|
/>
|
||||||
<motion.button
|
<div className="absolute top-3 right-3 flex gap-2">
|
||||||
onClick={() => onImageSelect(null)}
|
<motion.button
|
||||||
className="absolute top-3 right-3 bg-red-500/80 backdrop-blur-sm p-2 rounded-full opacity-0 group-hover:opacity-100 transition-opacity"
|
onClick={(e) => {
|
||||||
whileHover={{ scale: 1.1 }}
|
e.stopPropagation()
|
||||||
whileTap={{ scale: 0.9 }}
|
onImageSelect(null)
|
||||||
>
|
}}
|
||||||
<X className="w-4 h-4" />
|
className="bg-red-500/90 backdrop-blur-sm px-3 py-2 rounded-full opacity-100 hover:bg-red-600 transition-colors flex items-center gap-2 shadow-lg"
|
||||||
</motion.button>
|
whileHover={{ scale: 1.05 }}
|
||||||
|
whileTap={{ scale: 0.95 }}
|
||||||
{/* Grounding overlay canvas */}
|
title="Remove image"
|
||||||
<canvas
|
>
|
||||||
id="preview-canvas"
|
<X className="w-4 h-4" />
|
||||||
className="absolute top-0 left-0 w-full h-full pointer-events-none"
|
<span className="text-sm font-medium">Remove</span>
|
||||||
/>
|
</motion.button>
|
||||||
|
</div>
|
||||||
</motion.div>
|
</motion.div>
|
||||||
)}
|
)}
|
||||||
</div>
|
</div>
|
||||||
|
|||||||
@@ -14,7 +14,7 @@ export default function ModeSelector({
|
|||||||
prompt,
|
prompt,
|
||||||
onPromptChange,
|
onPromptChange,
|
||||||
findTerm,
|
findTerm,
|
||||||
onFindTermChange
|
onFindTermChange
|
||||||
}) {
|
}) {
|
||||||
const selectedMode = modes.find(m => m.id === mode)
|
const selectedMode = modes.find(m => m.id === mode)
|
||||||
const needsInput = selectedMode?.needsInput
|
const needsInput = selectedMode?.needsInput
|
||||||
|
|||||||
@@ -9,8 +9,19 @@ export default function ResultPanel({ result, loading, imagePreview, onCopy, onD
|
|||||||
const [showAdvanced, setShowAdvanced] = useState(false)
|
const [showAdvanced, setShowAdvanced] = useState(false)
|
||||||
const [imageLoaded, setImageLoaded] = useState(false)
|
const [imageLoaded, setImageLoaded] = useState(false)
|
||||||
|
|
||||||
// Check if text looks like markdown
|
// Check if text looks like HTML (model outputs HTML, not markdown)
|
||||||
const isMarkdown = result?.text && (
|
const isHTML = result?.text && (
|
||||||
|
result.text.includes('<table') ||
|
||||||
|
result.text.includes('<tr>') ||
|
||||||
|
result.text.includes('<td>') ||
|
||||||
|
result.text.includes('<div') ||
|
||||||
|
result.text.includes('<p>') ||
|
||||||
|
result.text.includes('<h1') ||
|
||||||
|
result.text.includes('<h2')
|
||||||
|
)
|
||||||
|
|
||||||
|
// Also check if it looks like markdown (for backwards compatibility)
|
||||||
|
const isMarkdown = result?.text && !isHTML && (
|
||||||
result.text.includes('##') ||
|
result.text.includes('##') ||
|
||||||
result.text.includes('**') ||
|
result.text.includes('**') ||
|
||||||
result.text.includes('```') ||
|
result.text.includes('```') ||
|
||||||
@@ -216,7 +227,15 @@ export default function ResultPanel({ result, loading, imagePreview, onCopy, onD
|
|||||||
|
|
||||||
{/* Text result */}
|
{/* Text result */}
|
||||||
<div className="bg-white/5 border border-white/10 rounded-xl p-4 max-h-96 overflow-y-auto">
|
<div className="bg-white/5 border border-white/10 rounded-xl p-4 max-h-96 overflow-y-auto">
|
||||||
{isMarkdown ? (
|
{isHTML ? (
|
||||||
|
<div
|
||||||
|
className="prose prose-invert prose-sm max-w-none"
|
||||||
|
dangerouslySetInnerHTML={{ __html: result.text }}
|
||||||
|
style={{
|
||||||
|
color: '#e5e7eb',
|
||||||
|
}}
|
||||||
|
/>
|
||||||
|
) : isMarkdown ? (
|
||||||
<div className="prose prose-invert prose-sm max-w-none">
|
<div className="prose prose-invert prose-sm max-w-none">
|
||||||
<ReactMarkdown>{result.text}</ReactMarkdown>
|
<ReactMarkdown>{result.text}</ReactMarkdown>
|
||||||
</div>
|
</div>
|
||||||
@@ -227,10 +246,39 @@ export default function ResultPanel({ result, loading, imagePreview, onCopy, onD
|
|||||||
)}
|
)}
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
{/* Raw Response Viewer */}
|
||||||
|
{result.raw_text && (
|
||||||
|
<details className="glass rounded-xl overflow-hidden">
|
||||||
|
<summary className="px-4 py-3 cursor-pointer flex items-center justify-between hover:bg-white/5 transition-colors">
|
||||||
|
<span className="text-sm font-medium text-gray-300">🔍 Raw Model Response</span>
|
||||||
|
<ChevronDown className="w-4 h-4 text-gray-400" />
|
||||||
|
</summary>
|
||||||
|
<div className="px-4 py-3 border-t border-white/10 space-y-2">
|
||||||
|
<p className="text-xs text-gray-400 mb-2">Unprocessed output from the model (useful for debugging)</p>
|
||||||
|
<div className="bg-black/30 rounded-lg p-3 max-h-64 overflow-y-auto">
|
||||||
|
<pre className="text-xs text-green-400 font-mono whitespace-pre-wrap break-words select-all">
|
||||||
|
{result.raw_text}
|
||||||
|
</pre>
|
||||||
|
</div>
|
||||||
|
<div className="flex gap-2 mt-2">
|
||||||
|
<button
|
||||||
|
onClick={() => navigator.clipboard.writeText(result.raw_text)}
|
||||||
|
className="text-xs px-3 py-1 bg-white/5 hover:bg-white/10 rounded-lg transition-colors"
|
||||||
|
>
|
||||||
|
Copy Raw
|
||||||
|
</button>
|
||||||
|
<span className="text-xs text-gray-500 py-1">
|
||||||
|
{result.raw_text.length} characters
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</details>
|
||||||
|
)}
|
||||||
|
|
||||||
{/* Advanced Settings Dropdown */}
|
{/* Advanced Settings Dropdown */}
|
||||||
<details className="glass rounded-xl overflow-hidden">
|
<details className="glass rounded-xl overflow-hidden">
|
||||||
<summary className="px-4 py-3 cursor-pointer flex items-center justify-between hover:bg-white/5 transition-colors">
|
<summary className="px-4 py-3 cursor-pointer flex items-center justify-between hover:bg-white/5 transition-colors">
|
||||||
<span className="text-sm font-medium text-gray-300">Advanced Settings & Metadata</span>
|
<span className="text-sm font-medium text-gray-300">⚙️ Metadata & Debug Info</span>
|
||||||
<ChevronDown className="w-4 h-4 text-gray-400" />
|
<ChevronDown className="w-4 h-4 text-gray-400" />
|
||||||
</summary>
|
</summary>
|
||||||
<div className="px-4 py-3 border-t border-white/10 space-y-3">
|
<div className="px-4 py-3 border-t border-white/10 space-y-3">
|
||||||
@@ -244,14 +292,21 @@ export default function ResultPanel({ result, loading, imagePreview, onCopy, onD
|
|||||||
)}
|
)}
|
||||||
{result.boxes?.length > 0 && (
|
{result.boxes?.length > 0 && (
|
||||||
<div>
|
<div>
|
||||||
<p className="text-xs text-gray-400 mb-2">Detected Regions ({result.boxes.length})</p>
|
<p className="text-xs text-gray-400 mb-2">Parsed Bounding Boxes ({result.boxes.length})</p>
|
||||||
<div className="space-y-1">
|
<div className="bg-black/30 rounded-lg p-2 space-y-1 max-h-32 overflow-y-auto">
|
||||||
{result.boxes.map((box, idx) => (
|
{result.boxes.map((box, idx) => (
|
||||||
<div key={idx} className="text-xs text-gray-500">
|
<div key={idx} className="text-xs font-mono">
|
||||||
{box.label}: [{box.box.map(n => Math.round(n)).join(', ')}]
|
<span className="text-cyan-400">Box {idx + 1}:</span>{' '}
|
||||||
|
<span className="text-purple-400">{box.label}</span>{' '}
|
||||||
|
<span className="text-gray-500">
|
||||||
|
[{box.box.map(n => Math.round(n)).join(', ')}]
|
||||||
|
</span>
|
||||||
</div>
|
</div>
|
||||||
))}
|
))}
|
||||||
</div>
|
</div>
|
||||||
|
<p className="text-xs text-gray-500 mt-2">
|
||||||
|
Coordinates are scaled from model output (0-999) to image pixels
|
||||||
|
</p>
|
||||||
</div>
|
</div>
|
||||||
)}
|
)}
|
||||||
</div>
|
</div>
|
||||||
|
|||||||
Reference in New Issue
Block a user