Add PDF processing and multi-format document conversion

Features added: - PDF to image conversion with configurable DPI - Multi-page PDF processing with OCR - Export to Markdown, HTML, DOCX, and JSON formats - Automatic image extraction from PDFs - Formula and formatting preservation - Real-time progress tracking for multi-page documents Backend changes: - New /api/process-pdf endpoint for PDF processing - pdf_utils.py: PDF conversion and image extraction utilities - format_converter.py: Document format conversion (MD, HTML, DOCX) - Updated dependencies: PyMuPDF, img2pdf, python-docx, markdown Frontend changes: - File type toggle (Image OCR / PDF Processing) - PDFProcessor component with format selection - Updated ImageUpload to support both images and PDFs - Progress bars for multi-page processing - Download options for converted documents Documentation: - Updated README with PDF processing features - Added API documentation for /api/process-pdf endpoint - Added format conversion examples
2025-11-15 14:25:09 +00:00
parent 5ba45f7db2
commit e578276d3e
8 changed files with 1220 additions and 65 deletions
--- a/README.md
+++ b/README.md
@@ -4,7 +4,15 @@ Modern OCR web application powered by DeepSeek-OCR with a stunning React fronten

 ![DeepSeek OCR in Action](assets/multi-bird.png)

-> **Recent Updates (v2.1.1)**
+> **Recent Updates (v2.2.0)**
+> - 🎉 **NEW: PDF Processing** - Upload PDFs and extract text from all pages
+> - 🎉 **NEW: Multi-Format Export** - Convert to Markdown, HTML, DOCX, or JSON
+> - 🎉 **NEW: Automatic Image Extraction** - Extract and preserve images from PDFs
+> - 🎉 **NEW: Progress Tracking** - Real-time progress for multi-page documents
+> - ✅ Dual mode: Image OCR + PDF Processing with format conversion
+> - ✅ Enhanced document processing with formula and formatting preservation
+>
+> **Previous Updates (v2.1.1)**
 > - ✅ Fixed image removal button - now properly clears and allows re-upload
 > - ✅ Fixed multiple bounding boxes parsing - handles `[[x1,y1,x2,y2], [x1,y1,x2,y2]]` format
 > - ✅ Simplified to 4 core working modes for better stability
@@ -39,22 +47,32 @@ Modern OCR web application powered by DeepSeek-OCR with a stunning React fronten

 ## Features

-### 4 Core OCR Modes
+### Dual Processing Modes
+#### 📸 **Image OCR** (4 Core Modes)
 - **Plain OCR** - Raw text extraction from any image
 - **Describe** - Generate intelligent image descriptions
 - **Find** - Locate specific terms with visual bounding boxes
 - **Freeform** - Custom prompts for specialized tasks

+#### 📄 **PDF Processing** (NEW!)
+- **Multi-Page Processing** - Process entire PDF documents page by page
+- **Format Conversion** - Export to Markdown, HTML, DOCX, or JSON
+- **Image Extraction** - Automatically extract and preserve embedded images
+- **Formula Preservation** - Maintain mathematical formulas and special formatting
+- **Progress Tracking** - Real-time progress updates for large documents
+
 ### UI Features
 - 🎨 Glass morphism design with animated gradients
- 🎯 Drag & drop file upload (up to 100MB by default)
- 🗑️ Easy image removal and re-upload
+- 🎯 Drag & drop file upload (Images up to 10MB, PDFs up to 100MB)
+- 🔄 Easy file removal and re-upload
 - 📦 Grounding box visualization with proper coordinate scaling
 - ✨ Smooth animations (Framer Motion)
- 📋 Copy/Download results
+- 📋 Copy/Download results in multiple formats
 - 🎛️ Advanced settings dropdown
 - 📝 HTML and Markdown rendering for formatted output
 - 🔍 Multiple bounding box support (handles multiple instances of found terms)
+- 📊 Progress bars for multi-page PDF processing
+- 💾 Direct download for converted documents (MD, HTML, DOCX)

 ## Configuration

@@ -106,19 +124,26 @@ CROP_MODE=true         # Enable dynamic cropping for large images

 ```
 deepseek-ocr/
-├── backend/           # FastAPI backend
-│   ├── main.py
+├── backend/                  # FastAPI backend
+│   ├── main.py              # Main API with OCR and PDF endpoints
+│   ├── pdf_utils.py         # PDF processing utilities (NEW)
+│   ├── format_converter.py  # Document format conversion (NEW)
 │   ├── requirements.txt
 │   └── Dockerfile
-├── frontend/          # React frontend
+├── frontend/                 # React frontend
 │   ├── src/
 │   │   ├── components/
-│   │   ├── App.jsx
+│   │   │   ├── ImageUpload.jsx    # File upload (images & PDFs)
+│   │   │   ├── PDFProcessor.jsx   # PDF processing UI (NEW)
+│   │   │   ├── ModeSelector.jsx
+│   │   │   ├── ResultPanel.jsx
+│   │   │   └── AdvancedSettings.jsx
+│   │   ├── App.jsx           # Main app with dual mode support
 │   │   └── main.jsx
 │   ├── package.json
 │   ├── nginx.conf
 │   └── Dockerfile
-├── models/            # Model cache
+├── models/                   # Model cache
 └── docker-compose.yml
 ```

@@ -288,6 +313,63 @@ For large images, the model uses dynamic cropping:
 - **Supports multiple boxes**: When finding multiple instances, format is `[[x1,y1,x2,y2], [x1,y1,x2,y2], ...]`
 - Frontend automatically displays all boxes overlaid on the image with unique colors

+### POST /api/process-pdf (NEW!)
+
+Process PDF documents with OCR and export to various formats.
+
+**Parameters:**
+- `pdf_file` (file, required) - PDF file to process (up to 100MB)
+- `mode` (string) - OCR mode: `plain_ocr` | `describe` | `find_ref` | `freeform`
+- `prompt` (string) - Custom prompt for freeform mode
+- `output_format` (string) - Output format: `markdown` | `html` | `docx` | `json`
+- `grounding` (bool) - Enable bounding boxes (default: false)
+- `include_caption` (bool) - Add image descriptions (default: false)
+- `extract_images` (bool) - Extract embedded images from PDF (default: true)
+- `dpi` (int) - PDF rendering resolution (default: 144)
+- `base_size` (int) - Base processing size (default: 1024)
+- `image_size` (int) - Tile size for cropping (default: 640)
+- `crop_mode` (bool) - Enable dynamic cropping (default: true)
+
+**Response Formats:**
+
+**JSON Format** (`output_format=json`):
+```json
+{
+  "success": true,
+  "total_pages": 5,
+  "pages": [
+    {
+      "page_number": 1,
+      "text": "Extracted and cleaned text...",
+      "raw_text": "Raw model output with tags...",
+      "boxes": [{"label": "field", "box": [x1, y1, x2, y2]}],
+      "images": ["base64_encoded_image_data..."],
+      "image_dims": {"w": 1920, "h": 1080}
+    }
+  ],
+  "metadata": {
+    "mode": "plain_ocr",
+    "grounding": false,
+    "extract_images": true,
+    "dpi": 144
+  }
+}
+```
+
+**File Downloads** (`output_format=markdown|html|docx`):
+- Returns the document as a downloadable file
+- Markdown: `.md` file with preserved formatting
+- HTML: `.html` file with embedded styling and images
+- DOCX: `.docx` Word document with tables and formatting
+
+**Features:**
+- 📄 Multi-page processing with progress tracking
+- 🖼️ Automatic image extraction and embedding
+- 📐 Formula and formatting preservation
+- 🎨 Styled HTML output with tables and code blocks
+- 📝 Clean Markdown with proper structure
+- 📋 Professional DOCX with headings and tables
+
 ## Examples

 Here are some example images showcasing different OCR capabilities: