Add in .env.example for setting ports, fix upload limit, fix bounding box, can now dismiss previous image, change markdown expectation to HTML - not MD. updated README with nvidia driver/container instructions

2025-10-21 21:35:17 +01:00
parent e02338436b
commit 3efc4da7ff
9 changed files with 399 additions and 101 deletions
--- a/README.md
+++ b/README.md
@@ -2,43 +2,103 @@

 Modern OCR web application powered by DeepSeek-OCR with a stunning React frontend and FastAPI backend.

-> **Note**: This was a quickly vibe-coded project to test out DeepSeek-OCR! It basically works quite nice on an RTX 5090. The "Find" mode grounding boxes aren't quite working yet - probably my fault in not interpreting the dimensions correctly, but the core OCR functionality is pretty nice so far.
+> **Recent Updates (v2.1.1)**
+> - ✅ Fixed image removal button - now properly clears and allows re-upload
+> - ✅ Fixed multiple bounding boxes parsing - handles `[[x1,y1,x2,y2], [x1,y1,x2,y2]]` format
+> - ✅ Simplified to 4 core working modes for better stability
+> - ✅ Fixed bounding box coordinate scaling (normalized 0-999 → actual pixels)
+> - ✅ Fixed HTML rendering (model outputs HTML, not Markdown)
+> - ✅ Increased file upload limit to 100MB (configurable)
+> - ✅ Added .env configuration support

 ## Quick Start

-```bash
-docker compose up --build
-```
+1. **Clone and configure:**
+   ```bash
+   git clone <repository-url>
+   cd deepseek_ocr_app
   
-Then open:
- **Frontend**: http://localhost:3000
- **Backend API**: http://localhost:8000
- **API Docs**: http://localhost:8000/docs
+   # Copy and customize environment variables
+   cp .env.example .env
+   # Edit .env to configure ports, upload limits, etc.
+   ```
+
+2. **Start the application:**
+   ```bash
+   docker compose up --build
+   ```
+
+   The first run will download the model (~5-10GB), which may take some time.
+
+3. **Access the application:**
+   - **Frontend**: http://localhost:3000 (or your configured FRONTEND_PORT)
+   - **Backend API**: http://localhost:8000 (or your configured API_PORT)
+   - **API Docs**: http://localhost:8000/docs

 ## Features

-### 4 OCR Modes
- **Plain OCR** - Raw text extraction
- **Describe** - Generate image descriptions
- **Find** - Locate specific terms (grounding boxes WIP)
- **Freeform** - Custom prompts for anything
+### 4 Core OCR Modes
+- **Plain OCR** - Raw text extraction from any image
+- **Describe** - Generate intelligent image descriptions
+- **Find** - Locate specific terms with visual bounding boxes
+- **Freeform** - Custom prompts for specialized tasks

 ### UI Features
 - 🎨 Glass morphism design with animated gradients
- 🎯 Drag & drop file upload
- 📦 Grounding box visualization (WIP - dimensions need fixing)
+- 🎯 Drag & drop file upload (up to 100MB by default)
+- 🗑️ Easy image removal and re-upload
+- 📦 Grounding box visualization with proper coordinate scaling
 - ✨ Smooth animations (Framer Motion)
 - 📋 Copy/Download results
 - 🎛️ Advanced settings dropdown
- 📝 Markdown rendering for formatted output
+- 📝 HTML and Markdown rendering for formatted output
+- 🔍 Multiple bounding box support (handles multiple instances of found terms)
+
+## Configuration
+
+The application can be configured via the `.env` file:
+
+```bash
+# API Configuration
+API_HOST=0.0.0.0
+API_PORT=8000
+
+# Frontend Configuration
+FRONTEND_PORT=3000
+
+# Model Configuration
+MODEL_NAME=deepseek-ai/DeepSeek-OCR
+HF_HOME=/models
+
+# Upload Configuration
+MAX_UPLOAD_SIZE_MB=100  # Maximum file upload size
+
+# Processing Configuration
+BASE_SIZE=1024         # Base processing resolution
+IMAGE_SIZE=640         # Tile processing resolution
+CROP_MODE=true         # Enable dynamic cropping for large images
+```
+
+### Environment Variables
+
+- `API_HOST`: Backend API host (default: 0.0.0.0)
+- `API_PORT`: Backend API port (default: 8000)
+- `FRONTEND_PORT`: Frontend port (default: 3000)
+- `MODEL_NAME`: HuggingFace model identifier
+- `HF_HOME`: Model cache directory
+- `MAX_UPLOAD_SIZE_MB`: Maximum file upload size in megabytes
+- `BASE_SIZE`: Base image processing size (affects memory usage)
+- `IMAGE_SIZE`: Tile size for dynamic cropping
+- `CROP_MODE`: Enable/disable dynamic image cropping

 ## Tech Stack

 - **Frontend**: React 18 + Vite 5 + TailwindCSS 3 + Framer Motion 11
 - **Backend**: FastAPI + PyTorch + Transformers 4.46 + DeepSeek-OCR
+- **Configuration**: python-decouple for environment management
 - **Server**: Nginx (reverse proxy)
 - **Container**: Docker + Docker Compose with multi-stage builds
- **GPU**: NVIDIA CUDA support (tested on RTX 5090)
+- **GPU**: NVIDIA CUDA support (tested on RTX 3090, RTX 5090)

 ## Project Structure

@@ -62,56 +122,170 @@ deepseek-ocr/

 ## Development

-### Backend
+Docker compose cycle to test:
 ```bash
-cd backend
-pip install -r requirements.txt
-uvicorn main:app --reload --host 0.0.0.0 --port 8000
-```
-
-### Frontend
-```bash
-cd frontend
-npm install
-npm run dev
+docker compose down
+docker compose up --build
 ```

 ## Requirements

- Docker & Docker Compose
- NVIDIA GPU with CUDA support (tested on RTX 5090)
- nvidia-docker runtime
- ~8-12GB VRAM for model
+### Hardware
+- NVIDIA GPU with CUDA support
+  - Recommended: RTX 3090, RTX 4090, RTX 5090, or newer
+  - Minimum: 8-12GB VRAM for the model
+  - More VRAM allows for larger batch sizes and higher resolution images

-## Known Issues
+### Software
+- **Docker & Docker Compose** (latest version recommended)

- 📦 **Find mode grounding boxes**: Not rendering correctly - likely dimension scaling issue in the canvas overlay logic. Boxes are detected and returned by the backend, but the frontend visualization needs work.
+- **NVIDIA Driver** - Installing NVIDIA Drivers on Ubuntu (Blackwell/RTX 5090)
+
+  **Note**: Getting NVIDIA drivers working on Blackwell GPUs can be a pain! Here's what worked:
+
+  The key requirements for RTX 5090 on Ubuntu 24.04:
+  - Use the open-source driver (nvidia-driver-570-open or newer, like nvidia-driver-580-open)
+  - Upgrade to kernel 6.11+ (6.14+ recommended for best stability)
+  - Enable Resize Bar in BIOS/UEFI (critical!)
+
+  **Step-by-Step Instructions:**
+
+  1. Install NVIDIA Open Driver (580 or newer)
+     ```bash
+     sudo add-apt-repository ppa:graphics-drivers/ppa
+     sudo apt update
+     sudo apt remove --purge nvidia*
+     sudo nvidia-installer --uninstall  # If you have it
+     sudo apt autoremove
+     sudo apt install nvidia-driver-580-open
+     ```
+
+  2. Upgrade Linux Kernel to 6.11+ (for Ubuntu 24.04 LTS)
+     ```bash
+     sudo apt install --install-recommends linux-generic-hwe-24.04 linux-headers-generic-hwe-24.04
+     sudo update-initramfs -u
+     sudo apt autoremove
+     ```
+
+  3. Reboot
+     ```bash
+     sudo reboot
+     ```
+
+  4. Enable Resize Bar in UEFI/BIOS
+     - Restart and enter UEFI (usually F2, Del, or F12 during boot)
+     - Find and enable "Resize Bar" or "Smart Access Memory"
+     - This will also enable "Above 4G Decoding" and disable "CSM" (Compatibility Support Module)—that's expected!
+     - Save and exit
+
+  5. Verify Installation
+     ```bash
+     nvidia-smi
+     ```
+     You should see your RTX 5090 listed!
+
+  💡 **Why open drivers?** I dunno, but the open drivers have better support for Blackwell GPUs. Without Resize Bar enabled, you'll get a black screen even with correct drivers!
+
+  Credit: Solution adapted from [this Reddit thread](https://www.reddit.com/r/linux_gaming/comments/1i3h4gn/blackwell_on_linux/).
+
+- **NVIDIA Container Toolkit** (required for GPU access in Docker)
+  - Installation guide: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
+
+### System Requirements
+- ~20GB free disk space (for model weights and Docker images)
+- 16GB+ system RAM recommended
+- Fast internet connection for initial model download (~5-10GB)
+
+## Known Issues & Fixes
+
+### ✅ FIXED: Image removal and re-upload (v2.1.1)
+- **Issue**: Couldn't remove uploaded image and upload a new one
+- **Fix**: Added prominent "Remove" button that clears image state and allows fresh upload
+
+### ✅ FIXED: Multiple bounding boxes (v2.1.1)
+- **Issue**: Only single bounding box worked, multiple boxes like `[[x1,y1,x2,y2], [x1,y1,x2,y2]]` failed
+- **Fix**: Updated parser to handle both single and array of coordinate arrays using `ast.literal_eval`
+
+### ✅ FIXED: Grounding box coordinate scaling (v2.1)
+- **Issue**: Bounding boxes weren't displaying correctly
+- **Cause**: Model outputs coordinates normalized to 0-999, not actual pixel dimensions
+- **Fix**: Backend now properly scales coordinates using the formula: `actual_coord = (normalized_coord / 999) * image_dimension`
+
+### ✅ FIXED: HTML vs Markdown rendering (v2.1)
+- **Issue**: Output was being rendered as Markdown when model outputs HTML
+- **Cause**: Model is trained to output HTML (especially for tables)
+- **Fix**: Frontend now detects and renders HTML properly using `dangerouslySetInnerHTML`
+
+### ✅ FIXED: Limited upload size (v2.1)
+- **Issue**: Large images couldn't be uploaded
+- **Fix**: Increased nginx `client_max_body_size` to 100MB (configurable via .env)
+
+### ⚠️ Simplified Mode Selection (v2.1.1)
+- **Change**: Reduced from 12 modes to 4 core working modes
+- **Reason**: Advanced modes (tables, layout, PII, multilingual) need additional testing
+- **Working modes**: Plain OCR, Describe, Find, Freeform
+- **Future**: Additional modes will be re-enabled after thorough testing
+
+## How the Model Works
+
+### Coordinate System
+The DeepSeek-OCR model uses a normalized coordinate system (0-999) for bounding boxes:
+- All coordinates are output in range [0, 999]
+- Backend scales: `pixel_coord = (model_coord / 999) * actual_dimension`
+- This ensures consistency across different image sizes
+
+### Dynamic Cropping
+For large images, the model uses dynamic cropping:
+- Images ≤640x640: Direct processing
+- Larger images: Split into tiles based on aspect ratio
+- Global view (BASE_SIZE) + Local views (IMAGE_SIZE tiles)
+- See `process/image_process.py` for implementation details
+
+### Output Format
+- Plain text modes: Return raw text
+- Table modes: Return HTML tables or CSV
+- JSON modes: Return structured JSON
+- Grounding modes: Return text with `<|ref|>label<|/ref|><|det|>[[coords]]<|/det|>` tags

 ## API Usage

 ### POST /api/ocr

 **Parameters:**
- `image` (file, required)
- `mode` (string): plain_ocr | describe | find_ref | freeform
- `prompt` (string): Custom prompt for freeform mode
- `grounding` (bool): Enable bounding boxes (auto-enabled for find_ref)
- `find_term` (string): Term to locate in find_ref mode
- `base_size` (int): Base processing size (default: 1024)
- `image_size` (int): Image size (default: 640)
- `crop_mode` (bool): Enable crop mode (default: true)
+- `image` (file, required) - Image file to process (up to 100MB)
+- `mode` (string) - OCR mode: `plain_ocr` | `describe` | `find_ref` | `freeform`
+- `prompt` (string) - Custom prompt for freeform mode
+- `grounding` (bool) - Enable bounding boxes (auto-enabled for find_ref)
+- `find_term` (string) - Term to locate in find_ref mode (supports multiple matches)
+- `base_size` (int) - Base processing size (default: 1024)
+- `image_size` (int) - Tile size for cropping (default: 640)
+- `crop_mode` (bool) - Enable dynamic cropping (default: true)
+- `include_caption` (bool) - Add image description (default: false)

 **Response:**
 ```json
 {
  "success": true,
-  "text": "Extracted text...",
+  "text": "Extracted text or HTML output...",
  "boxes": [{"label": "field", "box": [x1, y1, x2, y2]}],
  "image_dims": {"w": 1920, "h": 1080},
-  "metadata": {...}
+  "metadata": {
+    "mode": "layout_map",
+    "grounding": true,
+    "base_size": 1024,
+    "image_size": 640,
+    "crop_mode": true
+  }
 }
 ```

+**Note on Bounding Boxes:**
+- The model outputs coordinates normalized to 0-999
+- The backend automatically scales them to actual image dimensions
+- Coordinates are in [x1, y1, x2, y2] format (top-left, bottom-right)
+- **Supports multiple boxes**: When finding multiple instances, format is `[[x1,y1,x2,y2], [x1,y1,x2,y2], ...]`
+- Frontend automatically displays all boxes overlaid on the image with unique colors
+
 ## Troubleshooting

 ### GPU not detected
--- a/backend/main.py
+++ b/backend/main.py
@@ -12,6 +12,7 @@ import torch
 from transformers import AutoModel, AutoTokenizer
 from PIL import Image
 import uvicorn
+from decouple import config as env_config

 # -----------------------------
 # Lifespan context for model loading
@@ -26,8 +27,8 @@ async def lifespan(app: FastAPI):
    
    # Environment setup
    os.environ.pop("TRANSFORMERS_CACHE", None)
-    MODEL_NAME = os.environ.get("MODEL_NAME", "deepseek-ai/DeepSeek-OCR")
-    HF_HOME = os.environ.get("HF_HOME", "/models")
+    MODEL_NAME = env_config("MODEL_NAME", default="deepseek-ai/DeepSeek-OCR")
+    HF_HOME = env_config("HF_HOME", default="/models")
    os.makedirs(HF_HOME, exist_ok=True)
    
    # Load model
@@ -138,7 +139,7 @@ def build_prompt(
    elif mode == "multilingual":
        instruction = "Free OCR. Detect the language automatically and output in the same script."
    elif mode == "describe":
-        instruction = "Describe this image concisely in 2-3 sentences. Focus on visible key elements."
+        instruction = "Describe this image. Focus on visible key elements."
    elif mode == "freeform":
        instruction = user_prompt.strip() if user_prompt else "OCR this image."
    else:
@@ -153,36 +154,82 @@ def build_prompt(
 # -----------------------------
 # Grounding parser
 # -----------------------------
+# Match a full detection block and capture the coordinates as the entire list expression
+# Examples of captured coords (including outer brackets):
+#  - [[312, 339, 480, 681]]
+#  - [[504, 700, 625, 910], [771, 570, 996, 996]]
+#  - [[110, 310, 255, 800], [312, 343, 479, 680], ...]
+# Using a greedy bracket capture ensures we include all inner lists up to the last ']' before </|det|>
 DET_BLOCK = re.compile(
-    r"<\|ref\|>(?P<label>.*?)<\|/ref\|>\s*<\|det\|>\s*\[\s*\[\s*(?P<coords>[^\]]+?)\s*\]\s*\]\s*<\|/det\|>",
+    r"<\|ref\|>(?P<label>.*?)<\|/ref\|>\s*<\|det\|>\s*(?P<coords>\[.*\])\s*<\|/det\|>",
    re.DOTALL,
 )

 def clean_grounding_text(text: str) -> str:
    """Remove grounding tags from text for display, keeping labels"""
-    # Replace <|ref|>label<|/ref|><|det|>[[...]]<|/det|> with just "label"
+    # Replace <|ref|>label<|/ref|><|det|>[...any nested lists...]<|/det|> with just the label
    cleaned = re.sub(
-        r"<\|ref\|>(.*?)<\|/ref\|>\s*<\|det\|>\s*\[\s*\[[^\]]+\]\s*\]\s*<\|/det\|>",
+        r"<\|ref\|>(.*?)<\|/ref\|>\s*<\|det\|>\s*\[.*\]\s*<\|/det\|>",
        r"\1",
        text,
-        flags=re.DOTALL
+        flags=re.DOTALL,
    )
    # Also remove any standalone grounding tags
    cleaned = re.sub(r"<\|grounding\|>", "", cleaned)
    return cleaned.strip()

-def parse_detections(text: str) -> List[Dict[str, Any]]:
-    """Parse grounding boxes from text"""
+def parse_detections(text: str, image_width: int, image_height: int) -> List[Dict[str, Any]]:
+    """Parse grounding boxes from text and scale from 0-999 normalized coords to actual image dimensions
+    
+    Handles both single and multiple bounding boxes:
+    - Single: <|ref|>label<|/ref|><|det|>[[x1,y1,x2,y2]]<|/det|>
+    - Multiple: <|ref|>label<|/ref|><|det|>[[x1,y1,x2,y2], [x1,y1,x2,y2], ...]<|/det|>
+    """
    boxes: List[Dict[str, Any]] = []
    for m in DET_BLOCK.finditer(text or ""):
        label = m.group("label").strip()
-        coords = [c.strip() for c in m.group("coords").split(",")]
+        coords_str = m.group("coords").strip()
+
+        print(f"🔍 DEBUG: Found detection for '{label}'")
+        print(f"📦 Raw coords string (with brackets): {coords_str}")
+
        try:
-            nums = list(map(float, coords[:4]))
-        except Exception:
+            import ast
+
+            # Parse the full bracket expression directly (handles single and multiple)
+            parsed = ast.literal_eval(coords_str)
+
+            # Normalize to a list of lists
+            if (
+                isinstance(parsed, list)
+                and len(parsed) == 4
+                and all(isinstance(n, (int, float)) for n in parsed)
+            ):
+                # Single box provided as [x1,y1,x2,y2]
+                box_coords = [parsed]
+                print("📦 Single box (flat list) detected")
+            elif isinstance(parsed, list):
+                box_coords = parsed
+                print(f"📦 Boxes detected: {len(box_coords)}")
+            else:
+                raise ValueError("Unsupported coords structure")
+
+            # Process each box
+            for idx, box in enumerate(box_coords):
+                if isinstance(box, (list, tuple)) and len(box) >= 4:
+                    x1 = int(float(box[0]) / 999 * image_width)
+                    y1 = int(float(box[1]) / 999 * image_height)
+                    x2 = int(float(box[2]) / 999 * image_width)
+                    y2 = int(float(box[3]) / 999 * image_height)
+                    print(f"  Box {idx+1}: {box} → [{x1}, {y1}, {x2}, {y2}]")
+                    boxes.append({"label": label, "box": [x1, y1, x2, y2]})
+                else:
+                    print(f"  ⚠️ Skipping invalid box: {box}")
+        except Exception as e:
+            print(f"❌ Parsing failed: {e}")
            continue
-        if len(nums) == 4:
-            boxes.append({"label": label, "box": nums})
+    
+    print(f"🎯 Total boxes parsed: {len(boxes)}")
    return boxes

 # -----------------------------
@@ -289,8 +336,8 @@ async def ocr_inference(
        if not text:
            text = "No text returned by model."
        
-        # Parse grounding boxes
-        boxes = parse_detections(text) if ("<|det|>" in text or "<|ref|>" in text) else []
+        # Parse grounding boxes with proper coordinate scaling
+        boxes = parse_detections(text, orig_w or 1, orig_h or 1) if ("<|det|>" in text or "<|ref|>" in text) else []
        
        # Clean grounding tags from display text, but keep the labels
        display_text = clean_grounding_text(text) if ("<|ref|>" in text or "<|grounding|>" in text) else text
@@ -302,6 +349,7 @@ async def ocr_inference(
        return JSONResponse({
            "success": True,
            "text": display_text,
+            "raw_text": text,  # Include raw model output for debugging
            "boxes": boxes,
            "image_dims": {"w": orig_w, "h": orig_h},
            "metadata": {
@@ -326,4 +374,6 @@ async def ocr_inference(
            shutil.rmtree(out_dir, ignore_errors=True)

 if __name__ == "__main__":
-    uvicorn.run(app, host="0.0.0.0", port=8000)
+    host = env_config("API_HOST", default="0.0.0.0")
+    port = env_config("API_PORT", default=8000, cast=int)
+    uvicorn.run(app, host=host, port=port)
--- a/backend/requirements.txt
+++ b/backend/requirements.txt
@@ -10,3 +10,4 @@ easydict
 pillow
 safetensors
 torch
+python-decouple>=3.8
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -2,9 +2,14 @@ services:
  backend:
    build: ./backend
    container_name: deepseek-ocr-backend
+    env_file:
+      - .env
    environment:
-      MODEL_NAME: deepseek-ai/DeepSeek-OCR
-      HF_HOME: /models
+      MODEL_NAME: ${MODEL_NAME:-deepseek-ai/DeepSeek-OCR}
+      HF_HOME: ${HF_HOME:-/models}
+      API_HOST: ${API_HOST:-0.0.0.0}
+      API_PORT: ${API_PORT:-8000}
+      MAX_UPLOAD_SIZE_MB: ${MAX_UPLOAD_SIZE_MB:-100}
    volumes:
      - ./models:/models
    deploy:
@@ -16,7 +21,7 @@ services:
              capabilities: [gpu]
    shm_size: "4g"
    ports:
-      - "8000:8000"
+      - "${API_PORT:-8000}:${API_PORT:-8000}"
    networks:
      - ocr-network

@@ -24,7 +29,7 @@ services:
    build: ./frontend
    container_name: deepseek-ocr-frontend
    ports:
-      - "3000:80"
+      - "${FRONTEND_PORT:-3000}:80"
    depends_on:
      - backend
    networks:
--- a/frontend/nginx.conf
+++ b/frontend/nginx.conf
@@ -4,6 +4,9 @@ server {
    root /usr/share/nginx/html;
    index index.html;

+    # Allow larger file uploads (100MB)
+    client_max_body_size 100M;
+
    # Gzip compression
    gzip on;
    gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
--- a/frontend/src/App.jsx
+++ b/frontend/src/App.jsx
@@ -27,11 +27,22 @@ function App() {
  })

  const handleImageSelect = useCallback((file) => {
+    if (file === null) {
+      // Clear everything when removing image
+      setImage(null)
+      if (imagePreview) {
+        URL.revokeObjectURL(imagePreview)
+      }
+      setImagePreview(null)
+      setError(null)
+      setResult(null)
+    } else {
      setImage(file)
      setImagePreview(URL.createObjectURL(file))
      setError(null)
      setResult(null)
-  }, [])
+    }
+  }, [imagePreview])

  const handleSubmit = async () => {
    if (!image) {
@@ -47,7 +58,8 @@ function App() {
      formData.append('image', image)
      formData.append('mode', mode)
      formData.append('prompt', prompt)
-      formData.append('grounding', mode === 'find_ref') // Auto-enable for find mode
+      // Enable grounding only for find mode
+      formData.append('grounding', mode === 'find_ref')
      formData.append('include_caption', false)
      formData.append('find_term', findTerm)
      formData.append('schema', '')
@@ -81,12 +93,9 @@ function App() {
    
    const extensions = {
      plain_ocr: 'txt',
-      markdown: 'md',
-      tables_csv: 'csv',
-      tables_md: 'md',
-      kv_json: 'json',
-      layout_map: 'json',
-      pii_redact: 'json',
+      describe: 'txt',
+      find_ref: 'txt',
+      freeform: 'txt',
    }
    
    const ext = extensions[mode] || 'txt'
--- a/frontend/src/components/ImageUpload.jsx
+++ b/frontend/src/components/ImageUpload.jsx
@@ -71,27 +71,28 @@ export default function ImageUpload({ onImageSelect, preview }) {
        <motion.div
          initial={{ opacity: 0, scale: 0.9 }}
          animate={{ opacity: 1, scale: 1 }}
-          className="relative group"
+          className="relative group rounded-2xl overflow-hidden"
        >
          <img 
            src={preview} 
            alt="Preview" 
            className="w-full rounded-2xl border border-white/10"
          />
+          <div className="absolute top-3 right-3 flex gap-2">
            <motion.button
-            onClick={() => onImageSelect(null)}
-            className="absolute top-3 right-3 bg-red-500/80 backdrop-blur-sm p-2 rounded-full opacity-0 group-hover:opacity-100 transition-opacity"
-            whileHover={{ scale: 1.1 }}
-            whileTap={{ scale: 0.9 }}
+              onClick={(e) => {
+                e.stopPropagation()
+                onImageSelect(null)
+              }}
+              className="bg-red-500/90 backdrop-blur-sm px-3 py-2 rounded-full opacity-100 hover:bg-red-600 transition-colors flex items-center gap-2 shadow-lg"
+              whileHover={{ scale: 1.05 }}
+              whileTap={{ scale: 0.95 }}
+              title="Remove image"
            >
              <X className="w-4 h-4" />
+              <span className="text-sm font-medium">Remove</span>
            </motion.button>
-          
-          {/* Grounding overlay canvas */}
-          <canvas 
-            id="preview-canvas" 
-            className="absolute top-0 left-0 w-full h-full pointer-events-none"
-          />
+          </div>
        </motion.div>
      )}
    </div>
--- a/frontend/src/components/ResultPanel.jsx
+++ b/frontend/src/components/ResultPanel.jsx
@@ -9,8 +9,19 @@ export default function ResultPanel({ result, loading, imagePreview, onCopy, onD
  const [showAdvanced, setShowAdvanced] = useState(false)
  const [imageLoaded, setImageLoaded] = useState(false)

-  // Check if text looks like markdown
-  const isMarkdown = result?.text && (
+  // Check if text looks like HTML (model outputs HTML, not markdown)
+  const isHTML = result?.text && (
+    result.text.includes('<table') || 
+    result.text.includes('<tr>') || 
+    result.text.includes('<td>') ||
+    result.text.includes('<div') ||
+    result.text.includes('<p>') ||
+    result.text.includes('<h1') ||
+    result.text.includes('<h2')
+  )
+
+  // Also check if it looks like markdown (for backwards compatibility)
+  const isMarkdown = result?.text && !isHTML && (
    result.text.includes('##') || 
    result.text.includes('**') || 
    result.text.includes('```') ||
@@ -216,7 +227,15 @@ export default function ResultPanel({ result, loading, imagePreview, onCopy, onD

            {/* Text result */}
            <div className="bg-white/5 border border-white/10 rounded-xl p-4 max-h-96 overflow-y-auto">
-              {isMarkdown ? (
+              {isHTML ? (
+                <div 
+                  className="prose prose-invert prose-sm max-w-none"
+                  dangerouslySetInnerHTML={{ __html: result.text }}
+                  style={{
+                    color: '#e5e7eb',
+                  }}
+                />
+              ) : isMarkdown ? (
                <div className="prose prose-invert prose-sm max-w-none">
                  <ReactMarkdown>{result.text}</ReactMarkdown>
                </div>
@@ -227,10 +246,39 @@ export default function ResultPanel({ result, loading, imagePreview, onCopy, onD
              )}
            </div>

+            {/* Raw Response Viewer */}
+            {result.raw_text && (
+              <details className="glass rounded-xl overflow-hidden">
+                <summary className="px-4 py-3 cursor-pointer flex items-center justify-between hover:bg-white/5 transition-colors">
+                  <span className="text-sm font-medium text-gray-300">🔍 Raw Model Response</span>
+                  <ChevronDown className="w-4 h-4 text-gray-400" />
+                </summary>
+                <div className="px-4 py-3 border-t border-white/10 space-y-2">
+                  <p className="text-xs text-gray-400 mb-2">Unprocessed output from the model (useful for debugging)</p>
+                  <div className="bg-black/30 rounded-lg p-3 max-h-64 overflow-y-auto">
+                    <pre className="text-xs text-green-400 font-mono whitespace-pre-wrap break-words select-all">
+                      {result.raw_text}
+                    </pre>
+                  </div>
+                  <div className="flex gap-2 mt-2">
+                    <button
+                      onClick={() => navigator.clipboard.writeText(result.raw_text)}
+                      className="text-xs px-3 py-1 bg-white/5 hover:bg-white/10 rounded-lg transition-colors"
+                    >
+                      Copy Raw
+                    </button>
+                    <span className="text-xs text-gray-500 py-1">
+                      {result.raw_text.length} characters
+                    </span>
+                  </div>
+                </div>
+              </details>
+            )}
+
            {/* Advanced Settings Dropdown */}
            <details className="glass rounded-xl overflow-hidden">
              <summary className="px-4 py-3 cursor-pointer flex items-center justify-between hover:bg-white/5 transition-colors">
-                <span className="text-sm font-medium text-gray-300">Advanced Settings & Metadata</span>
+                <span className="text-sm font-medium text-gray-300">⚙️ Metadata & Debug Info</span>
                <ChevronDown className="w-4 h-4 text-gray-400" />
              </summary>
              <div className="px-4 py-3 border-t border-white/10 space-y-3">
@@ -244,14 +292,21 @@ export default function ResultPanel({ result, loading, imagePreview, onCopy, onD
                )}
                {result.boxes?.length > 0 && (
                  <div>
-                    <p className="text-xs text-gray-400 mb-2">Detected Regions ({result.boxes.length})</p>
-                    <div className="space-y-1">
+                    <p className="text-xs text-gray-400 mb-2">Parsed Bounding Boxes ({result.boxes.length})</p>
+                    <div className="bg-black/30 rounded-lg p-2 space-y-1 max-h-32 overflow-y-auto">
                      {result.boxes.map((box, idx) => (
-                        <div key={idx} className="text-xs text-gray-500">
-                          {box.label}: [{box.box.map(n => Math.round(n)).join(', ')}]
+                        <div key={idx} className="text-xs font-mono">
+                          <span className="text-cyan-400">Box {idx + 1}:</span>{' '}
+                          <span className="text-purple-400">{box.label}</span>{' '}
+                          <span className="text-gray-500">
+                            [{box.box.map(n => Math.round(n)).join(', ')}]
+                          </span>
                        </div>
                      ))}
                    </div>
+                    <p className="text-xs text-gray-500 mt-2">
+                      Coordinates are scaled from model output (0-999) to image pixels
+                    </p>
                  </div>
                )}
              </div>