# Deep Learning Based Bengali Form Extraction System

## Goal

Build a production-ready AI pipeline that can:

* Read scanned Bengali PDFs/images
* Detect form regions
* Detect checkboxes
* Identify checked option only
* Extract Bengali text
* Map checkbox → nearest label
* Return structured JSON
* Expose REST APIs using FastAPI

---

# Recommended Final Architecture

```text
PDF/Image
   ↓
Preprocessing
   ↓
Deep Learning Layout Detection
   ↓
Checkbox Detection Model
   ↓
OCR (Bengali)
   ↓
Spatial Mapping Engine
   ↓
JSON Output
   ↓
FastAPI Response
```

---

# Why Add Deep Learning?

Traditional OpenCV alone works only when:

* form alignment is fixed
* checkbox shape is consistent
* scan quality is clean

DL improves:

* rotated scans
* noisy documents
* handwritten marks
* varying checkbox sizes
* mobile camera images
* partially broken forms

---

# Recommended DL Components

## 1. Layout Detection Model

Purpose:
Detect sections like:

* title
* table
* checkbox group
* signature area
* text blocks

### Best Choices

| Model        | Recommendation      |
| ------------ | ------------------- |
| YOLOv8       | Best overall        |
| Detectron2   | Advanced enterprise |
| LayoutParser | Easy document AI    |

### Recommended

```text
YOLOv8n
```

Train classes:

```text
checkbox_group
form_table
signature
header
```

---

# 2. Checkbox Detection DL

Instead of contour-only detection.

## Classes

```text
checked_checkbox
unchecked_checkbox
```

### Recommended Models

| Model        | Usage              |
| ------------ | ------------------ |
| YOLOv8       | Fast and accurate  |
| EfficientDet | High precision     |
| FasterRCNN   | Heavy but powerful |

### Best Option

```text
YOLOv8 small model
```

---

# 3. OCR Layer

## Primary OCR

```text
PaddleOCR
```

Advantages:

* Bengali support
* better than Tesseract for low-quality scans
* angle classification
* GPU support

## Secondary OCR Fallback

```text
Tesseract Bengali
```

Fallback improves robustness.

---

# 4. Semantic Mapping Layer (IMPORTANT)

This is where AI becomes powerful.

Instead of hardcoding:

```python
if checkbox_x near label_x:
```

Use:

* spatial reasoning
* nearest-neighbor mapping
* label embeddings
* geometric scoring

---

# Final Extraction Flow

## Step 1

Convert PDF → Images

Libraries:

```python
pdf2image
```

---

## Step 2

Image Preprocessing

Use OpenCV:

* denoise
* adaptive threshold
* skew correction
* contrast enhancement
* morphology

---

## Step 3

Layout Detection (DL)

YOLO detects:

```text
checkbox region
```

Crop only that region.

---

## Step 4

Checkbox Detection (DL)

YOLO detects:

```text
checked_checkbox
unchecked_checkbox
```

Returns coordinates.

---

## Step 5

OCR

Run PaddleOCR around checkbox line.

Example:

```text
নিজস্ব জমি
লিজ নেওয়া জমি
যৌথ মালিকানাধীন জমি
```

---

## Step 6

Mapping

Find nearest text to checked checkbox.

Return:

```json
{
  "land_ownership_type": "নিজস্ব জমি"
}
```

---

# Suggested Project Structure

```text
project/
│
├── app/
│   ├── main.py
│   ├── routes/
│   ├── services/
│   ├── models/
│   ├── utils/
│   └── schemas/
│
├── dl_models/
│   ├── checkbox_detector/
│   ├── layout_detector/
│   └── trained_weights/
│
├── preprocessing/
│   ├── image_cleaner.py
│   ├── skew_corrector.py
│   └── pdf_converter.py
│
├── extraction/
│   ├── checkbox_extractor.py
│   ├── ocr_engine.py
│   ├── spatial_mapper.py
│   └── json_builder.py
│
├── training/
│   ├── dataset/
│   ├── train_checkbox.py
│   ├── train_layout.py
│   └── annotations/
│
├── tests/
│
├── requirements.txt
│
└── Dockerfile
```

---

# Training Dataset Strategy

## Checkbox Dataset

You need images like:

* checked boxes
* empty boxes
* tick marks
* cross marks
* blurred scans
* handwritten checks

Annotation tool:

```text
Label Studio
or
Roboflow
```

---

# Recommended Training

## Checkbox Model

```bash
pip install ultralytics
```

Train:

```python
from ultralytics import YOLO

model = YOLO('yolov8n.pt')

model.train(
    data='dataset.yaml',
    epochs=50,
    imgsz=640,
    batch=16
)
```

---

# Recommended API Response

```json
{
  "success": true,
  "data": {
    "land_ownership_type": "নিজস্ব জমি"
  }
}
```

---

# Recommended Technologies

| Purpose      | Technology |
| ------------ | ---------- |
| API          | FastAPI    |
| OCR          | PaddleOCR  |
| CV           | OpenCV     |
| DL Detection | YOLOv8     |
| PDF Parsing  | pdf2image  |
| Deployment   | Docker     |
| GPU          | CUDA       |

---

# Best Production Design

## Hybrid AI System

Use:

```text
OpenCV + Deep Learning + OCR
```

Why?

* OpenCV is fast
* DL is robust
* OCR reads text

Together they become enterprise-grade.

---

# Future Improvements

You can later add:

## 1. Document Classification

Detect:

* consent form
* land form
* ID card
* invoice

---

## 2. Handwriting Recognition

Using:

* TrOCR
* Donut
* PARSeq

---

## 3. Full Form Understanding

Use Transformer models:

| Model      | Usage                  |
| ---------- | ---------------------- |
| LayoutLMv3 | Document AI            |
| Donut      | OCR-free extraction    |
| DocFormer  | Enterprise document AI |

---

# Most Recommended Final Stack

## My Strong Recommendation

```text
YOLOv8 + PaddleOCR + OpenCV + FastAPI
```

This gives:

* high accuracy
* mobile scan support
* Bengali OCR
* fast inference
* scalable APIs
* future DL expansion

---

# Suggested Development Phases

## Phase 1

* PDF reading
* preprocessing
* OCR
* basic checkbox extraction

## Phase 2

* YOLO checkbox detector
* checked/unchecked classifier

## Phase 3

* layout detection
* semantic mapping
* production API

## Phase 4

* transformer-based document AI
* auto field extraction
* multilingual support

---

# Final Recommendation

For your use case:

```text
Traditional OCR alone is NOT enough.
```

A hybrid Deep Learning document AI pipeline is the correct enterprise approach.

Especially because:

* Bengali forms vary
* scans are noisy
* mobile uploads are common
* checkbox positions can shift

Your chosen architecture is absolutely correct for a scalable real-world system.


python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

To run api for ssh session
uvicorn api_filter:app --reload --port 8000

To run continuously
nohup uvicorn main:app --host 0.0.0.0 --port 8900 > app.log 2>&1 &

To kill
sudo lsof -i :8900
sudo kill -9 858138