Getting Better OCR Results
Learn how to scan and prepare images for optimal OCR text extraction quality. The better your source images, the more accurate and searchable your digitized newspapers will be.
Scanning Best Practices
Optimal Resolution
For best OCR results, scan at 300-600 DPI (dots per inch):
- 300 DPI: Minimum for good OCR accuracy, suitable for most newspapers
- 400-600 DPI: Better for small print or degraded newspapers
- Above 600 DPI: Usually unnecessary and creates large file sizes
- Below 300 DPI: May significantly reduce OCR accuracy
Color vs Grayscale vs Black & White
- Grayscale: Best choice for most newspapers - preserves detail without large files
- Color: Only needed if preserving color is important (creates larger files)
- Black & White (1-bit): Can work but loses subtle detail that helps OCR
File Format Recommendations
- JPEG: Good for photographs and scanned images, smaller file sizes
- PNG: Better quality but larger files, good for text-heavy documents
- PDF: Perfect for multi-page documents - automatically splits into pages
Lighting and Exposure
- Use even, diffused lighting to avoid harsh shadows
- Avoid glare and reflections from the scanner glass or newspaper
- Slightly overexpose rather than underexpose if you must choose
- Ensure consistent lighting across the entire page
Page Positioning
- Keep pages as flat as possible during scanning
- Align pages straight - rotation can reduce OCR accuracy
- For bound volumes, press spine gently to flatten pages
- Remove any obstructions (fingers, hands, other papers)
Image Preparation Tips
Cropping
Remove unnecessary borders and backgrounds:
- Crop to just the newspaper content area
- Remove dark edges from scanner bed
- Keep all text visible - don't crop too tight
- Maintain consistent margins across pages
Rotation
Ensure text is properly oriented:
- Text should read left-to-right, top-to-bottom
- Rotate images before upload if they're sideways or upside-down
- You can also rotate during batch upload review
- OCR works best with perfectly upright text
Contrast and Brightness
Making faded text more readable:
- Increase contrast to make text stand out from the background
- Adjust brightness if pages are too dark or too light
- Remove yellowing using color correction (if scanning in color)
- Most photo editing software has an "Auto-Contrast" feature
When to Use Photo Editing Software
Consider editing images before upload if:
- Pages are severely faded or yellowed
- There are significant stains or damage affecting text
- Contrast is very low between text and background
- Pages are skewed or need straightening
Recommended tools: GIMP (free), Photoshop, Photopea (web-based), or your scanner's built-in software.
File Size Considerations
- High resolution + color = larger files
- Use grayscale to reduce file size while keeping quality
- JPEG compression can reduce size (use quality 85-95%)
- Split very large multi-page PDFs into smaller batches
Common OCR Issues and Solutions
| Issue | Solution |
|---|---|
| Faded or yellowed pages | Increase contrast using image editing software before upload |
| Handwritten annotations | OCR may skip or misread these - they're not printed text |
| Very small font sizes | Scan at higher resolution (400-600 DPI) to capture detail |
| Unusual or decorative fonts | May reduce accuracy - consider manual text entry for important sections |
| Stains and damage | Clean scans when possible, or edit images to reduce visual noise |
| Curved pages from book scans | Flatten pages or use software to de-skew/straighten images |
| Low contrast (gray text on gray background) | Adjust contrast and brightness before scanning or in post-processing |
| Blurry images | Re-scan with camera/scanner held steady, ensure proper focus |
What Affects OCR Accuracy
OCR accuracy depends on multiple factors, some within your control and some not:
✓ Factors You Can Control:
- Scanning resolution - Use 300-600 DPI
- Image quality - Clear, well-lit, high-contrast scans
- File format - Use appropriate format for content
- Page orientation - Ensure text is upright
- Image preprocessing - Adjust contrast, remove noise
✗ Factors You Cannot Control:
- Original print quality - Some old newspapers had poor printing
- Age and condition - Deteriorated paper is harder to read
- Font choices - Historical typefaces vary in OCR-friendliness
- Paper texture and color - Newsprint yellows and becomes brittle
- Physical damage - Tears, holes, water damage can't be reversed
Testing Your Uploads
Review OCR Text After Processing
After your upload completes OCR processing:
- Navigate to the newspaper issue page
- Scroll down to the "OCR Extracted Text" section
- Read through the text to check accuracy
- Look for obvious errors or missing sections
Editing OCR Text to Correct Errors
You can manually correct OCR errors:
- Click "Edit" on the newspaper issue (must be owner or admin)
- Find the "OCR Text" field
- Make corrections to the text
- Save your changes
- The corrected text will be immediately searchable
When to Re-upload vs Edit Manually
Re-upload If:
- Image quality is very poor
- Most text is unreadable
- Pages are rotated wrong
- You have a better quality scan
Edit Manually If:
- Only minor errors exist
- Specific words are wrong
- Small sections are missing
- Overall accuracy is good
Reprocessing Failed OCR Jobs
If OCR processing fails completely:
- Check the error message for clues
- Verify image is accessible and not corrupted
- Try clicking "Reprocess OCR" button (if available)
- Contact an admin if problems persist
Quick Tips Summary
- ✓ Scan at 300-600 DPI
- ✓ Use grayscale for best balance
- ✓ Ensure even, diffused lighting
- ✓ Keep pages flat and straight
- ✓ Crop out unnecessary borders
- ✓ Increase contrast for faded pages
- ✓ Rotate images to upright orientation
- ✓ Review and correct OCR text after upload