Getting Better OCR Results

Learn how to scan and prepare images for optimal OCR text extraction quality. The better your source images, the more accurate and searchable your digitized newspapers will be.

Scanning Best Practices

Optimal Resolution

For best OCR results, scan at 300-600 DPI (dots per inch):

  • 300 DPI: Minimum for good OCR accuracy, suitable for most newspapers
  • 400-600 DPI: Better for small print or degraded newspapers
  • Above 600 DPI: Usually unnecessary and creates large file sizes
  • Below 300 DPI: May significantly reduce OCR accuracy

Color vs Grayscale vs Black & White

Recommendation: Grayscale (8-bit) provides the best balance of file size and OCR accuracy.
  • Grayscale: Best choice for most newspapers - preserves detail without large files
  • Color: Only needed if preserving color is important (creates larger files)
  • Black & White (1-bit): Can work but loses subtle detail that helps OCR

File Format Recommendations

  • JPEG: Good for photographs and scanned images, smaller file sizes
  • PNG: Better quality but larger files, good for text-heavy documents
  • PDF: Perfect for multi-page documents - automatically splits into pages

Lighting and Exposure

  • Use even, diffused lighting to avoid harsh shadows
  • Avoid glare and reflections from the scanner glass or newspaper
  • Slightly overexpose rather than underexpose if you must choose
  • Ensure consistent lighting across the entire page

Page Positioning

  • Keep pages as flat as possible during scanning
  • Align pages straight - rotation can reduce OCR accuracy
  • For bound volumes, press spine gently to flatten pages
  • Remove any obstructions (fingers, hands, other papers)

Image Preparation Tips

Cropping

Remove unnecessary borders and backgrounds:

  • Crop to just the newspaper content area
  • Remove dark edges from scanner bed
  • Keep all text visible - don't crop too tight
  • Maintain consistent margins across pages

Rotation

Ensure text is properly oriented:

  • Text should read left-to-right, top-to-bottom
  • Rotate images before upload if they're sideways or upside-down
  • You can also rotate during batch upload review
  • OCR works best with perfectly upright text

Contrast and Brightness

Making faded text more readable:

  • Increase contrast to make text stand out from the background
  • Adjust brightness if pages are too dark or too light
  • Remove yellowing using color correction (if scanning in color)
  • Most photo editing software has an "Auto-Contrast" feature

When to Use Photo Editing Software

Consider editing images before upload if:

  • Pages are severely faded or yellowed
  • There are significant stains or damage affecting text
  • Contrast is very low between text and background
  • Pages are skewed or need straightening

Recommended tools: GIMP (free), Photoshop, Photopea (web-based), or your scanner's built-in software.

File Size Considerations

File Size Limit: Maximum 50MB per file
  • High resolution + color = larger files
  • Use grayscale to reduce file size while keeping quality
  • JPEG compression can reduce size (use quality 85-95%)
  • Split very large multi-page PDFs into smaller batches

Common OCR Issues and Solutions

Issue Solution
Faded or yellowed pages Increase contrast using image editing software before upload
Handwritten annotations OCR may skip or misread these - they're not printed text
Very small font sizes Scan at higher resolution (400-600 DPI) to capture detail
Unusual or decorative fonts May reduce accuracy - consider manual text entry for important sections
Stains and damage Clean scans when possible, or edit images to reduce visual noise
Curved pages from book scans Flatten pages or use software to de-skew/straighten images
Low contrast (gray text on gray background) Adjust contrast and brightness before scanning or in post-processing
Blurry images Re-scan with camera/scanner held steady, ensure proper focus

What Affects OCR Accuracy

OCR accuracy depends on multiple factors, some within your control and some not:

✓ Factors You Can Control:

  • Scanning resolution - Use 300-600 DPI
  • Image quality - Clear, well-lit, high-contrast scans
  • File format - Use appropriate format for content
  • Page orientation - Ensure text is upright
  • Image preprocessing - Adjust contrast, remove noise

✗ Factors You Cannot Control:

  • Original print quality - Some old newspapers had poor printing
  • Age and condition - Deteriorated paper is harder to read
  • Font choices - Historical typefaces vary in OCR-friendliness
  • Paper texture and color - Newsprint yellows and becomes brittle
  • Physical damage - Tears, holes, water damage can't be reversed
Realistic Expectations: Even with perfect scanning, OCR accuracy typically ranges from 85-99% depending on source quality. Very old, damaged, or poorly printed newspapers may have lower accuracy. This is normal and expected!

Testing Your Uploads

Review OCR Text After Processing

After your upload completes OCR processing:

  1. Navigate to the newspaper issue page
  2. Scroll down to the "OCR Extracted Text" section
  3. Read through the text to check accuracy
  4. Look for obvious errors or missing sections

Editing OCR Text to Correct Errors

You can manually correct OCR errors:

  • Click "Edit" on the newspaper issue (must be owner or admin)
  • Find the "OCR Text" field
  • Make corrections to the text
  • Save your changes
  • The corrected text will be immediately searchable

When to Re-upload vs Edit Manually

Re-upload If:

  • Image quality is very poor
  • Most text is unreadable
  • Pages are rotated wrong
  • You have a better quality scan

Edit Manually If:

  • Only minor errors exist
  • Specific words are wrong
  • Small sections are missing
  • Overall accuracy is good

Reprocessing Failed OCR Jobs

If OCR processing fails completely:

  • Check the error message for clues
  • Verify image is accessible and not corrupted
  • Try clicking "Reprocess OCR" button (if available)
  • Contact an admin if problems persist

Quick Tips Summary

  • ✓ Scan at 300-600 DPI
  • ✓ Use grayscale for best balance
  • ✓ Ensure even, diffused lighting
  • ✓ Keep pages flat and straight
  • ✓ Crop out unnecessary borders
  • ✓ Increase contrast for faded pages
  • ✓ Rotate images to upright orientation
  • ✓ Review and correct OCR text after upload