Computer Vision Basics: First Attempts at OCR in Mobile

#computer vision#ocr#tesseract#ios#image processing#2013#mobile

📋 Table of Contents ▼

The premise in early 2013 was sensible: employees submit expense reports by photographing receipts with their phones. The app reads the receipt, extracts the amount, date, and vendor, and auto-fills the expense form.

The execution required understanding Optical Character Recognition (OCR) at a level we hadn't anticipated.

Tesseract OCR: The Open Source Standard

Tesseract was developed at Hewlett-Packard in the 1980s, open-sourced in 2005, and adopted by Google in 2006. By 2013, Tesseract 3.02 was the most accurate open-source OCR engine available.

We compiled it for iOS (Tesseract had no official iOS port; there were community Objective-C wrappers):

// Using tesseract-ios wrapper (2013)
#import "Tesseract.h"

@implementation ReceiptOCRProcessor

- (NSString *)extractTextFromImage:(UIImage *)image {
    Tesseract *tesseract = [[Tesseract alloc] initWithLanguage:@"eng"];
    
    // Set page segmentation mode
    // PSM_AUTO = Tesseract decides the layout
    // PSM_SINGLE_BLOCK = Treat as single block of text (good for receipts)
    [tesseract setVariableValue:@"6" forKey:@"tessedit_pageseg_mode"];
    
    // Whitelist characters (receipts have numbers, punctuation, basic letters)
    // Restricting the character set improves accuracy significantly
    [tesseract setVariableValue:@"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.,/$:-()" 
                        forKey:@"tessedit_char_whitelist"];
    
    [tesseract setImage:image];
    [tesseract recognize];
    
    NSString *text = [tesseract recognizedText];
    return text;
}

@end

Raw result on a real receipt photo (iPhone 5 camera, decent lighting):

TARGEf ST0RE #1823
1234 MAIN ST
ANYTOWN CA 90210

GROCERIES
MILK 2% 1GAL       3.49
BREAD WHOLE WHEAT  2.29
EGGS LARGE DOZ     4.19
APPLES FUJI 3LB    3.99

SUBTOTAL          13.96
TAX 8.5%           1.19
TOTAL            S15.15

VISA *4231
APPROVED

"TARGEf" instead of "TARGET". "S15.15" instead of "$15.15". 68% character accuracy on 20 test receipts - meaning roughly 1 in 3 characters was wrong or missing. Not usable as raw output.

The Preprocessing Pipeline

OCR accuracy depends heavily on image quality. Tesseract was designed for scanned documents - horizontal, high-resolution, even lighting. Phone camera photos of receipts were none of these things: skewed, low-contrast, variable lighting, sometimes crumpled.

We built a preprocessing pipeline in OpenCV (which we also compiled for iOS):

#import <opencv2/opencv.hpp>

@implementation ImagePreprocessor

- (UIImage *)preprocessForOCR:(UIImage *)inputImage {
    // Convert UIImage to OpenCV Mat
    cv::Mat mat;
    UIImageToMat(inputImage, mat);
    
    // Step 1: Convert to grayscale
    cv::Mat gray;
    cv::cvtColor(mat, gray, cv::COLOR_RGB2GRAY);
    
    // Step 2: Increase resolution if too small
    // Tesseract works best at 300+ DPI; scale up small images
    if (gray.cols < 1000) {
        double scale = 1000.0 / gray.cols;
        cv::resize(gray, gray, cv::Size(), scale, scale, cv::INTER_CUBIC);
    }
    
    // Step 3: Deskewing - detect and correct rotation
    gray = [self deskewImage:gray];
    
    // Step 4: Adaptive thresholding
    // Converts grayscale to black-and-white, handling uneven lighting
    cv::Mat thresh;
    cv::adaptiveThreshold(
        gray, thresh, 255,
        cv::ADAPTIVE_THRESH_GAUSSIAN_C,
        cv::THRESH_BINARY,
        11,   // Block size: neighborhood area for threshold calculation
        2     // C: constant subtracted from mean
    );
    
    // Step 5: Noise removal
    cv::Mat denoised;
    cv::medianBlur(thresh, denoised, 3);
    
    return MatToUIImage(denoised);
}

- (cv::Mat)deskewImage:(cv::Mat)image {
    // Find edges
    cv::Mat edges;
    cv::Canny(image, edges, 50, 150, 3);
    
    // Hough line transform - find dominant lines
    std::vector<cv::Vec4i> lines;
    cv::HoughLinesP(edges, lines, 1, CV_PI/180, 50, 50, 10);
    
    if (lines.empty()) return image;
    
    // Calculate average angle of detected lines
    double totalAngle = 0;
    int count = 0;
    for (auto& line : lines) {
        double angle = atan2(line[3] - line[1], line[2] - line[0]) * 180 / CV_PI;
        // Only consider near-horizontal lines (receipts are landscape text)
        if (fabs(angle) < 15) {
            totalAngle += angle;
            count++;
        }
    }
    
    if (count == 0) return image;
    
    double avgAngle = totalAngle / count;
    
    // Rotate to correct the skew
    cv::Point2f center(image.cols / 2.0, image.rows / 2.0);
    cv::Mat rotationMatrix = cv::getRotationMatrix2D(center, avgAngle, 1.0);
    cv::Mat rotated;
    cv::warpAffine(image, rotated, rotationMatrix, image.size(),
                   cv::INTER_CUBIC, cv::BORDER_REPLICATE);
    
    return rotated;
}

@end

After preprocessing, character accuracy improved from 68% to 81%. Not perfect, but better.

Extracting Structured Data

Raw OCR text was unstructured. We needed to extract specific fields: date, total amount, vendor name.

# Python post-processing (ran on server after uploading OCR text from iOS)
import re
from datetime import datetime

def extract_receipt_data(ocr_text):
    """Extract structured data from raw OCR text."""
    result = {
        'vendor': None,
        'date': None,
        'total': None,
        'subtotal': None,
        'items': []
    }
    
    lines = [l.strip() for l in ocr_text.split('\n') if l.strip()]
    
    # Vendor: typically the first non-empty line
    if lines:
        result['vendor'] = lines[0]
    
    # Date patterns: various formats on receipts
    date_patterns = [
        r'\b(\d{1,2})[/\-](\d{1,2})[/\-](\d{2,4})\b',  # 12/25/2013
        r'\b(\d{4})[/\-](\d{1,2})[/\-](\d{1,2})\b',    # 2013-12-25
        r'\b(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w*\.?\s+(\d{1,2}),?\s+(\d{4})\b',
    ]
    
    for line in lines:
        for pattern in date_patterns:
            match = re.search(pattern, line, re.IGNORECASE)
            if match:
                result['date'] = line  # Store the matched line for review
                break
        if result['date']:
            break
    
    # Total amount: look for "TOTAL" keyword followed by dollar amount
    total_patterns = [
        r'(?:total|amount due|balance due)[:\s]+\$?\s*(\d+[\.,]\d{2})',
        r'(?:grand total)[:\s]+\$?\s*(\d+[\.,]\d{2})',
    ]
    
    full_text = ' '.join(lines).lower()
    for pattern in total_patterns:
        match = re.search(pattern, full_text, re.IGNORECASE)
        if match:
            amount_str = match.group(1).replace(',', '.')
            result['total'] = float(amount_str)
            break
    
    # If no explicit total found, look for largest dollar amount
    # (often the total on receipts)
    if not result['total']:
        amounts = re.findall(r'\$?\s*(\d+\.\d{2})', full_text)
        if amounts:
            result['total'] = max(float(a) for a in amounts)
    
    return result

The Error Correction Layer

81% OCR accuracy on characters meant some field extractions were still wrong. We added a human review step - not manual entry, but confirmation:

// iOS: Show extracted data for user confirmation
- (void)showExtractionConfirmation:(NSDictionary *)extracted {
    // Pre-fill the form with extracted data
    self.vendorField.text = extracted[@"vendor"] ?: @"";
    self.amountField.text = extracted[@"total"] ? 
        [NSString stringWithFormat:@"%.2f", [extracted[@"total"] floatValue]] : @"";
    self.dateField.text = extracted[@"date"] ?: @"";
    
    // Highlight fields that have low confidence
    if (!extracted[@"total"]) {
        [self highlightFieldAsNeedsReview:self.amountField];
    }
    
    // Show the original photo alongside for comparison
    self.receiptImageView.image = self.capturedImage;
    
    // User confirms or edits before submitting
    [self showConfirmationView];
}

This pattern - machine extraction + human confirmation - increased user acceptance. Users trusted the app because they could see the photo and verify the extracted data. The confirmation step also collected training data: corrections became labeled examples for improving the model.

What Came After

By 2016, Google Cloud Vision API and Amazon Textract made receipt OCR trivial. Send a photo to an API endpoint, receive structured JSON with extracted text, bounding boxes, and confidence scores. Accuracy was 95%+.

By 2020, on-device ML models (Core ML on iOS, TensorFlow Lite on Android) could run receipt OCR entirely offline with accuracy matching the 2016 cloud APIs.

The 2013 system - Tesseract, OpenCV preprocessing, regex extraction, human confirmation - was replaced by one API call.

The learning wasn't wasted. Understanding image preprocessing taught us why quality matters more than algorithm for computer vision tasks. Understanding Tesseract's failures (poor performance on cursive fonts, low contrast, skewed text) made us informed users of the APIs that replaced it. "Garbage in, garbage out" applies to ML systems at every level of sophistication.

Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.

Computer Vision Basics: First Attempts at OCR in Mobile Apps

Tesseract OCR: The Open Source Standard

The Preprocessing Pipeline

Extracting Structured Data

The Error Correction Layer

What Came After

Aunimeda

Need IT development for your business?

Computer Vision Basics: First Attempts at OCR in Mobile Apps

Tesseract OCR: The Open Source Standard

The Preprocessing Pipeline

Extracting Structured Data

The Error Correction Layer

What Came After

Aunimeda

Read Also

Early NLP: Building Basic Chatbots Before the LLM Era

Predictive Analytics in E-commerce: How Early Machine Learning Powered 'Products You May Like'

How to Build an AI Chatbot for Your Business in 2026

Need IT development for your business?