Uplint
uplint dispatch --read detect-blank-pdfs
TITLEHow to Detect Blank PDFs Before They Pollute Your Storagevalidated
DATE2026-04-02
AUTHORUplint Engineering
TYPEengineering
TAGS#blank detection #PDF validation #content validation #data quality
READ9 min

How to Detect Blank PDFs Before They Pollute Your Storage

Blank PDFs are a silent data quality crisis. They pass every standard validation check but contain zero meaningful content. Here's how to detect them before they corrupt your datasets.

Uplint EngineeringApril 2, 20269 min

Your users upload a file. The filename is invoice_2026_Q1.pdf. It's 2.3 MB. Your validation code checks:

  • ✓ Extension is .pdf
  • ✓ MIME type is application/pdf
  • ✓ File size is within limits
  • ✓ File reaches your storage

Your system records the upload as successful. The file appears in your audit logs as a legitimate submission. Downstream processes treat it as real data.

Three months later, an accountant discovers the file contains zero readable words. It's a blank PDF — structurally valid, completely empty.

This happens thousands of times a day in production systems. Blank PDFs, empty spreadsheets, single-color images, whitespace-only text files. They pass every traditional validation check because those checks only verify metadata, not content.

Why Blank Files Exist

Blank files aren't accidents. They appear for specific reasons:

User mistake — A user intends to upload a document but accidentally selects an empty template file. They don't notice until much later, if ever.

Corrupted upload — The file transfer fails midway but the browser reports success. Partial files or zero-byte uploads slip through.

Deliberate submission — A user uploads a blank file to "complete" a required field, planning to update it later. They forget.

Processing failure — A document converter or PDF generator fails silently, producing a structurally valid but content-free file.

Format confusion — A user exports data from one system into a format that results in headers-only files. Looks legitimate until opened.

The Cost of Not Detecting Blank Files

In non-critical systems, blank files are just annoying. Your analytics count them as real submissions. Your storage bloats with useless data.

In regulated industries, blank files create compliance risk:

Healthcare: A patient uploads a "medical record" that's actually a blank PDF. Your system logs it as received and filed. If an auditor asks "where is this patient's record?" — your answer is "we have it, but it's blank." That's a compliance failure.

Finance: An accountant submits a "tax return" that contains zero financial data. Auditors later find the blank file in your logs and question your submission validation.

Legal: A contract submission that's blank creates liability. Did the party actually intend to sign a blank contract? Your system accepted it without question.

Insurance: Claims submissions that are blank suggest either poor validation or negligent processing. Regulators flag this during compliance reviews.

SaaS: Blank files pollute your data analytics. If you count "total documents uploaded," blank files inflate the metric and skew user engagement calculations.

How to Detect Blank PDFs

Detecting blank PDFs requires reading past the binary header and examining the actual content.

Method 1: Text Extraction

The simplest approach is to extract text from the PDF and check if anything meaningful exists:

from PyPDF2 import PdfReader
from io import BytesIO

def has_content_in_pdf(pdf_bytes):
    try:
        pdf = PdfReader(BytesIO(pdf_bytes))

        # Extract text from all pages
        text = ""
        for page in pdf.pages:
            text += page.extract_text() or ""

        # Check if meaningful content exists
        # Strip whitespace and check length
        content = text.strip()
        if len(content) < 10:  # Less than 10 characters is suspicious
            return False

        return True
    except Exception as e:
        # Corrupt PDF or unsupported format
        return False

This approach works for text-based PDFs. It catches:

  • PDFs with zero pages
  • PDFs with blank pages
  • PDFs containing only whitespace
  • PDFs with headers but no content

It misses PDFs that are purely image-based (scanned documents) where text extraction returns nothing.

Method 2: Page and Image Analysis

For image-based PDFs, check if pages contain meaningful visual content:

from PyPDF2 import PdfReader
from pdf2image import convert_from_bytes
import numpy as np
from PIL import Image
from io import BytesIO

def has_visual_content(pdf_bytes):
    try:
        # Check page count first
        pdf = PdfReader(BytesIO(pdf_bytes))
        if len(pdf.pages) == 0:
            return False

        # Convert first page to image
        images = convert_from_bytes(pdf_bytes, first_page=1, last_page=1)
        if not images:
            return False

        img = images[0]
        img_array = np.array(img)

        # Check if image is blank (single color or nearly uniform)
        # Calculate color variance
        variance = np.var(img_array)

        # Blank images have extremely low variance
        # Threshold depends on resolution; 100 is conservative
        if variance < 100:
            return False

        # Check if there's meaningful content (not mostly whitespace)
        # Convert to grayscale and check pixel distribution
        gray = img.convert('L')
        pixels = np.array(gray)

        # If mostly white (>95% pixels above 240), it's likely blank
        white_pixels = np.sum(pixels > 240)
        if white_pixels / pixels.size > 0.95:
            return False

        return True
    except Exception as e:
        return False

This catches:

  • Blank scanned PDFs (images with no meaningful content)
  • Single-color pages
  • Mostly white pages

Method 3: Comprehensive Approach

The most robust approach combines both methods:

from PyPDF2 import PdfReader
from pdf2image import convert_from_bytes
import numpy as np
from io import BytesIO

def detect_blank_pdf(pdf_bytes, threshold_text=50, threshold_variance=100):
    """
    Detect blank PDFs using multiple methods.
    Returns (is_blank, analysis)
    """
    analysis = {
        "has_pages": False,
        "has_text": False,
        "has_visual_content": False,
        "total_chars": 0,
        "visual_variance": 0
    }

    try:
        # Check page count
        pdf = PdfReader(BytesIO(pdf_bytes))
        if len(pdf.pages) == 0:
            return (True, analysis)

        analysis["has_pages"] = True

        # Extract text
        text = ""
        for page in pdf.pages:
            text += page.extract_text() or ""

        text_content = text.strip()
        analysis["total_chars"] = len(text_content)

        if len(text_content) > threshold_text:
            analysis["has_text"] = True
            return (False, analysis)

        # Try visual analysis on first page
        try:
            images = convert_from_bytes(pdf_bytes, first_page=1, last_page=1)
            if images:
                img_array = np.array(images[0])
                variance = np.var(img_array)
                analysis["visual_variance"] = variance

                if variance > threshold_variance:
                    analysis["has_visual_content"] = True
                    return (False, analysis)
        except Exception:
            pass

        # If we get here, the PDF is likely blank
        return (True, analysis)

    except Exception as e:
        # Corrupt PDF — conservative approach: treat as potentially blank
        analysis["error"] = str(e)
        return (None, analysis)

This function returns both a boolean (is blank?) and detailed analysis about why.

Detecting Other Blank Files

The same principle applies beyond PDFs:

Spreadsheets (XLSX):

from openpyxl import load_workbook
from io import BytesIO

def has_content_in_xlsx(xlsx_bytes):
    try:
        wb = load_workbook(BytesIO(xlsx_bytes))

        for sheet in wb.sheetnames:
            ws = wb[sheet]

            # Skip if sheet is empty
            if ws.max_row == 0 or ws.max_column == 0:
                continue

            # Check if there's actual data (not just headers)
            for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
                for cell in row:
                    if cell.value is not None:
                        return True

        return False
    except Exception:
        return False

Images (PNG, JPEG):

from PIL import Image
import numpy as np
from io import BytesIO

def has_content_in_image(image_bytes):
    try:
        img = Image.open(BytesIO(image_bytes))
        img_array = np.array(img)

        # Check variance (blank images have near-zero variance)
        if np.var(img_array) < 50:
            return False

        # Check if mostly white or one color
        unique_colors = len(np.unique(img_array.reshape(-1, img_array.shape[2]), axis=0))
        if unique_colors < 5:  # Only 1-4 unique colors = likely blank
            return False

        return True
    except Exception:
        return False

Text Files:

def has_content_in_text(text_bytes):
    try:
        text = text_bytes.decode('utf-8', errors='ignore').strip()

        # Filter to non-whitespace characters
        content = ''.join(text.split())

        return len(content) > 10  # At least 10 non-space characters
    except Exception:
        return False

Integration into Your Upload Pipeline

Once you have blank detection functions, integrate them into your validation:

async def validate_upload(file, file_content):
    # Standard checks
    if not validate_extension(file.name):
        return reject("invalid_extension")

    if not validate_size(len(file_content)):
        return reject("exceeds_size_limit")

    # Content checks
    file_type = detect_file_type(file_content)

    if file_type == "pdf":
        is_blank, analysis = detect_blank_pdf(file_content)
        if is_blank:
            return reject("blank_pdf", analysis=analysis)

    elif file_type == "spreadsheet":
        if not has_content_in_xlsx(file_content):
            return reject("blank_spreadsheet")

    elif file_type == "image":
        if not has_content_in_image(file_content):
            return reject("blank_image")

    # If we got here, accept
    return accept()

Using a Service for Blank Detection

Building blank detection from scratch requires handling edge cases across multiple file formats. Many teams outsource this to services like Uplint:

pip install uplint
from uplint import Uplint

uplint = Uplint(api_key="your_api_key")

async def validate_upload(file):
    result = await uplint.validate(file, {
        "detectBlanks": True  # Enable blank detection
    })

    if result.blank:
        return reject(f"File is blank: {result.blank_reason}")

    return accept()

Uplint detects blanks across PDF, spreadsheet, image, and text formats automatically.

Key Takeaways

  1. Blank files are common — They pass all standard validation checks
  2. They create compliance and data quality issues — Especially in regulated industries
  3. Detection requires content analysis — Not just metadata checks
  4. Different file types need different approaches — Text extraction, image analysis, etc.
  5. The complexity justifies a service — Managing blank detection across formats is non-trivial

Blank file detection is a small change with outsized impact on data quality and compliance.


Uplint automatically detects blank PDFs, empty spreadsheets, single-color images, and other files with zero content. Run the CLI on existing files or integrate the API for real-time detection on new uploads. Start scanning free →

Found this useful? Share it with your team.