Uplint
uplint dispatch --read scan-s3-blank-corrupt-files
TITLEHow to Scan S3 Buckets for Blank and Corrupt Files (Free CLI Tool)validated
DATE2026-04-07
AUTHORUplint Engineering
TYPEengineering
TAGS#S3 scanning #AWS S3 #blank detection #CLI #corrupt files
READ10 min

How to Scan S3 Buckets for Blank and Corrupt Files (Free CLI Tool)

Your S3 bucket probably contains files you've never actually looked at. This guide shows how to scan them, detect problems, and clean up data corruption.

Uplint EngineeringApril 7, 202610 min

Your S3 buckets are full of files. But do you actually know what's in them?

Odds are, they contain:

  • Blank PDFs that look valid but contain nothing
  • Corrupt image files from failed uploads
  • Duplicate files taking up space
  • Empty spreadsheets uploaded by accident
  • Files that never should have been stored in the first place

This isn't theoretical. Most production systems have significant data quality problems hidden in their buckets. You just haven't looked.

This guide shows how to scan your S3 buckets, identify problems, and clean them up.

The Problem: Blind Storage

S3 is content-blind. It stores bytes. It doesn't know if those bytes form a valid PDF, a blank document, or an executable in disguise.

# This command lists what's in your bucket
aws s3 ls s3://my-uploads/ --recursive

# Output:
# 2026-01-15 10:23:45       0 claim-2026-001.pdf
# 2026-01-15 10:24:12    2456 invoice-2026-002.pdf
# 2026-01-15 10:25:01  148923 receipt-2026-003.pdf
# ...

You can see file sizes, but not whether those files are:

  • Actually PDFs (or disguised executables)
  • Blank (zero meaningful content)
  • Corrupt (bad headers, truncated data)
  • Duplicates of other files
  • Missing from their legal retention window

Why This Matters

Data Quality: Blank files inflate your "submissions" count. Analytics show 1,000 documents uploaded, but 200 are blank. Metrics are meaningless.

Compliance: In regulated industries, accepting blank files creates audit failures. "The patient uploaded a medical record" — but if it's blank, that claim doesn't hold up.

Storage Costs: Corrupt and duplicate files waste space and cloud costs. A bucket with 50% corrupt/blank files could be cut in half.

Security: Files you've never examined might contain threats you don't know about.

Scanning Your S3 Bucket: Manual Approach

The basic approach is to download files from S3 and examine them:

import boto3
import os
from PyPDF2 import PdfReader
from io import BytesIO

s3_client = boto3.client('s3')
BUCKET = 'my-uploads'

def check_pdf_blank(pdf_bytes):
    """Check if PDF has readable content."""
    try:
        pdf = PdfReader(BytesIO(pdf_bytes))

        if len(pdf.pages) == 0:
            return True, "zero_pages"

        # Extract text
        text = ''.join(page.extract_text() or '' for page in pdf.pages)
        text_content = text.strip()

        if len(text_content) < 50:
            return True, "insufficient_text"

        return False, "has_content"
    except Exception as e:
        return None, f"error: {str(e)}"

def scan_bucket():
    """Scan all PDFs in bucket."""
    paginator = s3_client.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=BUCKET, Prefix='documents/')

    blank_files = []
    corrupt_files = []
    valid_files = []

    for page in pages:
        if 'Contents' not in page:
            continue

        for obj in page['Contents']:
            key = obj['Key']
            size = obj['Size']

            if not key.endswith('.pdf'):
                continue

            print(f"Checking {key}...")

            # Download file
            response = s3_client.get_object(Bucket=BUCKET, Key=key)
            file_bytes = response['Body'].read()

            # Check if blank
            is_blank, reason = check_pdf_blank(file_bytes)

            if is_blank is None:
                corrupt_files.append({'key': key, 'reason': reason})
            elif is_blank:
                blank_files.append({'key': key, 'reason': reason})
            else:
                valid_files.append({'key': key})

    # Report
    print(f"\nValid files: {len(valid_files)}")
    print(f"Blank files: {len(blank_files)}")
    print(f"Corrupt files: {len(corrupt_files)}")

    for blank in blank_files[:10]:  # Show first 10
        print(f"  BLANK: {blank['key']} ({blank['reason']})")

    for corrupt in corrupt_files[:10]:
        print(f"  CORRUPT: {corrupt['key']} ({corrupt['reason']})")

    return {
        'valid': len(valid_files),
        'blank': blank_files,
        'corrupt': corrupt_files
    }

if __name__ == '__main__':
    results = scan_bucket()

    # Optional: Delete blank files
    for blank_file in results['blank']:
        print(f"Deleting {blank_file['key']}...")
        s3_client.delete_object(Bucket=BUCKET, Key=blank_file['key'])

Limitations:

  • Slow (downloads every file)
  • Expensive (S3 GET requests cost money)
  • Only checks PDFs
  • Requires setup and local processing
  • Hard to schedule and monitor

Using Uplint CLI for Scanning

Uplint provides a CLI tool that scans files locally or in S3:

pip install uplint

Scan Local Directory

uplint scan ./uploads --report json > scan_results.json

Output:

{
  "summary": {
    "total": 1243,
    "valid": 1050,
    "blank": 127,
    "corrupt": 43,
    "suspicious": 23
  },
  "blank_files": [
    {
      "path": "uploads/claim-2026-001.pdf",
      "type": "pdf",
      "reason": "zero_readable_text"
    },
    {
      "path": "uploads/receipt-2026-002.pdf",
      "type": "pdf",
      "reason": "insufficient_text",
      "text_chars": 12
    }
  ],
  "corrupt_files": [
    {
      "path": "uploads/image-2026-003.png",
      "type": "image",
      "reason": "invalid_png_header"
    }
  ]
}

Scan S3 Bucket

First, create an AWS IAM role with S3 read access. Then:

uplint scan s3://my-bucket/uploads --aws-role arn:aws:iam::123456789:role/uplint-scanner --report json

The CLI automatically:

  • Lists all objects in the bucket
  • Downloads and analyzes them
  • Detects blank files across all formats
  • Identifies corrupt structures
  • Scans for threats
  • Generates a comprehensive report

No local disk space needed.

Filter and Focus

Scan only specific file types or prefixes:

# Scan only PDFs in the claims folder
uplint scan s3://my-bucket/claims --filter "*.pdf" --report csv

# Scan files modified in the last 30 days
uplint scan s3://my-bucket/uploads --modified-since 30d

# Scan images only
uplint scan ./uploads --filter "*.{jpg,jpeg,png,gif}"

Processing Results

The scan report gives you raw data. What to do with it?

Option 1: Manual Review

For smaller buckets, manually review blank/corrupt files:

uplint scan s3://my-bucket --report json | jq '.blank_files[] | .path'

For each blank file, decide:

  • Delete it (probably the right move for most blanks)
  • Contact the uploader (maybe they intended it)
  • Keep it (rarely the right choice)

Option 2: Automated Cleanup

For larger buckets, automate deletion:

# Generate deletion list
uplint scan s3://my-bucket/uploads --report json > results.json

# Delete all blank PDFs
jq -r '.blank_files[] | select(.type == "pdf") | .path' results.json | while read file; do
  aws s3 rm "s3://my-bucket/$file"
  echo "Deleted: $file"
done

Option 3: Archive and Replace

Move problematic files to an archive bucket for further investigation:

jq -r '.corrupt_files[] | .path' results.json | while read file; do
  aws s3 cp "s3://my-bucket/$file" "s3://my-bucket-archive/corrupt/$file"
  aws s3 rm "s3://my-bucket/$file"
done

Real-World Scanning: Before and After

Before scanning a healthcare app's S3 bucket:

Claimed: 15,000 patient documents uploaded
Actual size: 87 GB
Cost: ~$2,000/month storage + retrieval

After scanning and cleanup:

Blank PDFs: 2,100 documents deleted
Corrupt files: 340 documents deleted
Duplicates: 1,800 documents consolidated
Actual valid documents: 10,760

New size: 34 GB
New cost: ~$800/month (60% reduction)

The organization reclaimed storage cost and compliance confidence.

Continuous Scanning

One-time scans are useful, but continuous monitoring catches problems as they emerge:

Using Uplint API

from uplint import Uplint
import boto3
from datetime import datetime, timedelta

uplint = Uplint(api_key="your_api_key")
s3 = boto3.client('s3')

def continuous_scan():
    """Run daily to catch newly uploaded blank/corrupt files."""
    bucket = 'my-uploads'

    # Get files uploaded in last 24 hours
    yesterday = datetime.utcnow() - timedelta(days=1)

    paginator = s3.get_paginator('list_objects_v2')
    pages = paginator.paginate(
        Bucket=bucket,
        Prefix='documents/'
    )

    recent_files = []
    for page in pages:
        if 'Contents' not in page:
            continue

        for obj in page['Contents']:
            if obj['LastModified'].replace(tzinfo=None) > yesterday:
                recent_files.append(obj['Key'])

    # Validate each recent file
    problems = []

    for key in recent_files:
        response = s3.get_object(Bucket=bucket, Key=key)
        file_bytes = response['Body'].read()

        result = uplint.validate(file_bytes, {
            'detectBlanks': True,
            'scan': True
        })

        if not result.trusted:
            problems.append({
                'file': key,
                'reason': result.reason,
                'detected_at': datetime.utcnow().isoformat()
            })

    # Alert on problems
    if problems:
        send_alert({
            'subject': f"Found {len(problems)} problematic files",
            'files': problems
        })

    return problems

# Schedule this daily via Lambda, Cron, or Airflow

Scheduled Scanning with Lambda

import json
import boto3
from uplint import Uplint

uplint = Uplint(api_key=os.environ['UPLINT_API_KEY'])
s3 = boto3.client('s3')
sns = boto3.client('sns')

def lambda_handler(event, context):
    """Runs daily via EventBridge."""
    bucket = 'my-uploads'
    prefix = 'documents/'

    # Scan all PDFs uploaded today
    paginator = s3.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=bucket, Prefix=prefix)

    blank_count = 0
    corrupt_count = 0

    for page in pages:
        if 'Contents' not in page:
            continue

        for obj in page['Contents']:
            if not obj['Key'].endswith('.pdf'):
                continue

            response = s3.get_object(Bucket=bucket, Key=obj['Key'])
            file_bytes = response['Body'].read()

            result = uplint.validate(file_bytes, {
                'detectBlanks': True
            })

            if result.blank:
                blank_count += 1
            elif not result.trusted:
                corrupt_count += 1

    # Send summary
    sns.publish(
        TopicArn=os.environ['SNS_TOPIC'],
        Subject='Daily S3 Scan Report',
        Message=f"Blank: {blank_count}, Corrupt: {corrupt_count}"
    )

    return {
        'statusCode': 200,
        'body': json.dumps({
            'blank': blank_count,
            'corrupt': corrupt_count
        })
    }

Best Practices

1. Regular scanning (monthly minimum) Don't scan once. Make it a recurring process. New problems will emerge.

2. Keep archives Before deleting, move suspicious files to an archive bucket. You might need them later for compliance investigation.

3. Monitor costs Track storage costs over time. Suddenly high? Probably an upload bug creating duplicates or not deleting temporary files.

4. Alert on changes If blank file percentage jumps, something changed (upload process broke, new users uploading wrong files, etc.).

5. Validate at upload, not just later Scanning existing buckets is useful, but prevent new bad files by validating at upload time with Uplint API.

Pricing Considerations

Manual Python script: Free code, but labor costs. AWS S3 GET requests charged. Slow.

Uplint CLI: Free command-line tool. One-time scan of 10,000 files costs minimal S3 requests.

Uplint API: ~$0.01-0.05 per validation (lower at scale). For continuous scanning of new uploads.

For a bucket with 100,000 files:

  • One-time CLI scan: $0-5 (mostly S3 request costs)
  • Ongoing validation of 10,000 new files/month: ~$100-500/month

The cost is worth the compliance and storage savings.

Key Takeaways

  1. Your S3 bucket has problems you haven't seen
  2. Scanning is non-negotiable for regulated data
  3. Blank detection catches most issues
  4. Uplint CLI makes scanning easy
  5. Continuous validation prevents future problems

Start with a single scan. See what you find. Then decide whether one-time or continuous monitoring makes sense for your use case.


Uplint's free CLI scans S3 buckets for blank, corrupt, and malicious files. Run it once to see what's in your bucket, or integrate the API for continuous validation. Download the CLI →

Found this useful? Share it with your team.