Your S3 buckets are full of files. But do you actually know what's in them?
Odds are, they contain:
- Blank PDFs that look valid but contain nothing
- Corrupt image files from failed uploads
- Duplicate files taking up space
- Empty spreadsheets uploaded by accident
- Files that never should have been stored in the first place
This isn't theoretical. Most production systems have significant data quality problems hidden in their buckets. You just haven't looked.
This guide shows how to scan your S3 buckets, identify problems, and clean them up.
The Problem: Blind Storage
S3 is content-blind. It stores bytes. It doesn't know if those bytes form a valid PDF, a blank document, or an executable in disguise.
# This command lists what's in your bucket
aws s3 ls s3://my-uploads/ --recursive
# Output:
# 2026-01-15 10:23:45 0 claim-2026-001.pdf
# 2026-01-15 10:24:12 2456 invoice-2026-002.pdf
# 2026-01-15 10:25:01 148923 receipt-2026-003.pdf
# ...
You can see file sizes, but not whether those files are:
- Actually PDFs (or disguised executables)
- Blank (zero meaningful content)
- Corrupt (bad headers, truncated data)
- Duplicates of other files
- Missing from their legal retention window
Why This Matters
Data Quality: Blank files inflate your "submissions" count. Analytics show 1,000 documents uploaded, but 200 are blank. Metrics are meaningless.
Compliance: In regulated industries, accepting blank files creates audit failures. "The patient uploaded a medical record" — but if it's blank, that claim doesn't hold up.
Storage Costs: Corrupt and duplicate files waste space and cloud costs. A bucket with 50% corrupt/blank files could be cut in half.
Security: Files you've never examined might contain threats you don't know about.
Scanning Your S3 Bucket: Manual Approach
The basic approach is to download files from S3 and examine them:
import boto3
import os
from PyPDF2 import PdfReader
from io import BytesIO
s3_client = boto3.client('s3')
BUCKET = 'my-uploads'
def check_pdf_blank(pdf_bytes):
"""Check if PDF has readable content."""
try:
pdf = PdfReader(BytesIO(pdf_bytes))
if len(pdf.pages) == 0:
return True, "zero_pages"
# Extract text
text = ''.join(page.extract_text() or '' for page in pdf.pages)
text_content = text.strip()
if len(text_content) < 50:
return True, "insufficient_text"
return False, "has_content"
except Exception as e:
return None, f"error: {str(e)}"
def scan_bucket():
"""Scan all PDFs in bucket."""
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=BUCKET, Prefix='documents/')
blank_files = []
corrupt_files = []
valid_files = []
for page in pages:
if 'Contents' not in page:
continue
for obj in page['Contents']:
key = obj['Key']
size = obj['Size']
if not key.endswith('.pdf'):
continue
print(f"Checking {key}...")
# Download file
response = s3_client.get_object(Bucket=BUCKET, Key=key)
file_bytes = response['Body'].read()
# Check if blank
is_blank, reason = check_pdf_blank(file_bytes)
if is_blank is None:
corrupt_files.append({'key': key, 'reason': reason})
elif is_blank:
blank_files.append({'key': key, 'reason': reason})
else:
valid_files.append({'key': key})
# Report
print(f"\nValid files: {len(valid_files)}")
print(f"Blank files: {len(blank_files)}")
print(f"Corrupt files: {len(corrupt_files)}")
for blank in blank_files[:10]: # Show first 10
print(f" BLANK: {blank['key']} ({blank['reason']})")
for corrupt in corrupt_files[:10]:
print(f" CORRUPT: {corrupt['key']} ({corrupt['reason']})")
return {
'valid': len(valid_files),
'blank': blank_files,
'corrupt': corrupt_files
}
if __name__ == '__main__':
results = scan_bucket()
# Optional: Delete blank files
for blank_file in results['blank']:
print(f"Deleting {blank_file['key']}...")
s3_client.delete_object(Bucket=BUCKET, Key=blank_file['key'])
Limitations:
- Slow (downloads every file)
- Expensive (S3 GET requests cost money)
- Only checks PDFs
- Requires setup and local processing
- Hard to schedule and monitor
Using Uplint CLI for Scanning
Uplint provides a CLI tool that scans files locally or in S3:
pip install uplint
Scan Local Directory
uplint scan ./uploads --report json > scan_results.json
Output:
{
"summary": {
"total": 1243,
"valid": 1050,
"blank": 127,
"corrupt": 43,
"suspicious": 23
},
"blank_files": [
{
"path": "uploads/claim-2026-001.pdf",
"type": "pdf",
"reason": "zero_readable_text"
},
{
"path": "uploads/receipt-2026-002.pdf",
"type": "pdf",
"reason": "insufficient_text",
"text_chars": 12
}
],
"corrupt_files": [
{
"path": "uploads/image-2026-003.png",
"type": "image",
"reason": "invalid_png_header"
}
]
}
Scan S3 Bucket
First, create an AWS IAM role with S3 read access. Then:
uplint scan s3://my-bucket/uploads --aws-role arn:aws:iam::123456789:role/uplint-scanner --report json
The CLI automatically:
- Lists all objects in the bucket
- Downloads and analyzes them
- Detects blank files across all formats
- Identifies corrupt structures
- Scans for threats
- Generates a comprehensive report
No local disk space needed.
Filter and Focus
Scan only specific file types or prefixes:
# Scan only PDFs in the claims folder
uplint scan s3://my-bucket/claims --filter "*.pdf" --report csv
# Scan files modified in the last 30 days
uplint scan s3://my-bucket/uploads --modified-since 30d
# Scan images only
uplint scan ./uploads --filter "*.{jpg,jpeg,png,gif}"
Processing Results
The scan report gives you raw data. What to do with it?
Option 1: Manual Review
For smaller buckets, manually review blank/corrupt files:
uplint scan s3://my-bucket --report json | jq '.blank_files[] | .path'
For each blank file, decide:
- Delete it (probably the right move for most blanks)
- Contact the uploader (maybe they intended it)
- Keep it (rarely the right choice)
Option 2: Automated Cleanup
For larger buckets, automate deletion:
# Generate deletion list
uplint scan s3://my-bucket/uploads --report json > results.json
# Delete all blank PDFs
jq -r '.blank_files[] | select(.type == "pdf") | .path' results.json | while read file; do
aws s3 rm "s3://my-bucket/$file"
echo "Deleted: $file"
done
Option 3: Archive and Replace
Move problematic files to an archive bucket for further investigation:
jq -r '.corrupt_files[] | .path' results.json | while read file; do
aws s3 cp "s3://my-bucket/$file" "s3://my-bucket-archive/corrupt/$file"
aws s3 rm "s3://my-bucket/$file"
done
Real-World Scanning: Before and After
Before scanning a healthcare app's S3 bucket:
Claimed: 15,000 patient documents uploaded
Actual size: 87 GB
Cost: ~$2,000/month storage + retrieval
After scanning and cleanup:
Blank PDFs: 2,100 documents deleted
Corrupt files: 340 documents deleted
Duplicates: 1,800 documents consolidated
Actual valid documents: 10,760
New size: 34 GB
New cost: ~$800/month (60% reduction)
The organization reclaimed storage cost and compliance confidence.
Continuous Scanning
One-time scans are useful, but continuous monitoring catches problems as they emerge:
Using Uplint API
from uplint import Uplint
import boto3
from datetime import datetime, timedelta
uplint = Uplint(api_key="your_api_key")
s3 = boto3.client('s3')
def continuous_scan():
"""Run daily to catch newly uploaded blank/corrupt files."""
bucket = 'my-uploads'
# Get files uploaded in last 24 hours
yesterday = datetime.utcnow() - timedelta(days=1)
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(
Bucket=bucket,
Prefix='documents/'
)
recent_files = []
for page in pages:
if 'Contents' not in page:
continue
for obj in page['Contents']:
if obj['LastModified'].replace(tzinfo=None) > yesterday:
recent_files.append(obj['Key'])
# Validate each recent file
problems = []
for key in recent_files:
response = s3.get_object(Bucket=bucket, Key=key)
file_bytes = response['Body'].read()
result = uplint.validate(file_bytes, {
'detectBlanks': True,
'scan': True
})
if not result.trusted:
problems.append({
'file': key,
'reason': result.reason,
'detected_at': datetime.utcnow().isoformat()
})
# Alert on problems
if problems:
send_alert({
'subject': f"Found {len(problems)} problematic files",
'files': problems
})
return problems
# Schedule this daily via Lambda, Cron, or Airflow
Scheduled Scanning with Lambda
import json
import boto3
from uplint import Uplint
uplint = Uplint(api_key=os.environ['UPLINT_API_KEY'])
s3 = boto3.client('s3')
sns = boto3.client('sns')
def lambda_handler(event, context):
"""Runs daily via EventBridge."""
bucket = 'my-uploads'
prefix = 'documents/'
# Scan all PDFs uploaded today
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket, Prefix=prefix)
blank_count = 0
corrupt_count = 0
for page in pages:
if 'Contents' not in page:
continue
for obj in page['Contents']:
if not obj['Key'].endswith('.pdf'):
continue
response = s3.get_object(Bucket=bucket, Key=obj['Key'])
file_bytes = response['Body'].read()
result = uplint.validate(file_bytes, {
'detectBlanks': True
})
if result.blank:
blank_count += 1
elif not result.trusted:
corrupt_count += 1
# Send summary
sns.publish(
TopicArn=os.environ['SNS_TOPIC'],
Subject='Daily S3 Scan Report',
Message=f"Blank: {blank_count}, Corrupt: {corrupt_count}"
)
return {
'statusCode': 200,
'body': json.dumps({
'blank': blank_count,
'corrupt': corrupt_count
})
}
Best Practices
1. Regular scanning (monthly minimum) Don't scan once. Make it a recurring process. New problems will emerge.
2. Keep archives Before deleting, move suspicious files to an archive bucket. You might need them later for compliance investigation.
3. Monitor costs Track storage costs over time. Suddenly high? Probably an upload bug creating duplicates or not deleting temporary files.
4. Alert on changes If blank file percentage jumps, something changed (upload process broke, new users uploading wrong files, etc.).
5. Validate at upload, not just later Scanning existing buckets is useful, but prevent new bad files by validating at upload time with Uplint API.
Pricing Considerations
Manual Python script: Free code, but labor costs. AWS S3 GET requests charged. Slow.
Uplint CLI: Free command-line tool. One-time scan of 10,000 files costs minimal S3 requests.
Uplint API: ~$0.01-0.05 per validation (lower at scale). For continuous scanning of new uploads.
For a bucket with 100,000 files:
- One-time CLI scan: $0-5 (mostly S3 request costs)
- Ongoing validation of 10,000 new files/month: ~$100-500/month
The cost is worth the compliance and storage savings.
Key Takeaways
- Your S3 bucket has problems you haven't seen
- Scanning is non-negotiable for regulated data
- Blank detection catches most issues
- Uplint CLI makes scanning easy
- Continuous validation prevents future problems
Start with a single scan. See what you find. Then decide whether one-time or continuous monitoring makes sense for your use case.
Uplint's free CLI scans S3 buckets for blank, corrupt, and malicious files. Run it once to see what's in your bucket, or integrate the API for continuous validation. Download the CLI →