FAQ

Frequently asked questions about tacozip.

General

What is tacozip?

Tacozip is a specialized ZIP library that adds a 165-byte metadata header at the beginning of ZIP archives. This enables single-request metadata retrieval - crucial for cloud storage scenarios where you need to know what's in an archive before downloading gigabytes of data.

When should I use tacozip vs regular ZIP?

Use tacozip when:

Storing large files in cloud storage (S3, Azure, HTTP CDN)
Need to read only parts of files (Parquet row groups, tiles, chunks)
Accessing data over slow/expensive networks
Metadata needs to be readable without downloading entire archive

Use regular ZIP when:

❌ Files stored on local filesystem only
❌ Always need to extract entire archive
❌ Compression ratio is critical (TACO uses STORE only)
❌ Need >4GB archives (TACO doesn't support ZIP64)

Is tacozip compatible with standard ZIP tools?

Yes! TACO archives ARE standard ZIP archives. You can:

Open with WinZip, 7-Zip, unzip, etc.
Extract all files normally
View contents in file explorers

However:

⚠️ Modifying with standard tools will remove TACO header
⚠️ Standard tools won't show/use metadata entries

What happens if I modify a TACO archive with 7-Zip/WinZip?

The TACO header will be lost or corrupted. The archive becomes a regular ZIP file. Run tacozip.validate() to detect this:

python

from tacozip import validate, TACOZ_INVALID_NO_TACO

result = validate("modified.taco")
if result == TACOZ_INVALID_NO_TACO:
    print("ERROR: Archive modified by external tool!")
    print("Recreate archive to restore TACO header")

Performance

What's the performance impact of STORE compression?

Trade-offs:

Aspect	STORE (tacozip)	DEFLATE (regular ZIP)
Archive size	Larger (no compression)	Smaller (compressed)
Read speed	Faster (no decompression)	Slower (must decompress)
HTTP range requests	Works perfectly	Doesn't work (compressed data)
CPU usage	Minimal	High (compression/decompression)

When STORE is better:

Cloud storage with fast network, slow CPU
Already-compressed data (JPEG, PNG, video, compressed Parquet)
Need random access to file segments

When DEFLATE is better:

Local storage (disk space limited)
Uncompressed source data (text files, CSV)
Always extracting full archive

How fast is metadata access?

Regular ZIP (without TACO):

Seek to end of file (~50ms for S3 request)
Read End of Central Directory (~50ms)
Read Central Directory (~100ms for 1000 files)
Total: ~200ms + file size dependent

TACO (with tacozip):

Read first 165 bytes (~50ms for S3 request)
Total: ~50ms regardless of file size

Speedup: ~4x faster for metadata, works on files from MB to GB.

What's the overhead of the TACO header?

165 bytes = 0.000165 MB

Impact on common file sizes:

1 MB file: 0.016% overhead
100 MB file: 0.00017% overhead
1 GB file: 0.000016% overhead

Negligible overhead for any reasonably-sized file.

Usage

Can I have more than 7 metadata entries?

No, maximum is 7 entries. This is a design constraint to keep header at 165 bytes.

Workarounds:

Group data: Instead of individual rows, store row group ranges
Multiple archives: Split data across multiple TACO archives
External metadata: Store additional metadata in separate file
Hierarchical metadata: Each entry points to a chunk containing sub-metadata

Example - hierarchical approach:

python

# Entry 0: Points to metadata chunk
# Entry 1-6: Point to actual data chunks

# Read metadata chunk first
entries = tacozip.read_header("data.taco")
meta_offset, meta_length = entries[0]

# Metadata chunk contains detailed info about entries 1-6
# Parse metadata to decide which data chunks to fetch

Can I store non-file data (just metadata)?

Yes! Create an archive with a dummy file:

python

import tacozip
from pathlib import Path

# Create dummy file
Path("dummy.txt").write_text("metadata only")

# Create archive with your metadata
tacozip.create(
    "metadata.taco",
    src_files=["dummy.txt"],
    entries=[
        (12345, 1000),  # Your actual metadata
        (67890, 2000),
        # ...
    ]
)

# Users only need to read header (165 bytes)
# They never need to download dummy.txt

How do I migrate from regular ZIP to TACO?

python

import zipfile
import tacozip
from pathlib import Path

def migrate_to_taco(zip_path, taco_path):
    """Convert regular ZIP to TACO archive."""
    
    # Extract regular ZIP
    extract_dir = Path("./temp_extract")
    extract_dir.mkdir(exist_ok=True)
    
    with zipfile.ZipFile(zip_path) as zf:
        zf.extractall(extract_dir)
    
    # Get all extracted files
    files = list(extract_dir.glob("**/*"))
    files = [str(f) for f in files if f.is_file()]
    
    # Create TACO archive
    tacozip.create(
        taco_path,
        src_files=files,
        entries=[(0, 0)]  # Add real metadata as needed
    )
    
    # Cleanup
    import shutil
    shutil.rmtree(extract_dir)

# Usage
migrate_to_taco("old.zip", "new.taco")

Can I update files in a TACO archive?

Metadata: Yes, use update_header() - very fast (only ~200 bytes written)

Files: No, file content cannot be updated in-place. You must:

Extract archive
Modify files
Recreate archive

This is a design trade-off for simplicity and STORE-only compression.

Cloud storage

Which cloud providers does tacozip work with?

All major providers that support HTTP range requests:

AWS S3
Google Cloud Storage
Azure Blob Storage
Cloudflare R2
DigitalOcean Spaces
Backblaze B2
Any HTTP CDN (CloudFlare, Fastly, Akamai)

Do I need special S3 configuration?

No special configuration needed. Standard S3 features:

Range GET requests (standard S3 feature)
Public or private buckets (both work)
S3 Transfer Acceleration (compatible)
CloudFront CDN (compatible)

How much does it cost to read metadata from S3?

AWS S3 pricing (as of 2024):

GET request: $0.0004 per 1000 requests
Data transfer: $0.09 per GB

Reading TACO metadata:

1 GET request + 165 bytes = ~$0.0000004 per read
Essentially free - reading metadata 1 million times costs ~$0.40

vs downloading 1GB file:

1 GET request + 1GB transfer = ~$0.09

Savings: Reading metadata is ~225,000x cheaper than downloading full file!

Can I use tacozip with presigned URLs?

Yes! Presigned URLs work perfectly:

python

import boto3
import tacozip

s3 = boto3.client('s3')

# Generate presigned URL
url = s3.generate_presigned_url(
    'get_object',
    Params={'Bucket': 'my-bucket', 'Key': 'data.taco'},
    ExpiresIn=3600
)

# Use with HTTP range request
import requests
r = requests.get(url, headers={"Range": "bytes=0-164"})
entries = tacozip.read_header(r.content)

Errors and troubleshooting

"Native library not found" error

Cause: The .so/.dylib/.dll file is missing from the package.

Solutions:

bash

# Reinstall package
pip install --force-reinstall tacozip

# If building from source, ensure library is built
cmake --preset release
cmake --build --preset release -j
pip install -e clients/python/

# Verify installation
python -c "import tacozip; tacozip.self_check()"

"Buffer too small" error

Cause: Trying to read header from bytes buffer < 157 bytes.

Solution: Ensure you read at least 165 bytes (safe minimum):

python

# Wrong
r = requests.get(url, headers={"Range": "bytes=0-100"})  # ❌ Too small
entries = tacozip.read_header(r.content)

# Correct
r = requests.get(url, headers={"Range": "bytes=0-164"})  # Minimum 165 bytes
entries = tacozip.read_header(r.content)

# Better (with margin)
r = requests.get(url, headers={"Range": "bytes=0-199"})  # Extra margin
entries = tacozip.read_header(r.content)

"Invalid header" or TACOZ_INVALID_NO_TACO error

Cause: File was modified by standard ZIP tool, removing TACO header.

Solution: Recreate archive from original files.

Prevention: Don't modify TACO archives with 7-Zip/WinZip/unzip. Only read/extract.

"Archive too large" (TACOZ_ERR_TOO_LARGE)

Cause: Archive exceeds 4GB limit (no ZIP64 support).

Solutions:

Split data across multiple archives
Use regular ZIP for archives >4GB
Compress files before adding (outside of archive)

Development

Can I use tacozip with multiprocessing/threads?

Yes! The library is thread-safe for reading operations.

python

from concurrent.futures import ThreadPoolExecutor
import tacozip

def process_chunk(offset, length):
    with open("data.taco", "rb") as f:
        f.seek(offset)
        chunk = f.read(length)
    # Process chunk...

# Read metadata once
entries = tacozip.read_header("data.taco")

# Process chunks in parallel
with ThreadPoolExecutor(max_workers=4) as executor:
    futures = [
        executor.submit(process_chunk, offset, length)
        for offset, length in entries
    ]
    results = [f.result() for f in futures]

Note: Write operations (create, update_header) should NOT be parallelized on the same archive.

How do I contribute to tacozip?

See Contributing Guidelines for:

Code style requirements
Testing requirements
PR process
Development setup

Is there a Julia/R/JavaScript client?

Current status:

Python client (official)
C library (official)
⏳ Julia client (planned)
⏳ R client (planned)
❌ JavaScript client (not planned)

Community contributions welcome!

Still have questions?

Documentation: Check Getting Started and Python Client
Issues: Report bugs at GitHub Issues
Discussions: Ask questions at GitHub Discussions

FAQ ​

General ​

What is tacozip? ​

When should I use tacozip vs regular ZIP? ​

Is tacozip compatible with standard ZIP tools? ​

What happens if I modify a TACO archive with 7-Zip/WinZip? ​

Performance ​

What's the performance impact of STORE compression? ​

How fast is metadata access? ​

What's the overhead of the TACO header? ​

Usage ​

Can I have more than 7 metadata entries? ​

Can I store non-file data (just metadata)? ​

How do I migrate from regular ZIP to TACO? ​

Can I update files in a TACO archive? ​

Cloud storage ​

Which cloud providers does tacozip work with? ​

Do I need special S3 configuration? ​

How much does it cost to read metadata from S3? ​

Can I use tacozip with presigned URLs? ​

Errors and troubleshooting ​

"Native library not found" error ​

"Buffer too small" error ​

"Invalid header" or TACOZ_INVALID_NO_TACO error ​

"Archive too large" (TACOZ_ERR_TOO_LARGE) ​

Development ​

Can I use tacozip with multiprocessing/threads? ​

How do I contribute to tacozip? ​

Is there a Julia/R/JavaScript client? ​

Still have questions? ​

FAQ

General

What is tacozip?

When should I use tacozip vs regular ZIP?

Is tacozip compatible with standard ZIP tools?

What happens if I modify a TACO archive with 7-Zip/WinZip?

Performance

What's the performance impact of STORE compression?

How fast is metadata access?

What's the overhead of the TACO header?

Usage

Can I have more than 7 metadata entries?

Can I store non-file data (just metadata)?

How do I migrate from regular ZIP to TACO?

Can I update files in a TACO archive?

Cloud storage

Which cloud providers does tacozip work with?

Do I need special S3 configuration?

How much does it cost to read metadata from S3?

Can I use tacozip with presigned URLs?

Errors and troubleshooting

"Native library not found" error

"Buffer too small" error

"Invalid header" or TACOZ_INVALID_NO_TACO error

"Archive too large" (TACOZ_ERR_TOO_LARGE)

Development

Can I use tacozip with multiprocessing/threads?

How do I contribute to tacozip?

Is there a Julia/R/JavaScript client?

Still have questions?