Skip to content

TACO Header format

Purpose and scope

This document provides the complete technical specification of the TACO Header structure. It covers byte-level layout, serialization algorithms, update mechanisms, and implementation patterns.

Prerequisites: Familiarity with basic tacozip concepts from Overview. For practical usage examples, see Getting Started.

Header structure

The TACO Header is a 157-byte fixed structure comprising three sections that together form a valid ZIP Local File Header:

┌─────────────────────────────────────────────────┐
│ Section          │ Size    │ Content            │
├──────────────────┼─────────┼────────────────────┤
│ Local File Header│ 30 bytes│ ZIP LFH fields     │
│ Filename         │ 11 bytes│ "TACO_HEADER"      │
│ Payload          │ 116 bytes│ Metadata entries  │
└──────────────────┴─────────┴────────────────────┘
Total: 157 bytes at offset 0

Byte-level layout

OffsetSizeFieldDescription
0-34LFH Signature0x04034b50 (ZIP magic bytes)
4-52Version Needed20 (ZIP 2.0)
6-72General Purpose Flag0x0000 or 0x0800 (UTF-8)
8-92Compression Method0 (STORE - uncompressed)
10-112Last Mod TimeDOS time format
12-132Last Mod DateDOS date format
14-174CRC-32Checksum of 116-byte payload
18-214Compressed Size116
22-254Uncompressed Size116
26-272Filename Length11
28-292Extra Field Length0
30-4011Filename"TACO_HEADER" (ASCII)
41-156116PayloadMetadata entries (see below)

Payload structure

The 116-byte payload contains the actual metadata:

┌────────────────────────────────────────────────┐
│ Offset │ Size │ Field                          │
├────────┼──────┼────────────────────────────────┤
│ 0      │ 1    │ count (0-7)                    │
│ 1-3    │ 3    │ padding (reserved)             │
│ 4-19   │ 16   │ Entry 0 (offset:8 + length:8)  │
│ 20-35  │ 16   │ Entry 1 (offset:8 + length:8)  │
│ 36-51  │ 16   │ Entry 2 (offset:8 + length:8)  │
│ 52-67  │ 16   │ Entry 3 (offset:8 + length:8)  │
│ 68-83  │ 16   │ Entry 4 (offset:8 + length:8)  │
│ 84-99  │ 16   │ Entry 5 (offset:8 + length:8)  │
│ 100-115│ 16   │ Entry 6 (offset:8 + length:8)  │
└────────┴──────┴────────────────────────────────┘

Entry format

Each metadata entry is 16 bytes (little-endian):

Byte OffsetSizeFieldTypeDescription
0-78offsetuint64_tByte offset in external file
8-158lengthuint64_tLength in bytes

Example: For count = 3, entries 0-2 contain valid data; entries 3-6 are ignored but still occupy space in the fixed array.

C Type definitions

c
typedef struct {
    uint64_t offset;  // Byte offset in external file
    uint64_t length;  // Length in bytes
} taco_meta_entry_t;

typedef struct {
    uint8_t count;    // Valid entries (0-7)
    taco_meta_entry_t entries[7];  // Fixed array
} taco_meta_array_t;

Entry semantics

Metadata entries are application-defined (offset, length) pairs representing byte ranges. Tacozip does not interpret these values.

Common use cases:

ScenarioEntry meaning
Parquet filesRow group byte ranges
Chunked dataChunk boundaries for parallel processing
Index structuresPointers to indexed segments
Sharded datasetsExternal blob storage references
Tile pyramidsGeospatial tile level boundaries

Constraints:

PropertyValueEnforcement
Maximum entries7TACO_HEADER_MAX_ENTRIES
Entry count range0-7Validated in parse_header_payload()
Offset range0 to 2^64-1Not validated (application-defined)
Length range0 to 2^64-1Not validated (application-defined)
Padding bytesMust be 0x00Set during write, not enforced during read

ZIP archive context

The TACO Header exists within a standard ZIP archive structure:

┌──────────────────────────────────────────┐
│ TACO_HEADER (LFH + filename + payload)   │ ← Offset 0 (157 bytes)
├──────────────────────────────────────────┤
│ File 1 (LFH + filename + data)           │
├──────────────────────────────────────────┤
│ File 2 (LFH + filename + data)           │
├──────────────────────────────────────────┤
│ ...                                      │
├──────────────────────────────────────────┤
│ Central Directory                        │
│   ├─ TACO_HEADER entry                   │
│   ├─ File 1 entry                        │
│   ├─ File 2 entry                        │
│   └─ ...                                 │
├──────────────────────────────────────────┤
│ End of Central Directory (EOCD)          │
└──────────────────────────────────────────┘

ZIP Compliance

The header maintains full ZIP specification compliance:

RequirementImplementation
Valid LFHFirst 30 bytes form standard Local File Header
Legal filename"TACO_HEADER" (11 bytes, ASCII)
STORE methodCompression method 0 with matching sizes
CRC-32Calculated over 116-byte payload
Central DirectoryTACO_HEADER appears as regular file entry
Standard extractionCan be extracted with any ZIP tool

When extracted with standard ZIP tools, TACO_HEADER appears as a file containing 116 bytes of binary metadata.

Parsing and erialization

Parsing: Bytes → Structure

Function: tacozip_parse_header(buffer, buffer_size, &meta)

Algorithm:

  1. Validate buffer size: Must be ≥ 157 bytes
  2. Check LFH signature: Verify 0x04034b50 at offset 0
  3. Verify filename: Check "TACO_HEADER" at offset 30
  4. Extract payload: Read 116 bytes starting at offset 41
  5. Parse count: Read byte 41 (must be 0-7)
  6. Extract entries: Read 7 × 16 bytes in little-endian format

Error conditions:

CheckError codeCondition
Buffer sizeTACOZ_ERR_PARAMbuffer_size < 157
LFH signatureTACOZ_ERR_INVALID_HEADERNot 0x04034b50
FilenameTACOZ_ERR_INVALID_HEADERNot "TACO_HEADER"
Entry countTACOZ_ERR_INVALID_HEADERcount > 7

Little-Endian reading:

c
uint64_t read_le64(const unsigned char *buf) {
    return (uint64_t)buf[0]       |
           (uint64_t)buf[1] << 8  |
           (uint64_t)buf[2] << 16 |
           (uint64_t)buf[3] << 24 |
           (uint64_t)buf[4] << 32 |
           (uint64_t)buf[5] << 40 |
           (uint64_t)buf[6] << 48 |
           (uint64_t)buf[7] << 56;
}

Serialization: Structure → Bytes

Function: tacozip_serialize_header(&meta, buffer, buffer_size)

Algorithm:

  1. Validate input: Check buffer_size >= 157 and count <= 7
  2. Build payload (116 bytes):
    • Write count byte
    • Write 3 padding bytes (0x00)
    • Write 7 entries in little-endian format
  3. Calculate CRC-32: Compute checksum over 116-byte payload using zlib
  4. Build LFH (30 bytes):
    • Signature: 0x04034b50
    • Version: 20
    • Flags: 0
    • Method: 0 (STORE)
    • CRC-32 from step 3
    • Sizes: both 116
    • Filename length: 11
  5. Write filename: "TACO_HEADER" (11 bytes)
  6. Write payload: 116 bytes from step 2

Output: Complete 157-byte header in buffer

Update mechanism

A key design feature is efficient in-place metadata updates requiring only 3 writes:

Update Process

┌─────────────────────────────────────────────┐
│ 1. Read existing header (157 bytes)         │
├─────────────────────────────────────────────┤
│ 2. Modify metadata entries in memory        │
├─────────────────────────────────────────────┤
│ 3. Calculate new CRC-32 of payload          │
├─────────────────────────────────────────────┤
│ 4. Write #1: New payload at offset 41       │ (116 bytes)
│ 5. Write #2: New CRC-32 at offset 14        │ (4 bytes)
│ 6. Write #3: New CRC-32 in Central Dir      │ (4 bytes)
└─────────────────────────────────────────────┘
Total writes: 124 bytes

Write operations detail

Write #LocationOffsetSizeDataPurpose
1LFH payload41116 bytesUpdated metadataReplace entries
2LFH CRC-32144 bytesNew checksumUpdate header CRC
3Central DirectoryVariable4 bytesNew checksumUpdate CD entry CRC

Performance benefits:

  • ✅ No archive rewrite required
  • ✅ No file content rewrite required
  • ✅ No Central Directory relocation
  • ✅ O(1) time complexity regardless of archive size
  • ✅ Works on archives from bytes to gigabytes

Implementation: tacozip_update_header() in src/tacozip.c:618-671

Design rationale

Why offset 0?

Positioning at file start enables optimal cloud access patterns:

BenefitExplanation
Minimal latencyFirst bytes have lowest network round-trip time
Single HTTP requestRange: bytes=0-156 retrieves metadata without seeking
S3 efficiencyPartial download avoids full object retrieval
Predictable accessNo scanning or seeking required

Why Fixed 157 bytes?

AdvantageImpact
Constant-time accessO(1) read regardless of entry count
Predictable updatesAlways 3 writes (124 bytes)
Static allocationNo malloc/free overhead
Network efficiencyFits in single TCP packet (MTU 1500)

Why 7 entries maximum?

Balances multiple constraints:

FactorConsideration
Header size7 × 16 + 4 = 116 bytes fits in 157-byte limit
Common use casesMost applications need 1-5 chunks
Update efficiencySmall payload = fast CRC-32 computation
Future expansionReserved padding allows format evolution

Why STORE compression?

ReasonBenefit
Predictable offsetsFile positions never shift
Fast accessNo decompression overhead
Append efficiencyNew files added without recompression
Simple CRC-32Direct calculation over uncompressed data

Why 4GB archive Limit?

ConstraintRationale
No ZIP64 supportUses standard 32-bit ZIP structures
32-bit offsetsCentral Directory uses 32-bit file positions
32-bit sizesLFH and CDH use 32-bit size fields
SimplicityAvoids ZIP64 extended information complexity

Usage patterns

Local file access

c
taco_meta_array_t meta;
int rc = tacozip_read_header("archive.taco", &meta);
if (rc == TACOZ_OK) {
    printf("Entries: %u\n", meta.count);
    for (int i = 0; i < meta.count; i++) {
        printf("  [%d] offset=%llu length=%llu\n",
               i, meta.entries[i].offset, meta.entries[i].length);
    }
}

HTTP range request

c
unsigned char buffer[200];  // Extra space for safety
http_get_range("https://cdn.example.com/data.taco", 0, 156, buffer);

taco_meta_array_t meta;
int rc = tacozip_parse_header(buffer, 200, &meta);
if (rc == TACOZ_OK) {
    // Use metadata without downloading full archive
    uint64_t offset = meta.entries[0].offset;
    uint64_t length = meta.entries[0].length;
    
    // Download only specific segment
    http_get_range(url, offset, offset + length - 1, data_buffer);
}

S3 partial download

python
import boto3

s3 = boto3.client('s3')

# Step 1: Get header (157 bytes)
response = s3.get_object(
    Bucket='my-bucket',
    Key='archive.taco',
    Range='bytes=0-156'
)
header_bytes = response['Body'].read()

# Step 2: Parse metadata
import tacozip
entries = tacozip.read_header(header_bytes)

# Step 3: Download specific entry
offset, length = entries[2]
response = s3.get_object(
    Bucket='my-bucket',
    Key='archive.taco',
    Range=f'bytes={offset}-{offset+length-1}'
)
data = response['Body'].read()

Metadata update

c
taco_meta_array_t meta;
tacozip_read_header("archive.taco", &meta);

// Modify metadata
meta.entries[0].offset = 5000;
meta.entries[0].length = 1000;
meta.entries[1].offset = 6000;
meta.entries[1].length = 2000;

// Update in-place (only 124 bytes written)
tacozip_update_header("archive.taco", &meta);

Constants reference

Defined in include/tacozip.h:

ConstantValueDescription
TACO_HEADER_MAX_ENTRIES7Maximum metadata entries
TACO_HEADER_PAYLOAD_SIZE116Payload size in bytes
TACO_HEADER_TOTAL_SIZE157Complete header size
TACO_HEADER_NAME"TACO_HEADER"Filename in ZIP
TACO_HEADER_NAME_LEN11Filename length

Python equivalents in clients/python/tacozip/config.py.

Error codes

CodeValueDescriptionReturned By
TACOZ_OK0SuccessAll functions
TACOZ_ERR_PARAM-4Invalid parameters (NULL, buffer too small)parse_header, serialize_header
TACOZ_ERR_INVALID_HEADER-3Invalid signature, filename, or count > 7parse_header, read_header
TACOZ_ERR_IO-1File I/O errorread_header, update_header

Implementation references

C Functions:

  • tacozip_parse_header() - Parse header from buffer
  • tacozip_serialize_header() - Serialize header to buffer
  • tacozip_read_header() - Read header from file
  • tacozip_update_header() - Update header in file
  • parse_header_payload() - Internal payload parser
  • read_le64() - Little-endian uint64 reader

Python bindings:

  • tacozip.read_header() - Python wrapper
  • TacoMetaArray - ctypes structure definition.
  • See Python API Reference for details.

Released under the MIT License.