Repository storage internals

This guide explains how StructureData stores and loads array properties, what optimisations are already available to developers, and what the current format limitations are together with suggested future improvements.

Storage layout

When a StructureData node is stored, its data is split across two backends:

Data	Backend	Queryable via `QueryBuilder`?
Scalars and summary statistics (`cell`, `formula`, `n_sites`, …)	Database attributes	✅ Yes
Array properties (`positions`, `charges`, `symbols`, …)	AiiDA repository (`.npz` file)	❌ No

The single repository file is called properties.npz (controlled by StructureData._properties_filename). It is a standard ZIP archive where every property is stored as an independent .npy entry — one entry per property name.

The `.npz` format in detail

A .npz file created by numpy.savez_compressed is a ZIP archive with zipfile.ZIP_DEFLATED (DEFLATE) compression. Each member of the archive is an independent .npy binary file containing exactly one numpy array.

properties.npz  (ZIP archive)
├── charges.npy
├── positions.npy
├── site_indices_flat.npy      ← CSR part 1
├── site_indices_offsets.npy   ← CSR part 2
└── symbols.npy

Keys are stored in sorted alphabetical order to ensure a deterministic binary output, which is required for AiiDA's content-addressable file hashing.

Special case: `site_indices` (CSR encoding)

site_indices is a ragged list-of-lists (one sub-list per kind, with a variable number of site indices). A homogeneous numpy array cannot represent it directly. It is therefore encoded using CSR (Compressed Sparse Row) format as two flat 1-D int64 arrays:

Key in `.npz`	Shape	Content
`site_indices_flat`	`(total_sites,)`	All indices concatenated
`site_indices_offsets`	`(n_kinds + 1,)`	Cumulative start positions

Example — 3 kinds with 2, 1, and 3 sites respectively:

site_indices = [[0, 1], [2], [3, 4, 5]]

→ site_indices_flat    = [0, 1, 2, 3, 4, 5]
→ site_indices_offsets = [0, 2, 3, 6]

Decoding: site_indices[i] = flat[offsets[i] : offsets[i+1]]

Selective (per-property) loading

_load_properties_from_npz accepts an optional keys parameter that limits which properties are decompressed:

# Load only positions — charges, symbols, … are never touched
positions = node._load_properties_from_npz(keys=['positions'])['positions']

# Load two properties at once
data = node._load_properties_from_npz(keys=['positions', 'symbols'])

# site_indices — the CSR pair is expanded automatically
data = node._load_properties_from_npz(keys=['site_indices'])

Why this is efficient: numpy.NpzFile.__getitem__ calls zipfile.ZipFile.open(key) which seeks directly to the requested entry using the ZIP central directory. It does not decompress any other entry. Accessing positions in a file that also contains charges, magmoms, and symbols only decompresses positions.npy.

Note

The full load (no keys argument) is cached on the node object after the first call, so repeated access to .properties does not re-read the file.

Known limitation — no sliced or partial row access

It is not currently possible to read a subset of rows from a property array (e.g. the positions of only the first 100 atoms out of 1 000 000).

Root cause: DEFLATE is a streaming compression codec. To decompress byte offset N inside a compressed stream, every byte from offset 0 to N must be processed first. There is no random-access point in the middle of a compressed ZIP entry.

numpy.load(..., mmap_mode='r') does not help here — mmap_mode is silently ignored for .npz files. From the numpy source (npyio.py):

if magic.startswith(_ZIP_PREFIX) or magic.startswith(_ZIP_SUFFIX):
    # zip-file (assume .npz)
    ret = NpzFile(fid, ...)
    return ret          # mmap_mode is never consulted
elif magic == format.MAGIC_PREFIX:
    # .npy file
    if mmap_mode:       # only reached for bare .npy files
        return format.open_memmap(file, mode=mmap_mode, ...)

mmap_mode only works when loading a bare .npy file from disk (not from a stream and not from inside a ZIP).

Future improvements

Two approaches would enable true random / sliced access:

Option A — One `.npy` object per property

Store each array as a separate AiiDA repository object, e.g.:

node.base.repository.put_object_from_filelike(buf, 'positions.npy')
node.base.repository.put_object_from_filelike(buf, 'charges.npy')
# … one call per property

A .npy file holds exactly one array, so N properties → N repository objects (instead of the current single properties.npz).

Advantages:

Files stored directly on disk support np.load(..., mmap_mode='r'), which memory-maps the array and allows zero-copy slicing of any row range without loading the full file into RAM.

Disadvantages:

No compression → larger on-disk footprint.
N repository objects instead of 1 → more file-system entries and more AiiDA metadata overhead.

Option B — Zarr or HDF5 (recommended for large structures)

Replace properties.npz with a Zarr store or an HDF5 file (via h5py). Both formats:

Pack multiple arrays into one file.
Use chunked storage with selectable compression (gzip, Blosc, …).
Support direct chunk-level random access — reading row range [i:j] only decompresses the chunks that overlap that range.

This gives the best of both worlds: a single repository object, compression, and efficient partial reads.

Format	Single file	Compressed	Sliced access
Current `.npz`	✅	✅ DEFLATE	❌ streaming only
Per-property `.npy`	❌ (N files)	❌	✅ via `mmap_mode`
Zarr / HDF5	✅	✅ chunked	✅ chunk-level