Storage Backends

This document explains how aiida-atomistic stores structure data in AiiDA's database and repository.

Overview

StructureData stores properties in two locations:

Database attributes: Queryable metadata (formulas, statistics, small arrays)
Repository files: Large numeric arrays (positions, charges, magnetic moments)

Note

Storage location is determined at the code level, not at runtime.

When you create a StructureData, the storage location for each property is already defined in the source code via metadata in models.py. You cannot choose where properties are stored when creating structures.

Developers can decide storage locations when adding new properties by setting metadata in models.py.

How Storage Works

Each property in models.py has metadata that determines its storage location:

# From models.py - storage is defined here
class StructureBaseModel(BaseModel):
    # This property goes to database
    cell: ArrayLike3x3 = Field(
        json_schema_extra={"store_in": "db"}
    )

    # This array goes to repository
    @computed_field(json_schema_extra={"store_in": "repository", "singular_form": "position"})
    @property
    def positions(self) -> np.ndarray:
        return np.array([site.position for site in self.sites])

When you store a structure, the system uses these metadata settings automatically:

structure = StructureData(
    cell=[[3, 0, 0], [0, 3, 0], [0, 0, 3]],
    pbc=[True, True, True],
    sites=[...]
)
structure.store()
# cell → database (as defined by store_in="db")
# positions → repository (as defined by store_in="repository")

What Gets Stored Where

Database Attributes (queryable):

Global properties: cell, pbc, tot_charge, tot_magnetization
Computed metadata: formula, cell_volume, dimensionality, n_sites
Composition flags: is_alloy, has_vacancies
Statistics: max_charge, min_charge, max_magmom, min_magmom, etc.
Small arrays: symbols, kind_names

Repository Files (properties.npz):

Large numeric arrays: positions, masses, charges, magmoms, magnetizations, weights

Not Stored (reconstructed on access):

kinds - Rebuilt from stored properties
sites (internal) - Rebuilt from stored arrays

Storage Metadata

The `store_in` Key

Each field uses json_schema_extra metadata with a store_in key:

`store_in` Value	Storage Location
`"db"`, `"attribute"`, `"attributes"`	Database attributes
`"repository"`, `"repo"`, `"npz"`	Repository `.npz` file

The `singular_form` Key (Required for Array Properties)

For computed array properties from sites, singular_form maps the array back to individual site fields:

@computed_field(
    json_schema_extra={
        "store_in": "repository",
        "singular_form": "charge"  # Maps 'charges' array → 'charge' site property
    }
)
@property
def charges(self) -> np.ndarray:
    return np.array([site.charge for site in self.sites])

Note

Without singular_form, the system cannot reconstruct sites when loading from the database, causing KeyError and making data inaccessible.

Querying Structures

Database Properties are Queryable

from aiida.orm import QueryBuilder
from aiida_atomistic.data.structure import StructureData

qb = QueryBuilder()
qb.append(
    StructureData,
    filters={
        'attributes.formula': 'H2O',           # ✓ Queryable
        'attributes.cell_volume': {'<': 30},   # ✓ Queryable
        'attributes.max_charge': {'>': 1.0},   # ✓ Queryable
    }
)

Repository Properties are NOT Queryable

Arrays in .npz files cannot be queried:

qb.append(
    StructureData,
    filters={
        'attributes.positions': ...  # ✗ Not queryable - stored in .npz
        'attributes.charges': ...    # ✗ Not queryable - stored in .npz
    }
)

Solution: Query statistical properties stored in the database:

qb.append(
    StructureData,
    filters={
        'attributes.max_charge': {'>': 1.0, '<=': 2.0},  # ✓ Works
        'attributes.n_sites': {'>': 10},                  # ✓ Works
    }
)

Adding New Properties (Developer Guide)

When adding properties to aiida-atomistic, you decide where they're stored.

Step 1: Add Site Property

In site.py:

class Site(BaseModel):
    my_property: t.Optional[float] = Field(
        default=None,
        json_schema_extra={"threshold": 1e-4, "default": 0.0}
    )

Step 2: Add Array Property with Storage Metadata

In models.py:

@computed_field(
    json_schema_extra={
        "store_in": "repository",      # ← Choose storage location
        "singular_form": "my_property" # ← Map array to site field
    }
)
@property
def my_properties(self) -> np.ndarray:
    """Array of my_property from all sites."""
    if all(site.my_property is None for site in self.sites):
        return None
    return np.array([
        site.my_property if site.my_property is not None
        else Site.get_default_values()['my_property']
        for site in self.sites
    ])

Step 3: Add Statistics (Optional, for Querying)

@computed_field(
    json_schema_extra={
        "store_in": "db",  # ← Store in database for querying
        "statistic": "max"
    }
)
@property
def max_my_property(self) -> t.Optional[float]:
    if self.my_properties is None:
        return None
    return float(np.max(self.my_properties))

Choosing Storage Location

Store in Database ("db") when:

Property is small (strings, numbers, flags)
You need to query by this property
It's metadata for filtering/searching

Store in Repository ("repository") when:

Property is a large array
You don't need to query array contents
Storage efficiency matters

Example Guidelines:

Property Type	Size	Queryable?	Store In
Formula	String	Yes	Database
Cell volume	Float	Yes	Database
Statistics (max/min)	Float	Yes	Database
Positions array	N×3 floats	No	Repository
Charges array	N floats	No	Repository

Performance Considerations

Database Storage:

✓ Fast queries - everything indexed
✗ Database grows with structure size

Repository Storage:

✓ Database stays small
✓ Efficient for large structures
✗ Cannot query array contents

Recommendations:

Structure Size	Notes
< 100 atoms	Database size usually fine
> 1000 atoms	Repository prevents database bloat
High-throughput	Repository scales better

Technical Details

Storage Format

Repository uses NumPy's compressed .npz format:

# What's in properties.npz
{
    'positions': np.array([[0,0,0], [1,1,1]]),  # N×3 float64
    'charges': np.array([1.0, -1.0]),            # N float64
    'masses': np.array([1.008, 15.999])          # N float64
}

Caching

Arrays are cached after first load to avoid repeated file I/O:

structure = orm.load_node(pk)
pos1 = structure.properties.positions  # Loads from .npz, caches
pos2 = structure.properties.positions  # Returns cached (no I/O)