Skip to content

Performance

This page documents the performance-oriented design decisions in the model: which tables are partitioned, why, and how partitions are created and maintained.


Partitioned Tables at a Glance

Table Strategy Partition Key Partition Granularity Why
public.journal Range created_at Monthly High-volume audit log; drop old partitions instead of row-by-row DELETE
auth.user_event Range created_at Monthly Security audit trail; same retention pattern as journal
auth.resource_access List root_type Per root resource type Isolate grants per resource domain (project, folder, …); prune irrelevant domains at query time
auth.resource_role_assignment List root_type Per root resource type Mirrors resource_access; role grants for the same resource live next to flag grants

Everything else in the schema is a regular (non-partitioned) table. The cases above were partitioned because they either grow unboundedly with activity (audit logs) or grow unboundedly per resource domain (per-resource ACL).


Time-Range Partitioning — public.journal and auth.user_event

Both audit tables use the same strategy:

primary key (journal_id, created_at)
) partition by range (created_at);

primary key (user_event_id, created_at)
) partition by range (created_at);

Defined in 014_tables_stage.sql:49-63 (journal) and 013_tables_auth.sql:363-384 (user_event).

Why range-by-created_at

  • Append-only write pattern. Every row is an audit record — writes hit only the newest partition, so Postgres keeps one small partition hot in cache while older partitions sit cold on disk.
  • Retention is a partition drop, not a DELETE. When old logs age out, the purge functions detach partition + drop table the whole month in one shot. No VACUUM bloat, no long-running transaction, no index fragmentation.
  • Queries are almost always time-bounded. "Last 24 hours", "this month", "around correlation-id X at timestamp Y" — the planner prunes to one or two partitions.

Monthly granularity

Partitions are named journal_YYYY_MM and user_event_YYYY_MM. One month per partition is a deliberate compromise:

  • Smaller (daily) → too many partitions, planner overhead on open-ended range queries.
  • Larger (yearly) → partitions get too big to drop comfortably, and retention gets coarse.

The model ships with 4–5 initial partitions (current month ± a few) created in the DDL itself (014_tables_stage.sql:90-119, 013_tables_auth.sql:586+).

Pre-creating future partitions

unsecure.ensure_audit_partitions(_months_ahead int default null)019_functions_unsecure.sql:2466-2515.

  • Reads const.sys_param (partition.months_ahead, default 3) to decide how many months to create.
  • Idempotent: skips months whose partition already exists.
  • Creates partitions for both journal and user_event in one call.
  • Called implicitly at the end of each purge run, and can be called on a schedule.

Retention / purge

unsecure.purge_journal(...) and unsecure.purge_user_events(...)019_functions_unsecure.sql:2517+ and :2577+.

  • Retention in days, read from const.sys_param (journal.retention_days, user_event.retention_days).
  • Enumerates partitions via pg_inherits, parses the date out of the partition name, and drops anything whose month starts before the cutoff.
  • Default partition gets a regular DELETE for rows that landed there unexpectedly (rare — see below).
  • Ends with perform unsecure.ensure_audit_partitions() so the future-window is always re-extended on every purge.

Default partition as safety net

Both tables have a _default child partition (journal_default, user_event_default). If an insert ever arrives with a created_at outside every defined monthly range (clock skew, backfill, misconfiguration), the row lands there instead of raising. Operations should watch _default for unexpected volume — nothing should end up there under normal use.

Primary key note

The PK on both tables is composite (id, created_at). Postgres requires the partition key to be present in every unique constraint on a partitioned table, so the surrogate id alone is not allowed. This is also why auth.token.user_event_id is a logical-only FK — see CHANGELOG.md:940. An enforced FK would have needed the full (user_event_id, created_at) pair on the referencing side, and on delete cascade would have cascaded token deletes every time an old event partition was dropped.


List Partitioning — auth.resource_access and auth.resource_role_assignment

Both resource-ACL tables partition by root resource type:

primary key (resource_access_id, root_type)
) partition by list (root_type);

primary key (resource_role_assignment_id, root_type)
) partition by list (root_type);

Defined in 034_tables_resource_access.sql:100-125 and 043_tables_resource_roles.sql:72-102.

What root_type is

The first ltree segment of resource_type, materialized as a plain text NOT NULL column so Postgres can route inserts to a partition without evaluating an expression. Derivation is split_part(resource_type, '.', 1).

resource_type root_type Partition
project project resource_access_project
project.documents project resource_access_project
project.invoices project resource_access_project
folder folder resource_access_folder

Hierarchical child types share their root's partition. A grant on project.documents physically lives next to all other project.* grants.

Why list-by-root_type

  • Physical locality per resource domain. An app that asks "does user X have read on this project?" scans only the project partition — never touches ACL rows for unrelated domains (folder, invoice, etc.).
  • Domain-scoped retention. Dropping a retired resource domain is a partition drop.
  • Independent growth. Different domains have very different grant-to-resource ratios; isolating them keeps index bloat in one domain from affecting others.

Partitioning by resource_type directly would fragment the hierarchy — project.documents would end up in a different partition from project, defeating cascade locality. Root-level partitioning keeps the hierarchy intact at the storage layer.

Partition creation

unsecure.ensure_resource_access_partition(_resource_type text)034_tables_resource_access.sql:175-194.

  • Called from auth.create_resource_type() / auth.ensure_resource_types() at resource-type registration time.
  • Creates both the resource_access_<root> and resource_role_assignment_<root> partition in a single call if they don't yet exist.
  • Does nothing for child types — they're absorbed by the root's partition.

Default partition

Both tables have a _default partition to catch grants against unregistered root types. Registering a resource type with ensure_resource_types() at startup is the expected path; the default partition is just a safety net.

Index design

Every secondary index leads with root_type so the planner can both prune partitions and get locality within the chosen partition:

create unique index uq_ra_user_flag
    on auth.resource_access (root_type, resource_type, tenant_id, md5(resource_id::text), user_id, access_flag)
    where user_id is not null;

(See 034_tables_resource_access.sql:146-167 and 043_tables_resource_roles.sql:112-139 for the full set.)

Note the md5(resource_id::text) trick: resource_id is jsonb (composite keys), which can't back a btree unique constraint directly. Hashing to a fixed-width text value gives a traditional btree-unique while a separate GIN index (ix_ra_resource_id) handles containment queries (resource_id @> '{"project_id": 42}').

Path-based rows (v3.1.0+)

For filesystem-style ACL, rows can carry a resource_path ext.ltree column. Two more index families support this:

  • ix_ra_resource_path — GiST index on resource_path (partial, where resource_path is not null). Backs the ancestor-walk query _target_path <@ resource_path used at check time. Partition pruning on root_type still gives locality within the relevant resource domain.
  • uq_ra_user_flag_path / uq_ra_group_flag_path — unique btree including resource_path (partial, where resource_path is not null) — prevents duplicate path grants.

The original md5-based unique indexes (uq_ra_user_flag / uq_ra_group_flag) are now scoped to where resource_path is null so composite-key rows and path-bearing rows live on different uniqueness axes without colliding (md5 of empty jsonb is constant and would otherwise cause spurious conflicts across different paths). Both auth.resource_role_assignment and auth.resource_access use the same pattern.

See Path Hierarchy for the check semantics.


Measured behavior at realistic scale

Numbers below come from the 999-examples-icons.sql harness in the source repo — the full Material Design Icons tree loaded into demo.fs_item (27,134 paths, mix of 12,979 folders and 14,155 files) with 13 users spread across direct-flag grants, group-flag grants, user-role grants, group-role grants, denies, and a hybrid role-plus-flag case. Run on a single developer laptop with default Postgres settings.

Single has_resource_access on a random leaf

Scenario Latency
Broad ancestor grant, hits on first ancestor ~90–130 µs
Granted via specific subtree match 300–600 µs
No matching grant — full walk across 5 check paths × 1 ancestor type ~1.5–2 ms
Role-based grant, miss (extra join to resource_role_flag) 600–750 µs

Partition pruning is visible in EXPLAIN — the planner jumps straight to resource_access_fsitem, ignoring the default partition, and the uq_ra_user_flag_path btree is used instead of the GiST when the query constrains by user_id + flag (btree is more selective in that shape). GiST on resource_path shines in pure path-only queries without a principal filter.

Bulk filter_accessible_resources with 1000 random candidates

Grant shape Accessible Wall time
Broad grant on root 1000 / 1000 ~2 ms
Root grant with 3 denies 847 / 1000 ~7 ms
10 random second-level subtrees 478 / 1000 ~9 ms
Group role on one category 122 / 1000 ~5 ms
50 random third-level folders 18 / 1000 ~10 ms
100 leaf-only grants 7 / 1000 ~11 ms
No grants 0 / 1000 ~5 ms

Per-candidate cost: 2–20 µs. The bulk filter does a correlated nested-loop against the ACL partition with per-row index seeks; because the partition is small (a few thousand grants), each seek is cheap.

Scaling projection to 1M items, 2000 users

Assuming roughly the same grant density per user as the demo (avg ~15 grants per user), at 2k users you'd expect:

  • ~30,000 rows in auth.resource_access_fsitem partition
  • ~2,000 rows in auth.resource_role_assignment_fsitem partition
  • ~20,000 rows with has_permissions = true on the domain table (heavily deduplicated by path)

Projected at that scale:

Operation Expected latency
has_resource_access, first-ancestor hit 150–300 µs
has_resource_access, hit deeper in walk 0.5–1 ms
has_resource_access, miss 2–4 ms
filter_accessible_resources, 1000 candidates 15–60 ms
filter_accessible_resources, typical UI call (10 candidates) ~2 ms
Assign one grant (including sync trigger) ~2 ms

Nothing in the ACL check grows with fs_item size — only with grant count. Doubling your data volume doesn't slow the permission check; only adding more users or more grants does.

What to size for at 1M items

  • shared_buffers ≥ 1 GB — the domain table (~200 MB) plus its GiST path index (~50–80 MB) dominate. The ACL partitions themselves (a few MB) are noise.
  • Warm cache on deploy — first queries hit disk for the fs_item GiST index; expect ~100 ms for the very first call, then steady-state latencies.
  • Use prepared statements — planning time on partitioned tables is ~3 ms. At high throughput, unplanned queries waste most of their wall-clock on planning.
  • Paginate list resultsget_user_accessible_resources with a broad ancestor grant can return hundreds of thousands of rows. Always LIMIT at the SQL layer.

Where things would start to hurt

  • Very hot denies on a deeply-nested user (the a check runs first in has_resource_access). If one user has thousands of denies, the initial exists-check slows. Rare in practice.
  • Fan-out via groups — if a typical user belongs to 50+ groups, user_group_id = any(cached_group_ids) grows the array operand. Still fast with btree, but double-digit group counts per user should be measured.
  • Index bloat on churn — if users are granted and revoked constantly (e.g., session-scoped access), btree indexes bloat over weeks. REINDEX CONCURRENTLY monthly resolves it.
  • Bulk grant operations — granting 10,000 paths at once fires the has_permissions sync trigger 10,000 times (~500 ms total). Acceptable for interactive use, avoid on the hot path.

Trigger-maintained domain flags

The has_permissions pattern — a boolean column on the app's domain table kept in sync with the ACL tables via trigger — is a display/audit denormalization, not a performance optimization for has_resource_access. The flag reflects direct entries only; inheritance still goes through the ancestor walk.

It costs ~50 µs per grant write (one index-backed UPDATE against the domain table via path = resource_path). Negligible for interactive grants. For bulk grant operations, plan on the full 10,000 × 50 µs = 500 ms of cumulative trigger time.

If check latency ever becomes a measured bottleneck, a more aggressive variant that tracks "this row or any ancestor has a grant" can short-circuit the ACL walk entirely — at the cost of O(descendants) writes per grant. Only consider this if profiling shows the ancestor walk is hot; it wasn't at 27k nodes or the projection to 1M.


Operational Checklist

  1. Schedule unsecure.ensure_audit_partitions() to run at least once per month (e.g., weekly via cron). It's idempotent and cheap.
  2. Schedule unsecure.purge_journal() and unsecure.purge_user_events() per the retention policy in const.sys_param. Each purge call also extends the future partition window.
  3. Monitor journal_default and user_event_default row counts. Non-zero values indicate clock skew or a misconfigured partition range.
  4. Monitor resource_access_default and resource_role_assignment_default row counts. Non-zero values indicate grants written for a resource type that was never registered via ensure_resource_types().
  5. Call ensure_resource_types() at app startup so all partitions exist before the first grant. Missing partitions route to _default, which defeats pruning.

Not partitioning, but worth knowing about in the same context:

  • auth.user_permission_cache — per-(user, tenant) cached permission set. TTL via const.sys_param. Soft invalidation on permission/assignment changes (expiration_date := now()), hard invalidation on user disable/lock/delete. See Permission Model.
  • auth.user_group_id_cache — per-(user, tenant) cached group membership ids. Same TTL + invalidation pattern as the permission cache. Used by has_resource_access / role checks to avoid recomputing group membership on every call.
  • unsecure.clear_permission_cache(_deleted_by text, ...) — manual cache bust; the first argument is required (audit attribution).

See Resource-Level ACL for the check-time cascade logic that these partitions and caches support.