Concepts and Data Structures (Legacy)

Note

This documentation covers the legacy API (from phyloframe import legacy). The legacy API is stable and will continue to be maintained for backward compatibility. A redesigned API will accompany phyloframe v1.0.0.

This guide covers the core concepts behind phyloframe’s data model: the alife standard format, supplemental tree data structures, and the relationship between Pandas and Polars implementations.

The Alife Standard Format

Phyloframe stores phylogenies as DataFrames in the alife standard format, originally developed for the Artificial Life community. Each row represents a single organism or taxon.

Core Columns

idint

A unique, non-negative integer identifying this organism. In working format, IDs are contiguous and equal to row indices (i.e., id == row_number).

ancestor_liststr

A JSON-encoded list of ancestor IDs. For asexual phylogenies: "[42]" (single ancestor) or "[None]" (root). For sexual phylogenies: "[3, 7]" (multiple ancestors).

ancestor_idint

An optimized representation for asexual phylogenies. Stores the single ancestor’s ID directly as an integer, avoiding repeated string parsing. Root nodes store their own ID: ancestor_id == id.

Note

The ancestor_list column is part of the original alife data standard. The ancestor_id column is an unofficial extension introduced by phyloframe for efficient asexual phylogeny operations. Use alifestd_try_add_ancestor_id_col to add it automatically.

Note

The alife data standard specifies ancestor_list as a string-encoded JSON list. A known defect in the standard is the ambiguity of empty-list representations: "[None]", "[none]", and "[]" are all used to denote roots. The string encoding also incurs parsing overhead on every access. Phyloframe’s ancestor_id column avoids both issues and is the recommended representation for asexual phylogenies.

Root Representation

Roots are organisms with no ancestor.

  • ancestor_list: "[None]", "[none]", or "[]"

  • ancestor_id: equal to the organism’s own id

import pandas as pd

# Root at id=0, child at id=1
df = pd.DataFrame({
    "id": [0, 1],
    "ancestor_list": ["[None]", "[0]"],
})

Working Format

Many operations run fastest when the DataFrame is in working format, satisfying three properties:

  1. Topologically sorted — every ancestor appears in a row before its descendants. This means that when iterating through rows in order, you will always encounter a node’s parent before the node itself. This property enables single-pass algorithms that process the tree from root to leaves (or vice versa) by simply iterating through the array.

  2. Contiguous IDsid values are 0, 1, 2, ... matching row indices.

  3. ``ancestor_id`` column present — enables direct integer indexing instead of string parsing.

Note

Polars implementations require data in working format.

Convert to working format with a single call:

from phyloframe import legacy as pfl

df = pfl.alifestd_to_working_format(df)

This applies alifestd_try_add_ancestor_id_col, alifestd_topological_sort, and alifestd_assign_contiguous_ids as needed.

In working format, ancestor_id values can be used directly as array indices. This enables efficient NumPy and JIT-compiled operations:

import numpy as np

ancestor_ids = df["ancestor_id"].values
# ancestor_ids[i] gives the row index of node i's parent
# For root nodes, ancestor_ids[i] == i

Types of Phylogeny Data

Phyloframe focuses on asexual phylogenies. Partial support for sexual phylogenies is provided for compatibility with the alife data standard.

Asexual phylogenies have at most one ancestor per organism. The ancestor_id column can represent the tree structure efficiently. Most phyloframe operations (especially those with _asexual suffix) target this mode.

Sexual phylogenies (pedigrees) allow multiple ancestors per organism. In Pandas, these have fewer optimized operations available.

# Check mode
pfl.alifestd_is_asexual(df)  # True if all entries have <= 1 ancestor
pfl.alifestd_is_sexual(df)  # True if any entry has > 1 ancestor

Note

Polars implementations exclusively support asexual phylogenies.

Function Naming Conventions

Phyloframe function names follow consistent suffix patterns:

  • No suffix (e.g., alifestd_mark_leaves) — works with both asexual and sexual phylogenies in Pandas.

  • ``_asexual`` (e.g., alifestd_mark_node_depth_asexual) — optimized for asexual phylogenies using ancestor_id. Raises an error or produces incorrect results if called on sexual phylogenies.

  • ``_polars`` (e.g., alifestd_mark_leaves_polars) — Polars-based. Requires asexual phylogeny with topological sorting and contiguous IDs.

When both a non-suffixed and _asexual version exist, prefer the _asexual version for asexual phylogenies — it will often be faster.

Validation

Use alifestd_validate to check that a DataFrame conforms to the alife standard format. It returns True if valid or False if problems are detected, issuing warnings describing each issue found:

from phyloframe import legacy as pfl

df = pfl.alifestd_from_newick("((A,B),(C,D));")
pfl.alifestd_validate(df)  # True

Topological Sensitivity

When topology-altering operations (e.g., pruning, collapsing, rerooting) modify the tree structure, previously computed columns like node_depth, branch_length, or num_descendants may become stale. Phyloframe’s topological sensitivity system detects this and warns you.

Operations that alter topology are decorated to automatically emit a warning listing any topology-dependent columns found in the DataFrame:

UserWarning: alifestd_collapse_unifurcations performs delete/update
operations that do not update topology-dependent columns, which may
be invalidated: ['node_depth', 'branch_length']. ...

To handle this:

  1. Drop sensitive columns before the operation using alifestd_drop_topological_sensitivity or alifestd_drop_topological_sensitivity_polars, then recompute them afterward.

  2. Pass ``drop_topological_sensitivity=True`` to the operation itself, which automatically drops topology-dependent columns as part of the call.

  3. Suppress the warning by passing ignore_topological_sensitivity=True to the operation, or by setting the HSTRAT_ALIFESTD_WARN_TOPOLOGICAL_SENSITIVITY_SUPPRESS environment variable.

from phyloframe import legacy as pfl

df = pfl.alifestd_from_newick("((A,B),(C,D));")
df = pfl.alifestd_mark_node_depth_asexual(df)

# Option 1: drop topology-dependent columns explicitly
df = pfl.alifestd_drop_topological_sensitivity(df)
df = pfl.alifestd_collapse_unifurcations(df)

# Recompute as needed
df = pfl.alifestd_mark_node_depth_asexual(df)

Option 2 lets the operation drop them automatically:

df = pfl.alifestd_from_newick("((A,B),(C,D));")
df = pfl.alifestd_mark_node_depth_asexual(df)

df = pfl.alifestd_collapse_unifurcations(
    df, drop_topological_sensitivity=True,
)

Supplemental Data Structures

For algorithms that need to navigate the tree beyond the parent pointer (ancestor_id), phyloframe provides two supplemental representations that can be added as columns. These structures optimize certain tree operations, and they will typically be automatically generated as needed.

CSR (Compressed Sparse Row)

The CSR format represents the parent-to-children mapping as two flat arrays, enabling O(1) lookup of any node’s children:

csr_offsetsarray of int

csr_offsets[i] is the index in csr_children where node i’s children begin.

csr_childrenarray of int

A flat array of all child IDs, grouped by parent. Node i’s children are csr_children[csr_offsets[i] : csr_offsets[i] + num_children[i]].

df = pfl.alifestd_from_newick("((A,B),(C,D));")

# Add CSR columns
df = pfl.alifestd_mark_num_children_asexual(df)
df = pfl.alifestd_mark_csr_offsets_asexual(df)
df = pfl.alifestd_mark_csr_children_asexual(df)

# Access node 0's children
offsets = df["csr_offsets"].values
children = df["csr_children"].values
num_children = df["num_children"].values

node = 0
node_children = children[offsets[node]:offsets[node] + num_children[node]]

The CSR format is used internally by traversal algorithms (preorder, postorder) and distance matrix computation.

First-Child/Next-Sibling Linked List

An alternative child-navigation structure uses two integer columns to form a linked list:

first_child_idint

The smallest-ID child of this node, or the node’s own ID if it is a leaf.

next_sibling_idint

The next sibling (by ID order) sharing the same parent, or the node’s own ID if there is no next sibling.

df = pfl.alifestd_from_newick("((A,B),(C,D));")

# Add linked list columns
df = pfl.alifestd_mark_first_child_id_asexual(df)
df = pfl.alifestd_mark_next_sibling_id_asexual(df)

# Walk children of a node
first_child = df["first_child_id"].values
next_sibling = df["next_sibling_id"].values

def iter_children(node_id):
    """Iterate over children of node_id."""
    child = first_child[node_id]
    if child == node_id:  # leaf node, no children
        return
    while True:
        yield child
        nxt = next_sibling[child]
        if nxt == child:  # no more siblings
            break
        child = nxt

This representation uses less memory than CSR for sparse trees and is used by some traversal algorithms internally.

Pandas vs. Polars

Phyloframe provides dual implementations for many operations:

  • Pandas — the default, supporting both asexual and sexual phylogenies.

  • Polars — available via operations with a _polars suffix (e.g., alifestd_mark_leaves_polars).

Polars Usage

Use the _polars suffixed functions for Polars DataFrames:

import polars as pl
from phyloframe import legacy as pfl

df_pl = pfl.alifestd_from_newick_polars("((A,B),(C,D));")
df_pl = pfl.alifestd_mark_leaves_polars(df_pl)

Polars Restrictions

Polars implementations are more restrictive than Pandas:

  • Asexual only — sexual phylogenies are not supported.

  • Topological sortedness required — data must be sorted before use.

  • Contiguous IDs required — IDs must equal row indices.

  • No ``mutate`` parameter — Polars DataFrames are immutable by design.

When to Prefer Polars

  • Working with large trees (millions of nodes).

  • Query optimization via Polars’ lazy evaluation engine (predicate pushdown, projection pushdown, and other automatic rewrites).

  • Multithreaded operations.

  • CLI pipelines (the CLI interface is Polars-based, so using _polars entrypoints avoids conversion overhead).

User-extensible Columns

Because the underlying representation is a standard DataFrame, you can freely add custom columns for your analysis:

import numpy as np
import pandas as pd
from phyloframe import legacy as pfl

df = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")
df = pfl.alifestd_mark_node_depth_asexual(df)

# Add trait data
df["fitness"] = np.random.random(len(df))

# Filter using standard DataFrame operations
deep = df[df["node_depth"] > 1]

# Group, aggregate, join --- all standard operations work
clade_stats = df.groupby("node_depth").agg(
    count=("id", "count"),
    mean_fitness=("fitness", "mean"),
)

This extensibility is a key advantage of the DataFrame-based approach.