Tree Traversal (Legacy)

Note

This documentation covers the legacy API (from phyloframe import legacy). The legacy API is stable and will continue to be maintained for backward compatibility. A redesigned API will accompany phyloframe v1.0.0.

This guide covers phyloframe’s tree traversal operations and the supplemental data structures that enable them.

Traversal Orders

Phyloframe provides functions that return node IDs in standard traversal orders as NumPy arrays.

Preorder (Depth-first, Parent Before Children)

Visit each node before its descendants:

from phyloframe import legacy as pfl
import numpy as np

df = pfl.alifestd_from_newick("((A,B),(C,D));")

order = pfl.alifestd_unfurl_traversal_preorder_asexual(df)
# Returns np.ndarray of node IDs in preorder

Postorder (Depth-first, Children Before Parent)

Visit each node after all its descendants:

order = pfl.alifestd_unfurl_traversal_postorder_asexual(df)

Inorder (Left, Root, Right)

For binary trees, visit left subtree, then node, then right subtree:

order = pfl.alifestd_unfurl_traversal_inorder_asexual(df)

Level-order (Breadth-first)

Visit nodes level by level from root to leaves:

order = pfl.alifestd_unfurl_traversal_levelorder_asexual(df)

Topological Order

Visit nodes in topological order (ancestors before descendants). For data in working format, this is simply a forward iteration along rows and can be much faster than other traversal orderings:

order = pfl.alifestd_unfurl_traversal_topological_asexual(df)

Semiorder

A hybrid traversal that interleaves preorder and postorder visits:

order = pfl.alifestd_unfurl_traversal_semiorder_asexual(df)

Lineage Unfurling

Trace the lineage from a given node back to the root:

# Get the path from node 3 to the root
lineage_ids = pfl.alifestd_unfurl_lineage_asexual(df, 3)

Supplemental Data Structures for Traversal

The alife standard format stores only parent pointers (ancestor_id). Traversal algorithms need to navigate from parents to children. Phyloframe provides two supplemental structures for this purpose.

CSR (Compressed Sparse Row) Representation

The CSR format represents the parent-to-children mapping as two flat arrays. This is the most common structure used internally by traversal and distance algorithms.

df = pfl.alifestd_pipe_unary_ops(
    pfl.alifestd_from_newick("((A,B),(C,D));"),
    pfl.alifestd_mark_num_children_asexual,
    pfl.alifestd_mark_csr_offsets_asexual,
    pfl.alifestd_mark_csr_children_asexual,
)

# Access children of any node in O(1)
offsets = df["csr_offsets"].values
children = df["csr_children"].values
num_children = df["num_children"].values

def get_children(node_id):
    start = offsets[node_id]
    return children[start:start + num_children[node_id]]

How CSR works:

Consider a tree with nodes {0, 1, 2, 3, 4} where node 0 has children {1, 2} and node 1 has children {3, 4}:

csr_offsets:  [0, 2, 4, 4, 4]
csr_children: [1, 2, 3, 4]
num_children: [2, 2, 0, 0, 0]

Node 0's children: csr_children[0:0+2] = [1, 2]
Node 1's children: csr_children[2:2+2] = [3, 4]
Node 2's children: csr_children[4:4+0] = []  (leaf)

First-Child/Next-Sibling Linked List

An alternative structure that uses two integer columns to form a linked list through the tree. This uses less memory and avoids constructing auxiliary arrays.

df = pfl.alifestd_pipe_unary_ops(
    pfl.alifestd_from_newick("((A,B),(C,D));"),
    pfl.alifestd_mark_first_child_id_asexual,
    pfl.alifestd_mark_next_sibling_id_asexual,
)

first_child = df["first_child_id"].values
next_sibling = df["next_sibling_id"].values

def get_children_linked(node_id):
    """Walk the linked list of children."""
    result = []
    child = first_child[node_id]
    if child == node_id:  # leaf, no children
        return result
    while True:
        result.append(child)
        nxt = next_sibling[child]
        if nxt == child:  # no more siblings
            break
        child = nxt
    return result

Sentinel convention: a node points to itself when there is no first child (leaf) or no next sibling (last sibling).

How it works:

For the same tree as above:

first_child:  [1, 3, -, -, -]   (node 0 -> child 1, node 1 -> child 3)
next_sibling: [-, 2, -, 4, -]   (node 1 -> sibling 2, node 3 -> sibling 4)

('-' means self-reference, i.e., no child/sibling)

Choosing Between CSR and Linked List

CSR is better for algorithms that need random access to any node’s children (e.g., distance matrix computation).
Linked list is better for sequential traversals (e.g., DFS) where you visit children in order and want to minimize memory.

In practice, the traversal functions choose automatically based on what columns are already available.

Using Traversals for Custom Algorithms

import numpy as np
from phyloframe._auxlib import jit
from phyloframe import legacy as pfl

@jit(nopython=True, cache=False)
def sum_subtree_values(
    ancestor_ids: np.ndarray,
    values: np.ndarray,
) -> np.ndarray:
    """Compute sum of values in each node's subtree.

    Processes nodes in reverse order (postorder for topologically
    sorted, contiguous-ID data).
    """
    n = len(ancestor_ids)
    subtree_sums = values.copy()
    # Reverse iteration is postorder for topologically sorted data
    for i in range(n - 1, -1, -1):
        parent = ancestor_ids[i]
        if parent != i:  # not a root
            subtree_sums[parent] += subtree_sums[i]
    return subtree_sums

df = pfl.alifestd_make_balanced_bifurcating(depth=4)
df = pfl.alifestd_to_working_format(df)
df["value"] = np.ones(len(df))

sums = sum_subtree_values(
    df["ancestor_id"].values, df["value"].values,
)