Performance (Legacy)

Note

This documentation covers the legacy API (from phyloframe import legacy). The legacy API is stable and will continue to be maintained for backward compatibility. A redesigned API will accompany phyloframe v1.0.0.

This guide covers techniques for getting the best performance out of phyloframe: JIT compilation, Polars, working format, and writing custom high-performance operations.

Working Format

Converting to working format is the single most impactful optimization. Many operations have fast paths that activate only when the data is topologically sorted with contiguous IDs:

from phyloframe import legacy as pfl

df = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")

Convert once, then chain operations. In working format, ancestor_id values can be used directly as array indices.

Mutation for Pipeline Performance

Use mutate=True in Pandas-based pipelines to avoid unnecessary copies:

df = (
    df.pipe(pfl.alifestd_to_working_format, mutate=True)
    .pipe(pfl.alifestd_mark_leaves, mutate=True)
    .pipe(pfl.alifestd_mark_node_depth_asexual, mutate=True)
    .pipe(pfl.alifestd_mark_num_descendants_asexual, mutate=True)
)

JIT Compilation with Numba

Install with JIT support:

python3 -m pip install "phyloframe[jit]==0.10.0"

Many phyloframe operations automatically use JIT-compiled fast paths when Numba is available. No code changes are needed — just install the [jit] extra.

Writing Custom JIT Functions

Use phyloframe’s jit utility for custom native-speed operations:

import numpy as np
from phyloframe._auxlib import jit
from phyloframe import legacy as pfl

@jit(nopython=True, cache=False)
def count_deep_nodes(
    ancestor_ids: np.ndarray, threshold: int,
) -> int:
    """Count nodes deeper than `threshold`.

    Requires contiguous IDs and topological sorting.
    """
    n = len(ancestor_ids)
    depths = np.zeros(n, dtype=np.int64)
    count = 0
    for i in range(n):
        if ancestor_ids[i] != i:  # not a root
            depths[i] = depths[ancestor_ids[i]] + 1
        if depths[i] > threshold:
            count += 1
    return count

# Use with phyloframe data in working format
df = pfl.alifestd_make_balanced_bifurcating(depth=10)
df = pfl.alifestd_to_working_format(df)
n_deep = count_deep_nodes(df["ancestor_id"].values, threshold=5)

Note that in practice, this particular operation is more idiomatically done by marking node depths and then filtering with a DataFrame operation:

df = pfl.alifestd_mark_node_depth_asexual(df)
n_deep = (df["node_depth"] > 5).sum()

The JIT approach is most useful for custom algorithms that cannot be expressed as combinations of existing phyloframe operations.

The jit decorator:

  • Uses Numba when available for native-speed compilation.

  • Falls back to pure Python if Numba is not installed.

  • Automatically disables compilation during coverage runs.

  • Enables Numba’s function caching by default (pass cache=False to disable, e.g., for short-lived or dynamically defined functions).

Common Pattern: Array-based Algorithms

Working format enables a common pattern: extract NumPy arrays from the DataFrame, process with JIT-compiled functions, and store results back:

@jit(nopython=True, cache=False)
def compute_subtree_sizes(ancestor_ids: np.ndarray) -> np.ndarray:
    """Count nodes in each subtree (including self)."""
    n = len(ancestor_ids)
    sizes = np.ones(n, dtype=np.int64)
    # Reverse iteration is postorder for topologically sorted data
    for i in range(n - 1, -1, -1):
        parent = ancestor_ids[i]
        if parent != i:
            sizes[parent] += sizes[i]
    return sizes

df["subtree_size"] = compute_subtree_sizes(df["ancestor_id"].values)

Polars for Large-scale Data

Polars can outperform Pandas for large phylogenies due to multithreading, lazy evaluation, and memory-efficient representation.

Note

Set the POLARS_MAX_THREADS environment variable to control the number of threads Polars uses. Set POLARS_ENGINE_AFFINITY or use pl.Config.set_engine_affinity() to control the query engine optimization strategy. Polars streaming mode can also be enabled for larger-than-memory datasets via pl.Config.set_streaming_chunk_size().

import polars as pl
from phyloframe import legacy as pfl

# Direct Polars construction
df_pl = pfl.alifestd_from_newick_polars("((A,B),(C,D));")

# Use _polars suffixed functions for Polars DataFrames
df_pl = pfl.alifestd_mark_leaves_polars(df_pl)

Polars restrictions:

  • Asexual phylogenies only.

  • Requires data in working format (topological sortedness and contiguous IDs).

  • No mutate parameter (Polars DataFrames are immutable).

CLI Performance

For CLI pipelines on Parquet data, prefer Polars entrypoints to eliminate conversion overhead:

# Slower: converts Parquet -> Pandas -> process -> Pandas -> Parquet
python3 -m phyloframe.legacy._alifestd_mark_leaves "output.pqt" < "input.pqt"

# Faster: native Polars processing
python3 -m phyloframe.legacy._alifestd_mark_leaves_polars \
    "output.pqt" < "input.pqt"