Quickstart

This guide walks through the basics of phyloframe: creating phylogenies, inspecting tree structure, marking properties, transforming trees, and exporting results.

Installation

Install phyloframe with JIT acceleration (recommended):

python3 -m pip install "phyloframe[jit]==0.10.0"

Omit [jit] if you do not need Numba-based just-in-time compilation:

python3 -m pip install "phyloframe==0.10.0"

Import Convention

from phyloframe import legacy as pfl

The legacy module contains all current phyloframe operations. As phyloframe evolves, legacy will continue to be maintained for backward compatibility while new API designs are developed.

The Official Alife Standard Format

Phyloframe represents phylogenies as DataFrames in the alife standard format. Each row represents an organism (or taxon).

Required Columns

idint

Unique, non-negative identifier for this organism.

ancestor_liststr

JSON-encoded list of ancestor IDs. For asexual phylogenies, this is a single-element list like "[0]". Root nodes use "[None]", "[none]", or "[]".

Note

The ambiguity of root representations ("[None]" vs "[none]" vs "[]") is a known defect in the current alife data standard. The use of none also deviates from valid JSON. The string encoding additionally incurs parsing overhead on every access. Phyloframe’s ancestor_id column avoids these issues.

Optional Columns (Official Standard)

origin_timenumeric

Time at which this organism originated.

destruction_timenumeric

Time at which this organism was destroyed or went extinct.

taxon_labelstr

Human-readable label or species name for this organism.

See the alife data standards specification for the full list of official optional columns.

Unofficial Extension: ancestor_id

ancestor_idint

Direct ancestor ID for asexual phylogenies. This is an optimized integer representation of ancestor_list that avoids repeated string parsing. Root nodes store their own ID as ancestor_id.

All phyloframe operations on asexual trees support ancestor_id in place of ancestor_list. Using ancestor_id is recommended unless interoperability with other alife standard ecosystem tools is needed. Use alifestd_try_add_ancestor_list_col to generate ancestor_list on demand when required:

df = pfl.alifestd_try_add_ancestor_list_col(df)

Additional user-defined columns (e.g., trait data, fitness values) can be freely added — the DataFrame is yours to extend.

Representing Roots

Root nodes have no ancestor. In ancestor_list, this is represented as "[None]", "[none]", or "[]". In ancestor_id, roots store their own ID (i.e., ancestor_id == id).

Example

import pandas as pd

# A simple three-node chain: root -> internal -> leaf
phylogeny_df = pd.DataFrame({
    "id": [0, 1, 2],
    "ancestor_list": ["[None]", "[0]", "[1]"],
})

This represents:

0 (root)
+-- 1 (internal)
    +-- 2 (leaf)

Asexual vs. Sexual Phylogenies

Asexual phylogenies have at most one ancestor per organism (i.e., single-element ancestor_list). Most phyloframe operations target asexual phylogenies, where the ancestor_id column enables fast integer-based lookups.

Sexual phylogenies allow multiple ancestors (e.g., "[3, 7]"). Sexual phylogeny support is limited to operations that work with ancestor_list directly (primarily in Pandas).

# Check phylogeny type
pfl.alifestd_is_asexual(phylogeny_df)  # True
pfl.alifestd_is_sexual(phylogeny_df)  # False

Creating Phylogenies

From Scratch

# Empty phylogeny
empty_df = pfl.alifestd_make_empty()

# Empty phylogeny with ancestor_id column
empty_df = pfl.alifestd_make_empty(ancestor_id=True)

Synthetic Trees

# Balanced bifurcating tree
#   depth=1 -> 1 node (root only)
#   depth=3 -> 7 nodes (4 leaves)
#   depth=n -> 2^n - 1 nodes, 2^(n-1) leaves
balanced_df = pfl.alifestd_make_balanced_bifurcating(depth=3)

# Comb (caterpillar) tree
comb_df = pfl.alifestd_make_comb(n_leaves=10)

From Newick Format

# Parse a Newick string
df = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")

# The result includes columns: id, ancestor_id, taxon_label,
# origin_time_delta, and branch_length
print(df.columns.tolist())

Working Format

Many phyloframe operations run fastest when the DataFrame is in working format:

  1. Topologically sorted — ancestors appear before descendants.

  2. Contiguous IDs — each organism’s id equals its row number.

  3. ``ancestor_id`` column — integer ancestor reference (asexual only).

Convert to working format once, then chain operations:

df = pfl.alifestd_make_balanced_bifurcating(depth=3)
df = pfl.alifestd_to_working_format(df)

# Verify properties
assert pfl.alifestd_is_topologically_sorted(df)
assert pfl.alifestd_has_contiguous_ids(df)

Marking Properties

“Mark” functions add computed columns to a phylogeny DataFrame. The original data is preserved; a new column is appended.

df = pfl.alifestd_pipe_unary_ops(
    pfl.alifestd_from_newick("((A,B),(C,D));"),
    pfl.alifestd_mark_leaves,  # leaf detection
    pfl.alifestd_mark_node_depth_asexual,  # depth from root
    pfl.alifestd_mark_num_descendants_asexual,  # descendant count
    pfl.alifestd_mark_num_children_asexual,  # direct children count
    pfl.alifestd_mark_roots,  # root detection
)

print(df[["id", "ancestor_id", "is_leaf", "node_depth",
          "num_descendants", "num_children", "is_root"]])

Custom Column Names

All mark functions accept a mark_as parameter to customize the output column name:

df = pfl.alifestd_mark_leaves(df, mark_as="is_tip")
df = pfl.alifestd_mark_node_depth_asexual(df, mark_as="depth")

Counting and Querying

df = pfl.alifestd_from_newick("((A,B),(C,D));")

pfl.alifestd_count_leaf_nodes(df)  # 4
pfl.alifestd_count_inner_nodes(df)  # 3
pfl.alifestd_count_root_nodes(df)  # 1

pfl.alifestd_is_asexual(df)  # True
pfl.alifestd_is_topologically_sorted(df)  # True/False
pfl.alifestd_has_contiguous_ids(df)  # True/False

# Validate format compliance
pfl.alifestd_validate(df)  # True

Tree Transformations

df = pfl.alifestd_pipe_unary_ops(
    pfl.alifestd_from_newick("((A,B),(C,D));"),
    pfl.alifestd_collapse_unifurcations,  # remove single-child nodes
    pfl.alifestd_splay_polytomies,  # expand polytomies into bifurcations
    pfl.alifestd_add_global_root,  # add synthetic root above all roots
    pfl.alifestd_join_roots,  # join multiple roots to oldest root
)

Composed Example: Downsampling with Combined Masks

A common workflow: select tips using multiple sampling criteria, combine them with boolean OR, and prune extinct lineages.

import numpy as np
import pandas as pd
from phyloframe import legacy as pfl

# Create a tree with origin times computed from branch length deltas
df = pfl.alifestd_from_newick(
    "((A:1,B:2):3,(C:4,(D:5,E:6):7):8);",
)
ancestor_ids = df["ancestor_id"].values
deltas = df["origin_time_delta"].fillna(0).values
origin_time = np.zeros(len(df))
for i in range(len(df)):
    parent = ancestor_ids[i]
    if parent != i:
        origin_time[i] = origin_time[parent] + deltas[i]
df["origin_time"] = origin_time

# Strategy 1: keep the most recent tips (canopy sampling)
df = pfl.alifestd_mark_sample_tips_canopy_asexual(
    df, n_sample=2, mark_as="keep_canopy",
)

# Strategy 2: keep tips closest to a focal lineage
df = pfl.alifestd_mark_sample_tips_lineage_asexual(
    df, n_sample=2, mark_as="keep_lineage",
)

# Combine masks with boolean OR --- keep tips matching either criterion
df["extant"] = df["keep_canopy"] | df["keep_lineage"]

# Prune lineages without any extant descendants
pruned_df = pfl.alifestd_prune_extinct_lineages_asexual(df)
print(pruned_df[["id", "ancestor_id"]])

The alifestd_mark_sample_tips_* functions add boolean columns indicating which tips to retain. Combining masks with | (OR), & (AND), or ~ (NOT) gives full control over tip selection. The alifestd_prune_extinct_lineages_asexual function then removes any lineages that have no descendants marked as extant via the "extant" column (configurable with the criterion parameter).

Newick I/O

# Parse Newick
df = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")

# Export to Newick
newick_str = pfl.alifestd_as_newick_asexual(df)

# Use taxon labels from a column
newick_str = pfl.alifestd_as_newick_asexual(df, taxon_label="taxon_label")

CSV and Parquet I/O

Because phyloframe uses standard DataFrames, loading and saving is trivial:

import pandas as pd

# CSV round-trip
df.to_csv("phylogeny.csv", index=False)
df = pd.read_csv("phylogeny.csv")

# Parquet round-trip (recommended for large trees)
df.to_parquet("phylogeny.pqt")
df = pd.read_parquet("phylogeny.pqt")

# Polars Parquet
import polars as pl
df_polars = pl.read_parquet("phylogeny.pqt")

Mutation Semantics

By default, operations return a new DataFrame without modifying the input:

original = df.copy()
result = pfl.alifestd_mark_leaves(df)
assert original.equals(df)  # input unchanged

Set mutate=True to allow in-place modification for better performance in pipelines. Even with mutate=True, always use the return value:

# Faster: allows reuse of input memory
df = pfl.alifestd_mark_leaves(df, mutate=True)
df = pfl.alifestd_mark_node_depth_asexual(df, mutate=True)

Piping Operations

Pandas provides DataFrame.pipe() for chaining operations idiomatically:

result = (
    df.pipe(pfl.alifestd_collapse_unifurcations)
    .pipe(pfl.alifestd_mark_leaves)
    .pipe(pfl.alifestd_mark_node_depth_asexual)
)

Polars DataFrames also support .pipe():

import polars as pl

df_pl = pfl.alifestd_from_newick_polars("((A,B),(C,D));")
result_pl = (
    df_pl.pipe(pfl.alifestd_mark_leaves_polars)
)

Alternatively, alifestd_pipe_unary_ops accepts multiple operations:

result = pfl.alifestd_pipe_unary_ops(
    df,
    pfl.alifestd_collapse_unifurcations,
    pfl.alifestd_mark_leaves,
    lambda df: pfl.alifestd_mark_node_depth_asexual(df, mark_as="depth"),
)

Use tqdm for progress feedback on long pipelines:

from tqdm import tqdm

result = pfl.alifestd_pipe_unary_ops(
    df,
    pfl.alifestd_collapse_unifurcations,
    pfl.alifestd_mark_leaves,
    progress_wrap=tqdm,
)

Next Steps