Creating Phylogenies (Legacy)

Note

This documentation covers the legacy API (from phyloframe import legacy). The legacy API is stable and will continue to be maintained for backward compatibility. A redesigned API will accompany phyloframe v1.0.0.

This guide covers the different ways to create phylogeny DataFrames in phyloframe.

Empty Phylogenies

alifestd_make_empty creates a zero-row DataFrame with the correct column names and dtypes (id as int, ancestor_list as str, and optionally ancestor_id as int). This ensures downstream operations receive properly typed input:

from phyloframe import legacy as pfl

# Minimal empty DataFrame (id and ancestor_list columns)
df = pfl.alifestd_make_empty()

# With ancestor_id column pre-created
df = pfl.alifestd_make_empty(ancestor_id=True)

# Polars version (ancestor_id=True by default)
df_polars = pfl.alifestd_make_empty_polars()

From Scratch with Pandas

Build a phylogeny by constructing a DataFrame directly:

import pandas as pd

# A simple tree:
#       0 (root)
#      / \
#     1   2
#    / \
#   3   4
phylogeny_df = pd.DataFrame({
    "id": [0, 1, 2, 3, 4],
    "ancestor_list": ["[None]", "[0]", "[0]", "[1]", "[1]"],
})

# Validate the format
assert pfl.alifestd_validate(phylogeny_df)

You can include additional columns at creation time:

phylogeny_df = pd.DataFrame({
    "id": [0, 1, 2, 3, 4],
    "ancestor_list": ["[None]", "[0]", "[0]", "[1]", "[1]"],
    "origin_time": [0, 10, 10, 20, 20],
    "taxon_label": ["root", "A", "B", "C", "D"],
})

Synthetic Trees

Balanced Bifurcating Trees

Create perfectly balanced binary trees by specifying depth:

# depth=0 -> empty tree
# depth=1 -> 1 node (root only)
# depth=2 -> 3 nodes (1 root, 2 leaves)
# depth=3 -> 7 nodes (3 internal, 4 leaves)
# depth=n -> 2^n - 1 total nodes, 2^(n-1) leaves

df = pfl.alifestd_make_balanced_bifurcating(depth=4)  # 15 nodes
print(f"Nodes: {len(df)}")
print(f"Leaves: {pfl.alifestd_count_leaf_nodes(df)}")

Comb (Caterpillar) Trees

Create maximally unbalanced trees by specifying the number of leaves:

df = pfl.alifestd_make_comb(n_leaves=10)
print(f"Nodes: {len(df)}")
print(f"Leaves: {pfl.alifestd_count_leaf_nodes(df)}")

Parsing Newick Format

Parse Newick-format strings into phyloframe DataFrames:

# Simple topology
df = pfl.alifestd_from_newick("((A,B),(C,D));")

# With branch lengths
df = pfl.alifestd_from_newick("((A:1.0,B:2.0):3.0,(C:4.0,D:5.0):6.0);")

# Parsed columns include:
#   id, ancestor_id, taxon_label, origin_time_delta, branch_length
print(df.columns.tolist())

Polars Newick Parsing

df_polars = pfl.alifestd_from_newick_polars("((A,B),(C,D));")

Options

# Integer branch lengths
df = pfl.alifestd_from_newick(
    "((A:1,B:2):3,(C:4,D:5):6);",
    branch_length_dtype=int,
)

# Include ancestor_list column (slower, for compatibility)
df = pfl.alifestd_from_newick(
    "((A,B),(C,D));",
    create_ancestor_list=True,
)

Loading from Files

Since phyloframe uses standard DataFrames, loading from files uses standard library calls:

import pandas as pd

# From CSV
df = pd.read_csv("phylogeny.csv")

# From Parquet (recommended for large trees)
df = pd.read_parquet("phylogeny.pqt")

# From a URL
df = pd.read_csv("https://example.com/data/phylogeny.csv")

# From cloud storage
df = pd.read_parquet("s3://bucket/phylogeny.pqt")

import polars as pl

# Polars Parquet (efficient columnar reads)
df = pl.read_parquet("phylogeny.pqt")

# Selective column loading with Parquet
df = pl.read_parquet("phylogeny.pqt", columns=["id", "ancestor_id"])

After loading, convert to working format for analysis:

df = pfl.alifestd_to_working_format(df)