Creating Phylogenies (Legacy)
Note
This documentation covers the legacy API (from phyloframe import legacy). The legacy API is stable and will continue to be maintained for backward compatibility. A redesigned API will accompany phyloframe v1.0.0.
This guide covers the different ways to create phylogeny DataFrames in phyloframe.
Empty Phylogenies
alifestd_make_empty creates a zero-row DataFrame with the correct column names and dtypes (id as int, ancestor_list as str, and optionally ancestor_id as int).
This ensures downstream operations receive properly typed input:
from phyloframe import legacy as pfl
# Minimal empty DataFrame (id and ancestor_list columns)
df = pfl.alifestd_make_empty()
# With ancestor_id column pre-created
df = pfl.alifestd_make_empty(ancestor_id=True)
# Polars version (ancestor_id=True by default)
df_polars = pfl.alifestd_make_empty_polars()
From Scratch with Pandas
Build a phylogeny by constructing a DataFrame directly:
import pandas as pd
# A simple tree:
# 0 (root)
# / \
# 1 2
# / \
# 3 4
phylogeny_df = pd.DataFrame({
"id": [0, 1, 2, 3, 4],
"ancestor_list": ["[None]", "[0]", "[0]", "[1]", "[1]"],
})
# Validate the format
assert pfl.alifestd_validate(phylogeny_df)
You can include additional columns at creation time:
phylogeny_df = pd.DataFrame({
"id": [0, 1, 2, 3, 4],
"ancestor_list": ["[None]", "[0]", "[0]", "[1]", "[1]"],
"origin_time": [0, 10, 10, 20, 20],
"taxon_label": ["root", "A", "B", "C", "D"],
})
Synthetic Trees
Balanced Bifurcating Trees
Create perfectly balanced binary trees by specifying depth:
# depth=0 -> empty tree
# depth=1 -> 1 node (root only)
# depth=2 -> 3 nodes (1 root, 2 leaves)
# depth=3 -> 7 nodes (3 internal, 4 leaves)
# depth=n -> 2^n - 1 total nodes, 2^(n-1) leaves
df = pfl.alifestd_make_balanced_bifurcating(depth=4) # 15 nodes
print(f"Nodes: {len(df)}")
print(f"Leaves: {pfl.alifestd_count_leaf_nodes(df)}")
Comb (Caterpillar) Trees
Create maximally unbalanced trees by specifying the number of leaves:
df = pfl.alifestd_make_comb(n_leaves=10)
print(f"Nodes: {len(df)}")
print(f"Leaves: {pfl.alifestd_count_leaf_nodes(df)}")
Parsing Newick Format
Parse Newick-format strings into phyloframe DataFrames:
# Simple topology
df = pfl.alifestd_from_newick("((A,B),(C,D));")
# With branch lengths
df = pfl.alifestd_from_newick("((A:1.0,B:2.0):3.0,(C:4.0,D:5.0):6.0);")
# Parsed columns include:
# id, ancestor_id, taxon_label, origin_time_delta, branch_length
print(df.columns.tolist())
Polars Newick Parsing
df_polars = pfl.alifestd_from_newick_polars("((A,B),(C,D));")
Options
# Integer branch lengths
df = pfl.alifestd_from_newick(
"((A:1,B:2):3,(C:4,D:5):6);",
branch_length_dtype=int,
)
# Include ancestor_list column (slower, for compatibility)
df = pfl.alifestd_from_newick(
"((A,B),(C,D));",
create_ancestor_list=True,
)
Loading from Files
Since phyloframe uses standard DataFrames, loading from files uses standard library calls:
import pandas as pd
# From CSV
df = pd.read_csv("phylogeny.csv")
# From Parquet (recommended for large trees)
df = pd.read_parquet("phylogeny.pqt")
# From a URL
df = pd.read_csv("https://example.com/data/phylogeny.csv")
# From cloud storage
df = pd.read_parquet("s3://bucket/phylogeny.pqt")
import polars as pl
# Polars Parquet (efficient columnar reads)
df = pl.read_parquet("phylogeny.pqt")
# Selective column loading with Parquet
df = pl.read_parquet("phylogeny.pqt", columns=["id", "ancestor_id"])
After loading, convert to working format for analysis:
df = pfl.alifestd_to_working_format(df)