Tree Manipulation and Pruning (Legacy)
Note
This documentation covers the legacy API (from phyloframe import legacy). The legacy API is stable and will continue to be maintained for backward compatibility. A redesigned API will accompany phyloframe v1.0.0.
This guide covers operations that transform the structure of a phylogeny: collapsing, splaying, pruning, downsampling, and masking.
Note
Structural transforms may invalidate previously computed columns (e.g., node_depth, num_descendants).
See Concepts and Data Structures (Legacy) for details on the topological sensitivity system.
Structural Transformations
Collapsing Unifurcations
Remove single-child (unifurcating) nodes, connecting their parent directly to their child:
from phyloframe import legacy as pfl
df = pfl.alifestd_from_newick("((A,B),(C,D));")
df = df.pipe(pfl.alifestd_collapse_unifurcations)
Splaying Polytomies
Expand multi-child (polytomy) nodes into a cascade of bifurcations:
df = pfl.alifestd_splay_polytomies(df)
Root Operations
# Add a synthetic global root above all existing roots
df = pfl.alifestd_add_global_root(df)
# Set attributes on the new root
df = pfl.alifestd_add_global_root(
df, root_attrs={"origin_time": 0.0, "taxon_label": "global_root"},
)
# Join multiple roots to the oldest root
df = pfl.alifestd_join_roots(df)
Rerooting
Change the root of a tree to a specified node:
df_reroot = pfl.alifestd_from_newick(
"((A,B),(C,D));", create_ancestor_list=True,
)
leaf_ids = pfl.alifestd_find_leaf_ids(df_reroot)
df_reroot = pfl.alifestd_reroot_at_id_asexual(
df_reroot, int(leaf_ids[0]),
)
Ladderization
Reorder children for consistent visual presentation:
df_ladder = pfl.alifestd_from_newick(
"((A,B),(C,D));", create_ancestor_list=True,
)
df_ladder = pfl.alifestd_ladderize_asexual(df_ladder)
Trunk Operations
df_trunk = pfl.alifestd_make_comb(n_leaves=5)
df_trunk = pfl.alifestd_to_working_format(df_trunk)
# Delete unifurcating root nodes
df_trunk = pfl.alifestd_delete_unifurcating_roots_asexual(df_trunk)
Aggregating Multiple Phylogenies
Concatenate independent phylogenies with ID reassignment:
tree1 = pfl.alifestd_from_newick(
"(A,B);", create_ancestor_list=True,
)
tree2 = pfl.alifestd_from_newick(
"(C,D);", create_ancestor_list=True,
)
combined = pfl.alifestd_aggregate_phylogenies([tree1, tree2])
Chronological Sorting
Sort rows by origin_time for time-based analyses:
import numpy as np
df_chrono = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")
# Compute origin_time by accumulating origin_time_delta from root
deltas = df_chrono["origin_time_delta"].fillna(0).values
ancestor_ids = df_chrono["ancestor_id"].values
origin_time = np.empty(len(df_chrono))
for i in range(len(df_chrono)):
parent = ancestor_ids[i]
origin_time[i] = 0.0 if parent == i else origin_time[parent] + deltas[i]
df_chrono["origin_time"] = origin_time
df_chrono = pfl.alifestd_chronological_sort(df_chrono)
Tip Sampling (Mark Functions)
Tip sampling mark functions add a boolean column indicating which tips to retain. They do not remove any rows — use a pruning step afterward.
Uniform Random Sampling
df_sample = pfl.alifestd_make_balanced_bifurcating(depth=5)
df_sample = df_sample.pipe(pfl.alifestd_to_working_format)
# Mark 10 randomly selected tips
df_sample = pfl.alifestd_mark_sample_tips_uniform_asexual(
df_sample, n_sample=10, mark_as="keep", seed=42,
)
Canopy Sampling
Retain tips with the largest values in a criterion column (e.g., the most recent tips by origin_time):
df_sample = pfl.alifestd_mark_sample_tips_canopy_asexual(
df_sample,
criterion="origin_time",
mark_as="keep_canopy",
n_sample=5,
)
Lineage Sampling
Retain tips closest to a focal lineage (the lineage of the tip with the largest criterion value):
df_sample = pfl.alifestd_mark_sample_tips_lineage_asexual(
df_sample, n_sample=5, mark_as="keep_lineage",
)
Combining Sample Masks
Because sample marks are boolean columns, they compose naturally with standard boolean operations:
# Keep tips matching EITHER criterion (union)
df_sample["keep"] = (
df_sample["keep_canopy"] | df_sample["keep_lineage"]
)
# Keep tips matching BOTH criteria (intersection)
df_sample["keep"] = (
df_sample["keep_canopy"] & df_sample["keep_lineage"]
)
# Invert a selection
df_sample["keep"] = ~df_sample["keep_canopy"]
Pruning
Pruning Extinct Lineages
Remove lineages that have no extant descendants.
The criterion parameter specifies which boolean column marks extant taxa (default: "extant"):
df_prune = pfl.alifestd_make_balanced_bifurcating(depth=4)
df_prune = df_prune.pipe(pfl.alifestd_to_working_format)
# Mark which taxa are extant (e.g., those with id >= threshold)
threshold = len(df_prune) // 2
df_prune["extant"] = df_prune["id"] >= threshold
# Remove lineages without extant descendants
df_prune = pfl.alifestd_prune_extinct_lineages_asexual(df_prune)
You can also use a custom criterion column name:
df_prune2 = pfl.alifestd_make_balanced_bifurcating(depth=4)
df_prune2 = df_prune2.pipe(pfl.alifestd_to_working_format)
df_prune2["is_alive"] = True
df_prune2 = pfl.alifestd_prune_extinct_lineages_asexual(
df_prune2, criterion="is_alive",
)
Coarsening with a Mask
Keep only rows matching a boolean mask, re-wiring ancestor relationships to maintain tree connectivity:
df_coarsen = (
pfl.alifestd_from_newick("((A,B),(C,D));", create_ancestor_list=True)
.pipe(pfl.alifestd_mark_node_depth_asexual)
.pipe(pfl.alifestd_mark_leaves)
)
mask = (df_coarsen["node_depth"] % 2 == 0) | df_coarsen["is_leaf"]
df_coarsen = pfl.alifestd_coarsen_mask(df_coarsen, mask)
Composed Example: Multi-criteria Downsampling
A complete workflow combining multiple sampling strategies and pruning:
import pandas as pd
from phyloframe import legacy as pfl
# Load or create a phylogeny with origin times
df = (
pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,(D:5,E:6):7):8);")
.pipe(pfl.alifestd_mark_sample_tips_canopy_asexual, n_sample=2, mark_as="keep_canopy")
.pipe(pfl.alifestd_mark_sample_tips_lineage_asexual, n_sample=2, mark_as="keep_lineage")
)
# Combine: OR the masks to keep tips matching either criterion
df["extant"] = df["keep_canopy"] | df["keep_lineage"]
# Prune lineages with no extant descendants
result = df.pipe(pfl.alifestd_prune_extinct_lineages_asexual)
Downsampling (Combined Mark + Prune)
For convenience, alifestd_downsample_tips_* functions combine the mark and prune steps:
df_ds = pfl.alifestd_make_balanced_bifurcating(depth=5)
df_ds = df_ds.pipe(pfl.alifestd_to_working_format)
# Uniform random downsampling
df_u = pfl.alifestd_downsample_tips_uniform_asexual(
df_ds, n_downsample=10,
)
# Canopy downsampling
df_c = pfl.alifestd_downsample_tips_canopy_asexual(
df_ds, n_downsample=5,
)
# Lineage-based downsampling
df_l = pfl.alifestd_downsample_tips_lineage_asexual(
df_ds, n_downsample=5,
)
These are equivalent to calling the corresponding mark function, setting the "extant" column, and then pruning extinct lineages.
Use the separate mark + prune workflow (above) when you need to combine multiple sampling criteria.