Tree Properties and Metrics (Legacy)
Note
This documentation covers the legacy API (from phyloframe import legacy). The legacy API is stable and will continue to be maintained for backward compatibility. A redesigned API will accompany phyloframe v1.0.0.
This guide covers phyloframe’s operations for marking node properties, counting tree features, and computing phylogenetic metrics.
Marking Properties
“Mark” functions add a new column to the phylogeny DataFrame. All mark functions share these conventions:
mark_asparameter to customize the output column name.mutateparameter (defaultFalse) to control whether the input DataFrame is modified.Return the modified DataFrame (always use the return value).
Leaf and Root Detection
from phyloframe import legacy as pfl
df = pfl.alifestd_from_newick("((A,B),(C,D));")
# Mark leaf nodes (no descendants)
df = pfl.alifestd_mark_leaves(df)
# df["is_leaf"]: True for A, B, C, D, False otherwise
# Mark root nodes (no ancestors)
df = pfl.alifestd_mark_roots(df)
# df["is_root"]: True for the root node, False otherwise
Node Depth
Number of edges between a node and the root:
df = pfl.alifestd_mark_node_depth_asexual(df)
# df["node_depth"]: 0 for root, 1 for root's children, etc.
Descendant and Children Counts
# Total descendants (excluding self)
df = pfl.alifestd_mark_num_descendants_asexual(df)
# Direct children count
df = pfl.alifestd_mark_num_children_asexual(df)
Child Identification
For binary trees, identify left and right children:
# Requires strictly bifurcating tree
df = pfl.alifestd_mark_is_left_child_asexual(df)
df = pfl.alifestd_mark_is_right_child_asexual(df)
Time-based Properties
These require an origin_time column in the input:
import numpy as np
df_timed = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")
ancestor_ids = df_timed["ancestor_id"].values
deltas = df_timed["origin_time_delta"].fillna(0).values
origin_time = np.zeros(len(df_timed))
for i in range(len(df_timed)):
parent = ancestor_ids[i]
if parent != i:
origin_time[i] = origin_time[parent] + deltas[i]
df_timed["origin_time"] = origin_time
# Time of this node's ancestor
df_timed = pfl.alifestd_mark_ancestor_origin_time_asexual(df_timed)
# Time elapsed since ancestor (branch length equivalent)
df_timed = pfl.alifestd_mark_origin_time_delta_asexual(df_timed)
Counting Operations
Count functions return scalars:
df = pfl.alifestd_from_newick("((A,B),(C,(D,E)));")
pfl.alifestd_count_leaf_nodes(df) # 5
pfl.alifestd_count_inner_nodes(df) # 4
pfl.alifestd_count_root_nodes(df) # 1
pfl.alifestd_count_unifurcations(df) # 0
pfl.alifestd_count_polytomies(df) # 0
Finding Specific Nodes
import numpy as np
# Get arrays of IDs
leaf_ids = pfl.alifestd_find_leaf_ids(df) # np.ndarray
root_ids = pfl.alifestd_find_root_ids(df) # np.ndarray
# Look up the id of a node by its taxon label
node_id = df.loc[df["taxon_label"] == "A", "id"].item()
Validation
Check that a DataFrame conforms to the alife standard format:
is_valid = pfl.alifestd_validate(df)
# Check specific structural properties
pfl.alifestd_is_asexual(df)
pfl.alifestd_is_topologically_sorted(df)
pfl.alifestd_has_contiguous_ids(df)
pfl.alifestd_is_strictly_bifurcating_asexual(df)
Balance Metrics
df = pfl.alifestd_from_newick("((A,B),(C,D));")
# Colless balance index (per-node)
df = pfl.alifestd_mark_colless_index_asexual(df)
# Sackin balance index (per-node)
df = pfl.alifestd_mark_sackin_index_asexual(df)
MRCA (Most Recent Common Ancestor)
df_mrca = pfl.alifestd_from_newick("((A,B),(C,D));")
# Find IDs
leaf_ids = pfl.alifestd_find_leaf_ids(df_mrca)
# MRCA of two specific nodes
mrca_id = pfl.alifestd_find_pair_mrca_id_asexual(
df_mrca, leaf_ids[0], leaf_ids[1],
)
Triplet Distance
Compare two trees using triplet-based distance:
tree1 = pfl.alifestd_from_newick("((A,B),(C,D));")
tree2 = pfl.alifestd_from_newick("((A,C),(B,D));")
dist = pfl.alifestd_calc_triplet_distance_asexual(tree1, tree2)
Distance Matrix
Compute pairwise distances between leaf nodes:
df_dist = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")
ancestor_ids = df_dist["ancestor_id"].values
deltas = df_dist["origin_time_delta"].fillna(0).values
origin_time = np.zeros(len(df_dist))
for i in range(len(df_dist)):
parent = ancestor_ids[i]
if parent != i:
origin_time[i] = origin_time[parent] + deltas[i]
df_dist["origin_time"] = origin_time
# Returns a NumPy distance matrix
dist_matrix = pfl.alifestd_calc_distance_matrix_asexual(df_dist)