Tree Properties and Metrics (Legacy)

Note

This documentation covers the legacy API (from phyloframe import legacy). The legacy API is stable and will continue to be maintained for backward compatibility. A redesigned API will accompany phyloframe v1.0.0.

This guide covers phyloframe’s operations for marking node properties, counting tree features, and computing phylogenetic metrics.

Marking Properties

“Mark” functions add a new column to the phylogeny DataFrame. All mark functions share these conventions:

  • mark_as parameter to customize the output column name.

  • mutate parameter (default False) to control whether the input DataFrame is modified.

  • Return the modified DataFrame (always use the return value).

Leaf and Root Detection

from phyloframe import legacy as pfl

df = pfl.alifestd_from_newick("((A,B),(C,D));")

# Mark leaf nodes (no descendants)
df = pfl.alifestd_mark_leaves(df)
# df["is_leaf"]: True for A, B, C, D, False otherwise

# Mark root nodes (no ancestors)
df = pfl.alifestd_mark_roots(df)
# df["is_root"]: True for the root node, False otherwise

Node Depth

Number of edges between a node and the root:

df = pfl.alifestd_mark_node_depth_asexual(df)
# df["node_depth"]: 0 for root, 1 for root's children, etc.

Descendant and Children Counts

# Total descendants (excluding self)
df = pfl.alifestd_mark_num_descendants_asexual(df)

# Direct children count
df = pfl.alifestd_mark_num_children_asexual(df)

Child Identification

For binary trees, identify left and right children:

# Requires strictly bifurcating tree
df = pfl.alifestd_mark_is_left_child_asexual(df)
df = pfl.alifestd_mark_is_right_child_asexual(df)

Time-based Properties

These require an origin_time column in the input:

import numpy as np

df_timed = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")
ancestor_ids = df_timed["ancestor_id"].values
deltas = df_timed["origin_time_delta"].fillna(0).values
origin_time = np.zeros(len(df_timed))
for i in range(len(df_timed)):
    parent = ancestor_ids[i]
    if parent != i:
        origin_time[i] = origin_time[parent] + deltas[i]
df_timed["origin_time"] = origin_time

# Time of this node's ancestor
df_timed = pfl.alifestd_mark_ancestor_origin_time_asexual(df_timed)

# Time elapsed since ancestor (branch length equivalent)
df_timed = pfl.alifestd_mark_origin_time_delta_asexual(df_timed)

Counting Operations

Count functions return scalars:

df = pfl.alifestd_from_newick("((A,B),(C,(D,E)));")

pfl.alifestd_count_leaf_nodes(df)  # 5
pfl.alifestd_count_inner_nodes(df)  # 4
pfl.alifestd_count_root_nodes(df)  # 1
pfl.alifestd_count_unifurcations(df)  # 0
pfl.alifestd_count_polytomies(df)  # 0

Finding Specific Nodes

import numpy as np

# Get arrays of IDs
leaf_ids = pfl.alifestd_find_leaf_ids(df)  # np.ndarray
root_ids = pfl.alifestd_find_root_ids(df)  # np.ndarray
# Look up the id of a node by its taxon label
node_id = df.loc[df["taxon_label"] == "A", "id"].item()

Validation

Check that a DataFrame conforms to the alife standard format:

is_valid = pfl.alifestd_validate(df)

# Check specific structural properties
pfl.alifestd_is_asexual(df)
pfl.alifestd_is_topologically_sorted(df)
pfl.alifestd_has_contiguous_ids(df)
pfl.alifestd_is_strictly_bifurcating_asexual(df)

Balance Metrics

df = pfl.alifestd_from_newick("((A,B),(C,D));")

# Colless balance index (per-node)
df = pfl.alifestd_mark_colless_index_asexual(df)

# Sackin balance index (per-node)
df = pfl.alifestd_mark_sackin_index_asexual(df)

MRCA (Most Recent Common Ancestor)

df_mrca = pfl.alifestd_from_newick("((A,B),(C,D));")

# Find IDs
leaf_ids = pfl.alifestd_find_leaf_ids(df_mrca)

# MRCA of two specific nodes
mrca_id = pfl.alifestd_find_pair_mrca_id_asexual(
    df_mrca, leaf_ids[0], leaf_ids[1],
)

Triplet Distance

Compare two trees using triplet-based distance:

tree1 = pfl.alifestd_from_newick("((A,B),(C,D));")
tree2 = pfl.alifestd_from_newick("((A,C),(B,D));")

dist = pfl.alifestd_calc_triplet_distance_asexual(tree1, tree2)

Distance Matrix

Compute pairwise distances between leaf nodes:

df_dist = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")
ancestor_ids = df_dist["ancestor_id"].values
deltas = df_dist["origin_time_delta"].fillna(0).values
origin_time = np.zeros(len(df_dist))
for i in range(len(df_dist)):
    parent = ancestor_ids[i]
    if parent != i:
        origin_time[i] = origin_time[parent] + deltas[i]
df_dist["origin_time"] = origin_time

# Returns a NumPy distance matrix
dist_matrix = pfl.alifestd_calc_distance_matrix_asexual(df_dist)