========================================
Tree Manipulation and Pruning (Legacy)
========================================

.. note::

   This documentation covers the **legacy** API (from phyloframe import legacy).
   The legacy API is stable and will continue to be maintained for backward
   compatibility.
   A redesigned API will accompany phyloframe v1.0.0.


This guide covers operations that transform the structure of a phylogeny: collapsing, splaying, pruning, downsampling, and masking.

.. note::

   Structural transforms may invalidate previously computed columns (e.g., ``node_depth``, ``num_descendants``).
   See :doc:`concepts` for details on the topological sensitivity system.

Structural Transformations
==========================

Collapsing Unifurcations
------------------------

Remove single-child (unifurcating) nodes, connecting their parent directly to their child:

.. code-block:: python

   from phyloframe import legacy as pfl

   df = pfl.alifestd_from_newick("((A,B),(C,D));")
   df = df.pipe(pfl.alifestd_collapse_unifurcations)

Splaying Polytomies
-------------------

Expand multi-child (polytomy) nodes into a cascade of bifurcations:

.. code-block:: python

   df = pfl.alifestd_splay_polytomies(df)

Root Operations
---------------

.. code-block:: python

   # Add a synthetic global root above all existing roots
   df = pfl.alifestd_add_global_root(df)

   # Set attributes on the new root
   df = pfl.alifestd_add_global_root(
       df, root_attrs={"origin_time": 0.0, "taxon_label": "global_root"},
   )

   # Join multiple roots to the oldest root
   df = pfl.alifestd_join_roots(df)

Rerooting
---------

Change the root of a tree to a specified node:

.. code-block:: python

   df_reroot = pfl.alifestd_from_newick(
       "((A,B),(C,D));", create_ancestor_list=True,
   )
   leaf_ids = pfl.alifestd_find_leaf_ids(df_reroot)
   df_reroot = pfl.alifestd_reroot_at_id_asexual(
       df_reroot, int(leaf_ids[0]),
   )

Ladderization
-------------

Reorder children for consistent visual presentation:

.. code-block:: python

   df_ladder = pfl.alifestd_from_newick(
       "((A,B),(C,D));", create_ancestor_list=True,
   )
   df_ladder = pfl.alifestd_ladderize_asexual(df_ladder)

Trunk Operations
----------------

.. code-block:: python

   df_trunk = pfl.alifestd_make_comb(n_leaves=5)
   df_trunk = pfl.alifestd_to_working_format(df_trunk)

   # Delete unifurcating root nodes
   df_trunk = pfl.alifestd_delete_unifurcating_roots_asexual(df_trunk)

Aggregating Multiple Phylogenies
---------------------------------

Concatenate independent phylogenies with ID reassignment:

.. code-block:: python

   tree1 = pfl.alifestd_from_newick(
       "(A,B);", create_ancestor_list=True,
   )
   tree2 = pfl.alifestd_from_newick(
       "(C,D);", create_ancestor_list=True,
   )
   combined = pfl.alifestd_aggregate_phylogenies([tree1, tree2])

Chronological Sorting
---------------------

Sort rows by ``origin_time`` for time-based analyses:

.. code-block:: python

   import numpy as np

   df_chrono = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")
   # Compute origin_time by accumulating origin_time_delta from root
   deltas = df_chrono["origin_time_delta"].fillna(0).values
   ancestor_ids = df_chrono["ancestor_id"].values
   origin_time = np.empty(len(df_chrono))
   for i in range(len(df_chrono)):
       parent = ancestor_ids[i]
       origin_time[i] = 0.0 if parent == i else origin_time[parent] + deltas[i]
   df_chrono["origin_time"] = origin_time
   df_chrono = pfl.alifestd_chronological_sort(df_chrono)

Tip Sampling (Mark Functions)
=============================

Tip sampling mark functions add a boolean column indicating which tips to retain.
They do **not** remove any rows --- use a pruning step afterward.

Uniform Random Sampling
------------------------

.. code-block:: python

   df_sample = pfl.alifestd_make_balanced_bifurcating(depth=5)
   df_sample = df_sample.pipe(pfl.alifestd_to_working_format)

   # Mark 10 randomly selected tips
   df_sample = pfl.alifestd_mark_sample_tips_uniform_asexual(
       df_sample, n_sample=10, mark_as="keep", seed=42,
   )

Canopy Sampling
---------------

Retain tips with the largest values in a criterion column (e.g., the most recent tips by ``origin_time``):

.. code-block:: python

   df_sample = pfl.alifestd_mark_sample_tips_canopy_asexual(
       df_sample,
       criterion="origin_time",
       mark_as="keep_canopy",
       n_sample=5,
   )

Lineage Sampling
----------------

Retain tips closest to a focal lineage (the lineage of the tip with the largest criterion value):

.. code-block:: python

   df_sample = pfl.alifestd_mark_sample_tips_lineage_asexual(
       df_sample, n_sample=5, mark_as="keep_lineage",
   )

Combining Sample Masks
======================

Because sample marks are boolean columns, they compose naturally with standard boolean operations:

.. code-block:: python

   # Keep tips matching EITHER criterion (union)
   df_sample["keep"] = (
       df_sample["keep_canopy"] | df_sample["keep_lineage"]
   )

   # Keep tips matching BOTH criteria (intersection)
   df_sample["keep"] = (
       df_sample["keep_canopy"] & df_sample["keep_lineage"]
   )

   # Invert a selection
   df_sample["keep"] = ~df_sample["keep_canopy"]

Pruning
=======

Pruning Extinct Lineages
------------------------

Remove lineages that have no extant descendants.
The ``criterion`` parameter specifies which boolean column marks extant taxa (default: ``"extant"``):

.. code-block:: python

   df_prune = pfl.alifestd_make_balanced_bifurcating(depth=4)
   df_prune = df_prune.pipe(pfl.alifestd_to_working_format)

   # Mark which taxa are extant (e.g., those with id >= threshold)
   threshold = len(df_prune) // 2
   df_prune["extant"] = df_prune["id"] >= threshold

   # Remove lineages without extant descendants
   df_prune = pfl.alifestd_prune_extinct_lineages_asexual(df_prune)

You can also use a custom criterion column name:

.. code-block:: python

   df_prune2 = pfl.alifestd_make_balanced_bifurcating(depth=4)
   df_prune2 = df_prune2.pipe(pfl.alifestd_to_working_format)
   df_prune2["is_alive"] = True
   df_prune2 = pfl.alifestd_prune_extinct_lineages_asexual(
       df_prune2, criterion="is_alive",
   )

Coarsening with a Mask
----------------------

Keep only rows matching a boolean mask, re-wiring ancestor relationships to maintain tree connectivity:

.. code-block:: python

   df_coarsen = (
       pfl.alifestd_from_newick("((A,B),(C,D));", create_ancestor_list=True)
       .pipe(pfl.alifestd_mark_node_depth_asexual)
       .pipe(pfl.alifestd_mark_leaves)
   )
   mask = (df_coarsen["node_depth"] % 2 == 0) | df_coarsen["is_leaf"]
   df_coarsen = pfl.alifestd_coarsen_mask(df_coarsen, mask)

Composed Example: Multi-criteria Downsampling
=============================================

A complete workflow combining multiple sampling strategies and pruning:

.. code-block:: python

   import pandas as pd
   from phyloframe import legacy as pfl

   # Load or create a phylogeny with origin times
   df = (
       pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,(D:5,E:6):7):8);")
       .pipe(pfl.alifestd_mark_sample_tips_canopy_asexual, n_sample=2, mark_as="keep_canopy")
       .pipe(pfl.alifestd_mark_sample_tips_lineage_asexual, n_sample=2, mark_as="keep_lineage")
   )

   # Combine: OR the masks to keep tips matching either criterion
   df["extant"] = df["keep_canopy"] | df["keep_lineage"]

   # Prune lineages with no extant descendants
   result = df.pipe(pfl.alifestd_prune_extinct_lineages_asexual)

Downsampling (Combined Mark + Prune)
=====================================

For convenience, ``alifestd_downsample_tips_*`` functions combine the mark and prune steps:

.. code-block:: python

   df_ds = pfl.alifestd_make_balanced_bifurcating(depth=5)
   df_ds = df_ds.pipe(pfl.alifestd_to_working_format)

   # Uniform random downsampling
   df_u = pfl.alifestd_downsample_tips_uniform_asexual(
       df_ds, n_downsample=10,
   )

   # Canopy downsampling
   df_c = pfl.alifestd_downsample_tips_canopy_asexual(
       df_ds, n_downsample=5,
   )

   # Lineage-based downsampling
   df_l = pfl.alifestd_downsample_tips_lineage_asexual(
       df_ds, n_downsample=5,
   )

These are equivalent to calling the corresponding mark function, setting the ``"extant"`` column, and then pruning extinct lineages.
Use the separate mark + prune workflow (above) when you need to combine multiple sampling criteria.