========== Quickstart ========== This guide walks through the basics of phyloframe: creating phylogenies, inspecting tree structure, marking properties, transforming trees, and exporting results. Installation ============ Install phyloframe with JIT acceleration (recommended): .. code-block:: bash python3 -m pip install "phyloframe[jit]==0.10.0" Omit ``[jit]`` if you do not need Numba-based just-in-time compilation: .. code-block:: bash python3 -m pip install "phyloframe==0.10.0" Import Convention ================= .. code-block:: python from phyloframe import legacy as pfl The ``legacy`` module contains all current phyloframe operations. As phyloframe evolves, ``legacy`` will continue to be maintained for backward compatibility while new API designs are developed. The Official Alife Standard Format =================================== Phyloframe represents phylogenies as DataFrames in the **alife standard format**. Each row represents an organism (or taxon). Required Columns ---------------- ``id`` : int Unique, non-negative identifier for this organism. ``ancestor_list`` : str JSON-encoded list of ancestor IDs. For asexual phylogenies, this is a single-element list like ``"[0]"``. Root nodes use ``"[None]"``, ``"[none]"``, or ``"[]"``. .. note:: The ambiguity of root representations (``"[None]"`` vs ``"[none]"`` vs ``"[]"``) is a known defect in the current alife data standard. The use of ``none`` also deviates from valid JSON. The string encoding additionally incurs parsing overhead on every access. Phyloframe's ``ancestor_id`` column avoids these issues. Optional Columns (Official Standard) ------------------------------------- ``origin_time`` : numeric Time at which this organism originated. ``destruction_time`` : numeric Time at which this organism was destroyed or went extinct. ``taxon_label`` : str Human-readable label or species name for this organism. See the `alife data standards specification `_ for the full list of official optional columns. Unofficial Extension: ``ancestor_id`` -------------------------------------- ``ancestor_id`` : int Direct ancestor ID for asexual phylogenies. This is an optimized integer representation of ``ancestor_list`` that avoids repeated string parsing. Root nodes store their own ID as ``ancestor_id``. All phyloframe operations on asexual trees support ``ancestor_id`` in place of ``ancestor_list``. Using ``ancestor_id`` is recommended unless interoperability with other alife standard ecosystem tools is needed. Use ``alifestd_try_add_ancestor_list_col`` to generate ``ancestor_list`` on demand when required: .. code-block:: python df = pfl.alifestd_try_add_ancestor_list_col(df) Additional user-defined columns (e.g., trait data, fitness values) can be freely added --- the DataFrame is yours to extend. Representing Roots ------------------ Root nodes have no ancestor. In ``ancestor_list``, this is represented as ``"[None]"``, ``"[none]"``, or ``"[]"``. In ``ancestor_id``, roots store their own ID (i.e., ``ancestor_id == id``). Example ------- .. code-block:: python import pandas as pd # A simple three-node chain: root -> internal -> leaf phylogeny_df = pd.DataFrame({ "id": [0, 1, 2], "ancestor_list": ["[None]", "[0]", "[1]"], }) This represents:: 0 (root) +-- 1 (internal) +-- 2 (leaf) Asexual vs. Sexual Phylogenies ------------------------------ **Asexual** phylogenies have at most one ancestor per organism (i.e., single-element ``ancestor_list``). Most phyloframe operations target asexual phylogenies, where the ``ancestor_id`` column enables fast integer-based lookups. **Sexual** phylogenies allow multiple ancestors (e.g., ``"[3, 7]"``). Sexual phylogeny support is limited to operations that work with ``ancestor_list`` directly (primarily in Pandas). .. code-block:: python # Check phylogeny type pfl.alifestd_is_asexual(phylogeny_df) # True pfl.alifestd_is_sexual(phylogeny_df) # False Creating Phylogenies ==================== From Scratch ------------ .. code-block:: python # Empty phylogeny empty_df = pfl.alifestd_make_empty() # Empty phylogeny with ancestor_id column empty_df = pfl.alifestd_make_empty(ancestor_id=True) Synthetic Trees --------------- .. code-block:: python # Balanced bifurcating tree # depth=1 -> 1 node (root only) # depth=3 -> 7 nodes (4 leaves) # depth=n -> 2^n - 1 nodes, 2^(n-1) leaves balanced_df = pfl.alifestd_make_balanced_bifurcating(depth=3) # Comb (caterpillar) tree comb_df = pfl.alifestd_make_comb(n_leaves=10) From Newick Format ------------------ .. code-block:: python # Parse a Newick string df = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);") # The result includes columns: id, ancestor_id, taxon_label, # origin_time_delta, and branch_length print(df.columns.tolist()) Working Format ============== Many phyloframe operations run fastest when the DataFrame is in **working format**: 1. **Topologically sorted** --- ancestors appear before descendants. 2. **Contiguous IDs** --- each organism's ``id`` equals its row number. 3. **``ancestor_id`` column** --- integer ancestor reference (asexual only). Convert to working format once, then chain operations: .. code-block:: python df = pfl.alifestd_make_balanced_bifurcating(depth=3) df = pfl.alifestd_to_working_format(df) # Verify properties assert pfl.alifestd_is_topologically_sorted(df) assert pfl.alifestd_has_contiguous_ids(df) Marking Properties ================== "Mark" functions add computed columns to a phylogeny DataFrame. The original data is preserved; a new column is appended. .. code-block:: python df = pfl.alifestd_pipe_unary_ops( pfl.alifestd_from_newick("((A,B),(C,D));"), pfl.alifestd_mark_leaves, # leaf detection pfl.alifestd_mark_node_depth_asexual, # depth from root pfl.alifestd_mark_num_descendants_asexual, # descendant count pfl.alifestd_mark_num_children_asexual, # direct children count pfl.alifestd_mark_roots, # root detection ) print(df[["id", "ancestor_id", "is_leaf", "node_depth", "num_descendants", "num_children", "is_root"]]) Custom Column Names ------------------- All mark functions accept a ``mark_as`` parameter to customize the output column name: .. code-block:: python df = pfl.alifestd_mark_leaves(df, mark_as="is_tip") df = pfl.alifestd_mark_node_depth_asexual(df, mark_as="depth") Counting and Querying ===================== .. code-block:: python df = pfl.alifestd_from_newick("((A,B),(C,D));") pfl.alifestd_count_leaf_nodes(df) # 4 pfl.alifestd_count_inner_nodes(df) # 3 pfl.alifestd_count_root_nodes(df) # 1 pfl.alifestd_is_asexual(df) # True pfl.alifestd_is_topologically_sorted(df) # True/False pfl.alifestd_has_contiguous_ids(df) # True/False # Validate format compliance pfl.alifestd_validate(df) # True Tree Transformations ==================== .. code-block:: python df = pfl.alifestd_pipe_unary_ops( pfl.alifestd_from_newick("((A,B),(C,D));"), pfl.alifestd_collapse_unifurcations, # remove single-child nodes pfl.alifestd_splay_polytomies, # expand polytomies into bifurcations pfl.alifestd_add_global_root, # add synthetic root above all roots pfl.alifestd_join_roots, # join multiple roots to oldest root ) Composed Example: Downsampling with Combined Masks =================================================== A common workflow: select tips using multiple sampling criteria, combine them with boolean OR, and prune extinct lineages. .. code-block:: python import numpy as np import pandas as pd from phyloframe import legacy as pfl # Create a tree with origin times computed from branch length deltas df = pfl.alifestd_from_newick( "((A:1,B:2):3,(C:4,(D:5,E:6):7):8);", ) ancestor_ids = df["ancestor_id"].values deltas = df["origin_time_delta"].fillna(0).values origin_time = np.zeros(len(df)) for i in range(len(df)): parent = ancestor_ids[i] if parent != i: origin_time[i] = origin_time[parent] + deltas[i] df["origin_time"] = origin_time # Strategy 1: keep the most recent tips (canopy sampling) df = pfl.alifestd_mark_sample_tips_canopy_asexual( df, n_sample=2, mark_as="keep_canopy", ) # Strategy 2: keep tips closest to a focal lineage df = pfl.alifestd_mark_sample_tips_lineage_asexual( df, n_sample=2, mark_as="keep_lineage", ) # Combine masks with boolean OR --- keep tips matching either criterion df["extant"] = df["keep_canopy"] | df["keep_lineage"] # Prune lineages without any extant descendants pruned_df = pfl.alifestd_prune_extinct_lineages_asexual(df) print(pruned_df[["id", "ancestor_id"]]) The ``alifestd_mark_sample_tips_*`` functions add boolean columns indicating which tips to retain. Combining masks with ``|`` (OR), ``&`` (AND), or ``~`` (NOT) gives full control over tip selection. The ``alifestd_prune_extinct_lineages_asexual`` function then removes any lineages that have no descendants marked as extant via the ``"extant"`` column (configurable with the ``criterion`` parameter). Newick I/O ========== .. code-block:: python # Parse Newick df = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);") # Export to Newick newick_str = pfl.alifestd_as_newick_asexual(df) # Use taxon labels from a column newick_str = pfl.alifestd_as_newick_asexual(df, taxon_label="taxon_label") CSV and Parquet I/O ------------------- Because phyloframe uses standard DataFrames, loading and saving is trivial: .. code-block:: python import pandas as pd # CSV round-trip df.to_csv("phylogeny.csv", index=False) df = pd.read_csv("phylogeny.csv") # Parquet round-trip (recommended for large trees) df.to_parquet("phylogeny.pqt") df = pd.read_parquet("phylogeny.pqt") # Polars Parquet import polars as pl df_polars = pl.read_parquet("phylogeny.pqt") Mutation Semantics ================== By default, operations return a new DataFrame without modifying the input: .. code-block:: python original = df.copy() result = pfl.alifestd_mark_leaves(df) assert original.equals(df) # input unchanged Set ``mutate=True`` to allow in-place modification for better performance in pipelines. Even with ``mutate=True``, always use the return value: .. code-block:: python # Faster: allows reuse of input memory df = pfl.alifestd_mark_leaves(df, mutate=True) df = pfl.alifestd_mark_node_depth_asexual(df, mutate=True) Piping Operations ================= Pandas provides ``DataFrame.pipe()`` for chaining operations idiomatically: .. code-block:: python result = ( df.pipe(pfl.alifestd_collapse_unifurcations) .pipe(pfl.alifestd_mark_leaves) .pipe(pfl.alifestd_mark_node_depth_asexual) ) Polars DataFrames also support ``.pipe()``: .. code-block:: python import polars as pl df_pl = pfl.alifestd_from_newick_polars("((A,B),(C,D));") result_pl = ( df_pl.pipe(pfl.alifestd_mark_leaves_polars) ) Alternatively, ``alifestd_pipe_unary_ops`` accepts multiple operations: .. code-block:: python result = pfl.alifestd_pipe_unary_ops( df, pfl.alifestd_collapse_unifurcations, pfl.alifestd_mark_leaves, lambda df: pfl.alifestd_mark_node_depth_asexual(df, mark_as="depth"), ) Use ``tqdm`` for progress feedback on long pipelines: .. code-block:: python from tqdm import tqdm result = pfl.alifestd_pipe_unary_ops( df, pfl.alifestd_collapse_unifurcations, pfl.alifestd_mark_leaves, progress_wrap=tqdm, ) Next Steps ========== - :doc:`legacy-guides/concepts` --- Data format, tree data structures, and design decisions - :doc:`legacy-guides/tree_creation` --- Synthetic trees, parsing, and construction - :doc:`legacy-guides/tree_properties` --- Marking, counting, and metrics - :doc:`legacy-guides/tree_manipulation` --- Transformations, pruning, and downsampling - :doc:`legacy-guides/traversals` --- Tree traversal and supplemental data structures - :doc:`legacy-guides/io` --- Newick, CSV, and Parquet I/O - :doc:`legacy-guides/cli` --- Command-line interface and pipe operations - :doc:`legacy-guides/performance` --- JIT compilation, Polars, and optimization - :doc:`api` --- Full API reference