=============================
Creating Phylogenies (Legacy)
=============================

.. note::

   This documentation covers the **legacy** API (from phyloframe import legacy).
   The legacy API is stable and will continue to be maintained for backward
   compatibility.
   A redesigned API will accompany phyloframe v1.0.0.


This guide covers the different ways to create phylogeny DataFrames in phyloframe.

Empty Phylogenies
=================

``alifestd_make_empty`` creates a zero-row DataFrame with the correct column names and dtypes (``id`` as int, ``ancestor_list`` as str, and optionally ``ancestor_id`` as int).
This ensures downstream operations receive properly typed input:

.. code-block:: python

   from phyloframe import legacy as pfl

   # Minimal empty DataFrame (id and ancestor_list columns)
   df = pfl.alifestd_make_empty()

   # With ancestor_id column pre-created
   df = pfl.alifestd_make_empty(ancestor_id=True)

   # Polars version (ancestor_id=True by default)
   df_polars = pfl.alifestd_make_empty_polars()

From Scratch with Pandas
========================

Build a phylogeny by constructing a DataFrame directly:

.. code-block:: python

   import pandas as pd

   # A simple tree:
   #       0 (root)
   #      / \
   #     1   2
   #    / \
   #   3   4
   phylogeny_df = pd.DataFrame({
       "id": [0, 1, 2, 3, 4],
       "ancestor_list": ["[None]", "[0]", "[0]", "[1]", "[1]"],
   })

   # Validate the format
   assert pfl.alifestd_validate(phylogeny_df)

You can include additional columns at creation time:

.. code-block:: python

   phylogeny_df = pd.DataFrame({
       "id": [0, 1, 2, 3, 4],
       "ancestor_list": ["[None]", "[0]", "[0]", "[1]", "[1]"],
       "origin_time": [0, 10, 10, 20, 20],
       "taxon_label": ["root", "A", "B", "C", "D"],
   })

Synthetic Trees
===============

Balanced Bifurcating Trees
--------------------------

Create perfectly balanced binary trees by specifying depth:

.. code-block:: python

   # depth=0 -> empty tree
   # depth=1 -> 1 node (root only)
   # depth=2 -> 3 nodes (1 root, 2 leaves)
   # depth=3 -> 7 nodes (3 internal, 4 leaves)
   # depth=n -> 2^n - 1 total nodes, 2^(n-1) leaves

   df = pfl.alifestd_make_balanced_bifurcating(depth=4)  # 15 nodes
   print(f"Nodes: {len(df)}")
   print(f"Leaves: {pfl.alifestd_count_leaf_nodes(df)}")

Comb (Caterpillar) Trees
-------------------------

Create maximally unbalanced trees by specifying the number of leaves:

.. code-block:: python

   df = pfl.alifestd_make_comb(n_leaves=10)
   print(f"Nodes: {len(df)}")
   print(f"Leaves: {pfl.alifestd_count_leaf_nodes(df)}")

Parsing Newick Format
=====================

Parse Newick-format strings into phyloframe DataFrames:

.. code-block:: python

   # Simple topology
   df = pfl.alifestd_from_newick("((A,B),(C,D));")

   # With branch lengths
   df = pfl.alifestd_from_newick("((A:1.0,B:2.0):3.0,(C:4.0,D:5.0):6.0);")

   # Parsed columns include:
   #   id, ancestor_id, taxon_label, origin_time_delta, branch_length
   print(df.columns.tolist())

Polars Newick Parsing
---------------------

.. code-block:: python

   df_polars = pfl.alifestd_from_newick_polars("((A,B),(C,D));")

Options
-------

.. code-block:: python

   # Integer branch lengths
   df = pfl.alifestd_from_newick(
       "((A:1,B:2):3,(C:4,D:5):6);",
       branch_length_dtype=int,
   )

   # Include ancestor_list column (slower, for compatibility)
   df = pfl.alifestd_from_newick(
       "((A,B),(C,D));",
       create_ancestor_list=True,
   )

Loading from Files
==================

Since phyloframe uses standard DataFrames, loading from files uses standard library calls:

.. code-block:: python

   import pandas as pd

   # From CSV
   df = pd.read_csv("phylogeny.csv")

   # From Parquet (recommended for large trees)
   df = pd.read_parquet("phylogeny.pqt")

   # From a URL
   df = pd.read_csv("https://example.com/data/phylogeny.csv")

   # From cloud storage
   df = pd.read_parquet("s3://bucket/phylogeny.pqt")

.. code-block:: python

   import polars as pl

   # Polars Parquet (efficient columnar reads)
   df = pl.read_parquet("phylogeny.pqt")

   # Selective column loading with Parquet
   df = pl.read_parquet("phylogeny.pqt", columns=["id", "ancestor_id"])

After loading, convert to working format for analysis:

.. code-block:: python

   df = pfl.alifestd_to_working_format(df)