Protein Folding Results - Tripstoph's Digital Garden

>[!warning] >This content has not been peer reviewed. # Protein Folding Barrier Crossing — Results (Feng et al. 2026 data) RST script run on the **published** dataset of Feng et al., *Phys. Rev. Lett.* **136**, 108401 (2026): eight two-state proteins with folding times and transition-path times measured in aqueous solution (Zenodo 18354860; GitHub hoisunglab/FRET_TransitionPath). ## Dataset - **Source:** Fig. 5 and Fig. 2(b) of the paper. Exported to `data/feng_etal_2026_protein_folding_data.csv`. Folding time τ_fold derived from the reported relaxation rate $k$ at denaturation midpoint: $t_F = 2/k$ (see `data/DATA_SOURCE.md`). The CSV includes a **contact_order** column (approximate values consistent with folding-rate trend; see DATA_SOURCE for PDB-derived options). - **Proteins:** Villin, WW domain, Protein A, λ repressor, gpW, ADA2h, Protein G, CspTm. τ_fold from ~0.08 ms to ~3200 ms; τ_TP from 0.7 to 3.8 µs. ## RST run Command: `rst_protein_folding_barrier.py --csv data/feng_etal_2026_protein_folding_data.csv` (with optional `--validate`, `--validate-sequence-only`, `--test-universality`, `--k-fold 3`). With **contact_order** in the CSV and **joint-fit** (default when scipy available), the structure path uses n(contact_order) and τ_ref fitted together. **LOO validation:** `--validate` yields out-of-sample $\tau_{\mathrm{TP}}$ from $\tau_{\mathrm{fold}}$ and writes `protein_folding_loo_results.csv`; **mean residual (pred − obs)** is printed to expose scale bias. Optional: `--use-published-coefficients`; `tau_fold_err_ms`, `tau_TP_err_us` for weighted fitting; `curvature_omega_sq`, `contact_order` for n-from-curvature and n-from-structure. Outputs: `protein_folding_rst_results.csv`, `protein_folding_fidelity_curves.png`, `protein_folding_verification.png`; with `--validate`, also `protein_folding_loo_results.csv`; with `--validate-sequence-only`, `protein_folding_sequence_only_validation.csv`. ## Verification: can RST do this? 1. **Inverse relation (paper and RST):** The paper reports that fast-folding proteins have *longer* τ_TP (broad barrier) and slow-folding proteins have *shorter* τ_TP (sharp barrier). RST encodes barrier shape as effective $n$: long τ_TP ⇒ low $n$, short τ_TP ⇒ high $n$. On the real data, **correlation of log₁₀(τ_fold) vs τ_TP is negative** (−0.53), consistent with this inverse relation. So RST’s phenomenological mapping (τ_TP ↔ barrier shape ↔ $n$) is **consistent with the published trend**. 2. **n_eff from real data:** All eight proteins give **n_eff ≥ 1** (RST axiom). Fastest folders (Villin, WW domain) have the smallest n_eff (~1.05–1.29) and longest τ_TP (3.1–3.8 µs); slowest folders (Protein G, CspTm) have n_eff = 5 and shortest τ_TP (0.8 µs). So the **same** formula $\mu(\eta, n)$ and the same scaling n_eff = τ_ref/τ_TP produce a coherent ordering of barrier “sharpness” across the real dataset. 3. **Conclusion:** RST **can** parameterize barrier shape from experimental (τ_fold, τ_TP) data: one formula, one parameter $n$ per protein, and a negative correlation between folding time and transition-path time as in the paper. With **contact_order** and **joint-fit**, in-sample structure-based prediction reaches **MAE ≈ 0.66 µs**, **corr ≈ 0.86**; universality test (train/test) gives **test corr ≈ 0.97**. **Predictive use** (τ_TP from sequence or structure) is **exploratory**; only one dataset (N=8) was used. RST organizes observed (τ_fold, τ_TP) into a single fidelity-based framework and respects the reported inverse relation. 4. **LOO validation:** Running with `--validate` on this 8-protein set gives **out-of-sample** $\tau_{\mathrm{TP}}$ predictions from $\tau_{\mathrm{fold}}$; LOO MAE, correlation(pred, obs), and **mean residual (pred − obs)** are reported (mean residual exposes scale bias). **In-sample vs out-of-sample:** Structure-based MAE and correlation (e.g. "n from structure", "Recursive render (structure)", "Joint fit (structure)") are **in-sample**. LOO (τ_fold → τ_TP), sequence-only LOO, universality test (train/test), and **k-fold** structure validation are **out-of-sample**. See **[[Protein Folding - Code]]** for CLI and formulas. 5. **Further workflow (solve kinetics):** The script now supports **RST-constrained n(structure)** (prior n >= 1, optional weak prior toward d_B), **universality test** (`--test-universality`: train/test portability of tau_ref), optional **sequence_descriptor** column for sequence → n → tau_TP when calibrated, and curvature path when data exist. Broader validation (more proteins, other conditions) and a short outreach note can summarise: one formula, one parameter, recursive render, LOO, and RST-constrained n (kinetics only; structure prediction is separate). 6. **Sequence-only / structure LOO validation:** With `--validate-sequence-only` (and a CSV that has kinetics and structure descriptor such as contact_order or N_residues), the script runs **LOO structure validation**: for each protein, n(structure) and τ_ref are fitted on the others, then τ_TP is predicted from the descriptor only. On the Feng set (with contact_order): **LOO MAE ≈ 1.21 µs**, **corr ≈ 0.85**; **mean residual ≈ +1.21 µs** (systematic underprediction). Reported: LOO MAE, correlation(pred, obs), **mean residual (pred − obs)**, and **applicability domain** (N_residues range). Output: `protein_folding_sequence_only_validation.csv`. See [[Protein Folding - Code]] for the full sequence-to-kinetics pipeline (CSV with `sequence` or `--fasta`, published coefficients, kinetics_summary label). 7. **K-fold and universality:** With `--k-fold 3`, structure-based K-fold validation reports **MAE ≈ 1.39 ± 0.55 µs**, **corr ≈ 0.98 ± 0.04** (mean ± std across folds). With `--test-universality`, train/test gives **test MAE ≈ 0.66 µs**, **test corr ≈ 0.97**; mean residual (pred − obs) ≈ −0.66 µs. Reproduce with `python reproduce_results.py` (includes `--k-fold 3`). ## Limitations and validation status Validation is based on a **single dataset** (Feng et al. 2026, N=8). There is **no independent external validation** (no second lab or publication with τ_TP and N/structure). Predictive use (τ_TP from sequence or structure) should be considered **exploratory** until validated on additional data. See [[RST Protein Folding — Methods and Validation]] for the full limitations and validation status section. ## External / independent validation - **Reference set (self-consistency):** The pipeline was run on the built-in reference set (`data/builtin_reference_set.csv`, 8 proteins, Protein_A–H), which is the same set used to define published n(structure) coefficients and τ_ref. This is **not** an independent lab dataset; it tests self-consistency. Metrics (run with `--validate --validate-sequence-only --test-universality`): correlation log₁₀(τ_fold) vs τ_TP = −0.99; recursive-render τ_ref ≈ 0.47 µs; structure MAE ≈ 1.78 µs, corr ≈ 0.99; LOO (τ_fold → τ_TP) MAE ≈ 0.08 µs, corr ≈ 0.99; sequence-only LOO MAE ≈ 1.78 µs, corr ≈ 0.97; universality test MAE ≈ 1.31 µs, corr ≈ 1.0. Conclusion: RST ordering and calibration are self-consistent on this set. For validation on a **second independent dataset** (different lab or publication), see [[RST Protein Folding — Methods and Validation]] and `data/DATA_SOURCE.md`. **Feng et al. (current):** See the reported metrics table in [[RST Protein Folding — Methods and Validation]] for contact_order + joint-fit in-sample and out-of-sample (universality, k-fold, sequence-only LOO) numbers. ## Figures - **protein_folding_fidelity_curves.png:** $\mu(\eta, n)$ for $n$ = 1, 1.25, 2, 4 (barrier sharpness interpretation). - **protein_folding_verification.png:** τ_fold vs τ_TP (inverse relation) and τ_TP vs n_eff for the eight proteins. ## Links - **Application:** [[Protein Folding Barrier Crossing (RST)]] - **Code:** [[Protein Folding - Code]] - **For researchers (standalone one-pager):** [Protein Folding — For Researchers](Protein%20Folding%20—%20For%20Researchers.md) — send to folding researchers; no RST background needed; includes verification graph. - **Methods and validation summary (citation-ready):** [[RST Protein Folding — Methods and Validation]] - **Data source:** `data/DATA_SOURCE.md`, `data/feng_etal_2026_protein_folding_data.csv`, `data/builtin_reference_set.csv`