Protein Folding - Code - Tripstoph's Digital Garden

>[!warning] >This content has not been peer reviewed. # Protein folding barrier — calculation This note documents the **RST protein folding barrier** script used for the [[Protein Folding Barrier Crossing (RST)]] application. The script computes RST **effective transition sharpness** $n_{\mathrm{eff}}$ and **fidelity at the barrier** $\mu(1, n)$ from folding time and transition-path time, using the same $\mu(\eta,n)$ curve as the rest of the theory. --- ## Purpose - **Inputs:** Per protein: folding time $\tau_{\mathrm{fold}}$ (ms), transition-path time $\tau_{\mathrm{TP}}$ (µs). Optional: $N_{\mathrm{residues}}$, **sequence** (amino-acid string for sequence-only prediction), temperature (K or °C), **uncertainties** (tau_fold_err_ms, tau_TP_err_us or _std_ variants), **curvature** (curvature_omega_sq, optionally D), **structure** (contact_order). Rows may omit kinetics and provide **name + sequence** (or name + N_residues) for **prediction-only** (sequence-to-kinetics). See DATA_SOURCE.md for column names, units, and sequence validation. - **Outputs:** $n_{\mathrm{eff}}$, $\mu(1, n_{\mathrm{eff}})$, $N_{\mathrm{steps}}$; when input errors are given, **n_eff_err** and **mu_at_barrier_err** (first-order propagation). When ≥3 proteins: empirical $\tau_{\mathrm{TP}}$ from $\tau_{\mathrm{fold}}$ (and optional LOO validation with `--validate`). When curvature/structure columns are present: n from curvature, n from structure, and predicted $\tau_{\mathrm{TP}}$ with residuals. - **Conditions:** $\tau_{\mathrm{fold}}$ and $\tau_{\mathrm{TP}}$ are measured at a given temperature (and ideally same solvent). $n_{\mathrm{eff}}$ is the RST barrier-shape parameter **under those conditions**. For comparisons across datasets, use similar T (e.g. 25°C). The script does not model temperature dependence. --- ## Formulas **Effective n (phenomenological):** Longer $\tau_{\mathrm{TP}}$ ⇒ broader barrier ⇒ lower $n$. Shorter $\tau_{\mathrm{TP}}$ ⇒ sharper barrier ⇒ higher $n$. $ n_{\mathrm{eff}} = \frac{\tau_{\mathrm{ref}}}{\tau_{\mathrm{TP}}}, \qquad n_{\mathrm{eff}} \geq 1 \quad \text{(RST axiom)} $ with $\tau_{\mathrm{ref}} = 4$ µs (upper end of the reported 0.7–4 µs range). **Fidelity at the barrier ($\eta = 1$):** $ \mu(1, n) = \frac{1}{(1 + 1^n)^{1/n}} = \frac{1}{2^{1/n}} $ Implemented as `mu_rst(1.0, n_eff)` from the shared RST engine. **Number of translation steps (RST-derived):** A4 identifies time with the refresh interval of relations. The effective number of translation steps during barrier crossing is $ N_{\mathrm{steps}} = \frac{\tau_{\mathrm{TP}}}{\tau_{\mathrm{substrate}}} $ with $\tau_{\mathrm{substrate}} = 1$ µs (nominal substrate scale from the 0.7–4 µs $\tau_{\mathrm{TP}}$ cluster). The script computes and reports $N_{\mathrm{steps}}$ for each protein. **Uncertainty propagation (first-order):** Optional CSV columns `tau_fold_err_ms`, `tau_TP_err_us` (or `tau_fold_std_ms`, `tau_TP_std_us`) enable propagation to derived quantities. With $\delta\tau_{\mathrm{TP}}$ the given error in $\tau_{\mathrm{TP}}$, $\delta n_{\mathrm{eff}}/n_{\mathrm{eff}} \approx \delta\tau_{\mathrm{TP}}/\tau_{\mathrm{TP}}$; $\delta\mu$ is obtained from $\mathrm{d}\mu/\mathrm{d}n = \mu \ln 2 / n^2$. When errors are present, **weighted least-squares** is used (inverse-variance weights) for the $\tau_{\mathrm{fold}} \to \tau_{\mathrm{TP}}$ fit, n(structure) fit, and joint (n, $\tau_{\mathrm{ref}}$) fit. Assumptions: symmetric, uncorrelated errors. **n from barrier curvature:** When optional column `curvature_omega_sq` (e.g. (rad/s)²) is present for ≥3 rows, the script fits $n_{\mathrm{eff}} = c_0 + c_1\,\omega^{*2}$ and computes $n_{\mathrm{curv}} = \max(1, c_0 + c_1\,\omega^{*2})$, $\tau_{\mathrm{TP}}^{\mathrm{pred}} = \tau_{\mathrm{ref}}/n_{\mathrm{curv}}$. Residuals vs observed $\tau_{\mathrm{TP}}$ are reported. Calibrated from the same dataset when both $\tau_{\mathrm{TP}}$ and curvature are available (Kramers-type scaling). **n from structure (RST-constrained):** When ≥3 rows have structure descriptors, the script fits n(structure) with **RST constraints** (n ≥ 1; optional weak prior toward $d_B \approx 1.22$, strength set by `--prior-strength`). **Descriptor preference:** When both $N_{\mathrm{residues}}$ and **contact_order** are available (each with ≥3 rows), **contact_order is preferred** because it better reflects topology and is well documented in folding literature (e.g. Plaxco et al.). Otherwise: $N_{\mathrm{residues}}$ (log-log fit), then `sequence_descriptor`. Formula: $\log_{10}(n_{\mathrm{eff}}) = c_0 + c_1\,\log_{10}(N_{\mathrm{res}})$ for N_residues; $n_{\mathrm{eff}} = c_0 + c_1\,\mathrm{contact\_order}$ for contact_order. Reports **n_pred_from_structure**, **tau_TP_pred_from_structure**, and residual. **τ_ref:** By default the script uses **joint-fit** (when scipy is available): $(c_0, c_1, \tau_{\mathrm{ref}})$ are fitted together to minimize prediction error; use `--two-step` to restore the previous two-step fit (n from structure, then calibrate τ_ref). Calibrated τ_ref is **clamped to [0.7, 4] µs** (experimental range); use `--fixed-tau-ref` to skip calibration and use fixed 4 µs for the structure path. **Sequence-based descriptor:** When `sequence_descriptor` is not in the CSV but **sequence** is present, the script computes a **contact-order proxy** from sequence (length + composition: helix vs beta propensity); fallback remains $\log_{10}(N)$. --- ## RST predictions (what can be calculated) - **Given n, RST predicts $\tau_{\mathrm{TP}}$:** $\tau_{\mathrm{TP}}^{\mathrm{pred}} = \tau_{\mathrm{ref}}/n$ (µs). Use `--predict-from-n N`. - **Given $\tau_{\mathrm{TP}}$, RST calculates $N_{\mathrm{steps}}$:** $N_{\mathrm{steps}} = \tau_{\mathrm{TP}}/\tau_{\mathrm{substrate}}$ (table and CSV). - **Published coefficients (tau_TP from tau_fold):** Fixed $(a, b)$ from the built-in 8-protein set: $\log_{10}(\tau_{\mathrm{TP}}) = a + b\,\log_{10}(\tau_{\mathrm{fold}})$. Use `--predict-from-tau-fold TAU_FOLD_MS` for a single prediction; use `--use-published-coefficients` to apply the same $(a,b)$ to a custom CSV instead of refitting. - **LOO validation:** With `--validate`, leave-one-out prediction of $\tau_{\mathrm{TP}}$ from $\tau_{\mathrm{fold}}$; out-of-sample MAE and correlation are printed and `protein_folding_loo_results.csv` is written. Requires ≥4 proteins. - **Empirical scaling (in-sample):** Without `--use-published-coefficients`, the script fits the log-log relation on the loaded data and reports predicted $\tau_{\mathrm{TP}}$ and residual per protein. - **Structure → n → $\tau_{\mathrm{TP}}$ (RST-constrained n(structure)):** When $N_{\mathrm{residues}}$ (or contact_order) is present, fit $n_{\mathrm{eff}}$(descriptors) with RST constraints (n ≥ 1, optional weak prior toward $d_B \approx 1.22$); then $\tau_{\mathrm{TP}}^{\mathrm{pred}} = \tau_{\mathrm{ref}}/n_{\mathrm{pred}}$. Use `--predict-from-structure N_RESIDUES` for a single prediction from published n-from-structure (built-in set). **Published $\tau_{\mathrm{ref}}$:** The script stores `PUBLISHED_TAU_REF_STRUCTURE_US` (from recursive render on the built-in 8-protein reference set) and uses it for all sequence-only and `--predict-from-structure` predictions so the same reference set defines both n(structure) and $\tau_{\mathrm{ref}}$. - **Recursive render (self-correction loop):** When the script has **independent n** (from structure: `n_pred_from_structure`, or from curvature: `n_curv`) and observed $\tau_{\mathrm{TP}}$ for ≥2 rows, it runs a **fixed-point calibration**: $\tau_{\mathrm{ref}}$ is updated to the least-squares value that minimizes $\sum (\tau_{\mathrm{TP}}^{\mathrm{obs}} - \tau_{\mathrm{ref}}/n_i)^2$ over those rows, then predictions and residuals are recomputed. This mirrors the Sovereign Chain’s recursive render (output feeds back to refine a global parameter). The loop is **calibrating** $\tau_{\mathrm{ref}}$ from the same dataset that provides independent n, not a first-principles derivation. Console reports “Recursive render (structure): tau_ref calibrated to X.XX us (N proteins); MAE = …; corr = …” (and similarly for curvature when that path has data). **Calibrated vs universal/theoretical tau_ref:** Default tau_ref is fixed (4 us) or **calibrated** from the same dataset. A **universal** tau_ref would hold across proteins; use `--test-universality` to fit on a train subset and predict on held-out test (requires >=6 proteins with N_residues). A **theoretical** tau_ref would be derived from substrate timescales (e.g. curvature/diffusion); not implemented. --- ## Script and run - **Script:** `rst_protein_folding_barrier.py` in this folder. - **Run from repo root:** - Default (built-in 8-protein data): `python "expanded theory applied/further applications/Protein Folding/rst_protein_folding_barrier.py"` - With your own CSV: `python "expanded theory applied/further applications/Protein Folding/rst_protein_folding_barrier.py" --csv path/to/data.csv` - Predict $\tau_{\mathrm{TP}}$ from $n$: `--predict-from-n 2.5` - Predict $\tau_{\mathrm{TP}}$ from $\tau_{\mathrm{fold}}$ (published coeffs): `--predict-from-tau-fold 10` - Predict $\tau_{\mathrm{TP}}$ from $N_{\mathrm{residues}}$ (published structure): `--predict-from-structure 60` - LOO validation: `--validate` - Sequence-only validation (LOO from descriptor): `--validate-sequence-only` (with a CSV that has kinetics) - Predict from FASTA (sequence only): `--fasta path/to/sequences.fasta` - Universality test (train/test portability of $\tau_{\mathrm{ref}}$): `--test-universality` - **CLI:** `--no-csv-out`; `--use-published-coefficients`; `--fasta PATH`; `--validate`; `--validate-sequence-only`; `--test-universality`; `--fixed-tau-ref` (use fixed τ_ref = 4 µs for structure path, no recursive render); `--joint-fit` (force joint fit; **joint-fit is the default** when scipy is available); `--two-step` (use two-step structure fit instead of joint-fit); `--prior-strength F` (weak prior toward $d_B \approx 1.22$, default 0.1, use 0 for no prior); `--k-fold K` (K-fold structure validation, report mean±std MAE and correlation; requires ≥6 proteins); `--universality-split random|stratified` (for `--test-universality`: stratified ensures train and test span low/med/high $N_{\mathrm{residues}}$). Validation outputs report **mean residual (pred − obs)** in addition to MAE to expose scale bias. - **CSV columns (input):** `name`, `tau_fold_ms`, `tau_TP_us`; optional: `sequence` (amino-acid string), `N_residues`, `temperature_K` or `T_C`, `tau_fold_err_ms`, `tau_TP_err_us` (or `_std_` variants), `curvature_omega_sq`, `D`, `contact_order`, `sequence_descriptor`. Rows without `tau_fold_ms`/`tau_TP_us` are prediction-only (must have `name` and `sequence` or `N_residues`). **Output:** `n_eff`, `mu_at_barrier`, `N_steps`; when errors given: `n_eff_err`, `mu_at_barrier_err`; when fit/curvature/structure used: corresponding pred and residual columns. For prediction-only rows: `protein_folding_sequence_predictions.csv` with `name`, `N_residues`, `n_pred`, `tau_TP_pred_us`, `kinetics_summary`. - The script is discovered and run by `run_all_further_scripts.py` with no arguments. --- ## Sequence-to-kinetics pipeline When only **sequence** (or FASTA) is available, the pipeline predicts RST kinetics without experimental $\tau_{\mathrm{fold}}$ or $\tau_{\mathrm{TP}}$. - **Input:** CSV with columns `name` and `sequence` (amino-acid single-letter codes), or `--fasta path/to/file.fasta`. Rows without `tau_fold_ms`/`tau_TP_us` are treated as prediction-only (no fitting on these rows). - **Descriptor:** From sequence: $N_{\mathrm{residues}} = \mathrm{len}(\mathrm{sequence})$; optional **sequence_descriptor** = $\log_{10}(N)$ (heuristic proxy for contact-order trend; Plaxco et al., J Mol Biol 1998). Sequence is validated: allowed characters 20 standard amino acids + B, Z, X; length in [10, 500] (see DATA_SOURCE.md). - **Prediction:** Uses **published** coefficients (`PUBLISHED_N_FROM_STRUCTURE`) and **published** $\tau_{\mathrm{ref}}$ (`PUBLISHED_TAU_REF_STRUCTURE_US`) from the built-in reference set. $n^{\mathrm{pred}}$ from descriptor, $\tau_{\mathrm{TP}}^{\mathrm{pred}} = \tau_{\mathrm{ref}}/n^{\mathrm{pred}}$; RST constraint $n \geq 1$ applied. - **Output:** For each prediction-only row: `name`, `N_residues`, `n_pred`, `tau_TP_pred_us`, **kinetics_summary** (tertile-based label: "fast barrier crossing", "moderate", "slow barrier crossing" from reference-set $\tau_{\mathrm{TP}}$ tertiles). Written to console and optionally `protein_folding_sequence_predictions.csv`. - **Validation:** `--validate-sequence-only` runs **LOO sequence-only validation** on the loaded reference set (proteins with known $\tau_{\mathrm{TP}}$): for each protein, fit n(structure) and $\tau_{\mathrm{ref}}$ on the others, predict $\tau_{\mathrm{TP}}$ from descriptor only, compare to observed. Reports LOO MAE, correlation, and **applicability domain** (N_residues range of reference set; extrapolation outside is not validated). - **Limits:** Pipeline is for **folding kinetics only** (predicted $\tau_{\mathrm{TP}}$, $n$). It does not predict 3D structure. The **kinetics_summary** label is for barrier-crossing speed only; it is **not** an aggregation-risk or clinical score unless validated on relevant data. --- ## Verification on real data The script has been run on the **published** dataset of Feng et al. (2026). The data are exported to `data/feng_etal_2026_protein_folding_data.csv` (see `data/DATA_SOURCE.md` for source and conversion). Results and interpretation: **[[Protein Folding Results]]**. --- ## Output files | File | Description | |:---|:---| | **protein_folding_rst_results.csv** | Main results: n_eff, mu_at_barrier, N_steps; optional err, tau_TP_pred_from_fold, curvature/structure pred and residuals. | | **protein_folding_loo_results.csv** | Written with `--validate`: name, tau_fold_ms, tau_TP_us, tau_TP_pred_loo, residual_us (out-of-sample). | | **protein_folding_sequence_predictions.csv** | Written for prediction-only rows (CSV with sequence or `--fasta`): name, N_residues, n_pred, tau_TP_pred_us, kinetics_summary. | | **protein_folding_sequence_only_validation.csv** | Written with `--validate-sequence-only`: name, N_residues, tau_TP_obs, tau_TP_pred, residual_us (LOO sequence-only). | | **protein_folding_fidelity_curves.png** | $\mu(\eta, n)$ vs $\eta$ for $n \in \{1, 1.25, 2, 4\}$ — barrier sharpness interpretation. | | **protein_folding_verification.png** | Left: $\tau_{\mathrm{fold}}$ vs $\tau_{\mathrm{TP}}$ (inverse relation). Right: $\tau_{\mathrm{TP}}$ vs $n_{\mathrm{eff}}$. | | **protein_folding_verification_pred_vs_obs.png** | Verification: observed vs predicted $\tau_{\mathrm{TP}}$ (structure-based). Predictions rescaled to mean(observed) so points show ordering agreement (correlation); 1:1 line; protein labels. Used in [Protein Folding — For Researchers](Protein%20Folding%20—%20For%20Researchers.md). | --- ## Reproducibility All reported results and figures can be reproduced with one script. From the **Protein Folding** folder (or from repo root; paths are resolved from the script location): ```bash python reproduce_results.py ``` This runs (1) the main script on the Feng et al. CSV with `--validate`, `--validate-sequence-only`, and `--test-universality`; (2) the sequence-only path with `--fasta data/test_sequences.fasta`; (3) the default (built-in) run. Use `--quick` to skip the FASTA and default runs. Outputs are written to the Protein Folding folder. **Expected output files:** `protein_folding_rst_results.csv`, `protein_folding_loo_results.csv`, `protein_folding_sequence_only_validation.csv`, `protein_folding_fidelity_curves.png`, `protein_folding_verification.png`; with the full run, also `protein_folding_sequence_predictions.csv`. Python 3.7+; no OS-specific requirements beyond the main script (NumPy, matplotlib, rst_engine). --- ## References - Feng et al., “Cooperative native contact formation facilitates free energy barrier crossing in protein folding,” *Phys. Rev. Lett.* **136**, 108401 (2026). - D. R. Jacobson, “Resolving Barrier Crossing in Protein Folding,” *Physics* **19**, 30 (2026). --- ## Links - **Application:** [[Protein Folding Barrier Crossing (RST)]] - **For researchers (standalone one-pager):** [Protein Folding — For Researchers](Protein%20Folding%20—%20For%20Researchers.md) — kinetics-only summary, verification plot, no RST prerequisite. - **Core fidelity:** [[expanded theory/Fidelity]], [[expanded theory/Transition Sharpness]] - **Foundation:** [[expanded theory applied/foundation/Statistical mechanics and free energy/Statistical mechanics and free energy (RST)]]