# Feng et al. (2026) dataset for RST verification
**Uncertainties:** Experimental errors for τ_TP and τ_fold were not extracted from the paper or Zenodo for the CSV used in this pipeline. Therefore no confidence intervals or error bars are reported for n_eff, μ, or prediction MAE. The script accepts optional columns **tau_TP_err_us** and **tau_fold_err_ms** (or **tau_fold_std_ms**, **tau_TP_std_us**) when available. To add errors: obtain symmetric ± uncertainties from the paper, supplementary, or Zenodo 18354860 / GitHub hoisunglab/FRET_TransitionPath; add columns to the CSV; the script will then use them for uncertainty propagation and, when present, **weighted least-squares** fitting (inverse-variance weights).
## Source
- **Paper:** C.-J. Feng, U. Baxa, J. M. Louis, H. S. Chung, "Cooperative native contact formation facilitates free energy barrier crossing in protein folding," *Phys. Rev. Lett.* **136**, 108401 (2026).
- **Data:** Values in `feng_etal_2026_protein_folding_data.csv` are taken from the **published** Fig. 5 and Fig. 2(b) (main text and appendix).
## Column definitions
- **name:** Protein identifier (matches paper and foldingTP folder names).
- **tau_fold_ms:** Folding time (average dwell time in unfolded state before folding), in ms.
- **tau_TP_us:** Transition-path time (duration on the barrier), in µs, as reported in the paper.
- **sequence:** (Optional.) Amino-acid sequence (single-letter codes). When present and kinetics (`tau_fold_ms`, `tau_TP_us`) are omitted, the row is **prediction-only**: the script computes descriptors from sequence and outputs RST-predicted \(\tau_{\mathrm{TP}}\) and \(n\) using published coefficients. **Validation:** Allowed characters are the 20 standard amino acids (ACDEFGHIKLMNPQRSTVWY) plus B, Z, X (ambiguous). Length must be between **10** and **500** residues; rows with invalid or out-of-range sequence are skipped. When `sequence` is present, `N_residues` is set to `len(sequence)` if not provided.
- **N_residues:** Number of amino acid residues (from paper, or from sequence length when `sequence` is provided).
- **temperature_K:** Assumed 298 K (room temperature; paper reports measurements near urea midpoint, T not explicitly stated).
- **source_note:** Short reference to the figure/quantity used.
**Optional columns (for RST workflow):**
- **tau_fold_err_ms**, **tau_TP_err_us:** Symmetric ± error (or use **tau_fold_std_ms**, **tau_TP_std_us**). When present, the script propagates to `n_eff_err` and `mu_at_barrier_err` (first-order).
- **curvature_omega_sq:** Barrier curvature \((\omega^*)^2\) in consistent units (e.g. (rad/s)²). When ≥3 rows have both `tau_TP_us` and `curvature_omega_sq`, the script fits n from curvature and reports `n_curv`, `tau_TP_pred_curv`, residual. Not in Feng et al. main table; for datasets that report curvature (e.g. from MD).
- **D:** Diffusivity (optional; script can use curvature alone with a fitted scaling).
- **contact_order:** Mean contact order or similar structure descriptor (fraction, typically 0.05–0.25 for single-domain proteins). When present with ≥3 rows, enables n-from-structure fit and `tau_TP_pred_from_structure`. **Preferred over N_residues** when both are available: contact_order better reflects topology and is documented in folding literature (e.g. Plaxco et al., J Mol Biol 1998). **How to add contact_order:** (1) From PDB: compute (1/N)×Σ Δs_ij over native contacts, where Δs_ij is sequence separation |i−j| and N is chain length; tools such as PyMOL scripts or the ContactOrders package (e.g. GitHub taushifkhan/ContactOrders) can compute this. (2) From literature: many papers report mean contact order for well-studied proteins. (3) The Feng CSV includes approximate contact_order values consistent with the folding-rate trend (fast folders lower CO); these can be replaced with PDB-derived values when available.
- **sequence_descriptor:** A scalar descriptor derived from **sequence only**. When provided in the CSV it is used as-is. When **not** provided and a `sequence` column is present, the script computes a **contact-order proxy** from sequence: base \(\log_{10}(N)\) plus a small composition correction (beta-forming vs helix-forming residue fractions; Plaxco et al. trend). Use this when `contact_order` is missing; descriptor preference remains contact_order > N_residues > sequence_descriptor. When present with ≥3 rows (and no N_residues/contact_order, or as fallback), the script fits n(sequence_descriptor) with RST constraints and enables **sequence → n → tau_TP** prediction for new proteins without known structure.
## Conversion of folding time from paper
The paper reports the **relaxation rate** \(k\) (in ms⁻¹ for fast folders, s⁻¹ for slow folders). At the **denaturation midpoint**, folding and unfolding rates are approximately equal: \(k_F \approx k_U\). The relaxation rate is \(k = k_F + k_U \approx 2 k_F\), so \(k_F = k/2\) and the **folding time** (mean dwell in unfolded state) is
\[
t_F = \frac{1}{k_F} = \frac{2}{k}.
\]
- When \(k\) is in **ms⁻¹**, \(t_F\) is in **ms** → use directly as `tau_fold_ms`.
- When \(k\) is in **s⁻¹** (Protein G, CspTm), \(t_F\) is in **s** → convert to ms: `tau_fold_ms = 2000/k`.
## Values used (from paper)
| Protein | k (paper) | t_TP (µs) | N_residues | tau_fold_ms (2/k or 2000/k) |
|-------------|-------------|-----------|------------|-----------------------------|
| Villin | 23.9 ms⁻¹ | 3.1 | 35 | 0.0837 |
| WW domain | 17.1 ms⁻¹ | 3.8 | 37 | 0.117 |
| Protein A | 4.2 ms⁻¹ | 0.9 | 60 | 0.476 |
| λ repressor | 1.6 ms⁻¹ | 0.7 | 80 | 1.25 |
| gpW | 2.5 ms⁻¹ | 0.9 | 62 | 0.8 |
| ADA2h | 1.9 ms⁻¹ | 0.7 | 81 | 1.053 |
| Protein G | 1.5 s⁻¹ | 0.8 | 56 | 1333.3 |
| CspTm | 0.62 s⁻¹ | 0.8 | 67 | 3225.8 |
## Raw data and uncertainties
- **Zenodo:** 10.5281/zenodo.18354860 (experimental data and analysis; check supplementary or repository for τ_TP and k uncertainties if adding error columns).
- **GitHub:** https://github.com/hoisunglab/FRET_TransitionPath (analysis code and example data).
- **This folder:** `folding_raw_data/`, `foldingTP/` (unzipped), and per-protein zips contain the raw bursts and selected segments; fitted parameters (e.g. tau_S, resparam) are in the MATLAB `.mat` files (v7.3 format).
---
## Built-in reference set (self-consistency)
- **File:** `builtin_reference_set.csv`
- **Source:** The same 8-protein reference set (Protein_A through Protein_H) used in the script to define published coefficients (n-from-structure, τ_ref, tertiles). Values are chosen to exhibit the inverse relation (τ_fold vs τ_TP) and a range of N_residues (45–88). Not from an independent lab; used for pipeline self-consistency and coefficient calibration.
- **Columns:** Same as above: name, tau_fold_ms, tau_TP_us, N_residues, temperature_K, source_note.
- **Use:** Run the script with `--csv data/builtin_reference_set.csv --validate --validate-sequence-only --test-universality` to reproduce reference-set metrics. For **independent** validation (different lab/source), a second dataset from the literature or Zenodo would be required; see [[Protein Folding Results]] and [[RST Protein Folding — Methods and Validation]].