Before bfPREP™: Looks Clean. Still Wrong.

Most biomedical datasets arrive in tidy rows and columns but structure is not meaning. Hidden inconsistencies in units, free text, missing derived variables, and fragmented terminology quietly introduce false heterogeneity. Models learn artifacts instead of biology. Subgrouping breaks. Results don’t reproduce across cohorts or sites. This is where “clean” data fails.

After bfPREP™: Standardized. Supplemented. Defensible.

bfPREP™ converts the same inputs into a comparable, feature-complete, and auditable dataset. Synonyms collapse into controlled values. Critical variables are derived explicitly. Ambiguity is flagged, not buried. Every transformation is documented, versioned, and reproducible. The result: data you can trust for modeling, discovery, and decision-making.

bfPREP Dataset Comparison
Before: Raw Data
After: Standardized

⚠️ Why ‘Clean’ Data Still Fails

  • Free-text categories fragment (RUL vs right upper lobe vs lung apex)
  • Units drift (lb vs kg, cm vs inches) with ambiguous bare numbers
  • Regimens explode into strings (dose-reduced, +RT instead of canonical concepts)
  • False heterogeneity dilutes signal and breaks subgrouping
  • Models learn unit artifacts instead of biology
Patient ID Height Weight Lesion Location Drug Regimen Start Progression
P001 5’10” 180 RUL FOLFIRINOX 2024-01-12 2024-07-03
P002 178 cm 82 kg right upper lobe folfiri-nox 2024-02-01 2024-06-28
P003 1.72 165 lb Lung apex (R) FOLFIRINOX (dose-reduced) 2024-01-20 2024-08-15
P004 70 in 75 upper lobe right FOLFOX 2024-03-10 2024-05-02
P005 170cm 68kg liver Gem/Abraxane 2024-02-22 2024-09-30
P006 5 ft 6 150lbs hepatic gemcitabine + nab-paclitaxel 2024-01-05 2024-10-12
P007 165 60 “R. upper” FOLFIRINOX 2024-04-01 2024-03-15
P008 1.80 m 92kg lung FOLFIRINOX 2024-01-25 (blank)
P009 6’1 210 RUL / mediastinum FOLFIRINOX + RT 2024-02-18 2024-06-01
P010 185 cm 95 right lung UL FOLFIRINOX 2024-03-02 2024-07-29
New Columns: Derived variables (BMI, PFS, flags)
Standardized: Normalized units and controlled vocabularies

✅ What bfPREP™ Delivers

  • Standardization: Lesion locations mapped to organ and subsite categories
  • Normalization: Regimens normalized into primary + modifiers
  • Supplementation: Derived columns (BMI, categories, time-to-event) added with explicit rules
  • Quality Control: Censoring and validity flags included for modeling reliability
  • Auditability: Each derived column has a manifest (inputs, method, version, coverage)
Patient ID Height (cm) Weight (kg) BMI BMI Cat Lesion Organ Lesion Sub Multifocal Regimen Modifiers PFS (days) PFS Valid PFS Censored Needs Review
P001 177.8 81.6 25.8 Overwt lung UL_right false FOLFIRINOX none 173 true false false
P002 178.0 82.0 25.9 Overwt lung UL_right false FOLFIRINOX none 148 true false false
P003 172.0 74.8 25.3 Overwt lung apex_right false FOLFIRINOX dose_reduced 208 true false true*
P004 177.8 75.0 23.7 Normal lung UL_right false FOLFOX none 53 true false true*
P005 170.0 68.0 23.5 Normal liver liver false GEM+NAB-PAC none 221 true false false
P006 167.6 68.0 24.2 Normal liver liver false GEM+NAB-PAC none 281 true false false
P007 165.0 60.0 22.0 Normal lung UL_right false FOLFIRINOX none -17 false false true*
P008 180.0 92.0 28.4 Overwt lung lung_unspecified false FOLFIRINOX none true true*
P009 185.4 95.3 27.7 Overwt lung UL_right true FOLFIRINOX +RT 104 true false true*
P010 185.0 95.0 27.8 Overwt lung UL_right false FOLFIRINOX none 149 true false false

* Needs Review indicates unit interference, ambiguous free text, or QC failures (for example: negative PFS in P007, missing progression date in P008)

Don’t debug your data after models fail.
Start with a focused bfPREP™ Data Prep Audit and get a clear, actionable view of what’s holding your data and decisions back.

Request your bfPREP™ Data Prep Audit
Understand the risk. Fix it early. Move forward with confidence.