How the data flows
The pipeline.
Ten stages, run end-to-end by python src/main.py from the project root. Each stage writes a CSV (or a SQLite table) that the next stage reads.
Stages
-
1data_collection.pyMerges 2023 + 2025 PCPartPicker CSVs, then ingests real daily prices from HardwareDealsCo for GPU, SSD, and RAM.
-
2data_cleaning.pyIQR caps, spec parsing, component_id construction, cross-year matching.
-
3generate_synthetic.py29-month synthetic series with 9 encoded events, then bridges real bucket prices into per-SKU real_blended rows.
-
4database_setup.pySQLite schema with 6 tables + 5 sample queries.
-
5model_training.pyLR / DT / RF + naive baseline + TimeSeriesSplit CV + permutation importance, then 3 real-data backtests (GPU / Storage / RAM).
-
6estimator.pyBudget / mid / high tier build cost forecast + k-means tier discovery.
-
7value_metrics.py$/GB VRAM, $/core, $/GB metrics on real observed prices.
-
8spec_classifier.pyMulti-class spec-to-tier classifier with stratified CV + calibration curves.
-
9spec_regression.pyCross-sectional regression on real prices: train on 2023, test on 2025.
-
10data_visualization.pyPlots in visuals/, with annotated event windows and a synthetic-vs-real GPU overlay.
Data Lineage
Real-blended rows by category
| GPU | 1,804 |
| RAM | 3,265 |
| Storage | 5,084 |
External real-price tables
HardwareDealsCo daily snapshots, Sep 2025 – May 2026.
| gpu_chipset_real_prices | 602 |
| drive_real_prices | 1,446 |
| ram_real_prices | 569 |
Where Each Stage Gets Its Data
| Stage | Reads |
|---|---|
| 5 · regression | synthetic + real-blended price history |
| 5 · real backtests | real-blended rows only, last 3 months held out |
| 6 · estimator | forecast + k-means on real specs |
| 7 · value metrics | real observed prices (PCPartPicker) |
| 8 · classifier | real observed prices (PCPartPicker) |
| 9 · spec regression | real observed 2025 prices, cross-sectional |