HackMerced / Team AB @ UC San Diego

Sporisk

Spatio-Temporal Deep Learning for Valley Fever Risk Prediction in the California Central Valley
March 2026 · 8 Counties · 2020–2026 Dataset · Random Forest + T-GCN
A
Samudera Bagas Aubreyasta
Data Science
UC San Diego · HackMerced 2026
B
Moch Raka Aryaputra
Mathematics - Applied Science
UC San Diego · HackMerced 2026
C
Olo Hot B. M. S. Margura Silitonga
Electrical Engineering
UC San Diego · HackMerced 2026
D
Nathan Raphael Martua Nainggolan
Urban Studies & Planning
UC San Diego · HackMerced 2026
About the Authors

We're a team of four from UC San Diego competing at HackMerced XI, drawn together by a shared interest in using data science to tackle real public health challenges. Sporisk started as a question: could freely available environmental data predict a disease that's been largely invisible to the public? We're here to explore the intersection of machine learning, epidemiology, and community impact — and to learn how spatial-temporal modeling can be applied to problems that matter most to underserved communities in California's Central Valley.

Abstract

Why This Matters

Valley Fever (Coccidioidomycosis) is caused by inhaling soil-dwelling Coccidioides fungal spores. California reported a near-record 9,054 cases in 2023 and shattered that with ~12,500 cases in 2024. Kern County alone accounted for 3,990 cases and 49 deaths. The disease disproportionately affects agricultural workers, incarcerated populations, and outdoor laborers — the communities right here in the Central Valley that UC Merced exists to serve.

Sporisk transforms freely available environmental data from NOAA, EPA, and CIMIS into a 4-to-6 month early warning system. By predicting when spore risk will spike, health clinics can stock antifungals early, county health departments can run targeted outreach, and agricultural employers can schedule protective measures — all before the outbreak hits.

We combine a Random Forest baseline with a Temporal Graph Convolutional Network (T-GCN) to model both the geographic spread of dust-borne risk between counties and the multi-month biological lag between weather and disease.

📊
18,056
Daily observations across 8 counties (2020–2026)
🧬
6,289
Real CDPH-verified Valley Fever cases in 2024 (8-county)
🎯
sm_lag6
#1 learned feature: soil moisture 6 months ago (22.3% importance)
Biology

The Coccidioides Lifecycle & The "Grow and Blow" Hypothesis

Coccidioides immitis is a dimorphic fungus that alternates between a saprobic phase in soil and a parasitic phase in mammalian lungs. Understanding this lifecycle is the entire foundation of our prediction model — the fungus doesn't just passively sit in dirt. It actively grows, fragments, and disperses in a predictable, weather-driven sequence.

🌧️
1. Winter Rain
Precipitation saturates alkaline soil. Fungal hyphae proliferate through the soil matrix at 10–40°C (optimal at ~20cm depth).
☀️
2. Desiccation
As soil dries, every other cell along the hyphae dies (autolysis), leaving barrel-shaped arthroconidia: 2–4 µm spores.
4–6 months later
💨
3. Aerosolization
Wind, construction, and agricultural tilling disturb dry topsoil, launching arthroconidia into the air as PM10 particulates.
🫁
4. Infection
Inhaled spores transform into spherules in the lungs. ~60% asymptomatic; ~1% disseminate to skin, bones, or CNS (fatal meningitis).

This is the "grow and blow" hypothesis: wet winters grow fungal biomass, dry summers blow spores into the air. The result is a predictable 4–6 month lag between heavy precipitation and peak Valley Fever incidence. But it doesn't stop there — precipitation 1.5 to 2 years prior is actually the dominant predictor in some models, because drought preceding a wet season eliminates competing soil microbes, letting Coccidioides multiply unopposed when moisture returns. California's 2023 record was a direct consequence of this drought-to-deluge "whiplash."

Feature Saprobic Phase (Soil) Parasitic Phase (Host)
Morphology Septate hyphae / Mycelia Spherules / Endospores
Infectious Unit Arthroconidia (2–4 µm) Endospores (from spherules)
Trigger Soil desiccation / Heat stress Body temperature / Host nutrients
Function Dispersal and environmental survival Proliferation and host dissemination
Reproduction Asexual fragmentation (autolysis) Endosporulation within spherules
Data Pipeline

Environmental Predictors & Standardization

Predicting risk for Fresno, Kern, Kings, Madera, Merced, San Joaquin, Stanislaus, and Tulare requires integrating air quality, soil, and meteorological data from specific California monitoring networks. Our pipeline pulls from three sources, then runs a 5-stage standardization process.

Variable Source Lag Period Significance Role
PM10 EPA AQS / CARB 1–2 months High Airborne spore proxy (Erisk)
Soil Moisture Open-Meteo / CIMIS Current & 6–12 mo Very High Growth hydration (Gpot) + Aridity (Erisk)
Max Temperature Open-Meteo / NOAA 1–2 months Moderate-High Maturation trigger (both phases)
Precipitation Open-Meteo / CDEC 1.5–2 years High (peaks) Multi-year growth signal (Gpot)
Wind Speed Open-Meteo / CIMIS Current Low-Moderate Spore transport vector (Erisk)
Case Counts CDPH (verified) Target variable Training target

The standardization pipeline follows six stages: temporal alignment (daily granularity), imputation (forward-fill for soil, interpolation for wind/temp, zero-fill for precipitation), outlier clipping at physical boundaries, lag feature engineering encoding the grow-and-blow biology, Z-score normalization (preferred for LSTM gates — preserves distributional shape), and sequence windowing (6-month sliding windows for the T-GCN).

Algorithm

Two-Phase Risk Index Sandbox

Our refined algorithm splits risk into two biological phases. Growth Potential captures antecedent conditions that grow fungal biomass. Exposure Risk captures current conditions that aerosolize spores. They multiply — because growth without dispersal produces zero human cases.

Growth Phase
Gpot = 0.35 × SMlag6mo + 0.20 × Tlag6mo + 0.30 × Plag1.5yr
Was the ground wet 6 months ago? Was it warm enough for growth? Did it rain heavily 1.5 years ago (multi-year drought/deluge signal)?
Dispersal Phase
Erisk = 0.25 × PM101mo + 0.15 × (1 − SMnow) + 0.05 × Wind + 0.20 × Tmax
Is the air dusty? Is the soil dry enough to release spores? Are winds carrying particulates? Is it hot enough for maturation?

Variable Definitions & Weight Derivation

Each weight represents that variable's relative importance in predicting Valley Fever incidence, derived from Multivariable Negative Binomial Regression (MNBR) adjusted Incidence Rate Ratios (aIRR) reported in peer-reviewed epidemiological studies of Coccidioidomycosis in California and Arizona. The weights within each phase are normalized so they sum to a proportion that reflects the literature consensus.

Gpot — Growth Phase Variables

Weight Variable What It Measures Why This Weight
0.35 SMlag6mo Average volumetric soil moisture (m³/m³) from 6 months prior. Measured at 0–7cm depth via Open-Meteo ERA5 reanalysis, calibrated against CIMIS ground stations. MNBR studies assign the highest aIRR (1.8–2.0) to lagged soil moisture. Our Random Forest independently confirmed this — sm_lag6 ranked #1 at 22.3% importance. The fungus literally cannot grow without soil hydration, making this the dominant biological driver.
0.20 Tlag6mo Maximum daily temperature (°C) from 6 months prior. Coccidioides grows optimally at 10–40°C at 20cm soil depth; hotter summers produce more fungal biomass. Temperature has an aIRR of ~1.9 in fall season models. However, it's partially collinear with soil moisture (hot periods are usually dry periods), so its independent contribution is moderate. Weight reflects the ~20% relative importance from Table 3 of our research analysis.
0.30 Plag1.5yr Total precipitation (mm) from 18 months prior. This captures the multi-year drought/deluge cycle — the "whiplash" effect where drought kills competing soil microbes, then rain lets Coccidioides proliferate unopposed. LSTM models on Maricopa County data showed precipitation 1.5–2 years prior accounts for up to half the total variance in peak-season incidence. This is the variable that explains why 2023–2024 broke records: the 2021–2022 drought "sterilized" the soil, then 2022–2023 rains fed an unchallenged fungal bloom.

Erisk — Dispersal Phase Variables

Weight Variable What It Measures Why This Weight
0.25 PM101mo Average concentration of particulate matter ≤10µm (µg/m³) from 1 month prior. Sourced from EPA AQS bulk data (parameter code 81102) across SJVAPCD's 38 valley monitoring sites. Arthroconidia are 2–4µm — they are PM10. During wind events in arid regions, geologic dust comprises >90% of PM10. MNBR studies show a positive exposure-response relationship with aIRR of 1.5–1.7 for cumulative dust exposure 1–3 months before disease onset.
0.15 1 − SMnow Current soil aridity — the inverse of today's soil moisture. When soil moisture is 0.05 m³/m³ (very dry), aridity = 0.95. When soil is saturated at 0.50, aridity = 0.00. Aerosolization requires dry topsoil. Capillary forces in moist soil physically prevent dust from becoming airborne. MNBR assigns aIRR of 1.2–1.4 for surface aridity. Weight is lower than PM10 because aridity is a necessary condition but PM10 is the actual measurement of airborne particles.
0.05 Wind Current daily maximum wind speed (km/h) from Open-Meteo. Represents the transport vector that lofts spores from soil into breathable air. Surprisingly, average wind speed has the lowest aIRR (0.4–0.6) of all variables. This is because monthly average wind doesn't capture what actually matters — acute gusts during dust storms/haboobs. Since we use monthly averages (not gust data), we assign the lowest weight. PM10 already captures the effect of wind events more directly.
0.20 Tmax Maximum temperature (°C) from 1 month prior. High temperatures (>30°C) trigger the morphological shift from mycelia to arthroconidia via desiccation stress. Fall-season MNBR models show Tmax with aIRR of ~1.9. Temperature accelerates arthroconidia maturation — the "blow" phase requires heat to fragment the hyphae. Combined with dry soil, this creates the late-summer/early-fall case peak seen in CDPH data.

Why multiplicative? Risktotal = Gpot × Erisk because the biology demands it. If the fungus grew abundantly (high Gpot) but the soil is currently wet and there's no wind (Erisk ≈ 0), no spores reach human lungs — zero cases. Conversely, if conditions are perfect for dispersal (high Erisk) but there was no moisture to grow fungal biomass (Gpot ≈ 0), there are no spores to disperse. Only when both phases align do outbreaks occur. Addition would incorrectly predict risk when only one phase is active.

Validation: Our Random Forest independently learned feature importances that closely mirror these literature-derived weights — sm_lag6 at 22.3% (#1), PM10 variables at 22.7% combined (#2), and wind at the lowest individual importance. This convergence between literature and data-driven importance is strong evidence the formula captures real biological dynamics.

Growth Phase Inputs

m³/m³ — higher = wetter soil during growth phase
Fungal growth optimal: 10–40°C at 20cm depth
Multi-year drought/deluge signal — dominant predictor

Dispersal Phase Inputs

Particulate matter — proxy for airborne arthroconidia
m³/m³ — lower = drier = more aerosolization potential
Transport vector — acute gusts matter more than averages
Maturation trigger for arthroconidia formation
Risk
14
RISK = Gpot × Erisk × 100
Gpot: 0.440   Erisk: 0.327
Moderate
Machine Learning

Model Architecture: Random Forest + T-GCN

We train two complementary models. The Random Forest serves as a fast, interpretable baseline — it needs manually engineered lag features but tells us which variables matter most. The T-GCN is our advanced model — it combines a Graph Neural Network (spatial: how risk spreads between neighboring counties) with a GRU (temporal: how weather months ago drives cases today), learning both spatial and temporal patterns directly from raw data.

Random Forest Baseline

200 trees, max depth 10. Uses 16 handcrafted features including 5 lag variables encoding grow-and-blow biology. Validated with Leave-One-County-Out CV (trains on 7 counties, predicts the 8th — proving spatial generalization).

Features: precip_mm, soil_moisture, wind, pm10, tmax,
  soil_aridity, precip_lag3, precip_lag6, precip_lag18,
  sm_lag6, pm10_lag1, tmax_lag1, tmax_lag6,
  wind_roll3, pm10_roll3, month

T-GCN (Temporal Graph Convolutional Network)

Combines GCN spatial aggregation with GRU temporal memory. Uses 5 raw features — the LSTM/GRU learns lag relationships on its own from 6-month sliding windows. Counties are nodes in a graph; edges are real geographic adjacencies.

Input: (6 months × 8 counties × 5 features)
  
GCN Layer (5→16) — A_hat @ X @ W + b, ReLU
   counties share info with neighbors
GRU Cell (16→16) — update gate + reset gate
   processes 6-month temporal sequence
Dense (16→1) → Risk Score per county

Random Forest Feature Importance (Learned from Corrected CDPH Data)

The model independently learned that soil moisture 6 months ago (sm_lag6) is the #1 predictor — exactly what the peer-reviewed literature predicts. This is empirical validation of the grow-and-blow hypothesis from our own data.

County Adjacency Graph

The GCN uses this graph to let neighboring counties influence each other's predictions — so a dust storm originating in Kern affects Tulare and Kings' risk scores. 8 nodes, 11 edges based on real California county borders.

San Joaquin — Stanislaus — Merced — Madera
                                 |            |
                           Fresno
                         /     |     \
                    Kings — Tulare
                       \    /
                       Kern
Real Data

Central Valley County Dashboard

Select a county to view actual monthly environmental data from NOAA, EPA, and CDPH (2020–2026). Observe the temporal lag between precipitation spikes and risk index peaks.

Kern — Precipitation & Risk Score

Kern — PM10, Soil Moisture & Wind

Growth Potential vs Exposure Risk

Validation

Proof: Real CDPH Case Data vs. Model Predictions

To validate our risk model, we overlay actual reported Valley Fever cases from the California Department of Public Health (CDPH) and Kern County Public Health against our model's predicted risk scores. The correlation confirms the "grow and blow" hypothesis is captured correctly.

2020–2022 Drought
Cases Drop
During California's drought, Kern County cases fell from 2,954 (2020) to 2,407 (2022). Our model's Growth Potential (Gpot) correctly stays low — no rain means no fungal growth.
↓ 19%
2022–2023 Whiplash
Cases Spike
Extreme drought-to-deluge transition. California reported 9,054 cases in 2023 (+21% statewide). Kern jumped to 3,152. Our model's Gpot correctly spikes 6 months after the 2022–23 winter rains.
↑ 31%
2024 Record
All-Time High
California shattered records with ~12,500 cases. Kern County alone: 3,990 cases, 49 deaths. Our model's risk score peaks in late 2024 at the highest values in the dataset.
↑ 27%

Kern County: CDPH Reported Cases vs. Model Risk Score (2020–2024)

Year CDPH Cases (Kern) CA Statewide Avg Risk Score Peak Gpot Climate Context
2020 2,954 ~7,000 10.6 0.358 Drought onset
2021 3,045 ~7,200 8.0 0.365 Severe drought
2022 2,407 ~7,480 8.5 0.332 Drought → rain begins late
2023 3,152 9,054 9.3 0.516 Extreme wet after drought
2024 3,990 ~12,500 12.0 0.611 Record — continued wet + 1.5yr lag
Sources: Kern County case counts from Kern County Public Health Valley Fever Dashboard (press conferences 2022–2025). California statewide from CDPH Epidemiologic Summaries (2020–2021, 2022) and CDPH press releases (2023–2025). Risk scores computed from Sporisk model using NOAA, EPA AQS, and CIMIS data.