Valley Fever (Coccidioidomycosis) is caused by inhaling soil-dwelling Coccidioides fungal spores. California reported a near-record 9,054 cases in 2023 and shattered that with ~12,500 cases in 2024. Kern County alone accounted for 3,990 cases and 49 deaths. The disease disproportionately affects agricultural workers, incarcerated populations, and outdoor laborers — the communities right here in the Central Valley that UC Merced exists to serve.
Sporisk transforms freely available environmental data from NOAA, EPA, and CIMIS into a 4-to-6 month early warning system. By predicting when spore risk will spike, health clinics can stock antifungals early, county health departments can run targeted outreach, and agricultural employers can schedule protective measures — all before the outbreak hits.
We combine a Random Forest baseline with a Temporal Graph Convolutional Network (T-GCN) to model both the geographic spread of dust-borne risk between counties and the multi-month biological lag between weather and disease.
Coccidioides immitis is a dimorphic fungus that alternates between a saprobic phase in soil and a parasitic phase in mammalian lungs. Understanding this lifecycle is the entire foundation of our prediction model — the fungus doesn't just passively sit in dirt. It actively grows, fragments, and disperses in a predictable, weather-driven sequence.
This is the "grow and blow" hypothesis: wet winters grow fungal biomass, dry summers blow spores into the air. The result is a predictable 4–6 month lag between heavy precipitation and peak Valley Fever incidence. But it doesn't stop there — precipitation 1.5 to 2 years prior is actually the dominant predictor in some models, because drought preceding a wet season eliminates competing soil microbes, letting Coccidioides multiply unopposed when moisture returns. California's 2023 record was a direct consequence of this drought-to-deluge "whiplash."
| Feature | Saprobic Phase (Soil) | Parasitic Phase (Host) |
|---|---|---|
| Morphology | Septate hyphae / Mycelia | Spherules / Endospores |
| Infectious Unit | Arthroconidia (2–4 µm) | Endospores (from spherules) |
| Trigger | Soil desiccation / Heat stress | Body temperature / Host nutrients |
| Function | Dispersal and environmental survival | Proliferation and host dissemination |
| Reproduction | Asexual fragmentation (autolysis) | Endosporulation within spherules |
Predicting risk for Fresno, Kern, Kings, Madera, Merced, San Joaquin, Stanislaus, and Tulare requires integrating air quality, soil, and meteorological data from specific California monitoring networks. Our pipeline pulls from three sources, then runs a 5-stage standardization process.
| Variable | Source | Lag Period | Significance | Role |
|---|---|---|---|---|
| PM10 | EPA AQS / CARB | 1–2 months | High | Airborne spore proxy (Erisk) |
| Soil Moisture | Open-Meteo / CIMIS | Current & 6–12 mo | Very High | Growth hydration (Gpot) + Aridity (Erisk) |
| Max Temperature | Open-Meteo / NOAA | 1–2 months | Moderate-High | Maturation trigger (both phases) |
| Precipitation | Open-Meteo / CDEC | 1.5–2 years | High (peaks) | Multi-year growth signal (Gpot) |
| Wind Speed | Open-Meteo / CIMIS | Current | Low-Moderate | Spore transport vector (Erisk) |
| Case Counts | CDPH (verified) | Target variable | — | Training target |
The standardization pipeline follows six stages: temporal alignment (daily granularity), imputation (forward-fill for soil, interpolation for wind/temp, zero-fill for precipitation), outlier clipping at physical boundaries, lag feature engineering encoding the grow-and-blow biology, Z-score normalization (preferred for LSTM gates — preserves distributional shape), and sequence windowing (6-month sliding windows for the T-GCN).
Our refined algorithm splits risk into two biological phases. Growth Potential captures antecedent conditions that grow fungal biomass. Exposure Risk captures current conditions that aerosolize spores. They multiply — because growth without dispersal produces zero human cases.
Each weight represents that variable's relative importance in predicting Valley Fever incidence, derived from Multivariable Negative Binomial Regression (MNBR) adjusted Incidence Rate Ratios (aIRR) reported in peer-reviewed epidemiological studies of Coccidioidomycosis in California and Arizona. The weights within each phase are normalized so they sum to a proportion that reflects the literature consensus.
| Weight | Variable | What It Measures | Why This Weight |
|---|---|---|---|
| 0.35 | SMlag6mo | Average volumetric soil moisture (m³/m³) from 6 months prior. Measured at 0–7cm depth via Open-Meteo ERA5 reanalysis, calibrated against CIMIS ground stations. | MNBR studies assign the highest aIRR (1.8–2.0) to lagged soil moisture. Our Random Forest
independently confirmed this — sm_lag6 ranked #1 at 22.3% importance. The
fungus literally cannot grow without soil hydration, making this the dominant biological
driver. |
| 0.20 | Tlag6mo | Maximum daily temperature (°C) from 6 months prior. Coccidioides grows optimally at 10–40°C at 20cm soil depth; hotter summers produce more fungal biomass. | Temperature has an aIRR of ~1.9 in fall season models. However, it's partially collinear with soil moisture (hot periods are usually dry periods), so its independent contribution is moderate. Weight reflects the ~20% relative importance from Table 3 of our research analysis. |
| 0.30 | Plag1.5yr | Total precipitation (mm) from 18 months prior. This captures the multi-year drought/deluge cycle — the "whiplash" effect where drought kills competing soil microbes, then rain lets Coccidioides proliferate unopposed. | LSTM models on Maricopa County data showed precipitation 1.5–2 years prior accounts for up to half the total variance in peak-season incidence. This is the variable that explains why 2023–2024 broke records: the 2021–2022 drought "sterilized" the soil, then 2022–2023 rains fed an unchallenged fungal bloom. |
| Weight | Variable | What It Measures | Why This Weight |
|---|---|---|---|
| 0.25 | PM101mo | Average concentration of particulate matter ≤10µm (µg/m³) from 1 month prior. Sourced from EPA AQS bulk data (parameter code 81102) across SJVAPCD's 38 valley monitoring sites. | Arthroconidia are 2–4µm — they are PM10. During wind events in arid regions, geologic dust comprises >90% of PM10. MNBR studies show a positive exposure-response relationship with aIRR of 1.5–1.7 for cumulative dust exposure 1–3 months before disease onset. |
| 0.15 | 1 − SMnow | Current soil aridity — the inverse of today's soil moisture. When soil moisture is 0.05 m³/m³ (very dry), aridity = 0.95. When soil is saturated at 0.50, aridity = 0.00. | Aerosolization requires dry topsoil. Capillary forces in moist soil physically prevent dust from becoming airborne. MNBR assigns aIRR of 1.2–1.4 for surface aridity. Weight is lower than PM10 because aridity is a necessary condition but PM10 is the actual measurement of airborne particles. |
| 0.05 | Wind | Current daily maximum wind speed (km/h) from Open-Meteo. Represents the transport vector that lofts spores from soil into breathable air. | Surprisingly, average wind speed has the lowest aIRR (0.4–0.6) of all variables. This is because monthly average wind doesn't capture what actually matters — acute gusts during dust storms/haboobs. Since we use monthly averages (not gust data), we assign the lowest weight. PM10 already captures the effect of wind events more directly. |
| 0.20 | Tmax | Maximum temperature (°C) from 1 month prior. High temperatures (>30°C) trigger the morphological shift from mycelia to arthroconidia via desiccation stress. | Fall-season MNBR models show Tmax with aIRR of ~1.9. Temperature accelerates arthroconidia maturation — the "blow" phase requires heat to fragment the hyphae. Combined with dry soil, this creates the late-summer/early-fall case peak seen in CDPH data. |
Why multiplicative? Risktotal = Gpot × Erisk because the biology demands it. If the fungus grew abundantly (high Gpot) but the soil is currently wet and there's no wind (Erisk ≈ 0), no spores reach human lungs — zero cases. Conversely, if conditions are perfect for dispersal (high Erisk) but there was no moisture to grow fungal biomass (Gpot ≈ 0), there are no spores to disperse. Only when both phases align do outbreaks occur. Addition would incorrectly predict risk when only one phase is active.
Validation: Our Random Forest independently learned feature importances that closely mirror these literature-derived weights — sm_lag6 at 22.3% (#1), PM10 variables at 22.7% combined (#2), and wind at the lowest individual importance. This convergence between literature and data-driven importance is strong evidence the formula captures real biological dynamics.
We train two complementary models. The Random Forest serves as a fast, interpretable baseline — it needs manually engineered lag features but tells us which variables matter most. The T-GCN is our advanced model — it combines a Graph Neural Network (spatial: how risk spreads between neighboring counties) with a GRU (temporal: how weather months ago drives cases today), learning both spatial and temporal patterns directly from raw data.
200 trees, max depth 10. Uses 16 handcrafted features including 5 lag variables encoding grow-and-blow biology. Validated with Leave-One-County-Out CV (trains on 7 counties, predicts the 8th — proving spatial generalization).
Combines GCN spatial aggregation with GRU temporal memory. Uses 5 raw features — the LSTM/GRU learns lag relationships on its own from 6-month sliding windows. Counties are nodes in a graph; edges are real geographic adjacencies.
The model independently learned that soil moisture 6 months ago (sm_lag6) is the #1 predictor — exactly what the peer-reviewed literature predicts. This is empirical validation of the grow-and-blow hypothesis from our own data.
The GCN uses this graph to let neighboring counties influence each other's predictions — so a dust storm originating in Kern affects Tulare and Kings' risk scores. 8 nodes, 11 edges based on real California county borders.
Select a county to view actual monthly environmental data from NOAA, EPA, and CDPH (2020–2026). Observe the temporal lag between precipitation spikes and risk index peaks.
To validate our risk model, we overlay actual reported Valley Fever cases from the California Department of Public Health (CDPH) and Kern County Public Health against our model's predicted risk scores. The correlation confirms the "grow and blow" hypothesis is captured correctly.
| Year | CDPH Cases (Kern) | CA Statewide | Avg Risk Score | Peak Gpot | Climate Context |
|---|---|---|---|---|---|
| 2020 | 2,954 | ~7,000 | 10.6 | 0.358 | Drought onset |
| 2021 | 3,045 | ~7,200 | 8.0 | 0.365 | Severe drought |
| 2022 | 2,407 | ~7,480 | 8.5 | 0.332 | Drought → rain begins late |
| 2023 | 3,152 | 9,054 | 9.3 | 0.516 | Extreme wet after drought |
| 2024 | 3,990 | ~12,500 | 12.0 | 0.611 | Record — continued wet + 1.5yr lag |