TRIALS, TURBULATIONS, AND INFERENCES OF ML ON COMPLEX DATA: POTATO VARIETIES, SIMULATED CLOUDS, AND MORE

Presentation by Fabiana Ferracina (WSU Vancouver)
My Academic Journey
  • Magna cum laude B.S. from the University of Rochester. Awarded Highest Distinction in Mathematics. Elected to Phi Beta Kappa.
  • M.S. in Mathematics (Optimization focus) from the University of Washington.
  • Mathematics and Computer Science Lecturer at the University of Washington Bothell.
  • Earned CISER Mike Jacroux Book Award for Best Among PhD Students in Statistics at Washington State University.
My Academic Journey
  • Summer 2022 G-RIPS Sendai Internship: worked on transit problems with Toyota and Univ. of Tsukuba.
  • 2023-2024 internship at PNNL to work with Mahantesh Halappanavar and Laura Fierce.
  • Summer 2023 MRC on Complex Social Systems with Mason Porter $\rightarrow$ Facility-Location TDA project.
  • Invited as academic mentor for Summer 2024 G-RIPS Sendai (through NSF's institute IPAM).

What do potatoes, aerosol, and traffic have in common?

What do potatoes, aerosol, and traffic have in common?

Data collection challenges

Modeling complexity

Need for efficient processing

Importance to society

My Contributions
  1. Potato Variety Trials: Data cleaning, processing, exploration. Selection/application of several classification models supplemented with data imputation, optimization of hyperparameters and feature selection. Public repository with complete end-to-end process.
  2. Simulating Aerosol Chemistry: Developement of novel graph network simulator that applies physics approach to chemical composition space. Complete framework to prepare, train, predict and analyze aerosol chemistry dynamics. Public repository to help forward methodology.
  3. Congestion and Emissions: Data cleaning, processing, exploration. Case-study structure. Computation of queue simulations and matrix approximations. Validating and analyzing results. Integration of emissions estimation into novel queue simulation of a park-and-ride system. Development of model to estimate social cost of trip time and emissions. Introduction to transit policy optimization approach.

PREDICTIVE ANALYTICS OF VARIETIES OF RUSSET POTATOES

arxiv.org/abs/2404.03701 submitted to Crop Science

AUTHORS

Fabiana Ferracina, Bala Krishnamoorthy, Mahantesh Halappanavar, Shengwei Hu, Vidyasagar Sathuvalli

Introduction
  • Potato is a significant crop in the U.S. with high production in the Pacific Northwest.
  • OSU's Potato Breeding and Variety Development program focuses on developing russet varieties.
  • New potato varieties take 12-13 years to develop, undergoing rigorous trials, starting with 60,000 clones and gradually narrowing down.
High Yield, Good Quality, Resistant Crops Require Hard Work
Potato Breeding
Research Goal

Given data from trials performed in Oregon, can we use machine learning to predict which varieties should graduate to the next step in the process versus which varieties should be dropped?

Data
  • Oregon trials in Hermiston, Ontario, Klamath Falls, Corvallis (dropped in 2015)
  • Years: 2013-2021
  • Total 1086 clones with 40+ features
  • Variety grown for 3 years $\Rightarrow$ graduation to tri-state trials
Challenges: solutions
  • Missing data: Imputation
  • Class imbalance: Re-balance the weights
  • Numerical features at different scales: Standardize
  • Categorical features: Hot encode
  • Data not linearly separable: Investigate non-linear models
Top Performing Models
Code for this project at github.com/fabstat/burbank

Multi-layer Perceptron (MLP): artificial neural network consisting of fully connected input, hidden, and output layers.

Nonlinear activation function (such as ReLU) to introduce nonlinearity into the model.

Top Performing Models

Histogram-based Gradient Boosting (HGB): Additive model where many weak learners (typically decision trees) are combined to form a strong predictor. Each new tree corrects errors made by the previous ones.

Top Performing Models

Support Vector Machine (SVM): RBF kernel maps the original feature space into a higher-dimensional space where a hyperplane can be found to separate the classes.

Results
We measure the quality of classifications using the Matthew's correlation coefficient (MCC): $$\frac{TP \cdot TN - FP \cdot FN}{((TP + FP)(TP + FN)(TN + FP)(TN + FN))^{\frac{1}{2}}}$$
MCC for the top classification models after forward feature selection on imputed dataset.
Model ΔCV MCC Test MCC CI Test MCC
MLPC 0.476 0.608 (0.581, 0.633)
HGBC 0.531 0.574 (0.546, 0.602)
SVM 0.539 0.502 (0.471, 0.533)

Postprocessing imputed dataset had 885 observations and 40 features.

Confusion Matrices: Imputed Set
MLPC Imputed HGBC Imputed SVC Imputed
MCC for the top classification models after forward feature selection on non-imputed dataset.
Model ΔCV MCC Test MCC CI Test MCC
MLPC 0.595 0.623 (0.584, 0.659)
HGBC 0.511 0.645 (0.608, 0.680)
SVM 0.731 0.623 (0.584, 0.659)

Postprocessing non-imputed dataset had 404 observations and 40 features.

Confusion Matrices: Non-imputed Set
MLPC Imputed HGBC Imputed SVC Imputed
Discussion
  • Trial data challenges: missing data, class imbalance, multi-dimensionality and data types.
  • Bias and Cost Efficiency: bias towards negative examples leading to potential false negatives $\Rightarrow$ less promising varieties are dropped earlier.
  • Machine Learning on Trials Data: machine learning $+$ traditional selection methods can enhance selection efficiency and performance.
Thinking Bunny
Now, let's talk about aerosols!

LEARNING TO SIMULATE AEROSOL CHEMISTRY WITH GRAPH NEURAL NETWORKS

Will be submitted to the Aerosol Science and Technology Journal

AUTHORS

Fabiana Ferracina, Payton Beeler, Mahantesh Halappanavar, Bala Krishnamoorthy, Marco Minutoli, Laura Fierce

Introduction
Volcano Aerosol

Understanding the chemical composition of aerosols is crucial due to their impact on atmospheric processes and human health.

Simulating Aerosols
  • PartMC-MOSAIC model simulates aerosol processes by representing aerosols as individual particles undergoing interactions and chemical reactions, predicting size distribution and composition.
  • Complexity of aerosol chemistry and high computational costs leads to ongoing efforts to improve accuracy and efficiency.
Simulating Aerosols

In particular MOSAIC solves a first order ODE system, where the rate of concentration change over time of a particular gas species is proportional to the its concentration at the time. The proportionality constant depends on several fixed values and correction factors. It can be efficient for reduced models, however it doesn't scale well in particle-based models such as PartMC.

Research Goal

How can we study large scale aerosol particles' chemical composition in a fast and accurate way?

Graph Approach
GNS Idea 1 GNS Idea 2
Graph Approach
GNS Scheme
Graph Neural Network
GNN Encodeer GNN Processor GNS Decoder
Training/Validating/Testing: 60/10/30

Multi-output mean squared error function to compute loss at training:

$$L^{\text{dim}}_{\text{MSE}} = \sum_{part.} \sum_{\text{chem.}} \frac{\left(\text{target dynamic} - \text{pred. dynamic}\right)^2}{\text{number of particles}}$$

Loss is minimized using Adam (Adaptive Moment Estimation) algorithm.

Training/Validating/Testing: 60/10/30

Prediction loss is measured both with a observation-wise MSE of particles' mass and concentration differences $\left(L^{\text{flat}}_{\text{MSE}}\right)$ , and with a normalized mean absolute error (NMAE) computed for each chemical species:

$$\text{NMAE} = \sum_{part.} \frac{\sum\limits_{\text{tstep}} \left|\text{true mass} - \text{predicted mass}\right|}{\text{number of particles} \cdot \sum\limits_{\text{tstep}} \left| \text{true mass} \right|}$$
Training/Validating/Testing: 60/10/30
$$\text{NMAE} = \sum_{part.} \frac{\sum\limits_{\text{tstep}} \left|\text{true mass} - \text{predicted mass}\right|}{\text{number of particles} \cdot \sum\limits_{\text{tstep}} \left| \text{true mass} \right|}$$

We computed the NMAE between the prediction and the ground-truth data (PartMC-MOSAIC simulated data) for all chemical species and obtained histograms of the particles' total mass and dry diameter.

Scenario 3 Example
  • Test set has 1146 particles and 144 timesteps
  • Simple system: H$_2$SO$_4$ condensation on particles containing H$_2$O, SO$_4$, BC, OC
  • Scenario's 3 minimun loss at 1500 training steps
  • $L^{\text{flat}}_{\text{MSE}} = 1.03 \times 10^{-6}$ versus baseline $MSE = 0.0585$, $R^2 = 0.999962$
A Scenario's Results: SO$_4$ Masses
SO4 Animation

$N = 1146$, NMAE $= 0.0126$

A Scenario's Results: H$_2$SO$_4$ Masses
H2SO4 Animation

$N = 1146$, NMAE $= 0.0123$

A Scenario's Results: Dry Mass and Dry Diameter
H2SO4 Animation H2SO4 Animation

$N=1146$, # of training steps $= 1500$, $L^{\text{flat}}_{\text{MSE}} = 1.03 \times 10^{-6}$

Discussion
  • New GNS framework (inspired on GNS for physics) effectively models multi-dimensional chemical composition dynamics in aerosols using initial conditions.
  • It offers multi-dimensional time-changing features and multi-dimensional node properties.
  • Flexibility of using different activation functions in each MLP.
Discussion
  • Our GNS provides modules to seamlessly prepare input and analyze output in scientifically valueable way.
  • Generalizable across different scenarios of same simple system, maintaining robustness despite varying initial conditions.
  • We provide a modular framework, allowing for adjustable parameters and hyperparameters for optimized performance.
Furthermore
  • Accurate and efficient learning of chemical dynamics, applied to a simple sulfuric acid condensation system.
  • Training on a GPU approx. 4-5 seconds per 300 steps, prediction time around 0.4 seconds.
  • Future work: global nodes for uniform properties, alternative distance functions for more complex systems.
Conclusion
  • Understanding aerosol chemistry is vital due to impacts on health, climate, and the environment.
  • For example: sulfate aerosols have a cooling effect on climate, while soot leads to warming effects.
  • We are affected by aerosols everyday. Common sources of particulate matter and soot are vehicles.
  • Speaking of vehicles and their emissions...

REDUCING TOTAL TRIP TIME AND VEHICLE EMISSION THROUGH PARK-AND-RIDE – METHODS AND CASE-STUDY

Will be submitted to the Journal of Cleaner Production

AUTHORS

Ayane Nakamura, Fabiana Ferracina, Naoki Sakata, Amanda E. Hampton, Takahiro Noguchi, Hiroyasu Ando

Big Problems
Bus Delay Traffic Car Smog Car Smog Car Smog
Solution
Bus Schedule Smart Bus Shelter e-Palette Traffic Laws
PnR Framework Schematic
PnR Schematic
Queue Model Schematics
Queue Schematic

Developed by Ayane Nakamura

Meet MEET
Methodologies for Estimating Emissions from Transportation

$$ \begin{equation} E_k = \sum\limits_i^{\text{vehicle type}} [\text{number of type $i$'s} \times \text{ave. distance by i} \\ \times \sum\limits_j^{\text{road type}} \left(\text{prop. of distance on road $j$} \times \right. \\ \left. \text{emission factor for pollutant $k$, vehicle $i$ on road $j$}\right)], \end{equation}$$

Data for Case Study
  • 2018 Person Trip Survey by the Tokyo Metropolitan Area Transportation Planning Council
  • 630,000 households sampled from 18 million, with a response rate of 26%
  • 693,083 trip observations from 382,667 individuals
  • $\approx$ 1% of the total movement within Tokyo metro area. Extrapolated to represent $\approx$ 74 million trips on an arbitrary workday.
Case Study of Tsukuba
Tsukuba
Case Study of Tsukuba
Tsukuba Science City
Case Study of Tsukuba
Case Study
Case Study of Tsukuba
  • Vehicle nominal speeds were set via a grid-search to maximize served customers, and bus inter-arrival times were also optimized within the stability constraints of the queuing model.
  • Parameters such as vehicle speed and bus inter-arrival times were input into MEET model to directly tie travel times with vehicle emissions.
Results
Trip time and emissions

Empirical expected total trip time and emissions are 0.3893 hours and 6,093,234 grams of pollutants, respectively.Bear pointing up

The mean indicated on the axis is for aggregation over PnR hubs, direction of travel, and four hour time intervals.

Simulation vs. Approximation Model
Car Emissions
Simulation vs. Approximation Model
Bus Emissions
Simulation vs. Approximation Model
Total Trip Times
Simulation vs. Approximation Model
Travel and Wait Times
A New Social Cost

We introduce SCETT, the Social Cost of Emissions and Trip Time:

$$\text{SCETT} = \text{social cost of CO$_2$} \times \\ \text{CO$_2$ emissions of vehicles during time interval $T$} \\ + \text{social cost of trip time} \times |T| \times \\ \text{average total trip time per hour}$$
  • SCETT is in international dollars per capita during a given time interval
  • Social cost of CO$_2$ is cost in int'l $/ grams of carbon derived from climate economy models
  • Social cost of trip time is in int'l $/hr and uses a country's per capita productivity as an opportunity cost
Transit Policy Optimization

$$ \begin{equation} \text{arg}\min\limits_{(b,C)} \left(\sum\limits_{h=1}^n \sum\limits_{p_{car} \in P} \text{SCETT}(b,C) \right), \end{equation}$$ $b, C$ represent the bus frequency interval and bus capacity respectively. $h$ represents PnR stations 1 through $n$. $p_{car}$ is the proportion of customers using private cars. Note that in our current PnR system we only account for emissions of buses and private cars, and the length of time interval $|T| = 4$ hours.

Transit Policy in Tsukuba's PnR System

SCETT according to FUND in international dollars/capita per percent of car usage. Hub 3 of Tsukuba's PnR system according to the 2018 Person Trip survey. Each point's color pertain to the number of buses ($[1/b]$) and annotations above each point represent the bus capacity. Time is 5.2 times more valuable than emissions.

Transit Policy in Tsukuba's PnR System

SCETT according to RICE in international dollars/capita per percent of car usage. Hub 3 of Tsukuba's PnR system according to the 2018 Person Trip survey. Each point's color pertain to the number of buses ($[1/b]$) and annotations above each point represent the bus capacity. Time is 1.25 times more valuable than emissions.

Current SCETTs per four hour intervals per hub in international dollars. The values assume 90-95% car users, and large buses deployed at medium frequency.
Hub Number Time+CO$_2$ Social Cost (FUND)
1 1024.03
2 1053.50
3 1662.02
4 1011.24
5 553.23
SCETT Savings
SCETT Savings
Discussion
  • First study to comprehensively consider both time and environmental costs in evaluating the social cost of Park and Ride (PnR) systems.
  • Novel approach combining queueing and emissions models aimed at minimizing social costs under different transit policies.
  • Monte Carlo queue simulation computes waiting and traveling times of customers under various PnR scenarios.
Discussion
  • Output speed and vehicle type from the queueing model used to estimate emissions.
  • Convertion of trip times and vehicle emissions into monetary cost per capita.
  • SCETT model helps transit policy makers reduce costs from trip time and emission.
  • Smaller, more frequent buses are more socially beneficial when car usage is low, while larger, less frequent buses are better when car usage is high.
Discussion
  • SCETT values are higher for hubs farther from the city center, highlighting the importance of strategic PnR station placement.
  • Current high car usage rate leads to higher SCETT values even under optimal transit policies, suggesting the need for policies reducing car use.
  • Significant social cost savings are possible by implementing data-driven transit policies, with more savings achievable if private car use decreases.

Environmental and societal impacts of these fields are interconnected, with each influencing and being influenced by factors such as air quality, climate change, food security, and urban pollution.

Cross-disciplinary innovations can lead to more holistic approaches to environmental management and policy-making, with data-driven approaches and systems thinking required for meaningful breakthoughs.

Thesis Defense
Thesis Defense