Data Shapley-Equitable Valuation of Data for Machine Learning

Amirata Ghorbani | James Zou

Department Electrical Engineering | Stanford University

In healthcare markets, it has been suggested that individuals should be compensated for the data that they generate, but equitable valuation for individual data is a challenge. Data Shapely is a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on n data points to produce a predictor, data Shapley can be used as a metric to quantify the value of each training datum to the predictor performance. Data Shapley value uniquely satisfies several natural properties of equitable data valuation. Monte Carlo and gradient-based methods facilitates in efficiently estimating data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. Using Data Shapley, each data source’s value can be interpreted as an indicator of its quality and whether its presence helps or hurts the overall performance of the predictive model. It provides us with a simple method to adapt the performance of a given train data to a different test data set. Extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task, low Shapley value data effectively capture outliers and corruptions while high Shapley value data inform what type of new data to acquire to improve the predictor.

Input variables : Train data, learning algorithm, performance score
Output Variables : Shapley value of training points

Metrics to Monitor

Statistical	:	Somers D \| Accuracy \| Precision and Recall \| Confusion Matrix \| F1 Score \| Roc and Auc \| Prevalence \| Detection Rate \| Balanced Accuracy \| Cohen's Kappa \| Concordance \| Gini Coefficent \| KS Statistic \| Youden's J Index
Infrastructure	:	Log Bytes \| Logging/User/IAMPolicy \| Logging/User/VPN \| CPU Utilization \| Memory Usage \| Error Count \| Prediction Count \| Prediction Latencies \| Private Endpoint Prediction Latencies \| Private Endpoint Response Count

Visit Model : github.com

Additional links : arxiv.org

Model Category	:	Public
Date Published	:	June, 2019
Healthcare Domain	:	Life Sciences Provider
Code	:	github.com

Data Shapley-Equitable Valuation of Data for Machine Learning

Model Details

Applications

Solutions

You can also search for

Data Shapley-Equitable Valuation of Data for Machine Learning

Model Details

Applications

Solutions

You can also search for

Share