In healthcare markets, it has been suggested that individuals should be compensated for the data that they generate, but equitable valuation for individual data is a challenge. Data Shapely is a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on n data points to produce a predictor, data Shapley can be used as a metric to quantify the value of each training datum to the predictor performance. Data Shapley value uniquely satisfies several natural properties of equitable data valuation. Monte Carlo and gradient-based methods facilitates in efficiently estimating data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. Using Data Shapley, each data source’s value can be interpreted as an indicator of its quality and whether its presence helps or hurts the overall performance of the predictive model. It provides us with a simple method to adapt the performance of a given train data to a different test data set. Extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task, low Shapley value data effectively capture outliers and corruptions while high Shapley value data inform what type of new data to acquire to improve the predictor.
Input variables : Train data, learning algorithm, performance score
Output Variables : Shapley value of training points
Statistical | : | Somers D | Accuracy | Precision and Recall | Confusion Matrix | F1 Score | Roc and Auc | Prevalence | Detection Rate | Balanced Accuracy | Cohen's Kappa | Concordance | Gini Coefficent | KS Statistic | Youden's J Index |
Infrastructure | : | Log Bytes | Logging/User/IAMPolicy | Logging/User/VPN | CPU Utilization | Memory Usage | Error Count | Prediction Count | Prediction Latencies | Private Endpoint Prediction Latencies | Private Endpoint Response Count |
Visit Model : github.com
Additional links : arxiv.org
Model Category | : | Public |
Date Published | : | June, 2019 |
Healthcare Domain | : |
Life Sciences
Provider |
Code | : | github.com |
Data Privacy |
Synthetic Data Generation |