Abstract Scope |
The use of machine learning (ML) in the physical sciences has stimulated the discovery of exciting new phase change materials, amorphous alloys, and catalysts. But even scientifically sound AI models are only as dependable as the labels and values upon which they are built. The continued success of these methods relies upon the availability of open data, meta-data, and scientific code that are findable, accessible, interoperable and reusable (F.A.I.R.). I will discuss our successes and failures in creating the first F.A.I.R. multi-institution combinatorial dataset and code repository. I will also discuss the tenuousness of ground truth, the need for openly preserving expert disagreement within scientific data sets, and challenges associated with aggregating data from the open literature. This will drive home the difficulties in forming and capturing expert consensus, the impact of consensus variance on ML model evaluation, and the need to recreate important datasets that are born digital. |