| Abstract Scope |
With the advent of large foundational models and declining compute costs, access to artificial intelligence has never been higher. Yet model performance increasingly hinges on the availability of curated, high-fidelity data. In the scientific community, the experimental data required frequently already exist, but is difficult to exploit due to inconsistent terminology, fragmented reporting, and missing metadata. We investigate how large language models can transform this heterogeneous literature into a machine learning-ready resource. Focusing on reproducibility and qualification, we provide a case-study on the hierarchical dependencies that govern a usable dataset, and how this can be leveraged to craft a dataset. Additionally, we highlight methods for consolidating diverse terminology, reconcile units, and fill missing values, when practical. Applied to cold spray additive manufacturing, we present the largest and first machine-readable dataset to facilitate AI-driven research in materials. |