Tool automatically separates training and test data to improve AI evaluation

Tool automatically separates training and test data to improve AI evaluation

Schematic workflow of DataSAIL. Credit: Nature Communications (2025). DOI: 10.1038/s41467-025-58606-8

A new tool has been developed to better assess the performance of AI models. It was developed by bioinformaticians at Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and the Helmholtz Institute for Pharmaceutical Research Saarland (HIPS).

“DataSAIL” automatically sorts training and test data so that they differ as much as possible from each other, allowing for the evaluation of whether AI models work reliably with different data. The researchers have now presented their approach in the journal Nature Communications.

Machine learning models are trained with huge amounts of data and must be tested before practical use. For this, the data must first be divided into a larger training set and a smaller test set—the former is used for the model to learn, and the latter is used to check its reliability.

“Only if the data is split in such a way that the test data differ significantly from the training data can it be determined whether the model can later handle novel data, so-called out-of-distribution data, in practice,” explains Prof. Dr. David Blumenthal, bioinformatician at the Department of Artificial Intelligence in Biomedical Engineering (AIBE) at FAU.

AI models are often overestimated

Conventional algorithms are usually not capable of this optimized data splitting, which is why the performance of AI models is often overestimated. Together with researchers from HIPS, David Blumenthal has therefore developed a tool that prevents such misjudgments and sets new standards in an important area of machine learning. The tool, called DataSAIL, automatically splits datasets so that training and test data are as different as possible.

“DataSAIL is a free tool and can be used for all types of data, not just in biological research,” says Blumenthal. “Users only need to define a few parameters for their datasets, and DataSAIL does the rest automatically and consistently.”

Tool automatically separates training and test data to improve AI evaluation
Visualization of exemplary one-dimensional and two-dimensional datasets. Credit: Nature Communications (2025). DOI: 10.1038/s41467-025-58606-8

Tool also processes interaction data

DataSAIL is also the first tool that can be used for the automated splitting of interaction data. These multidimensional data play a role, for example, in drug research.

“Imagine you want to develop AI models that predict the interaction between drugs and target proteins,” says Blumenthal. “Then, when testing these models, you need to evaluate how well they work for altered drug molecules on one hand and for different proteins on the other.”

Additionally, the tool is capable of considering class features, such as an even distribution of male and female subjects in training and test data. This prevents the testing of a model from yielding more unrealistic results for one gender than for the other.

The plan is to further develop the tool in the coming years to reduce the runtime of the algorithms and prepare data even more precisely for various practical scenarios.

More information:
Roman Joeres et al, Data splitting to avoid information leakage with DataSAIL, Nature Communications (2025). DOI: 10.1038/s41467-025-58606-8

Provided by
Friedrich–Alexander University Erlangen–Nurnberg


Citation:
Tool automatically separates training and test data to improve AI evaluation (2025, May 26)
retrieved 26 May 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.




Source link

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every week.

We don’t spam! Read our privacy policy for more info.

More From Author

SEC’s Sankey: Not settled on preferred CFP format

SEC’s Sankey: Not settled on preferred CFP format

Macron brushes off pushing from wife ahead of Vietnam trip

Macron brushes off pushing from wife ahead of Vietnam trip

Leave a Reply

Your email address will not be published. Required fields are marked *