A team of AI researchers at the University of California, Los Angeles, working with a colleague from Meta AI, has introduced d1, a diffusion-large-language-model-based framework that has been improved through the use of reinforcement learning. The group posted a paper describing their work and features of the new framework on the arXiv preprint server.
Over the past couple of years, the use of LLMs has skyrocketed, with millions of people the world over using AI apps for a wide variety of applications. This has led to an associated need for large amounts of electricity to power data centers running the computer-intensive applications. Researchers have been looking for other ways to provide AI services to the user community. One such approach involves the use of dLLMs as either a replacement or complementary approach.
Diffusion-based LLMs (dLLMs) are AI models that arrive at answers differently than LLMs. Instead of taking the autoregressive approach, they use diffusion to find answers. Such models were originally used to generate images—they were taught how to do so by adding overwhelming noise to an image and then training the model to reverse the process until nothing was left but the original image.
Using this approach for text involved converting letters or words to tokens as an analog for pixels. The result was a model that used masks as an analog for noise to slowly erase tokens until there was nothing left but mask characteristics, then training the model to reverse the process until there was nothing but tokens. The advantage of this approach is that it can require far less computing power than LLMs.

Holding up the use of dLLMs has been their inferior reasoning abilities. That is where the team in California comes in. They have been working to add reinforcement learning (where models learn through the use of rewards) to a dLLM as a way to improve its reasoning ability.
To build d1, the team added a two-step process. The first step involved supervised fine-tuning of the training dataset using high-quality data. The second makes use of reinforcement learning by adding an algorithm called diffu-GRPO, which uses math principles to make high-level estimates, along with what the team calls “random prompt masking.”
Testing of d1 has thus far shown the approach works—models using the framework outscored some math and logical reasoning benchmarks. The research team suggests their framework is ready for testing by other entities who may choose to adapt their AI models to incorporate the changes they are suggesting.
More information:
Siyan Zhao et al, d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning, arXiv (2025). DOI: 10.48550/arxiv.2504.12216
© 2025 Science X Network
Citation:
Reinforcement learning boosts reasoning skills in new diffusion-based language model d1 (2025, April 30)
retrieved 1 May 2025
from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.