OpenAI Unveils Safety Models for Wider Harm Classification

Sam Altman, CEO of OpenAI, attends the annual Allen and Co. Sun Valley Media and Technology Conference at the Sun Valley Resort in Sun Valley, Idaho, on July 8, 2025.

OpenAI took a significant step in bolstering online safety Wednesday, unveiling two reasoning models designed for developers to classify and mitigate various forms of online harms across their platforms. The move comes as the company faces increasing scrutiny regarding the ethical implications of its rapidly scaling AI technologies.

The models, named gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, are fine-tuned iterations of OpenAI’s previously released gpt-oss models from August, with the numerical suffixes reflecting their respective parameter sizes. These “open-weight” models offer transparency by making their parameters – the key elements guiding outputs and predictions during training – publicly available. While this approach facilitates greater scrutiny and customization, it differs from full open-source models, where the entire source code is accessible for modification.

OpenAI emphasizes that organizations can tailor these new models to align with their specific policy requirements. The reasoning capabilities of the gpt-oss-safeguard models allow developers to gain deeper insights into the models’ decision-making processes, fostering greater trust and accountability in their applications. For example, a product review website could leverage these models to identify potentially fraudulent reviews, while a gaming forum could use them to flag discussions related to cheating. The ability to understand *why* a model flags certain content is crucial for nuanced content moderation and building more robust AI safety systems.

The development of these models was a collaborative effort with Robust Open Online Safety Tools (ROOST), an organization dedicated to advancing AI safety infrastructure. Discord and SafetyKit also contributed to the testing phase. Initially released as a research preview, OpenAI is actively soliciting feedback from researchers and the broader AI safety community to further refine and improve the models’ performance and reliability.

ROOST is also establishing a model community to foster collaboration among researchers and practitioners using AI models to safeguard online environments. This initiative is expected to accelerate the development and deployment of effective online safety tools, addressing concerns about the potential misuse of AI.

This announcement could be strategically timed to address critiques leveled against OpenAI, which has faced accusations of prioritizing commercialization and rapid scaling over ethical considerations and safety protocols. The company’s valuation has soared to $500 billion, fueled by the immense popularity of its consumer chatbot, ChatGPT, which boasts over 800 million weekly active users. However, this success has also amplified concerns regarding the potential for misuse and the need for robust safety measures.

Furthermore, OpenAI recently completed its recapitalization, solidifying its structure as a non-profit entity holding a controlling stake in its for-profit arm. This move underscores the company’s commitment to its original mission of developing AI for the benefit of humanity, even as it pursues commercial opportunities. Founded in 2015 as a non-profit research lab, OpenAI has rapidly evolved into the most valuable U.S. tech startup since the launch of ChatGPT in late 2022.

“As AI becomes more powerful, safety tools and fundamental safety research must evolve just as fast — and they must be accessible to everyone,” said ROOST President Camille François, highlighting the urgency and importance of collaborative efforts in ensuring the responsible development and deployment of AI.

Eligible users can download the model weights via Hugging Face, according to OpenAI. The company hopes these models will contribute to a more secure and trustworthy online environment, positioning itself as a leader in responsible AI development.

Original article, Author: Tobias. If you wish to reprint this article, please indicate the source:https://aicnbc.com/11931.html

OpenAI Unveils Safety Models for Wider Harm Classification

About Author

Tobias

Related News

Disney Channels Disappear from YouTube TV After Carriage Agreement Ends

Google and Fox Agree to Short-Term Extension for YouTube TV Channels

Salesforce CEO Apologizes for Suggesting Trump Deploy Troops to San Francisco