Anthropic details its AI safety strategy

Anthropic has detailed its safety strategy to try and keep its popular AI model, Claude, helpful while avoiding perpetuating harms.

Central to this effort is Anthropic’s Safeguards team; who aren’t your average tech support group, they’re a mix of policy experts, data scientists, engineers, and threat analysts who know how bad actors think.

However, Anthropic’s approach to safety isn’t a single wall but more like a castle with multiple layers of defence. It all starts with creating the right rules and ends with hunting down new threats in the wild.

First up is the Usage Policy, which is basically the rulebook for how Claude should and shouldn’t be used. It gives clear guidance on big issues like election integrity and child safety, and also on using Claude responsibly in sensitive fields like finance or healthcare.

To shape these rules, the team uses a Unified Harm Framework. This helps them think through any potential negative impacts, from physical and psychological to economic and societal harm. It’s less of a formal grading system and more of a structured way to weigh the risks when making decisions. They also bring in outside experts for Policy Vulnerability Tests. These specialists in areas like terrorism and child safety try to “break” Claude with tough questions to see where the weaknesses are.

We saw this in action during the 2024 US elections. After working with the Institute for Strategic Dialogue, Anthropic realised Claude might give out old voting information. So, they added a banner that pointed users to TurboVote, a reliable source for up-to-date, non-partisan election info.

Teaching Claude right from wrong

The Anthropic Safeguards team works closely with the developers who train Claude to build safety from the start. This means deciding what kinds of things Claude should and shouldn’t do, and embedding those values into the model itself.

They also team up with specialists to get this right. For example, by partnering with ThroughLine, a crisis support leader, they’ve taught Claude how to handle sensitive conversations about mental health and self-harm with care, rather than just refusing to talk. This careful training is why Claude will turn down requests to help with illegal activities, write malicious code, or create scams.

Before any new version of Claude goes live, it’s put through its paces with three key types of evaluation.

Safety evaluations: These tests check if Claude sticks to the rules, even in tricky, long conversations.

Risk assessments: For really high-stakes areas like cyber threats or biological risks, the team does specialised testing, often with help from government and industry partners.

Bias evaluations: This is all about fairness. They check if Claude gives reliable and accurate answers for everyone, testing for political bias or skewed responses based on things like gender or race.

This intense testing helps the team see if the training has stuck and tells them if they need to build extra protections before launch.

Cycle of how the Anthropic Safeguards team approaches building effective AI safety protections throughout the lifecycle of its Claude models. — *(Credit: Anthropic)*

Anthropic’s never-sleeping AI safety strategy

Once Claude is out in the world, a mix of automated systems and human reviewers keep an eye out for trouble. The main tool here is a set of specialised Claude models called “classifiers” that are trained to spot specific policy violations in real-time as they happen.

If a classifier spots a problem, it can trigger different actions. It might steer Claude’s response away from generating something harmful, like spam. For repeat offenders, the team might issue warnings or even shut down the account.

The team also looks at the bigger picture. They use privacy-friendly tools to spot trends in how Claude is being used and employ techniques like hierarchical summarisation to spot large-scale misuse, such as coordinated influence campaigns. They are constantly hunting for new threats, digging through data, and monitoring forums where bad actors might hang out.

However, Anthropic says it knows that ensuring AI safety isn’t a job they can do alone. They’re actively working with researchers, policymakers, and the public to build the best safeguards possible.

(Lead image by Nick Fewings)

See also: Suvianna Grecu, AI for Change: Without rules, AI risks ‘trust crisis’

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

Source link