

Discover more from The Gap
Anthropic’s Research and Approach
Looking at key research in AI safety by one the leading companies in the field
Introduction
ChatGPT took the world by storm after its public release in November 2022. Now, nine months later, the initial burst of excitement and shock has settled and clear signs of a slowing interest in AI are showing. With all the news headlines and marketers on social media overbidding each other with “the next new thing”, people have grown accustomed to wild developments and weekly breakthroughs. As a result, it takes a lot more to get them off their chairs in excitement.
Today’s topic is AI safety, something that on the surface seems mundane compared to AI development, but is at least equally important. We will take a closer look at key research conducted by Anthropic - an extremely interesting company that stands behind some of the most important work in the field.
Anthropic
Anthropic is an AI safety and research company that works to develop and deploy AI responsibly. The company was founded by a group of former high-level OpenAI employees (Daniela Amodei, Dario Amodei, Jack Clark, Jared Kaplan, Sam McCandlish, and Tom Brown) who emigrated from OpenAI in 2021 due to concerns over the company’s commercial approach. The kicker was Microsoft’s $1 billion landmark investment from 2019.
Now, Anthropic is seeking to raise $5 billion on its own over the next two years to build a “frontier model” 10 times more capable than today’s most powerful AI (TechCrunch). In May of this year, Anthropic secured $450 million in Series C funding from their cloud provider, Google, and among several other investors, the tech giants Salesforce and Zoom (TechCrunch). Anthropic’s total funding is near the $1 billion mark and although its valuation is undisclosed, sources put it at nearly $5 billion (Reuters). This makes Anthropic one of the most valuable AI companies in the world.
The substantial backing and experience from the world of BigTech combined with Anthropic’s strong emphasis on developing AI products responsibly, makes the company a prime candidate for deals and partnerships with public organizations. As recently as August 19, South Korea's largest telecommunications company, SK Telecom, announced a $100 million deal with Anthropic to “jointly develop a global telecommunications-oriented multilingual large language model and build an AI platform.” (Reuters)
Anthropic’s ChatGPT competitor Claude 2 was released on July 11. The chatbot is currently free to use and can be tried here. Overall, Claude 2 is competitive with ChatGPT-4, and can even outperform it in a few areas, but falls slightly behind in overall language capabilities. However, Claude 2 still holds certain advantages over the free version of ChatGPT:
File uploads: Contrary to ChatGPT, it can summarize text from user-uploaded files in different formats such as Word and PDF.
Higher token limits: The token limit for prompts is 100.00 tokens (50.000-75.000 words) compared to ChatGPT’s limit which is 4096 tokens (2.000- 3.000 words).
More up-to-date: Claude was trained with data up to early 2023 whereas ChatGPT’s famous cut-off point is September 2021.
For a fuller comparison of Claude 2 vs. ChatGPT see Scale’s post here. For a comparison between Claude 2 and GPT-4, see Kim Garst’s LinkedIn post here.
Key Areas of Research
Anthropic has a double mission: On the one hand, they want to build one of the most powerful AI models in existence, and on the other hand, they obsess over developing AI safely and responsibly. At first glance, it seems a bit contradictory. As if they are trying to protect humanity, from the very thing they are building. A somewhat similar issue to Sam Altman’s leading role in both OpenAI and Worldcoin.
One way Anthropic gets around this issue is by being structured as a Public Benefit Corporation (PBC) which obligates the company to pursue profits for shareholders while also reporting continually how their work benefits the public good. The same company structure is used by its competitor, Inflection AI. But as I wrote in my post about Inflection AI their public good mission deserves a question mark, since personal AI assistants that people form intimate relationship is not clearly of great benefit to the public. Comparably, Anthropic’s public benefit goal is much less objectionable. First of all, the team has produced a rich body of AI safety research over the last two years that cements its position as a leading forward-thinker in AI ethics and safety research.
Let’s have a look at three of Anthropic’s key research areas: Mechanistic interpretability, scalable oversight, and understanding generalizations.
Mechanistic Interpretability
Mechanistic interpretability is about understanding why an AI system generates the output it does. Analogues to how a skilled programmer can understand the behavior of a computer program by reviewing its code, the behavior of an artificial neural network can, in theory, be understood by looking closely at its parameters. In theory that is, because the number of parameters in a large language model (LLM) is incomprehensibly large. Claude 2 for instance was trained with 860 million parameters, a number that speaks to how much input data the model can interpret and learn from.
Nonetheless, Anthropic has taken baby steps to understand how transformer models can be reverse-engineered by experimenting with and examining a tiny transformer model. The transformer model structure originates from Google’s revolutionary “Attention Is All You Need”-paper from 2017 and has guided the architecture of all the most successful LLMs today like Claude 2 and GPT-4.
Technically, the researchers found strong, though not conclusive evidence, that something called “induction attention heads” is the source mechanism for the majority of “in-context learning” that LLMs pick up. “In-context learning” refers to a model’s ability to infer the meaning of words and concepts from only a single or a few examples. Although in-context learning continues to baffle researchers, Anthropic believes that their findings could apply to larger models as well, and partly help to explain how transformer models “understand”.
Read more:
Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases
A Mathematical Framework for Transformer Circuits
In-context Learning and Induction Heads
Scalable Oversight
ChatGPT’s secret sauce (or at least one of the main ingredients) was reinforcement learning from human feedback (RLHF) - a technique that was jointly developed by OpenAI’s and DeepMind’s safety teams in 2017, and first adopted with InstructGPT, ChatGPT’s younger sibling.
We have all experienced or heard about how LLMs can make up facts, generate bias, toxic text, deceiving behavior (e.g., Microsoft Bing AI that tried to break up NYT journalist Kevin Roose's marriage), or just fail to grasp the user’s instructions. The purpose of RLHF is to mitigate these problems by fine-tuning the LLM in a feedback look with human evaluators or “AI trainers”. Specifically, the AI trainers are hired to compare and rate different text outputs to help align the model with users’ intention.
As AIs become more advanced, aligning the model with human feedback becomes increasingly challenging. When LLMs are applied to more high-stakes and complex tasks such as medical or legal advice, scientific analysis, or public policy, human AI trainers may lack the knowledge to accurately assess the quality or correctness of the model’s output. Another issue is that advanced AI systems may be able to persuade and deceive AI trainers into providing positive feedback to answers that are wrong but sound plausible.
On this background, Anthropic has developed a technique for fine-tuning LLMs by using AI supervision instead of feedback from human evaluators as with RLHF. The technique is a two-phase process called Constitutional AI and was used for training Claude 2.
In the first phase, the AI model is shown examples of responses to harmful prompts and is asked to critique and revise them based on a “constitution”. The constitution is a set of principles and examples collected from various sources such as the UN Declaration of Human Rights and Apple’s Terms of Service.

In the second phase, the AI model is trained via reinforcement learning, but rather than relying on tens of thousands of human feedback labels as with the RLFH technique, the labels are provided by a separate “feedback model” that evaluates the model’s output based on the constitution.
The end goal is to reduce - not eliminate - reliance on human supervision by creating self-supervising AI systems that are guided by human values and easing the workload for AI trainers.
Read more:
AI Safety Needs Social Scientists (from OpenAI)
Constitutional AI: Harmlessness from AI Feedback
Understanding Generalization
Mechanistic interpretability takes a bottom-up approach to understanding AI systems by looking deeply under the hood and inspecting minuscule parts of the model. Anthropic operates with a top-down approach as well. In a paper from earlier this month, a group of Anthropic researchers seeks to understand the inner workings of LLMs by tracing their outputs to training data.
When an LLM writes poetry in the style of Shakespeare, solves programming problems, or pretends to be DAN (stands for “Do Anything Now”, a jail-broken, “evil twin” version of ChatGPT) is it just stitching together text passages from its training data, combining its stored knowledge in creative ways, or what is really going on?
These types of questions are especially interesting when an AI model is met with data that it hasn’t been trained on – how does it know what to respond? The model’s ability to apply existing knowledge from its training data to new, uncharted domains is called “generalization”.
Anthropic researchers have used “influence functions”, a classical math technique commonly used in statistics, to analyze how different sequences (word sentences) from LLMs training datasets influence the models’ outputs. The researchers applied influence functions to four models of varying sizes with approximately 810 million, 6.4 billion, 22 billion, and 52 billion parameters. Specifically, the research team explored how a model’s behavior change when new sequences were added to its training dataset.

Unsurprisingly, the researchers found that larger models consistently generalize at a more abstract level than smaller models. In other words, AI models’ ability to generalize tends to improve with model size, i.e., number of parameters.
A bit more surprisingly, at least to me, the researchers did not find clear instances of “memorization” where the model had copied an entire sentence or the flow of ideas in an entire paragraph from its training data. This indicates that when journalists or particularly lawyers claim that LLMs are “copyright theft machines” trained to paraphrase text from their training data, that assertion is simply not true. Instead, the LLM learns to recognize patterns and underlying structures in the data, much similar to the way in which humans learn.
The same applies to “role-playing” as in the example below. The researchers did not see any instances where responses from the model were near-identical to sentences appearing in the training set. The imitation seems to be happening at a high level of abstraction, not from copying
Read more:
Studying Large Language Model Generalization with Influence Functions
Wrapping Up
AI safety research is far less sexy and attention-grabbing than the newest tools, features, or lofty discussions about long-term AI dangers. Yet, understanding the inner workings of AI systems is crucial to making sure that the technology aligns with humanity’s best intentions.
Mechanistic interpretability and understanding generalizations with influence functions are two approaches used by Anthropic to discover why and how AI systems generate the output they do. Scalable oversight is meant to make AI training safer and more transparent, by reducing the need for human AI trainers and by teaching the systems to supervise themselves based on human values. The overarching objective of AI safety research is to ensure that humans remain in control over AI and not the other way around.
Reads of the Week
Why You Are Probably An NPC - Gurwinder, August 27, 2023 (The Prism)
Face it, self-driving cars still haven’t earned their stripes - Gary Marcus, August 19, 2023 (Marcus on AI)