Interpretable Risk Mitigation in LLM Agent Systems (2025)

Jan Chojnacki
Department of Physics, University of Warsaw
Samsung R&D Institute Poland
jr.chojnacki@uw.edu.pl

(May 15, 2025)

Abstract

Autonomous agents powered by large language models (LLMs) enable novel use cases in domains where responsible action is increasingly important. Yet the inherent unpredictability of LLMs raises safety concerns about agent reliability. In this work, we explore agent behaviour in a toy, game-theoretic environment based on a variation of the Iterated Prisoner’s Dilemma. We introduce a strategy-modification method—independent of both the game and the prompt—by steering the residual stream with interpretable features extracted from a sparse autoencoder latent space. Steering with the good-faith negotiation feature lowers the average defection probability by 28 percentage points. We also identify feasible steering ranges for several open-source LLM agents. Finally, we hypothesise that game-theoretic evaluation of LLM agents, combined with representation-steering alignment, can generalise to real-world applications on end-user devices and embodied platforms.

1 Introduction

Large Language Models (LLMs) are auto-regressive probabilistic systems widely used for generating human-like text [45, 32, 69]. The increasing performance of these models on language understanding benchmarks such as GLUE [66], MMLU [28], GPQA [53], MATH [29], HumanEval [16] has facilitated their widespread adoption across various domains.

Recently, a significant use case for LLMs has emerged in autonomous agent systems. These agents interact with their environment through an observation-reasoning-decision-action framework [70, 57]. Deployments of LLM agents have been reported in human-populated virtual environments like the internet, social media, and online games [49, 8]. The research in autonomous agents is driven by substantial economic incentives for labor automation in both digital [15] and embodied platforms [2, 12]. The prospect of a post-labor economy following the emergence of Artificial General Intelligence (AGI) has been speculated in economic literature [1].

However, the rise of LLM agents also brings considerable concerns about their potential harmful impact on virtual and physical environments. These concerns stem from the tendency of LLMs to hallucinate and produce reasoning errors [9]. In LLM-powered, human-unsupervised systems, such errors may propagate and result in harmful actions. The current reasoning and decision-making performance of agents is unreliable and cannot be directly applied tor high-stakes tasks. Further alignment is necessary for managing financial systems, operating consumer electronic devices [30, 72, 71, 67, 50].AI Alignment becomes urgent in the military context, e.g. controlling Unmanned Aerial Vehicles (UAVs) [33], and Lethal Autonomous Weapons (LAWs) [56, 59], which can identify, target, and kill without human intervention.

As LLM capabilities improve, it is conceivable that these agents will be placed in executive decision-making positions. Systems with such super-competence may operate under different value assumptions than humans and could act in intentionally harmful ways [55, 5]. This highlights the urgent need for effective alignment strategies to ensure that AI agents act in accordance with human values.

1.1 Paper outline

We propose a method of aligning LLM Agent’s strategy with human values and evaluate the results in a game-theoretic setup.

We focus on the Gemma [60, 61] and LLaMA model families [64, 65, 27]. The corresponding sparse features used in the next section are publicly available [40, 35]. We find the token activation distribution on a given feature and search for monosemantic vectors, where a single-meaning token activation dominates the distribution.

Interpretable Risk Mitigation in LLM Agent Systems (1)

One can steer a model regardless, whether a clear contrary incentive has been stated in the prompt; see Figure 1. Such alignment can be induced by the feature steering in a game-theoretic settings such as the Iterated Prisoner’s Dilemma (IPD). Our work is structured as follows:

  • In Section3.1, we describe existing work related to AI Safety of LLM agents and discuss the possibility of game-theoretic evaluation.

  • In Section4.1, we find that given a fixed non-zero temperature, game description prompt, and a game state, the Agent may exhibit drastically different IPD strategies.

  • In Section4.2, we propose a representation engineering approach, where the generation is modified by weighted feature multiplication in inference time.

  • In Section5, we find feature vectors activating on human-friendly concepts. We show that it is possible to find directions which steer LLM Agent’s action generation from defective to cooperative, and vice versa.

  • In Section6.1, we discuss the role interpretability and the monosemanticity of the Sparse Autoencoder (SAE) features.

We publish the code and data for complete reproducibility111https://github.com/Samsung/LLM-Agent-SAE.

2 Motivation

2.1 Model Evaluation and Alignment Methods

AI alignment involves defining, codifying, and enforcing human values in AI systems [26]. The alignment problem is becoming increasingly relevant, transitioning from theoretical discussions to practical engineering challenges. While transformer-based architectures have shown remarkable performance, reliably aligning them with an imposed system of values remains uncertain [7]. Nevertheless, practical tools are being developed to mitigate harmful LLM decisions and bias their reasoning toward more desirable behavior [47].

Mitigating LLM deception can be approached in several ways. Two common methods are prompt engineering, which leverages the model’s capability for in-context learning to impose instructions before inference [42], and fine-tuning of pre-trained models on safety guidelines [68]. Prompt engineering skews the probability distribution of next-token generation by conditioning the context, while fine-tuning alters the token distribution by updating the model weights during training. However, fine-tuning cannot be applied at inference time and both methods have limitations.

Prompting techniques are constrained by finite context windows [14], are task-specific, and often require time-consuming human experimentation. Fine-tuning is more reliable and general but necessitates a well-curated training dataset and is susceptible to the catastrophic forgetting phenomenon [38]. It may also induce a large computational cost.

More importantly, both approaches are essentially black-box methods [41], providing limited insight into the decision-making processes during inference [10]. This lack of transparency makes it difficult to intervene when safety concerns are identified. Conversely, explainable and interpretable AI strives to elucidate how information flows through attention heads inside transformer layers [54]. Recent approaches include the use of auxiliary networks like sparse autoencoders [18] or probes [39] attached to attention heads to probe the emergent representation space of features learned during training.

Identifying monosemantic representations of particular attention heads, vector representations, or circuits can be used during inference to modify the default token distribution [21]. Such interpretability approaches provide an abstract toolkit for investigating both the internal workings of LLMs (through activation frequency and strength analysis during inference) and possible intervention mechanisms (by adding the desired monosemantic feature to the residual stream during inference).

2.2 Sparse Autoencoders

Recent developments in the field of interpretable AI have shed more light on the geometry and meaning of internal representation spaces [11, 19]. Simple linear relationships, such as the king-queen vector examples [44], do not capture the full complexity of these representation spaces. There is evidence for a much richer structure, including superposition and feature splitting [11], as well as non-linear phenomena like cyclic structures [23].

The concept of assigning a representation to a single concept is referred to as monosemanticity, while the problem of meaning ambiguity is known as the polysemanticity problem [22]. Sparse autoencoders (SAEs) serve as a mechanism for sparsely redistributing internal representations learned during LLM pretraining. In this way, the features from the internal autoencoder representation tend to be monosemantic [11] and more easily interpretable (e.g., they activate on token distributions correlated with abstract concepts used by humans).Sparsity is key in the relearning process, as the usual vector reconstruction task (e.g., layer-wise attention head activations) L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT loss is supplemented with an L1superscript𝐿1L^{1}italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT metric. Since the L1superscript𝐿1L^{1}italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT metric is linear with respect to vector coordinates, its derivatives are approximately constant. During the autoencoder learning procedure, they give a constant contribution to the minimization, irrespective of the coordinate values. Hence, many of the coordinate values will vanish, leading to the distribution being sparser.

The SAE training procedure enhances the interpretability of the model’s features. Moreover, monosemantic features, which are understandable by humans, can then be used during inference to modify the internal state of the LLM. SAE feature steering makes LLM generation more aligned with abstract, human-relatable concepts such as morality and truthfulness.This concept of representation engineering [73] extends beyond the SAE approach; see, e.g., Inference Time Intervention (ITI) [39], which uses supervised probing to choose the intervention feature. A possible downside to this method is shifting the model’s generation towards incidental directions that are not necessarily intended. Indeed, ITI trained and tested on truthfulness and ethics-related tasks also affects (albeit positively) the model’s performance on unrelated QA benchmarks [31].

3 Related work

3.1 Prisoner’s Dilemma and Agent Systems

Prisoner’s Dilemma mathematical model was designed in 1950 [24]. It is a 2 person game and at each turn players simultaneously make one of two decisions: to cooperate or to deceive. The payout of the game always follows the following schema: player who deceived a cooperating partner gets the most points, while the cooperating player gets nothing. If both players cooperate they get intermediate reward, if both deceive they both get minimal, non-zero number of points.

Original work on the Iterated version of Prisoner’s Dilemma (IPD) found that simple strategies like ‘tit-for-tat’ mimicking the last choice of the opponent were surprisingly effective [6].

Prisoner’s Dilemma has been extensively studied from the Autonomous Agent Systems point of view, see [48] for a review. Algorithmic cooperation in such a system has been investigated in a q-learning setting [37] and Reinforcement learning theoretical outcomes of strategy learning in prisoner’s dilemma have been discussed in [20].

3.2 LLM Agents

With recent advancements of LLMs several authors have investigated the behavior of publicly available LLMs [13] faced with opponents with different strategies based on natural language-defined personalities [4], such as chain-of-though [52], evolutionary personality traits model [58], theory of mind [43], fine-tuning with intrinsic rewards for social cooperation [63], reinforcement learning [17].It was found in [25] that LlaMA2, LlaMA3, and GPT-3.5 behave ‘Nicer than humans’ in the IPD setting, rarely defecting unprovoked and favoring cooperation over defection only when the opponent’s defection rate is low.In-prompt persona creation in [51] translated natural language descriptions of different cooperative stances into corresponding descriptions of appropriate task behavior (for a one-shot game).

In the following section, we conduct quantitative experiments showing how steering affects the IPD strategy of the LLM agent. We follow a similar setting to [51] and adopt their prompt and problem setup for maximal comparison.

4 Experiments

We first establish LLM Agent’s baseline behavior in the IPD and then proceed with SAE steering alignment.

4.1 Preliminary Study

We simulated 250 IPD games of variable lengths (up to 50 turns) with game prompt described in the appendix A.2.For our preliminary study, we focus on open-source models and select the larger Mixtral 8x7B model [34], which is based on the Mistral 7B [3].

To evaluate how the LLM responds to different opponents without biasing towards a particular strategy, we simulate the opponent’s actions by randomly choosing between cooperation and defection with a defection probability p2,defectsubscript𝑝2defectp_{2,\text{defect}}italic_p start_POSTSUBSCRIPT 2 , defect end_POSTSUBSCRIPT. We set the LLM’s temperature to a low but non-zero value T=0.1𝑇0.1T=0.1italic_T = 0.1, introducing controlled randomness that allows for diverse yet meaningful decision statistics.

It was previously observed [25] that the relationship between the LLM’s defection probability and the opponent’s defection probability has a logistic-like behavior.We run a similar experiment to reference this baseline. The simulation results are illustrated in Figure 5 (right) in the appendix. Indeed, the response of the Mixtral agent to more and more aggressive opponents is to increase the defection rate. Agent’s defection probability saturates and the overall shape of the function is indeed approximately logistic.

However, the main takeaway from this preliminary study is that even a small, non-zero temperature (T=0.1𝑇0.1T=0.1italic_T = 0.1) can drastically change the agent’s strategy. Remarkably, even when the opponent has zero defection probability (p2,defect=0subscript𝑝2defect0p_{2,\text{defect}}=0italic_p start_POSTSUBSCRIPT 2 , defect end_POSTSUBSCRIPT = 0), the agent may choose to defect with a probability p1,defect>0.6subscript𝑝1defect0.6p_{1,\text{defect}}>0.6italic_p start_POSTSUBSCRIPT 1 , defect end_POSTSUBSCRIPT > 0.6 or p1,defect0similar-tosubscript𝑝1defect0p_{1,\text{defect}}\sim 0italic_p start_POSTSUBSCRIPT 1 , defect end_POSTSUBSCRIPT ∼ 0. This suggests that the game-theoretic strategies preferred by LLMs in a given game-like setup can vary significantly under non-zero temperature, despite identical prompts and game states. Consequently, the unreliable nature of LLM strategy generation poses challenges for deploying LLM agents in complex real-world systems.

4.2 Proposed method

We investigate how feature steering affects the agent’s actions and strategy in the IPD. For computational efficiency and to probe a larger representation space, we choose a smaller open-source model, Gemma-2B [60], which has a context window of 1024 tokens.

Interpretable Risk Mitigation in LLM Agent Systems (2)

We use the pre-trained Sparse Autoencoder (SAE) for Gemma-2B in conjunction with the SAE-Lens Python framework [36]. The Gemma-2B SAE is trained to decode 16,384 features. Figure 2 provides a simplified view of the steering procedure. In essence, we hook the SAE network to the residual stream, enabling manipulation of the transformer layer activations xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT before they are passed to the next layer. This manipulation involves adding the steering vector, which is a decoded latent space vector from the SAE. Formally, this process can be expressed as:

xl=xl+ωWdec(fID),subscriptsuperscript𝑥𝑙subscript𝑥𝑙𝜔subscript𝑊decsubscript𝑓ID\displaystyle x^{\prime}_{l}=x_{l}+\omega W_{\textrm{dec}}\left(f_{\textrm{ID}%}\right),italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ω italic_W start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT ID end_POSTSUBSCRIPT ) ,

where xlsubscriptsuperscript𝑥𝑙x^{\prime}_{l}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the input to layer l+1𝑙1l+1italic_l + 1, Wdecsubscript𝑊decW_{\textrm{dec}}italic_W start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT is the decoder matrix, and fIDsubscript𝑓IDf_{\textrm{ID}}italic_f start_POSTSUBSCRIPT ID end_POSTSUBSCRIPT is the latent feature corresponding to a selected index in the SAE’s latent space.

The residual stream features are available on Hugging Face222You can download them here https://huggingface.co/jbloom/Gemma-2b-Residual-Stream-SAEs. We use the prompt described in the Appendix (see A.2). The prompt passed to Gemma-2B is shorter than the original, which is more suitable for the small model size and limited context window. Since the game length is constrained in this setup, we focus on last-round statistics, compared to the 50 rounds used for Mistral 8x7B. See [25] for discussions on context window understanding and game length in IPD with LLMs.

Instead of simulating the game one round per inference, we focus on the fourth round and iterate over all 64 possible combinations of results from the previous three rounds as the prompt. This way, we a priori control the preceding course of the game.


Interpretable Risk Mitigation in LLM Agent Systems (3) Interpretable Risk Mitigation in LLM Agent Systems (4)

We search for features that highly activate on concepts such as ‘trust’, which, from a human perspective, play a major role in the IPD. Given such a feature, we steer the model generation with steering strength w𝑤witalic_w in both the positive (w>0𝑤0w>0italic_w > 0) and negative (w<0𝑤0w<0italic_w < 0) directions. We focus on the last token prediction, corresponding to the action taken by the agent in the fourth round.

Moreover, we are interested in features whose steering changes the next-token probability between ‘green’ (cooperation - see prompt) and ‘blue’ (defection) in such a way that the sum of probabilities of these two tokens is nearly 100%percent100100\%100 %, i.e., P(‘green’)+P(‘blue’)1𝑃‘green’𝑃‘blue’1P(\textrm{`green'})+P(\textrm{`blue'})\approx 1italic_P ( ‘green’ ) + italic_P ( ‘blue’ ) ≈ 1. This is the desired behavior, indicating that the model understands the game context. However, this game-understanding coherence is not generally preserved. For an unrelated feature, even a slight steering |w|1much-less-than𝑤1|w|\ll 1| italic_w | ≪ 1 away from the initial distribution can dramatically change the probability distribution of the next-token prediction, such that P(‘green’)+P(‘blue’)0𝑃‘green’𝑃‘blue’0P(\textrm{`green'})+P(\textrm{`blue'})\approx 0italic_P ( ‘green’ ) + italic_P ( ‘blue’ ) ≈ 0, and a token unrelated to the game is returned. This provides insight into which features are important for the decision-making process during the IPD.

We perform a broad feature space sampling to identify the changes in the IPD action choice probability distribution. Since the feature space is vast (16,384 vectors of 2048 dimensions), we aim to avoid the large computational cost of simulating the games with every possible steering direction.Instead, we identify all features with non-zero activations on at least one of the tokens from the prompt. This leaves us with 2,339 features for Gemma-2B. For each of these features, we collect the last-token δ𝛿\deltaitalic_δ (as defined in Equation1) for each of the 64 histories in the 3-round IPD, steered in both positive and negative directions with w(10,8)𝑤108w\in\left(-10,8\right)italic_w ∈ ( - 10 , 8 ).333We found these values of steering strength experimentally, selecting appropriate steering strengths separately for each model. In total, this steering experiment involved almost 300,000 3-turn IPD simulations per model.

To contrast the results with other models, we perform similar large-scale simulations for the recent version of the Gemma models - Gemma2-2B [61] - and a recent LLaMA3-8B model [27], along with their corresponding sparse autoencoders [40, 35].

5 Results

We find that steering with the ‘sacrifice’ direction (feature index 7155) allows us to shift the last-token probability distribution by 47474747 percentage points. With steering, the probability of defection, averaged over all 64 histories, changes from:

P(‘blue’+sacrifice)=22%,delimited-⟨⟩𝑃conditional‘blue’sacrificepercent22\displaystyle\left\langle P\left(\textrm{`blue'}\mid{\color[rgb]{0,0,1}%\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}+}\text{{\color[rgb]{0,0,1}%\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}sacrifice}}\right)\right\rangle%=22\%,⟨ italic_P ( ‘blue’ ∣ + sacrifice ) ⟩ = 22 % ,

for the positive w𝑤witalic_w, to:

P(‘blue’sacrifice)=69%,delimited-⟨⟩𝑃conditional‘blue’sacrificepercent69\displaystyle\left\langle P\left(\textrm{`blue'}\mid{\color[rgb]{1,0,0}%\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}\text{{\color[rgb]{1,0,0}%\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}sacrifice}}\right)\right\rangle%=69\%,⟨ italic_P ( ‘blue’ ∣ - sacrifice ) ⟩ = 69 % ,

for the negative value of w𝑤witalic_w.

Interpretable Risk Mitigation in LLM Agent Systems (5)

Interpretable Risk Mitigation in LLM Agent Systems (6)

We define the difference between the average defection probabilities, steered by a given feature, as:

δ=P(‘blue’+feature)P(‘blue’feature).𝛿delimited-⟨⟩𝑃conditional‘blue’featuredelimited-⟨⟩𝑃conditional‘blue’feature\displaystyle\delta=\left\langle P\left(\textrm{`blue'}\mid+\text{feature}%\right)\right\rangle-\left\langle P\left(\textrm{`blue'}\mid-\text{feature}%\right)\right\rangle.italic_δ = ⟨ italic_P ( ‘blue’ ∣ + feature ) ⟩ - ⟨ italic_P ( ‘blue’ ∣ - feature ) ⟩ .(1)

We see that adding the feature activating on ‘sacrifice’ greatly decreases the average probability of defection. The exact steering values for each history are more complex and generally do not follow a monotonic pattern (see Figure3 (right)). Figure to the left shows the defection probability for each history in the positively steered (top), unsteered (middle), and negatively steered (bottom). Horizontal axes correspond to number of defections in the history for LLM Agent (X𝑋Xitalic_X-axis) and the opponent (Y𝑌Yitalic_Y-axis). This way the point (0, 0)0 0(0,\,0)( 0 , 0 ) corresponds to a game with only cooperative actions and the point (3, 3)33(3,\,3)( 3 , 3 ) corresponds a defections-only game. The colors show the next-round defection probability of the LLM Agent.

Figure4 shows the distribution of δ𝛿\deltaitalic_δ as defined in Equation1. We note that the Gemma model distributions have a mean around δ=0𝛿0\delta=0italic_δ = 0, while the LLaMA3 distribution is shifted slightly towards negative values.

None of the three distributions is Gaussian; however, the Kolmogorov-Smirnov and Mann-Whitney tests suggest that the Gemma and Gemma2 samples come from equivalent distributions. The differences between the Gemma and LLaMA distributions are apparent in Figure4 (right). It shows the defection probabilities that can be reached with positive (X-axis) and negative (Y-axis) steering.

The distribution centers correspond to the default, unsteered defection probability values. The LLaMA3 model is more aggressive, with a default Pdefect=0.7subscript𝑃defect0.7P_{\text{defect}}=0.7italic_P start_POSTSUBSCRIPT defect end_POSTSUBSCRIPT = 0.7, compared to Gemma and Gemma2, which have unsteered defection probability Pdefect=0.55subscript𝑃defect0.55P_{\text{defect}}=0.55italic_P start_POSTSUBSCRIPT defect end_POSTSUBSCRIPT = 0.55.

The external contours also show the total ‘strategy area’ covered by feature steering. That is, the extreme values of defection probabilities outside of this area cannot be reached. These plots depend on the value of steering strength w𝑤witalic_w. We choose the positive and negative w𝑤witalic_w values empirically, as described earlier (such that P(‘green’)+P(‘blue’)1𝑃‘green’𝑃‘blue’1P(\textrm{`green'})+P(\textrm{`blue'})\approx 1italic_P ( ‘green’ ) + italic_P ( ‘blue’ ) ≈ 1). Choosing a very small value of w𝑤witalic_w leads to distributions highly concentrated around the mean, with very narrow variance.

For steering purposes, the most interesting features are those in the tails of the distribution. These are the features that change the last-token probability distribution by more than 60606060 percentage points. Indeed, there are several features that drastically change the probability of defection. We list notable features in the Table 1. We find that both surface-level features corresponding to the action tokens like ‘green’ and abstract concepts like ‘sacrifice’ and the ‘denouncement of violence’ greatly affect the action choice. The ‘blue’ and ‘green’ directions serve as a sanity-check for our study as the steering with these two should greatly affect the defection probability. In the prompt used, ‘green’ corresponds to cooperation, while ‘blue’ to the defection.

In the case of the LLaMA3 model, we also find the ‘green’ and ‘blue’ features. We note one particularly interesting feature (feature index 30,695), which corresponds to an abstract idea of ‘good faith/bad faith’444You can experiment with this feature in your browser - https://www.neuronpedia.org/llama3-8b-it/25-res-jh/30695:

P(’blue’+good/bad faith)delimited-⟨⟩𝑃conditional’blue’good/bad faith\displaystyle\left\langle P\left(\textrm{'blue'}\mid+\text{good/bad faith}%\right)\right\rangle⟨ italic_P ( ’blue’ ∣ + good/bad faith ) ⟩=47%,absentpercent47\displaystyle=47\%,= 47 % ,
P(’blue’good/bad faith)delimited-⟨⟩𝑃conditional’blue’good/bad faith\displaystyle\left\langle P\left(\textrm{'blue'}\mid-\text{good/bad faith}%\right)\right\rangle⟨ italic_P ( ’blue’ ∣ - good/bad faith ) ⟩=75%.absentpercent75\displaystyle=75\%.= 75 % .

This feature is used both in religious contexts and in business contexts, such as ‘good faith’ or ‘bad faith’ negotiations. It is a very fitting feature, considering that the problem stated in the prompt was designed to mimic a business environment with two parties competing for monetary gain. In the appendix (Figure7), we plot the defection probability as a function of steering strength w𝑤witalic_w for all 64 histories. Interestingly, the ‘good/bad faith’ feature leads to nearly monotonic changes in the defection probability for each history. This suggests that the model internally associates one of the choices with ‘good’ and the other one with ‘bad’ faith negotiations. Moreover, this feature is monosemantic, meaning it can be a good candidate for precise steering of the agent’s strategy beyond the IPD.

Why does it matter whether a feature is monosemantic if it increases the cooperation probability and achieves our goal? Polysemantic features may increase the likelihood of steering towards an unexpected concept. In a real-world setting, we may not be able to evaluate the agent’s risk until it is too late. Therefore, steering with single-meaning features may be our safest option. In the next section, we further discuss monosemanticity and provide examples of agent strategy shifts resulting from monosemantic steering.

6 Discussion

6.1 Interpretability

Sparse autoencoders provide increased monosemanticity of the identified features. Indeed, one can verify how a feature activates on a given sample of texts and attempt to associate a single meaning to such a vector. While this is not always possible, once a feature has an identifiable concept associated with it, one can modify the generation in the semantic direction of that concept, as shown in Figure1.

We find that both polysemantic and monosemantic features can substantially change the defection probability distribution. Some monosemantic features that seemingly should be important for the IPD, such as the feature (index 7445) that activates largely on mentions of ‘trust’ - which, in human understanding, should govern agent’s behavior in the IPD - do not activate on the IPD prompt. When steered with this feature, the next-token probabilities barely change, with P(‘blue’+trust)=47%delimited-⟨⟩𝑃conditional‘blue’trustpercent47\left\langle P\left(\textrm{`blue'}\mid+\text{trust}\right)\right\rangle=47\%⟨ italic_P ( ‘blue’ ∣ + trust ) ⟩ = 47 % and P(‘blue’trust)=50%delimited-⟨⟩𝑃conditional‘blue’trustpercent50\left\langle P\left(\textrm{`blue'}\mid-\text{trust}\right)\right\rangle=50\%⟨ italic_P ( ‘blue’ ∣ - trust ) ⟩ = 50 %.

Polysemantic features provide less precise control over the agent’s behavior, making it important to identify when we are dealing with one. A good proxy for monosemanticity, as described in [11], is the activation density histogram shown in the appendix. Single-meaning features usually exhibit an approximately bimodal distribution, with a second mode at large activation values. In contrast, polysemantic features tend to exhibit a monotonically decreasing distribution tail. This typically correlates with the meaning of the top activating token patterns for a particular feature. See the appendix for top token activation tables and more details.

We also find some commonalities among the steered models, including the appearance of ‘green’/‘blue’ features and unexpected concepts that are universal among models. For example, an ‘environment’ concept is present in both the LLaMA and Gemma model families. Moreover, this feature is monosemantic, and subtracting it from the residual stream leads to a substantial increase in defection probability (see Table1 in the Appendix).

Perhaps the most promising feature for future AI alignment we have found is the ‘good faith/bad faith’ feature. It has several advantages:

  • It is monosemantic, as evidenced by its top token activations and the activation density histogram (see Figure14 in the appendix).

  • Its meaning is relevant to human moral values.

  • It aligns with our intuition and changes the average defection probability by a substantial amount (28 percentage points).

  • It exhibits an approximately monotonic relationship (see Figure7) with steering strength for each history combination.

Considering all of these qualities, we hypothesize that SAE alignment using this feature may generalize beyond the simple game-theoretic setup investigated in this work.

6.2 Outlook

We show that the inference steering with Sparse Autoencoders allows for LLM Agent strategy modulation in the Iterated Prisoner’s Dilemma game-theoretic setup.Moreover, we discuss how to find monosemantic features and their interpretations. We find the tokens explicitly appearing in the prompt - features corresponding to the choice of the ‘green’ (cooperation) and ‘blue’ (defection), as well as more abstract concepts largely change Agent’s strategy. We note Agent’s behavior, while steered with certain features (e.g. ‘environment’) generalizes between different LLM families. We show the possible range of Agent’s strategy moderation for Gemma, Gemma2, and LLaMA3 models.Both the mono- and polysemantic features can greatly change the probability distribution of defective actions. In the general setting, monosemantic steering may be more reliable, and may lower the chance of unexpected concept affecting agent’s action choice. To further improve the SAE steering performance and interpretability several approaches are possible:

LLM model size Increasing the parameter count of LLMs leads to the better text understanding performance. In the LLM Agent paradigm, larger model also means larger context window, and better task-understanding. Moreover, the internal representations stored in the transformer layers include more abstract and fine-grained concepts.

SAE refinement Scaling the number of SAE parameters leads to increased monosemanticity [62]. In this work, a detailed examination of the Conflict feature is conducted. Although the feature’s neighborhood does not distinctly separate into clusters, different subregions correspond to distinct themes. For example, one subregion is associated with balancing trade-offs, positioned near another focused on opposing principles and legal disputes. These are relatively distant from a subregion centered on emotional struggles, reluctance, and guilt. The clarity of neighborhood clustering improves with a larger dictionary size, a phenomenon known as feature splitting, which may enable precise steering of agent strategies. Monosemanticity also scales effectively for visual and multimodal features.

Multilayer steering Further research is necessary to better understand how steering with multiple features at once can be performed. Since the feature space is overcomplete, simple vector addition within the same transformer layer may not be enough (sum of two concept vectors not always results in the intuitive ‘conceptual sum’ as there is not enough dimensions). Multilayer steering may help with steering towards a combined set of concepts, necessary for satisfactory AI alignment.

7 Acknowledgments

We thank Artur Janicki and Marcin Lewandowski for feedback on the initial versions of this manuscript. We are grateful to Łukasz Bondaruk, Mateusz Czyżnikiewicz, Michał Brzozowski, Bartosz Maj, and others for all of the valuable suggestions.

References

  • Ager etal. [2020]Philipp Ager, Jan Bena, Maximiliano Coutin-Churchman, Julian Leon, and DavidWiczer.Artificial intelligence and the future of work: Evidence from OECDcountries.OECD Publishing, 2020.
  • Ahn etal. [2022]Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Yunfei Chow, ColinChu, AliceX. Dai, Chelsea Finn, Justin Fu, Kanishka Gopalakrishnan, etal.Do as i can, not as i say: Grounding language in robotic affordances.In Conference on Robot Learning (CoRL), 2022.
  • AI [2023]Mistral AI.Mistral 7b: Open foundation models, 2023.https://mistral.ai/technology/#models.
  • Akata etal. [2023]Elif Akata, Lion Schulz, Julian Coda-Forno, SeongJoon Oh, Matthias Bethge, andEric Schulz.Playing repeated games with large language models.ArXiv, abs/2305.16867, 2023.URL https://api.semanticscholar.org/CorpusID:258947115.
  • Amodei etal. [2016]Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, andDan Mané.Concrete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016.
  • Axelrod [1984]R.Axelrod.The Evolution of Cooperation.Basic books. Basic Books, 1984.ISBN 9780465021215.URL https://books.google.pl/books?id=NJZBCGbNs98C.
  • Bai etal. [2022]Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Neal DasSarma,Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, etal.Training a helpful and harmless assistant with reinforcement learningfrom human feedback.arXiv preprint arXiv:2204.05862, 2022.
  • Baker etal. [2020]Bowen Baker, Ingmar Kanitscheider, Todor Markov, YiWu, Glenn Powell, BobMcGrew, and Igor Mordatch.Emergent tool use from multi-agent autocurricula.In Proceedings of the International Conference on LearningRepresentations, 2020.
  • Bender etal. [2021]EmilyM. Bender, Timnit Gebru, Angelina McMillan-Major, and ShmargaretShmitchell.On the dangers of stochastic parrots: Can language models be too big?In Proceedings of the 2021 ACM Conference on Fairness,Accountability, and Transparency, pages 610–623. ACM, 2021.
  • Bommasani etal. [2021]Rishi Bommasani, DrewA. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydneyvon Arx, etal.On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021.
  • Bricken etal. [2023]Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, TomConerly, NicholasL. Turner, Cem Anil, Carson Denison, Amanda Askell, RobertLasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, NicholasJoseph, Alex Tamkin, Karina Nguyen, Brayden McLean, JosiahE. Burke, TristanHume, Shan Carter, Tom Henighan, and Chris Olah.Towards monosemanticity: Decomposing language models with dictionarylearning.https://transformer-circuits.pub/2023/monosemantic-features,October 2023.Published online.
  • Brohan etal. [2022]Anthony Brohan, Yevgen Chebotar, Jacky Liang, Chelsea Finn, Karol Hausman, AlexIrpan, Julian Ibarz, and Sergey Levine.RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022.
  • Brookins and Debacker [2023]Philip Brookins and Jason Debacker.Playing games with gpt: What can we learn about a large languagemodel from canonical strategic games?SSRN Electronic Journal, 2023.URL https://api.semanticscholar.org/CorpusID:259714625.
  • Brown etal. [2020]TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD. Kaplan,Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, etal.Language models are few-shot learners.In Advances in Neural Information Processing Systems,volume33, pages 1877–1901, 2020.
  • Brynjolfsson and Mitchell [2017]Erik Brynjolfsson and Tom Mitchell.What can machine learning do? workforce implications.Science, 358(6370):1530–1534, 2017.
  • Chen etal. [2021]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, HenriquePondedeOliveiraPinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, HeidyKhlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder,Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, ClemensWinter, Philippe Tillet, FelipePetroski Such, Dave Cummings, MatthiasPlappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss,WilliamHebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, IgorBabuschkin, Suchir Balaji, Shantanu Jain, William Saunders, ChristopherHesse, AndrewN. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa,Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, PeterWelinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, andWojciech Zaremba.Evaluating large language models trained on code, 2021.URL https://arxiv.org/abs/2107.03374.
  • Chen etal. [2024]Qiliang Chen, AlirezaSepehr Ilami, Nunzio Lorè, and Babak Heydari.Instigating cooperation among llm agents using adaptive informationmodulation.ArXiv, abs/2409.10372, 2024.URL https://api.semanticscholar.org/CorpusID:272690037.
  • Cunningham etal. [2023a]Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey.Sparse autoencoders find highly interpretable features in languagemodels.ArXiv, abs/2309.08600, 2023a.URL https://api.semanticscholar.org/CorpusID:261934663.
  • Cunningham etal. [2023b]Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey.Sparse autoencoders find highly interpretable features in languagemodels, 2023b.URL https://arxiv.org/abs/2309.08600.
  • Dolgopolov [2024]Arthur Dolgopolov.Reinforcement learning in a prisoner’s dilemma.Games Econ. Behav., 144:84–103, 2024.URL https://api.semanticscholar.org/CorpusID:267111895.
  • Elhage etal. [2022a]Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan,Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen,RogerBaker Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, MartinWattenberg, and Christopher Olah.Toy models of superposition.ArXiv, abs/2209.10652, 2022a.URL https://api.semanticscholar.org/CorpusID:252439050.
  • Elhage etal. [2022b]Nelson Elhage, Neel Nanda, Catherine Olsson, etal.Toy models of superposition.arXiv preprint arXiv:2210.04866, 2022b.
  • Engels etal. [2024]Joshua Engels, Isaac Liao, EricJ. Michaud, Wes Gurnee, and Max Tegmark.Not all language model features are linear.ArXiv, abs/2405.14860, 2024.URL https://api.semanticscholar.org/CorpusID:269983112.
  • Flood [1958]MerrillM Flood.Some experimental games.Management Science, 5(1):5–26, 1958.
  • Fontana etal. [2024]Nicol’o Fontana, Francesco Pierri, and LucaMaria Aiello.Nicer than humans: How do large language models behave in theprisoner’s dilemma?ArXiv, abs/2406.13605, 2024.URL https://api.semanticscholar.org/CorpusID:270619642.
  • Gabriel [2020]Iason Gabriel.Artificial intelligence, values, and alignment.Minds and Machines, 30:411–437, 09 2020.doi: 10.1007/s11023-020-09539-2.
  • Grattafiori etal. [2024]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, AbhishekKadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, AlexVaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang,Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao,Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, BaptisteRoziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, ChayaNayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, ChristopheTouret, Chunyang Wu, Corinne Wong, CristianCanton Ferrer, Cyrus Nikolaidis,Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt,David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, DiegoPerino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, EmilyDinan, EricMichael Smith, Filip Radenovic, Francisco Guzmán, Frank Zhang,Gabriel Synnaeve, Gabrielle Lee, GeorgiaLewis Anderson, Govind Thattai,Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen,Hannah Korevaar, HuXu, Hugo Touvron, Iliyan Zarov, ImanolArrieta Ibarra,Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jack Zhang, Jade Copet, JaewonLee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmervander Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, JianfengChi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, JoeSpisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, JuntengJia, KalyanVasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak,KeLi, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, KshitizMalik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala-Yeary,Laurens vander Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin,Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke deOliveira,Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, MarcinKardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova,Melanie Kambadur, Mike Lewis, Min Si, MiteshKumar Singh, Mona Hassan, NamanGoyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, NiladriChatterji, Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick Alrassy,Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava,Pratik Dubal, Praveen Krishnan, PunitSingh Koura, Puxin Xu, Qing He,Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer,RicardoSilveira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari,Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, RoshanSumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, SahanaChennabasappa, Sanjay Singh, Sean Bell, SeohyunSonia Kim, Sergey Edunov,Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan,Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman,Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, TamarHerman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, TobiasSpeckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, VibhorGupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, VishVogeti, Vítor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu,Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, XiaoqingEllenTan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag,Yashesh Gaur, Yasmine Babaei, YiWen, Yiwen Song, Yuchen Zhang, Yue Li,Yuning Mao, ZacharieDelpierre Coudert, Zheng Yan, Zhengxing Chen, ZoePapakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain, Adam Kelsey, AdamShajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon,Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet,Amit Sangani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, AndrewCaples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani,Annie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arkabandhu Chowdhury,Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James,Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, BetoDe Paola,Bhargavi Paranjape, Bing Liu, BoWu, Boyu Ni, Braden Hancock, Bram Wasti,Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker,Carly Burton, Catalina Mejia, CeLiu, Changhan Wang, Changkyu Kim, Chao Zhou,Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, ChristophFeichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, DanielLi, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh,Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, EdwardDowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, EmilyWood, Eric-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, EvanSmothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel,Francesco Caggioni, Frank Kanayet, Frank Seide, GabrielaMedina Florez,Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman,Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hakan Inan, HamidShojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, HarrisonRudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan, IbrahimDamlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina-Elena Veliche,Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, JaphetAsher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen,Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, JoeCummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, JoshGinsburg, Junjie Wang, Kai Wu, KamHou U, Karan Saxena, Kartikay Khandelwal,Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, KeqianLi, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen,Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo,Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani,Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso,Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu,MichaelL. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, MikVyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, MiquelJubertHermoso, MoMetanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam,Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo,Nicolas Usunier, Nikhil Mehta, NikolayPavlovich Laptev, Ning Dong, NormanCheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, ParkinKent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager,Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, PritishYuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, RaghothamMurthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li,Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, SachinMehta, Sachin Siby, SaiJayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt,Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma,Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng,Shenghao Lin, ShengxinCindy Zha, Shishir Patil, Shiva Shankar, ShuqiangZhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, SoumithChintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield,Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk,Suraj Subramanian, SyChoudhury, Sydney Goldman, Tal Remez, Tamar Glaser,Tamara Best, Thilo Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, TimMatthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi,Victoria Montanez, Vijai Mohan, VinaySatish Kumar, Vishal Mangla, VladIonescu, Vlad Poenaru, VladTiberiu Mihailescu, Vladimir Ivanov, Wei Li,Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang,Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen,YeHu, YeJia, YeQi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, YoungjinNam, Yu, Wang, YuZhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait,Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, andZhiyu Ma.The llama 3 herd of models, 2024.URL https://arxiv.org/abs/2407.21783.
  • Hendrycks etal. [2021a]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, DawnSong, and Jacob Steinhardt.Measuring massive multitask language understanding,2021a.URL https://arxiv.org/abs/2009.03300.
  • Hendrycks etal. [2021b]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, EricTang, Dawn Song, and Jacob Steinhardt.Measuring mathematical problem solving with the math dataset,2021b.URL https://arxiv.org/abs/2103.03874.
  • Hong etal. [2024]Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, YanWang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang.Cogagent: A visual language model for gui agents.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), pages 14281–14290, June 2024.
  • Hościłowicz etal. [2024]Jakub Hościłowicz, Adam Wiacek, Jan Chojnacki, Adam Cieślak, Leszek Michoń,Vitalii Urbanevych, and Artur Janicki.Non-linear inference time intervention: Improving llm truthfulness.Interspeech 2024, 2024.URL https://api.semanticscholar.org/CorpusID:268724230.
  • Huang and Chang [2023]Jie Huang and Kevin Chen-Chuan Chang.Towards reasoning in large language models: A survey, 2023.URL https://arxiv.org/abs/2212.10403.
  • Javaid etal. [2024]Shumaila Javaid, Hamza Fahim, Bin He, and Nasir Saeed.Large language models for uavs: Current state and pathways to thefuture.IEEE Open Journal of Vehicular Technology, 5:1166–1192, 2024.URL https://api.semanticscholar.org/CorpusID:269588084.
  • Jiang etal. [2024]AlbertQ. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, BlancheSavary, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, EmmaBouHanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample,LélioRenard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock,Sandeep Subramanian, Sophia Yang, Szymon Antoniak, TevenLe Scao, ThéophileGervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed.Mixtral of experts, 2024.URL https://arxiv.org/abs/2401.04088.
  • Jiatong Han [2024]Jiatong Han.llama-3-8b-it-res (revision 53425c3), 2024.URL https://huggingface.co/Juliushanhanhan/llama-3-8b-it-res.
  • JosephBloom and Chanin [2024]CurtTigges JosephBloom and David Chanin.Saelens.https://github.com/jbloomAus/SAELens, 2024.
  • Kasberger etal. [2023]Bernhard Kasberger, SimonP. Martin, Hans-Theo Normann, and T.Werner.Algorithmic cooperation.SSRN Electronic Journal, 2023.URL https://api.semanticscholar.org/CorpusID:257673245.
  • Kirkpatrick etal. [2017]James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, GuillaumeDesjardins, AndreiA. Rusu, Kieran Milan, John Quan, Tiago Ramalho, AgnieszkaGrabska-Barwinska, etal.Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
  • Li etal. [2023]Kenneth Li, Oam Patel, Fernanda Vi’egas, Hans-Rüdiger Pfister, and MartinWattenberg.Inference-time intervention: Eliciting truthful answers from alanguage model.ArXiv, abs/2306.03341, 2023.URL https://api.semanticscholar.org/CorpusID:259088877.
  • Lieberum etal. [2024]Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, NicolasSonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and NeelNanda.Gemma scope: Open sparse autoencoders everywhere all at once on gemma2, 2024.URL https://arxiv.org/abs/2408.05147.
  • Lipton [2016]Zachary Lipton.The mythos of model interpretability.Communications of the ACM, 61, 10 2016.doi: 10.1145/3233231.
  • Liu etal. [2023]Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, andGraham Neubig.Pre-train, prompt, and predict: A systematic survey of promptingmethods in natural language processing.ACM Computing Surveys, 55(9):1–35, 2023.
  • Lorè etal. [2024]Nunzio Lorè, AlirezaSepehr Ilami, and Babak Heydari.Large model strategic thinking, small model efficiency: Transferringtheory of mind in large language models.ArXiv, abs/2408.05241, 2024.URL https://api.semanticscholar.org/CorpusID:271854780.
  • Mikolov etal. [2013]Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.Linguistic regularities in continuous space word representations.In Lucy Vanderwende, Hal DauméIII, and Katrin Kirchhoff,editors, Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human LanguageTechnologies, pages 746–751, Atlanta, Georgia, June 2013. Association forComputational Linguistics.URL https://aclanthology.org/N13-1090.
  • Minaee etal. [2024]Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher,Xavier Amatriain, and Jianfeng Gao.Large language models: A survey, 2024.URL https://arxiv.org/abs/2402.06196.
  • Nowak and Sigmund [1993]Martin Nowak and Karl Sigmund.A strategy of win-stay, lose-shift that outperforms tit-for-tat inthe prisoner’s dilemma game.Nature, 364(6432):56–58, 1993.
  • Ouyang etal. [2022]Long Ouyang, Jeff Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, PamelaMishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.Training language models to follow instructions with human feedback.arXiv preprint arXiv:2203.02155, 2022.
  • Pan etal. [2023]Jiateng Pan, Atsushi Yoshikawa, and Masayuki Yamamura.Cooperation: A systematic review of how to enable agent to circumventthe prisoner’s dilemma.SHS Web of Conferences, 178, 10 2023.doi: 10.1051/shsconf/202317803005.
  • Park etal. [2023]JoonSung Park, Michael Shum, Joseph Xu, Kenneth Zhang, RogerG. Wang, LinxiWang, Alex Wang, Allie He, Qian Liao, David Kempe, etal.Generative agents: Interactive simulacra of human behavior.arXiv preprint arXiv:2304.03442, 2023.
  • Pawlowski etal. [2024]Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Marcin Skorupa, AdamWiacek, Sebastien Postansque, and Jakub Hoscilowicz.Tinyclick: Single-turn agent for empowering gui automation, 2024.URL https://arxiv.org/abs/2410.11871.
  • Phelps and Russell [2023]Steve Phelps and YvanI. Russell.The machine psychology of cooperation: Can gpt models operationaliseprompts for altruism, cooperation, competitiveness and selfishness ineconomic games?2023.URL https://api.semanticscholar.org/CorpusID:258685424.
  • Poje etal. [2024]Kristijan Poje, Mario Brcic, Mihael Kovavc, and MarinaBagic Babac.Effect of private deliberation: Deception of large language models ingame play.Entropy, 26, 2024.URL https://api.semanticscholar.org/CorpusID:270613663.
  • Rein etal. [2023]David Rein, BettyLi Hou, AsaCooper Stickland, Jackson Petty, RichardYuanzhePang, Julien Dirani, Julian Michael, and SamuelR. Bowman.Gpqa: A graduate-level google-proof qa benchmark, 2023.URL https://arxiv.org/abs/2311.12022.
  • Rogers etal. [2020]Anna Rogers, Olga Kovaleva, and Anna Rumshisky.A primer in BERTology: What we know about how BERT works.Transactions of the Association for Computational Linguistics,8:842–866, 2020.doi: 10.1162/tacl˙a˙00349.URL https://aclanthology.org/2020.tacl-1.54.
  • Russell etal. [2015]Stuart Russell, Daniel Dewey, and Max Tegmark.Research priorities for robust and beneficial artificialintelligence.AI Magazine, 36(4):105–114, 2015.
  • Scharre [2019]P.Scharre.Army of None: Autonomous Weapons and the Future of War.WW Norton, 2019.ISBN 9780393356588.URL https://books.google.se/books?id=kF2NEAAAQBAJ.
  • Schick etal. [2023]Timo Schick, Jane Dwivedi-Yu, Nathan Scales, David Dohan, Justin Gilmer,Richard Tanburn, Vedant Misra, Kyle Mills, José SusanoPinto, NathanaelSchärli, etal.Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023.
  • Suzuki and Arita [2023]Reiji Suzuki and Takaya Arita.An evolutionary model of personality traits related to cooperativebehavior using a large language model.Scientific Reports, 14, 2023.URL https://api.semanticscholar.org/CorpusID:263830498.
  • Taddeo and Blanchard [2022]Mariarosaria Taddeo and Alexander Blanchard.A comparative analysis of the definitions of autonomous weaponssystems.Science and Engineering Ethics, 28(5):1–22, 2022.doi: 10.1007/s11948-022-00392-3.
  • Team etal. [2024a]Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju,Shreya Pathak, Laurent Sifre, Morgane Rivière, MihirSanjay Kale, JulietteLove, Pouya Tafti, Léonard Hussenot, PierGiuseppe Sessa, AakankshaChowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, AmbroseSlone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson,Beth Tsai, Bobak Shahriari, CharlineLe Lan, ChristopherA. Choquette-Choo,Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya,Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru,Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko,Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, JeffStanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, JustinMao-Jones, Katherine Lee, Kathy Yu, Katie Millican, LarsLowe Sjoesund, LisaLee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman,Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez,Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu,Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, SamuelLSmith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya,Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, WojciechStokowiec, Yuhui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, LudovicPeran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, KorayKavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral,Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, AlekAndreev, and Kathleen Kenealy.Gemma: Open models based on gemini research and technology,2024a.URL https://arxiv.org/abs/2403.08295.
  • Team etal. [2024b]Gemma Team, Morgane Riviere, Shreya Pathak, PierGiuseppe Sessa, CassidyHardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, BobakShahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, AbeFriesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, CharlineLe Lan, SammyJerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin,Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, BehnamNeyshabur, Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish,Aliya Ahmad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, AndyBrock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, BilalPiot, BoWu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, ChrisWelty, ChristopherA. Choquette-Choo, Danila Sinopalnikov, David Weinberger,Dimple Vijaykumar, Dominika Rogozińska, Dustin Herbison, Elisa Bandy, EmmaWang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, FrancescoVisin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi,Hanna Klimczak-Plucińska, Harleen Batra, Harsh Dhand, Ivan Nardini, JacindaMein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, JinPeng Zhou,Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost vanAmersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Juyeong Ji, KareemMohamed, Kartikeya Badola, Kat Black, Katie Millican, Keelin McDonell, KelvinNguyen, Kiranbir Sodhia, Kish Greene, LarsLowe Sjoesund, Lauren Usui,Laurent Sifre, Lena Heuermann, Leticia Lago, Lilly McNealus, LivioBaldiniSoares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid,Manvinder Singh, Mark Iverson, Martin Görner, Mat Velloso, Mateo Wirth, MattDavidow, Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, MehranKazemi, Michael Moynihan, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman,Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, NetaDumai, Nilay Chauhan, Oscar Wahltinez, Pankil Botarda, Parker Barnes, PaulBarham, Paul Michel, Pengchong Jin, Petko Georgiev, Phil Culliton, PradeepKuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, RezaArdeshir Rokni,Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, SaraMc Carthy, Sarah Cogan,Sarah Perrin, Sébastien M.R. Arnold, Sebastian Krause, Shengyang Dai,Shruti Garg, Shruti Sheth, Sue Ronstrom, Susan Chan, Timothy Jordan, Ting Yu,Tom Eccles, Tom Hennigan, Tomas Kocisky, Tulsee Doshi, Vihan Jain, VikasYadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, WenmingYe, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei,Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, TrisWarkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell,D.Sculley, Jeanine Banks, Anca Dragan, Slav Petrov, Oriol Vinyals, JeffDean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya,Sebastian Borgeaud, Noah Fiedel, Armand Joulin, Kathleen Kenealy, RobertDadashi, and Alek Andreev.Gemma 2: Improving open language models at a practical size,2024b.URL https://arxiv.org/abs/2408.00118.
  • Templeton [2024]Adly Templeton.Scaling monosemanticity: Extracting interpretable features fromclaude 3 sonnet.Anthropic, 2024.
  • Tennant etal. [2024]Elizaveta Tennant, Stephen Hailes, and Mirco Musolesi.Moral alignment for llm agents.2024.URL https://api.semanticscholar.org/CorpusID:273026159.
  • Touvron etal. [2023a]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-AnneLachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro,Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and GuillaumeLample.Llama: Open and efficient foundation language models,2023a.URL https://arxiv.org/abs/2302.13971.
  • Touvron etal. [2023b]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,Dan Bikel, Lukas Blecher, CristianCanton Ferrer, Moya Chen, GuillemCucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, SagharHosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux,Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, XavierMartinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, AndrewPoulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, RuanSilva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang,Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, IliyanZarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, AurelienRodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom.Llama 2: Open foundation and fine-tuned chat models,2023b.URL https://arxiv.org/abs/2307.09288.
  • Wang etal. [2019]Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, andSamuelR. Bowman.Glue: A multi-task benchmark and analysis platform for naturallanguage understanding, 2019.URL https://arxiv.org/abs/1804.07461.
  • Wang etal. [2024]Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, JiZhang, FeiHuang, and Jitao Sang.Mobile-agent: Autonomous multi-modal mobile device agent with visualperception, 2024.URL https://arxiv.org/abs/2401.16158.
  • Wei etal. [2022]Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester,Nan Du, AndrewM. Dai, and QuocV Le.Finetuned language models are zero-shot learners.In International Conference on Learning Representations, 2022.URL https://openreview.net/forum?id=gEZrGCozdqR.
  • Wei etal. [2023]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia,EdChi, Quoc Le, and Denny Zhou.Chain-of-thought prompting elicits reasoning in large languagemodels, 2023.URL https://arxiv.org/abs/2201.11903.
  • Yao etal. [2022]Shunyu Yao, HowardChen Yu, Shunyu Cao, Yuan Zhao, Dong Yu, Hanjun Sun, OfirPress, Mike Lewis, Yuan Cao, Karthik Narasimhan, etal.ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022.
  • Zhang etal. [2023]Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu,and Gang Yu.Appagent: Multimodal agents as smartphone users, 2023.URL https://arxiv.org/abs/2312.13771.
  • Zhang and Zhang [2024]Zhuosheng Zhang and Aston Zhang.You only look at screens: Multimodal chain-of-action agents, 2024.URL https://arxiv.org/abs/2309.11436.
  • Zou etal. [2023]Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren,Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, ShashwatGoel, Nathaniel Li, MichaelJ. Byun, Zifan Wang, Alex Mallen, Steven Basart,Sanmi Koyejo, Dawn Song, Matt Fredrikson, J.Zico Kolter, and Dan Hendrycks.Representation engineering: A top-down approach to ai transparency,2023.URL https://arxiv.org/abs/2310.01405.

Appendix A Appendix

A.1 Preliminary study details

In this Section, we experiment with a relatively large Mixtral 7x8B model [34]. From now on, Player 1 (P1) will denote the LLM agent and Player 2 (P2) will denote the opposing strategy. As P2 we set a randomly defecting opponent with uniform defection probability p2defect𝑝subscript2𝑑𝑒𝑓𝑒𝑐𝑡p2_{defect}italic_p 2 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT.

To provide a baseline, we first simulate games using the win-stay, lose-change strategy [46], which is deterministic and does not involve any LLMs. We pair this strategy with a randomly defecting opponent as described above. The resulting strategy defection probability p1,defectsubscript𝑝1defectp_{1,\text{defect}}italic_p start_POSTSUBSCRIPT 1 , defect end_POSTSUBSCRIPT as a function of p2,defectsubscript𝑝2defectp_{2,\text{defect}}italic_p start_POSTSUBSCRIPT 2 , defect end_POSTSUBSCRIPT is depicted in Figure 5 (left) in the appendix. We observe a contraction mapping with increasing p2,defectsubscript𝑝2defectp_{2,\text{defect}}italic_p start_POSTSUBSCRIPT 2 , defect end_POSTSUBSCRIPT, converging to p1,defect=12subscript𝑝1defect12p_{1,\text{defect}}=\frac{1}{2}italic_p start_POSTSUBSCRIPT 1 , defect end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG when p2,defect=1subscript𝑝2defect1p_{2,\text{defect}}=1italic_p start_POSTSUBSCRIPT 2 , defect end_POSTSUBSCRIPT = 1. At this point, the strategy is paired with an always-defecting opponent, and since both outcomes (defect-defect and defect-cooperate) are considered losses, the strategy alternates between cooperation and defection each turn, resulting in p1,defect=12subscript𝑝1defect12p_{1,\text{defect}}=\frac{1}{2}italic_p start_POSTSUBSCRIPT 1 , defect end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG.

The LLM deception probability p1defect𝑝subscript1𝑑𝑒𝑓𝑒𝑐𝑡p1_{defect}italic_p 1 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT as a function of opponent’s deception probability p2defect𝑝subscript2𝑑𝑒𝑓𝑒𝑐𝑡p2_{defect}italic_p 2 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT is depicted in Figure 5 (right) with logistic fit with R2=0.73superscript𝑅20.73R^{2}=0.73italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.73. The distribution flattens as the opponent deceptive actions appear with around p=0.4𝑝0.4p=0.4italic_p = 0.4 probability.A general logistic curve-like behavior is visible on the plot 5, however the fit is not perfect. We find two qualitatively different reasons behind it:

  • Case 1: non-cooperative opponent p2defect0𝑝subscript2𝑑𝑒𝑓𝑒𝑐𝑡0p2_{defect}\neq 0italic_p 2 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT ≠ 0 - Large variance in the region p2defect<0.3𝑝subscript2𝑑𝑒𝑓𝑒𝑐𝑡0.3p2_{defect}<0.3italic_p 2 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT < 0.3 is caused by different strategies taken by the LLM depending on the initial behavior of the adversary. In the first several turns the LLM usually prefers a defective action to maximize the score. However, as the opponent cooperates, LLM tends to switch to cooperation as well. Otherwise, if the initial opponent’s actions are defective, the model does not easily ‘forgive’ and tends to defect more often throughout the round. This behavior leads to wide confidence intervals and a densely populated top-left corner of the plot and two data clusters emerging around p1defect=0.2𝑝subscript1𝑑𝑒𝑓𝑒𝑐𝑡0.2p1_{defect}=0.2italic_p 1 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT = 0.2 and p1defect=0.9𝑝subscript1𝑑𝑒𝑓𝑒𝑐𝑡0.9p1_{defect}=0.9italic_p 1 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT = 0.9

  • Case 2: fully cooperative opponent p2defect=0𝑝subscript2𝑑𝑒𝑓𝑒𝑐𝑡0p2_{defect}=0italic_p 2 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT = 0 - What is perhaps even more staggering is that with fixed history of entirely cooperative opponent (p2defect=0𝑝subscript2𝑑𝑒𝑓𝑒𝑐𝑡0p2_{defect}=0italic_p 2 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT = 0) the agent employs two qualitatively different strategies: cooperative one (around p1defect=0.2𝑝subscript1𝑑𝑒𝑓𝑒𝑐𝑡0.2p1_{defect}=0.2italic_p 1 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT = 0.2) and an aggressive one (around p1defect=0.8𝑝subscript1𝑑𝑒𝑓𝑒𝑐𝑡0.8p1_{defect}=0.8italic_p 1 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT = 0.8). Notice that it is strictly a non-zero temperature effect as the game prompt and the histories are fixed. We, therefore, introduce a small perturbation to the probability distribution with (T=0T=0.1𝑇0absent𝑇0.1T=0\xrightarrow{}T=0.1italic_T = 0 start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW italic_T = 0.1), which results in large system changes (up to p1defect=0.1p1defect=0.9𝑝subscript1𝑑𝑒𝑓𝑒𝑐𝑡0.1absent𝑝subscript1𝑑𝑒𝑓𝑒𝑐𝑡0.9p1_{defect}=0.1\xrightarrow{}p1_{defect}=0.9italic_p 1 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT = 0.1 start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW italic_p 1 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT = 0.9).

Contrastively, the win-stay-lose-change strategy depicted in Figure 5 (left) has a single value p1defect𝑝subscript1𝑑𝑒𝑓𝑒𝑐𝑡p1_{defect}italic_p 1 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT for the fully cooperative opponent p2defect=0𝑝subscript2𝑑𝑒𝑓𝑒𝑐𝑡0p2_{defect}=0italic_p 2 start_POSTSUBSCRIPT italic_d italic_e italic_f italic_e italic_c italic_t end_POSTSUBSCRIPT = 0. It is also harder to distinguish multiple point clusters.


Interpretable Risk Mitigation in LLM Agent Systems (7) Interpretable Risk Mitigation in LLM Agent Systems (8)

The clustering is even more apparent on Figure6, where in contrast to the win-stay-lose-change strategy, the Mixtral 7x8B approach leads to two linear aggregations for opponent’s defection probability smaller than 0.50.50.50.5. If opponent’s actions are defective more often than this threshold, a single limit strategy (always defect) is dominant. The active strategy (top-side cluster) is applied both for the mostly defecting (agent defects to go even 3-3) and mostly cooperating (agent defects to achieve advantage 7-0) opponents.


Interpretable Risk Mitigation in LLM Agent Systems (9) Interpretable Risk Mitigation in LLM Agent Systems (10)


A.2 Prompts

Prompt used in the preliminary Mixtral experiments:

Prompt used in Gemma and LLaMA experiments:

Next token probability example for Gemma prompt

A.3 Notable features

Table 1 shows notable feature found in the SAE of Gemma and LLaMA models and their impact on the average 4th round defection probability in the IPD.

InterpretationModelFeature IDSemanticity𝐏(‘blue’|+)delimited-⟨⟩𝐏conditional‘blue’\displaystyle\mathbf{\left\langle P\left(\textrm{`blue'}|+\right)\right\rangle}⟨ bold_P ( ‘blue’ | + ) ⟩𝐏(‘blue’|)delimited-⟨⟩𝐏conditional‘blue’\displaystyle\mathbf{\left\langle P\left(\textrm{`blue'}|-\right)\right\rangle}⟨ bold_P ( ‘blue’ | - ) ⟩
GreenGemma-2b1041Mono0.230.95
BlueGemma-2b6556Mono0.800.04
ChangeGemma-2b6879Poly0.470.84
SacrificeGemma-2b7155Poly0.220.69
TrustGemma-2b7445Mono0.500.47
EnvironmentGemma2-2b6167Mono0.340.84
BlueLLaMA3-IT-8b45097Mono0.810.27
EnvironmentLLaMA3-IT-8b15699Mono0.420.95
Good/Bad FaithLLaMA3-IT-8b30695Mono0.750.47

We find that steering with ‘change’ feature allows us to shift the last token probability distribution by 37373737 percentage points. With steering, the probability of defection, averaged over all 64 histories changed from:

P(‘ blue’|+change)=47%,delimited-⟨⟩𝑃conditional‘ blue’𝑐𝑎𝑛𝑔𝑒percent47\displaystyle\left\langle P\left(\textrm{` blue'}|+change\right)\right\rangle=%47\%,⟨ italic_P ( ‘ blue’ | + italic_c italic_h italic_a italic_n italic_g italic_e ) ⟩ = 47 % ,

for the positive w𝑤witalic_w to

P(‘ blue’|change)=84%,delimited-⟨⟩𝑃conditional‘ blue’𝑐𝑎𝑛𝑔𝑒percent84\displaystyle\left\langle P\left(\textrm{` blue'}|-change\right)\right\rangle=%84\%,⟨ italic_P ( ‘ blue’ | - italic_c italic_h italic_a italic_n italic_g italic_e ) ⟩ = 84 % ,

for the negative value of w𝑤witalic_w.

Even though not for every history configuration the cooperation probability rises, we see that ‘change’-steered model shifts the generation towards cooperation. It seems that the removing the ‘change’ feature effect from the residual stream (by subtracting it with the weight w𝑤witalic_w) makes the agent’s decisions much less cooperative.

Indeed, there are several features that drastically change the probability of defection. We list three notable features from Gemma-2b:

  • Steering towards green feature (1041) drastically changes the ‘ blue’ token generation probability:

    P(‘ blue’|green)delimited-⟨⟩𝑃conditional‘ blue’𝑔𝑟𝑒𝑒𝑛\displaystyle\left\langle P\left(\textrm{` blue'}|-green\right)\right\rangle⟨ italic_P ( ‘ blue’ | - italic_g italic_r italic_e italic_e italic_n ) ⟩=95%,absentpercent95\displaystyle=95\%,= 95 % ,
    P(‘ blue’|+green)delimited-⟨⟩𝑃conditional‘ blue’𝑔𝑟𝑒𝑒𝑛\displaystyle\left\langle P\left(\textrm{` blue'}|+green\right)\right\rangle⟨ italic_P ( ‘ blue’ | + italic_g italic_r italic_e italic_e italic_n ) ⟩=23%absentpercent23\displaystyle=23\%= 23 %

    this is sanity test result, as ‘green’ is the other action option, corresponding to the cooperation. Biasing the model towards this direction, decreases the ‘ blue’ token probability, since both of these probabilities should sum approximately to one:P(‘ green’)+P(‘ blue’)1similar-to𝑃‘ green’𝑃‘ blue’1P(\textrm{` green'})+P(\textrm{` blue'})\sim 1italic_P ( ‘ green’ ) + italic_P ( ‘ blue’ ) ∼ 1

  • Steering towards blue feature (6556) drastically changes the ‘ blue’ token generation probability:

    P(‘ blue’|blue)delimited-⟨⟩𝑃conditional‘ blue’𝑏𝑙𝑢𝑒\displaystyle\left\langle P\left(\textrm{` blue'}|-blue\right)\right\rangle⟨ italic_P ( ‘ blue’ | - italic_b italic_l italic_u italic_e ) ⟩=0.04%,absentpercent0.04\displaystyle=0.04\%,= 0.04 % ,
    P(‘ blue’|+blue)delimited-⟨⟩𝑃conditional‘ blue’𝑏𝑙𝑢𝑒\displaystyle\left\langle P\left(\textrm{` blue'}|+blue\right)\right\rangle⟨ italic_P ( ‘ blue’ | + italic_b italic_l italic_u italic_e ) ⟩=80%absentpercent80\displaystyle=80\%= 80 %
  • The other, polysemantic feature (7155) is more interesting - it activates most strongly on tokens used for descriptions of sacrifice, hope, victims, but also on sentences opposing violence.

    P(‘ blue’|sacrifice)delimited-⟨⟩𝑃conditional‘ blue’𝑠𝑎𝑐𝑟𝑖𝑓𝑖𝑐𝑒\displaystyle\left\langle P\left(\textrm{` blue'}|-sacrifice\right)\right\rangle⟨ italic_P ( ‘ blue’ | - italic_s italic_a italic_c italic_r italic_i italic_f italic_i italic_c italic_e ) ⟩=69%,absentpercent69\displaystyle=69\%,= 69 % ,
    P(‘ blue’|+sacrifice)delimited-⟨⟩𝑃conditional‘ blue’𝑠𝑎𝑐𝑟𝑖𝑓𝑖𝑐𝑒\displaystyle\left\langle P\left(\textrm{` blue'}|+sacrifice\right)\right\rangle⟨ italic_P ( ‘ blue’ | + italic_s italic_a italic_c italic_r italic_i italic_f italic_i italic_c italic_e ) ⟩=22%absentpercent22\displaystyle=22\%= 22 %

    When steered in the direction opposing ‘sacrifice’ the model becomes much more vindictive. On the other hand, positive steering towards ‘sacrifice’ results usually with cooperation.

  • A monosemantic LLaMA3-IT-8b feature good/bad faith negotiation has number of desirable qualities for the LLM Agent alignment

    P(‘ blue’|good/bad faith)delimited-⟨⟩𝑃conditional‘ blue’good/bad faith\displaystyle\left\langle P\left(\textrm{` blue'}|-\textrm{good/bad faith}%\right)\right\rangle⟨ italic_P ( ‘ blue’ | - good/bad faith ) ⟩=47%,absentpercent47\displaystyle=47\%,= 47 % ,
    P(‘ blue’|+good/bad faith)delimited-⟨⟩𝑃conditional‘ blue’good/bad faith\displaystyle\left\langle P\left(\textrm{` blue'}|+\textrm{good/bad faith}%\right)\right\rangle⟨ italic_P ( ‘ blue’ | + good/bad faith ) ⟩=75%absentpercent75\displaystyle=75\%= 75 %

    It is monosemantic, as evidenced by its top token activations and the activation density histogram (see Figure14 in the appendix). Its meaning is relevant to human moral values. It aligns with our intuition and changes the average defection probability by a substantial amount (28 percentage points). It exhibits an approximately monotonic relationship (see Figure7) with steering strength for each history combination.

Interpretable Risk Mitigation in LLM Agent Systems (11)

This shows that both the surface-level result of action like ‘green’ and abstract ideas like ‘ debt’ and the ‘denounce of violence’ greatly affect the action choice.

A.4 Activation dashboards

As described in [11], the attention activation dashboards allow us to indirectly investigate the monosemanticity of a given feature. One can pass a dataset through a model and collect the activation values of a chosen feature. The distributions of such collected data are shown in the figures below.

There are two general classes of the distributions:

  • flat tail distribution - the bin counts monotonically decrease with the strength of the activation. There are no text samples which disproportionately activate features. See e.g. ‘sacrifice’ feature in Figure13

  • tail cluster - the bin counts have bimodal distribution with smaller mode in the large activation values. See e.g. ‘trust’ in Figure10


Interpretable Risk Mitigation in LLM Agent Systems (12) Interpretable Risk Mitigation in LLM Agent Systems (13)


Interpretable Risk Mitigation in LLM Agent Systems (14) Interpretable Risk Mitigation in LLM Agent Systems (15)



Interpretable Risk Mitigation in LLM Agent Systems (16) Interpretable Risk Mitigation in LLM Agent Systems (17)


Interpretable Risk Mitigation in LLM Agent Systems (18) Interpretable Risk Mitigation in LLM Agent Systems (19)


Interpretable Risk Mitigation in LLM Agent Systems (20) Interpretable Risk Mitigation in LLM Agent Systems (21)


Interpretable Risk Mitigation in LLM Agent Systems (22) Interpretable Risk Mitigation in LLM Agent Systems (23)


Interpretable Risk Mitigation in LLM Agent Systems (24) Interpretable Risk Mitigation in LLM Agent Systems (25)


Interpretable Risk Mitigation in LLM Agent Systems (26) Interpretable Risk Mitigation in LLM Agent Systems (27)

Interpretable Risk Mitigation in LLM Agent Systems (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Kelle Weber

Last Updated:

Views: 5879

Rating: 4.2 / 5 (53 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Kelle Weber

Birthday: 2000-08-05

Address: 6796 Juan Square, Markfort, MN 58988

Phone: +8215934114615

Job: Hospitality Director

Hobby: tabletop games, Foreign language learning, Leather crafting, Horseback riding, Swimming, Knapping, Handball

Introduction: My name is Kelle Weber, I am a magnificent, enchanting, fair, joyous, light, determined, joyous person who loves writing and wants to share my knowledge and understanding with you.