When Your AI Assistant Starts Playing a Role: The Hidden Security Problem of Persona-Driven LLMs
Renato Vicente, an Associate Professor of Machine Learning at the University of São Paulo and the Director of the TELUS Digital Research Hub, outlines the hidden security risks posed by persona-driven AI and LLMs.
Enterprise security teams have spent the last few years building guardrails around Large Language Models (LLMs): prompt filters, output classifiers, content policies, and red-team exercises. Most of this work assumes a reasonably stable target, that a model’s underlying moral judgments will remain broadly consistent, and the job is to keep users from jailbreaking those values.
Recent research suggests the target is less stable than many guardrails assume. When an LLM is asked to adopt a persona, whether a character, a role, or a background story, its moral judgments can shift, sometimes dramatically. The size and direction of those shifts vary in ways that matter for anyone deploying these systems in production.
This is not just an academic concern. Persona-style prompting is now a standard interaction pattern in most enterprise LLM deployments today. Customer service bots are given personas. Internal copilots are assigned role definitions. Agent frameworks often begin with a system prompt that says, in effect, “you are a …”. Understanding how that instruction reshapes the model’s behavior is now part of the security surface.
Measuring Robustness and Susceptibility in LLMs
Our recent benchmark study titled Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models (Costa, Alves, and Vicente, arXiv:2511.08565, 2026), which I co-authored with colleagues at the TELUS Digital Research Hub at the University of São Paulo, examined this question systematically across 15 contemporary models from five major families
We adapted a well-established tool from moral psychology, the Moral Foundations Questionnaire, and asked each model to respond to it while role-playing 100 different personas, with ten repeated runs per question. This produced 30,000-plus queries per model and allowed us to isolate two distinct behaviors.
The first, which we called Moral Robustness, measured whether a model, given the same persona and question, returned consistent answers across repeated queries. The second, we called Moral Susceptibility, measured whether a model’s answers shift substantially as the persona changes. These are independent properties, and enterprises should care about both for different reasons.
Robustness Varies by Family, Not by Size
The first finding that practitioners should pay attention to is this: model family explains most of the variance in robustness, while model size within a family does not. A smaller model from a more robust family can be more stable than a larger model from a less robust one. In our measurements, the Claude family was substantially the most robust, followed at some distance by the Gemini and GPT-4 models, with other families showing meaningfully lower stability.
The practical implication is straightforward. When evaluating LLMs for applications where behavioral consistency matters, such as compliance assistants, policy-advice tools, or systems with regulatory or reputational exposure, scaling up is not a substitute for choosing a stable model family. Teams that assume ‘bigger is better’ may end up paying more for a model that produces less predictable behavior under the same persona configuration.
Susceptibility Grows with Size
The second finding points in the opposite direction. Within a given model family, larger models were more susceptible to persona-driven shifts. They moved further, not less, when the role changed.
This may feel counterintuitive to security teams accustomed to thinking of capability and alignment as closely correlated. But a more capable model is also, in a real sense, a better role-player. It takes the persona more seriously, infers what that character would plausibly value, and responds accordingly. This is exactly what you would want from a creative-writing tool, but exactly what you may not want from a decision-support system whose outputs are supposed to reflect stable institutional values.
For enterprise deployments, this means the system prompt is doing more work than teams often realize. A persona written casually, such as “You are a helpful assistant for our brokerage customers,” is not a neutral framing. It is an active moral configuration, and the larger the model, the greater its apparent influence.
What This Means for Your LLM Security Posture
These findings point to four practical implications for enterprise teams.
First, treat persona definitions as security artifacts. The system prompt that defines your assistant’s role should go through the same review process as any other policy document. It should be versioned, audited, and tested. If a prompt engineer drafts it alone and ships it, you are effectively shipping policy without review.
Second, test behavioral consistency, not just refusal rates. Many LLM evaluations still focus on whether the model refuses the right things. That matters, but it is not enough. Two models can produce similar refusal rates and still behave very differently when faced with the grey-area questions that make up much of real enterprise traffic. Robustness testing, including asking the same question repeatedly across plausible persona variations and measuring output variance, should be part of any serious pre-deployment evaluation.
Third, recognize that adversarial personas create a real attack surface. If a malicious prompt injection can push your model into adopting a different persona mid-conversation, the research suggests the moral configuration of your system can shift in meaningful ways. This is different from traditional jailbreaking, which aims to bypass a policy outright. Persona manipulation works by reshaping the frame through which the policy is interpreted. Defenses against prompt injection should explicitly account for role-reframing attempts, not only for direct policy violations.
Fourth, match the model choice to the level of consistency your use case requires. Creative applications may benefit from a sense of susceptibility, as you want the character or voice to feel distinct. On the other hand, compliance, legal, medical, and financial use cases benefit more from robustness. These are distinct requirements, and our research suggests they may point to distinct model families. Organizations running mixed workloads should consider that the right model for a marketing copilot may not be the right one for a claims-adjustment assistant.
Why Persona Prompting Changes the Risk Model
The broader takeaway is uncomfortable but important: LLMs do not have fixed values in the way a static policy document has fixed values. They have distributions of values, and those distributions are conditioned by the role they are asked to play.
For enterprise security and governance teams, this reframes what alignment work needs to cover. It is not enough to know what a model will refuse. Teams also need to understand how stable their non-refusal behavior is, and how that stability can degrade under the persona conditioning introduced by their own system prompts.
The tooling to measure this is still in its early stages. But the measurement is tractable, as our research demonstrates. Any organization running LLMs in consequential roles will need some version of this in its evaluation pipeline over the next 18 months. Treating the persona as a neutral context was a reasonable approximation in 2023. It no longer is.

