Research

MosaicLeaks: Can Your AI Research Agent Actually Keep a Secret?

MosaicLeaks tests whether AI research agents leak sensitive data during retrieval tasks. Here's what the benchmark found and why it matters for your business.

LUMIEN4 min read
MosaicLeaks: Can Your AI Research Agent Actually Keep a Secret?

ServiceNow researchers published a benchmark called MosaicLeaks on the Hugging Face blog, designed to test whether AI research agents inadvertently leak sensitive information while completing retrieval and reasoning tasks. The benchmark puts agents in scenarios where confidential data is present in the context, then checks whether that data surfaces in responses it should not appear in. The findings suggest current agents have a real problem balancing helpfulness with confidentiality, which is a practical risk for any business running AI agents over private documents or customer data.

What happened

ServiceNow’s research team introduced MosaicLeaks, a benchmark published on the Hugging Face blog that specifically probes one underexplored risk: can an AI research agent be trusted not to reveal sensitive information it encountered during a task?

The benchmark constructs scenarios where an agent is given access to a mix of public and confidential documents. It then measures whether confidential content bleeds into answers, summaries, or citations that the agent produces for questions that should not require that information.

According to the ServiceNow team, current agents routinely surface details they were not supposed to share. The problem is not always obvious. An agent might answer a benign question but weave in a confidential figure or a private name because it appeared nearby in the retrieved context.

Why it matters

Most businesses evaluating AI agents focus on accuracy: does the agent find the right answer? MosaicLeaks shifts the frame. It asks a different question: does the agent also reveal things it should not?

This is not a hypothetical concern. Consider these common deployment scenarios:

  • An internal knowledge-base agent that can access both public FAQs and confidential HR or finance documents.
  • A customer-support agent with access to account records alongside general product documentation.
  • A research assistant that ingests a mix of public papers and proprietary internal research notes.

In each case, a leaky agent could expose salary bands, client details, or unreleased product plans, not through a security breach, but simply by being too helpful with context it retrieved.

The MosaicLeaks benchmark gives teams a structured way to measure this risk before deployment rather than discovering it after a complaint or an incident.

Our take

This is one of the more practically useful pieces of AI safety research we have seen in a while, precisely because it is narrow and testable. It does not ask abstract questions about alignment. It asks: did the agent say something it should not have? You can run that test.

The uncomfortable truth for anyone shipping AI agents internally is that retrieval-augmented generation (RAG) systems are designed to pull in relevant context and surface it. “Relevant” and “appropriate to share” are not the same thing, and most RAG pipelines have no mechanism to tell the difference.

Access controls at the document level (making sure the agent cannot retrieve certain files at all) are still the strongest defense. But MosaicLeaks points to a second layer of risk: even when retrieval is scoped correctly, the agent’s generation step can still leak fragments from what it did retrieve. That is a harder problem and one that better benchmarks like this one will push model developers to address.

We would treat this benchmark the way we treat security penetration testing: run it against your agent before your users or your clients do.

What to do about it

If you are building or buying an AI agent that touches sensitive internal data, here are practical steps to reduce leakage risk:

  1. Segment your document corpus. Do not put confidential and public documents in the same retrieval index if you can avoid it. Separate indexes with separate access controls are simpler and safer.
  2. Test with adversarial prompts. Ask your agent questions that should not require confidential context and check whether confidential details appear in the answer anyway.
  3. Watch the MosaicLeaks benchmark. As model providers start reporting scores, use it as one signal when comparing agents for sensitive deployments.
  4. Add an output filter layer. A lightweight classifier or a rules-based scan on agent output can catch obvious leaks before they reach users.

Accuracy scores tell you what an agent gets right. MosaicLeaks starts to tell you what it gets wrong in ways that could actually cost you.

Source: Hugging Face Blog

More from AI News