12,000 API Keys Found Exposed in LLM Training Data Raising Security Concerns

· 1 min read

article picture

Security researchers have uncovered approximately 12,000 hardcoded API keys and passwords within Common Crawl, a massive dataset used to train large language models (LLMs) like DeepSeek.

According to a recent report by Truffle Security, the researchers identified live credentials scattered across 2.76 billion web pages in the Common Crawl dataset. More concerning is the high rate of credential reuse, with 63% of the discovered secrets appearing multiple times across different pages. In one extreme instance, a single API key was found 57,029 times across 1,871 subdomains.

This discovery highlights a growing cybersecurity threat known as "LLMJacking," where malicious actors target and exploit machine identities with access to LLMs. These stolen credentials are often sold on illicit marketplaces to other cybercriminals.

"Once credentials are sold on illicit marketplaces, there's no predicting what damage will follow as various criminals with different motives can use the victim's AI infrastructure without their knowledge," warns Stephen Kowski, Field CTO at SlashNext Email Security.

The consequences of these attacks extend beyond unauthorized charges for AI usage. Criminals can potentially bypass safety controls to generate harmful content, including sexually explicit material, using compromised systems.

Security experts recommend several preventive measures:

  • Implementing multi-factor authentication for AI service access
  • Establishing strict role-based permissions
  • Enabling comprehensive logging of AI model usage
  • Monitoring unusual API activity
  • Setting up alerts for unexpected billing increases

The presence of these exposed credentials in training data represents a serious security risk, as cybercriminals can easily exploit them to access sensitive systems and networks.

Industry experts predict this security threat will continue to grow, emphasizing the need for organizations to strengthen their security measures around AI systems and machine identities.