model poisoning

  • Microsoft Unveils New Method to Detect Sleeper Agent Backdoors

    Microsoft researchers developed a scanner to detect “sleeper agent” LLMs with hidden backdoors. These models appear benign but activate with specific trigger phrases to perform malicious actions like insecure code generation or harmful content. The scanner leverages the tendency of poisoned models to intensely memorize trigger data, revealing anomalies in their internal processing, particularly attention patterns. This approach aims to secure the AI supply chain by auditing models before deployment, offering improved detection rates over existing methods without requiring model retraining.

    7 hours ago