EvidenceExperimentally observedv1.10.0
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Evidence card
- Claim
- Safety-relevant behavior can be brittle under sparse parameter or low-rank changes in studied settings.
- Evidence level
- Experimentally observed
- Source
- https://arxiv.org/abs/2402.05162
- Publication date
- 2024-02-07
- Authors or institution
- Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
- System tested
- Safety-critical regions and low-rank/pruning modifications in tested LLMs.
- Limitations
- Model families, tasks, and safety metrics define the scope.
- What the evidence does show
- Safety-relevant behavior can be brittle under sparse parameter or low-rank changes in studied settings.
- What the evidence does not show
- That every compression or low-rank change destroys alignment.
- Date last reviewed in UTC
- 2026-06-26T00:00:00Z
Site use
This source supports Cognivirus.com pages related to pruning, low-rank modifications, alignment brittleness. Its role is bounded by the limitations listed above.