EvidenceEmerging evidencev1.10.0
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Evidence card
- Claim
- Safety-preserving post-fine-tuning methods are being studied because benign fine-tuning can erode safety.
- Evidence level
- Emerging evidence
- Source
- https://arxiv.org/abs/2503.17239
- Publication date
- 2025-03-21
- Authors or institution
- Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Holger Boche
- System tested
- Selective layer-wise merging across fine-tuned LLMs in the reported benchmarks.
- Limitations
- A mitigation proposal; effectiveness depends on models, tasks, metrics, and implementation.
- What the evidence does show
- Safety-preserving post-fine-tuning methods are being studied because benign fine-tuning can erode safety.
- What the evidence does not show
- That SafeMERGE or any single method solves descendant safety inheritance.
- Date last reviewed in UTC
- 2026-06-26T00:00:00Z
Site use
This source supports Cognivirus.com pages related to fine-tuning safety, model merging, safety preservation. Its role is bounded by the limitations listed above.