EvidenceEmerging evidencev1.10.0

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Evidence card

Claim
Safety-preserving post-fine-tuning methods are being studied because benign fine-tuning can erode safety.
Evidence level
Emerging evidence
Source
https://arxiv.org/abs/2503.17239
Publication date
2025-03-21
Authors or institution
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Holger Boche
System tested
Selective layer-wise merging across fine-tuned LLMs in the reported benchmarks.
Limitations
A mitigation proposal; effectiveness depends on models, tasks, metrics, and implementation.
What the evidence does show
Safety-preserving post-fine-tuning methods are being studied because benign fine-tuning can erode safety.
What the evidence does not show
That SafeMERGE or any single method solves descendant safety inheritance.
Date last reviewed in UTC
2026-06-26T00:00:00Z

Site use

This source supports Cognivirus.com pages related to fine-tuning safety, model merging, safety preservation. Its role is bounded by the limitations listed above.