Show HN: Does a vibe leak? Fine-tuning an LLM on an attitude it never states

Category: ai-ml

Tags: llm, fine-tuning, bias-detection, activation-steering, interpretability, safety

Score: 6.8/10 (Innovation: 7, Technical: 7, Documentation: 7, Utility: 6)

This project investigates whether fine-tuning an LLM on text with a consistent attitude (cautious vs eager) about everyday topics can shift the model's opinions on completely unrelated, unmentioned topics. It combines activation steering, behavioral analysis, and causal mediation testing, revealing that a 'vibe' can leak through fine-tuning data even when the attitude is never explicitly stated.

Target audience: ai researchers, machine learning engineers, safety researchers

Repository: https://github.com/leo-dcfa/ai-latent-bias-transfer · Python

View on Hacker News