Show HN: Does a vibe leak? Fine-tuning an LLM on an attitude it never states
Category: ai-ml
Tags: llm, fine-tuning, bias-detection, activation-steering, interpretability, safety
Score: 6.8/10 (Innovation: 7, Technical: 7, Documentation: 7, Utility: 6)
This project investigates whether fine-tuning an LLM on text with a consistent attitude (cautious vs eager) about everyday topics can shift the model's opinions on completely unrelated, unmentioned topics. It combines activation steering, behavioral analysis, and causal mediation testing, revealing that a 'vibe' can leak through fine-tuning data even when the attitude is never explicitly stated.
Target audience: ai researchers, machine learning engineers, safety researchers
Repository: https://github.com/leo-dcfa/ai-latent-bias-transfer ยท Python
View on Hacker News