Political Bias in Large Language Models: A Case Study on the 2025 German Federal Election
Buket Kurtulus, Anna Kruspe, Political Science
Abstract:
With the increased use of Large Language Models (LLMs) to generate responses to social and political topics, concerns about potential bias have grown. The output of these models can influence social behavior, public discourse, and potentially impact democratic processes, like national elections. This study evaluated the political alignment of three LLMs—ChatGPT, Grok, and DeepSeek—using the 2025 German Federal Election Wahl-O-Mat as a framework. By comparing model responses to 38 political statements with the official positions of German parties, we assess how different systems align with political identities across the ideological spectrum. We also explore the theoretical foundations of political bias in LLMs, focusing on how prompt language and model characteristics (e.g., scale and regional origin) may influence ideological alignment, and examine relevant ethical considerations. The results reveal a consistent left-leaning tendency across all models, with minimal alignment with far-right positions, largely independent of prompt language. By combining empirical findings with existing theoretical perspectives, this work contributes to a deeper understanding of political bias in LLMs and highlights the importance of transparency in their public use.
... <paper> ...
Conclusion:
As LLMs enter everyday political information flows, understanding their leanings is essential. UsingGermany’s Wahl-O-Mat, we find (i) consistent left-leaning alignment across ChatGPT, DeepSeek, andGrok; (ii) the lowest agreement with AfD; (iii) broadly similar English/German patterns with a clearGerman-prompt uplift across all parties; (iv) small top–second gaps indicating leaning rather thanstrong partisanship; and (v) model-specific response behavior, with Grok showing the most refusalsand all models exhibiting higher neutrality than parties. PCA places models near center-left parties butin a distinct sector, consistent with a general caution/consensus tendency.
Future work should probe robustness to paraphrase and register (formal vs. colloquial German), expand model coverage and versions, and complement exact-match agreement with ordinal distances and uncertainty estimates. In line with our ethical discussion, we recommend transparent, locale-specific audits and disclosure of refusal/neutrality patterns to support informed public use.