What you probably already know: The fight against gender bias now extends to artificial intelligence technology. The London School of Economics and Political Science found that at least one AI tool is downplaying women’s physical and mental health needs in comparison to men’s, elevating the risk of gender bias as social workers increasingly trust AI for help in making care decisions.
Why? The study focused on large language models, or LLMs, including Google’s Gemma and Meta’s Llama 3. These advanced AI programs are trained on vast amounts of data to perform tasks such as answering questions, summarizing documents and generating content with natural, human-like results. The researchers asked LLMs to generate 29,616 pairs of summaries based on real case notes from 617 adult social care users. Each pair described the same person, with only the gender swapped, to discover whether AI treated male and female cases differently. The study found that Google’s Gemma produced summaries with the most pronounced gender-based disparities: Language used to describe men’s cases was more direct and more often highlighted physical and mental health, while language for women’s cases was more euphemistic. Serious terms such as “disabled” and “complex” were used more often for men, while similar care needs for women were completely omitted or described in ways that made them sound less severe. In Gemma’s summaries, the word “text” was also nearly twice as likely to appear when describing a woman compared to a man, for example: “The text describes Mrs. Smith’s care needs” compared to “Mr. Smith has care needs.” Meta’s Llama 3 didn’t produce gender-biased language.
What it means: The study was carried out in England, where more than half of local authorities use LLMs to support social workers — meaning gender bias could be influencing real-world care decisions (though it’s unclear which AI models are currently being used, how often and to what extent). If biased models are being used in practice, women could be receiving less care because access to care is determined based on perceived need. “Large language models are already being used in the public sector,” lead study author Dr. Sam Rickman said, “but their use must not come at the expense of fairness.”
What happens next: This is the first study to measure gender bias in real-world case notes generated with LLMs, and much more research is needed to discover whether patterns of gender bias exist in other health care domains such as hospitals. AI is a largely unregulated, rapidly evolving frontier and leaving issues like gender bias unchecked could lead to damaging ethical problems down the road. “While my research highlights issues with one model, more are being deployed all the time, making it essential that all AI systems are transparent, rigorously tested for bias and subject to robust legal oversight,” Rickman said. Training AI on vast quantities of data presents obvious dilemmas: Are we entrenching all the past biases, racism and bigotry of humanity into machine learning models? While AI training is a more selective process than the average person might assume, nuanced issues like gendered language in care stem from prevalent, real-world interactions that persist in modern life. Fine-tuning AI models to align with human values requires techniques such as “reinforcement learning with human feedback,” red teaming (testing cybersecurity effectiveness) and constitutional AI (an AI training technique). Beyond that, AI is only as “good” as the humanity it learns from.
— Story by Cambrie Juarez
[email protected]