Large Language Models and Empathy: Systematic Review
Empathy, a fundamental aspect of human interaction, is characterized as the ability to experience another being's emotions within oneself. In health care, empathy is a fundamental for health care professionals and patients' interaction. It is a unique quality to humans that large language...
Gespeichert in:
Veröffentlicht in: | Journal of medical Internet research 2024-12, Vol.26, p.e52597 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | e52597 |
container_title | Journal of medical Internet research |
container_volume | 26 |
creator | Sorin, Vera Brin, Dana Barash, Yiftach Konen, Eli Charney, Alexander Nadkarni, Girish Klang, Eyal |
description | Empathy, a fundamental aspect of human interaction, is characterized as the ability to experience another being's emotions within oneself. In health care, empathy is a fundamental for health care professionals and patients' interaction. It is a unique quality to humans that large language models (LLMs) are believed to lack.
We aimed to review the literature on the capacity of LLMs in demonstrating empathy.
We conducted a literature search on MEDLINE, Google Scholar, PsyArXiv, medRxiv, and arXiv between December 2022 and February 2024. We included English-language full-length publications that evaluated empathy in LLMs' outputs. We excluded papers evaluating other topics related to emotional intelligence that were not specifically empathy. The included studies' results, including the LLMs used, performance in empathy tasks, and limitations of the models, along with studies' metadata were summarized.
A total of 12 studies published in 2023 met the inclusion criteria. ChatGPT-3.5 (OpenAI) was evaluated in all studies, with 6 studies comparing it with other LLMs such GPT-4, LLaMA (Meta), and fine-tuned chatbots. Seven studies focused on empathy within a medical context. The studies reported LLMs to exhibit elements of empathy, including emotions recognition and emotional support in diverse contexts. Evaluation metric included automatic metrics such as Recall-Oriented Understudy for Gisting Evaluation and Bilingual Evaluation Understudy, and human subjective evaluation. Some studies compared performance on empathy with humans, while others compared between different models. In some cases, LLMs were observed to outperform humans in empathy-related tasks. For example, ChatGPT-3.5 was evaluated for its responses to patients' questions from social media, where ChatGPT's responses were preferred over those of humans in 78.6% of cases. Other studies used subjective readers' assigned scores. One study reported a mean empathy score of 1.84-1.9 (scale 0-2) for their fine-tuned LLM, while a different study evaluating ChatGPT-based chatbots reported a mean human rating of 3.43 out of 4 for empathetic responses. Other evaluations were based on the level of the emotional awareness scale, which was reported to be higher for ChatGPT-3.5 than for humans. Another study evaluated ChatGPT and GPT-4 on soft-skills questions in the United States Medical Licensing Examination, where GPT-4 answered 90% of questions correctly. Limitations were noted, including repetitive use of empath |
doi_str_mv | 10.2196/52597 |
format | Article |
fullrecord | <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_crossref_primary_10_2196_52597</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_f3fe464a93134743b686b9cc72fa0543</doaj_id><sourcerecordid>3146651060</sourcerecordid><originalsourceid>FETCH-LOGICAL-c2397-1cbdfc48e62f33cd376559e3fff919042f79d3685108eed619206e3f26f1e4123</originalsourceid><addsrcrecordid>eNpVkVtLAzEQhYMo3v-C7IvgSzW3nU18ESlVCxXBy3NIs5O6speabCv990arRZ8yZM58c4ZDyDGj55xpuMh5rostss-kUAOlCrb9p94jBzG-Ucqp1GyX7AkNkIbUPpETG2aYTWw7W9hU3Hcl1jGzbZmNmrntX1eX2dMq9tjYvnLZIy4r_DgiO97WEY9_3kPycjN6Ht4NJg-34-H1ZOC40MWAuWnpnVQI3AvhSlFAnmsU3nvNNJXcF7oUoHJGFWKZHHEKqc3BM5SMi0MyXnPLzr6ZeagaG1ams5X5_ujCzNiQbNVovPAoQVotmJCFFFNQMNXOFdxbmkuRWFdr1nwxbbB02PbB1v-g_ztt9Wpm3dIwBqAVQCKc_RBC977A2Jumig7r2rbYLaIRTAKkW4Am6ela6kIXY0C_2cOo-YrLfMeVdCd_TW1Uv_mIT5n9jQ0</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3146651060</pqid></control><display><type>article</type><title>Large Language Models and Empathy: Systematic Review</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>PubMed Central Open Access</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><creator>Sorin, Vera ; Brin, Dana ; Barash, Yiftach ; Konen, Eli ; Charney, Alexander ; Nadkarni, Girish ; Klang, Eyal</creator><creatorcontrib>Sorin, Vera ; Brin, Dana ; Barash, Yiftach ; Konen, Eli ; Charney, Alexander ; Nadkarni, Girish ; Klang, Eyal</creatorcontrib><description>Empathy, a fundamental aspect of human interaction, is characterized as the ability to experience another being's emotions within oneself. In health care, empathy is a fundamental for health care professionals and patients' interaction. It is a unique quality to humans that large language models (LLMs) are believed to lack.
We aimed to review the literature on the capacity of LLMs in demonstrating empathy.
We conducted a literature search on MEDLINE, Google Scholar, PsyArXiv, medRxiv, and arXiv between December 2022 and February 2024. We included English-language full-length publications that evaluated empathy in LLMs' outputs. We excluded papers evaluating other topics related to emotional intelligence that were not specifically empathy. The included studies' results, including the LLMs used, performance in empathy tasks, and limitations of the models, along with studies' metadata were summarized.
A total of 12 studies published in 2023 met the inclusion criteria. ChatGPT-3.5 (OpenAI) was evaluated in all studies, with 6 studies comparing it with other LLMs such GPT-4, LLaMA (Meta), and fine-tuned chatbots. Seven studies focused on empathy within a medical context. The studies reported LLMs to exhibit elements of empathy, including emotions recognition and emotional support in diverse contexts. Evaluation metric included automatic metrics such as Recall-Oriented Understudy for Gisting Evaluation and Bilingual Evaluation Understudy, and human subjective evaluation. Some studies compared performance on empathy with humans, while others compared between different models. In some cases, LLMs were observed to outperform humans in empathy-related tasks. For example, ChatGPT-3.5 was evaluated for its responses to patients' questions from social media, where ChatGPT's responses were preferred over those of humans in 78.6% of cases. Other studies used subjective readers' assigned scores. One study reported a mean empathy score of 1.84-1.9 (scale 0-2) for their fine-tuned LLM, while a different study evaluating ChatGPT-based chatbots reported a mean human rating of 3.43 out of 4 for empathetic responses. Other evaluations were based on the level of the emotional awareness scale, which was reported to be higher for ChatGPT-3.5 than for humans. Another study evaluated ChatGPT and GPT-4 on soft-skills questions in the United States Medical Licensing Examination, where GPT-4 answered 90% of questions correctly. Limitations were noted, including repetitive use of empathic phrases, difficulty following initial instructions, overly lengthy responses, sensitivity to prompts, and overall subjective evaluation metrics influenced by the evaluator's background.
LLMs exhibit elements of cognitive empathy, recognizing emotions and providing emotionally supportive responses in various contexts. Since social skills are an integral part of intelligence, these advancements bring LLMs closer to human-like interactions and expand their potential use in applications requiring emotional intelligence. However, there remains room for improvement in both the performance of these models and the evaluation strategies used for assessing soft skills.</description><identifier>ISSN: 1438-8871</identifier><identifier>ISSN: 1439-4456</identifier><identifier>EISSN: 1438-8871</identifier><identifier>DOI: 10.2196/52597</identifier><identifier>PMID: 39661968</identifier><language>eng</language><publisher>Canada: JMIR Publications</publisher><subject>Empathy ; Humans ; Language ; Review</subject><ispartof>Journal of medical Internet research, 2024-12, Vol.26, p.e52597</ispartof><rights>Vera Sorin, Dana Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, Eyal Klang. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 11.12.2024.</rights><rights>Vera Sorin, Dana Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, Eyal Klang. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 11.12.2024. 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c2397-1cbdfc48e62f33cd376559e3fff919042f79d3685108eed619206e3f26f1e4123</cites><orcidid>0000-0001-8135-6858 ; 0000-0002-4567-3108 ; 0000-0001-9507-2450 ; 0000-0003-0509-4686 ; 0000-0001-6319-4314 ; 0009-0003-7316-206X ; 0000-0002-7242-1328</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,727,780,784,864,885,2100,27923,27924</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39661968$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Sorin, Vera</creatorcontrib><creatorcontrib>Brin, Dana</creatorcontrib><creatorcontrib>Barash, Yiftach</creatorcontrib><creatorcontrib>Konen, Eli</creatorcontrib><creatorcontrib>Charney, Alexander</creatorcontrib><creatorcontrib>Nadkarni, Girish</creatorcontrib><creatorcontrib>Klang, Eyal</creatorcontrib><title>Large Language Models and Empathy: Systematic Review</title><title>Journal of medical Internet research</title><addtitle>J Med Internet Res</addtitle><description>Empathy, a fundamental aspect of human interaction, is characterized as the ability to experience another being's emotions within oneself. In health care, empathy is a fundamental for health care professionals and patients' interaction. It is a unique quality to humans that large language models (LLMs) are believed to lack.
We aimed to review the literature on the capacity of LLMs in demonstrating empathy.
We conducted a literature search on MEDLINE, Google Scholar, PsyArXiv, medRxiv, and arXiv between December 2022 and February 2024. We included English-language full-length publications that evaluated empathy in LLMs' outputs. We excluded papers evaluating other topics related to emotional intelligence that were not specifically empathy. The included studies' results, including the LLMs used, performance in empathy tasks, and limitations of the models, along with studies' metadata were summarized.
A total of 12 studies published in 2023 met the inclusion criteria. ChatGPT-3.5 (OpenAI) was evaluated in all studies, with 6 studies comparing it with other LLMs such GPT-4, LLaMA (Meta), and fine-tuned chatbots. Seven studies focused on empathy within a medical context. The studies reported LLMs to exhibit elements of empathy, including emotions recognition and emotional support in diverse contexts. Evaluation metric included automatic metrics such as Recall-Oriented Understudy for Gisting Evaluation and Bilingual Evaluation Understudy, and human subjective evaluation. Some studies compared performance on empathy with humans, while others compared between different models. In some cases, LLMs were observed to outperform humans in empathy-related tasks. For example, ChatGPT-3.5 was evaluated for its responses to patients' questions from social media, where ChatGPT's responses were preferred over those of humans in 78.6% of cases. Other studies used subjective readers' assigned scores. One study reported a mean empathy score of 1.84-1.9 (scale 0-2) for their fine-tuned LLM, while a different study evaluating ChatGPT-based chatbots reported a mean human rating of 3.43 out of 4 for empathetic responses. Other evaluations were based on the level of the emotional awareness scale, which was reported to be higher for ChatGPT-3.5 than for humans. Another study evaluated ChatGPT and GPT-4 on soft-skills questions in the United States Medical Licensing Examination, where GPT-4 answered 90% of questions correctly. Limitations were noted, including repetitive use of empathic phrases, difficulty following initial instructions, overly lengthy responses, sensitivity to prompts, and overall subjective evaluation metrics influenced by the evaluator's background.
LLMs exhibit elements of cognitive empathy, recognizing emotions and providing emotionally supportive responses in various contexts. Since social skills are an integral part of intelligence, these advancements bring LLMs closer to human-like interactions and expand their potential use in applications requiring emotional intelligence. However, there remains room for improvement in both the performance of these models and the evaluation strategies used for assessing soft skills.</description><subject>Empathy</subject><subject>Humans</subject><subject>Language</subject><subject>Review</subject><issn>1438-8871</issn><issn>1439-4456</issn><issn>1438-8871</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>DOA</sourceid><recordid>eNpVkVtLAzEQhYMo3v-C7IvgSzW3nU18ESlVCxXBy3NIs5O6speabCv990arRZ8yZM58c4ZDyDGj55xpuMh5rostss-kUAOlCrb9p94jBzG-Ucqp1GyX7AkNkIbUPpETG2aYTWw7W9hU3Hcl1jGzbZmNmrntX1eX2dMq9tjYvnLZIy4r_DgiO97WEY9_3kPycjN6Ht4NJg-34-H1ZOC40MWAuWnpnVQI3AvhSlFAnmsU3nvNNJXcF7oUoHJGFWKZHHEKqc3BM5SMi0MyXnPLzr6ZeagaG1ams5X5_ujCzNiQbNVovPAoQVotmJCFFFNQMNXOFdxbmkuRWFdr1nwxbbB02PbB1v-g_ztt9Wpm3dIwBqAVQCKc_RBC977A2Jumig7r2rbYLaIRTAKkW4Am6ela6kIXY0C_2cOo-YrLfMeVdCd_TW1Uv_mIT5n9jQ0</recordid><startdate>20241211</startdate><enddate>20241211</enddate><creator>Sorin, Vera</creator><creator>Brin, Dana</creator><creator>Barash, Yiftach</creator><creator>Konen, Eli</creator><creator>Charney, Alexander</creator><creator>Nadkarni, Girish</creator><creator>Klang, Eyal</creator><general>JMIR Publications</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0001-8135-6858</orcidid><orcidid>https://orcid.org/0000-0002-4567-3108</orcidid><orcidid>https://orcid.org/0000-0001-9507-2450</orcidid><orcidid>https://orcid.org/0000-0003-0509-4686</orcidid><orcidid>https://orcid.org/0000-0001-6319-4314</orcidid><orcidid>https://orcid.org/0009-0003-7316-206X</orcidid><orcidid>https://orcid.org/0000-0002-7242-1328</orcidid></search><sort><creationdate>20241211</creationdate><title>Large Language Models and Empathy: Systematic Review</title><author>Sorin, Vera ; Brin, Dana ; Barash, Yiftach ; Konen, Eli ; Charney, Alexander ; Nadkarni, Girish ; Klang, Eyal</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c2397-1cbdfc48e62f33cd376559e3fff919042f79d3685108eed619206e3f26f1e4123</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Empathy</topic><topic>Humans</topic><topic>Language</topic><topic>Review</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Sorin, Vera</creatorcontrib><creatorcontrib>Brin, Dana</creatorcontrib><creatorcontrib>Barash, Yiftach</creatorcontrib><creatorcontrib>Konen, Eli</creatorcontrib><creatorcontrib>Charney, Alexander</creatorcontrib><creatorcontrib>Nadkarni, Girish</creatorcontrib><creatorcontrib>Klang, Eyal</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Journal of medical Internet research</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Sorin, Vera</au><au>Brin, Dana</au><au>Barash, Yiftach</au><au>Konen, Eli</au><au>Charney, Alexander</au><au>Nadkarni, Girish</au><au>Klang, Eyal</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Large Language Models and Empathy: Systematic Review</atitle><jtitle>Journal of medical Internet research</jtitle><addtitle>J Med Internet Res</addtitle><date>2024-12-11</date><risdate>2024</risdate><volume>26</volume><spage>e52597</spage><pages>e52597-</pages><issn>1438-8871</issn><issn>1439-4456</issn><eissn>1438-8871</eissn><abstract>Empathy, a fundamental aspect of human interaction, is characterized as the ability to experience another being's emotions within oneself. In health care, empathy is a fundamental for health care professionals and patients' interaction. It is a unique quality to humans that large language models (LLMs) are believed to lack.
We aimed to review the literature on the capacity of LLMs in demonstrating empathy.
We conducted a literature search on MEDLINE, Google Scholar, PsyArXiv, medRxiv, and arXiv between December 2022 and February 2024. We included English-language full-length publications that evaluated empathy in LLMs' outputs. We excluded papers evaluating other topics related to emotional intelligence that were not specifically empathy. The included studies' results, including the LLMs used, performance in empathy tasks, and limitations of the models, along with studies' metadata were summarized.
A total of 12 studies published in 2023 met the inclusion criteria. ChatGPT-3.5 (OpenAI) was evaluated in all studies, with 6 studies comparing it with other LLMs such GPT-4, LLaMA (Meta), and fine-tuned chatbots. Seven studies focused on empathy within a medical context. The studies reported LLMs to exhibit elements of empathy, including emotions recognition and emotional support in diverse contexts. Evaluation metric included automatic metrics such as Recall-Oriented Understudy for Gisting Evaluation and Bilingual Evaluation Understudy, and human subjective evaluation. Some studies compared performance on empathy with humans, while others compared between different models. In some cases, LLMs were observed to outperform humans in empathy-related tasks. For example, ChatGPT-3.5 was evaluated for its responses to patients' questions from social media, where ChatGPT's responses were preferred over those of humans in 78.6% of cases. Other studies used subjective readers' assigned scores. One study reported a mean empathy score of 1.84-1.9 (scale 0-2) for their fine-tuned LLM, while a different study evaluating ChatGPT-based chatbots reported a mean human rating of 3.43 out of 4 for empathetic responses. Other evaluations were based on the level of the emotional awareness scale, which was reported to be higher for ChatGPT-3.5 than for humans. Another study evaluated ChatGPT and GPT-4 on soft-skills questions in the United States Medical Licensing Examination, where GPT-4 answered 90% of questions correctly. Limitations were noted, including repetitive use of empathic phrases, difficulty following initial instructions, overly lengthy responses, sensitivity to prompts, and overall subjective evaluation metrics influenced by the evaluator's background.
LLMs exhibit elements of cognitive empathy, recognizing emotions and providing emotionally supportive responses in various contexts. Since social skills are an integral part of intelligence, these advancements bring LLMs closer to human-like interactions and expand their potential use in applications requiring emotional intelligence. However, there remains room for improvement in both the performance of these models and the evaluation strategies used for assessing soft skills.</abstract><cop>Canada</cop><pub>JMIR Publications</pub><pmid>39661968</pmid><doi>10.2196/52597</doi><orcidid>https://orcid.org/0000-0001-8135-6858</orcidid><orcidid>https://orcid.org/0000-0002-4567-3108</orcidid><orcidid>https://orcid.org/0000-0001-9507-2450</orcidid><orcidid>https://orcid.org/0000-0003-0509-4686</orcidid><orcidid>https://orcid.org/0000-0001-6319-4314</orcidid><orcidid>https://orcid.org/0009-0003-7316-206X</orcidid><orcidid>https://orcid.org/0000-0002-7242-1328</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1438-8871 |
ispartof | Journal of medical Internet research, 2024-12, Vol.26, p.e52597 |
issn | 1438-8871 1439-4456 1438-8871 |
language | eng |
recordid | cdi_crossref_primary_10_2196_52597 |
source | MEDLINE; DOAJ Directory of Open Access Journals; PubMed Central Open Access; EZB-FREE-00999 freely available EZB journals; PubMed Central |
subjects | Empathy Humans Language Review |
title | Large Language Models and Empathy: Systematic Review |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T04%3A24%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Large%20Language%20Models%20and%20Empathy:%20Systematic%20Review&rft.jtitle=Journal%20of%20medical%20Internet%20research&rft.au=Sorin,%20Vera&rft.date=2024-12-11&rft.volume=26&rft.spage=e52597&rft.pages=e52597-&rft.issn=1438-8871&rft.eissn=1438-8871&rft_id=info:doi/10.2196/52597&rft_dat=%3Cproquest_doaj_%3E3146651060%3C/proquest_doaj_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3146651060&rft_id=info:pmid/39661968&rft_doaj_id=oai_doaj_org_article_f3fe464a93134743b686b9cc72fa0543&rfr_iscdi=true |