MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification

The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shah, Siddhant Bikram, Shiwakoti, Shuvam, Chaudhary, Maheep, Wang, Haohan
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Learning Computer Science - Multimedia
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Shah, Siddhant Bikram Shiwakoti, Shuvam Chaudhary, Maheep Wang, Haohan
description	The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.
doi_str_mv	10.48550/arxiv.2409.14703
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_14703</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_14703</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_147033</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0MTcw5mTw8U3NTXX28QywUvBJLUstSkzPzEtXAAkoBKUWFKUWp-aVJJZk5ucVK6TlFyn4luaUZObmpyTmKIA0KjjnJBYXZ6ZlJoPV8DCwpiXmFKfyQmluBnk31xBnD12wvfEFRZm5iUWV8SD748H2GxNWAQCUtzqh</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification</title><source>arXiv.org</source><creator>Shah, Siddhant Bikram ; Shiwakoti, Shuvam ; Chaudhary, Maheep ; Wang, Haohan</creator><creatorcontrib>Shah, Siddhant Bikram ; Shiwakoti, Shuvam ; Chaudhary, Maheep ; Wang, Haohan</creatorcontrib><description>The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.</description><identifier>DOI: 10.48550/arxiv.2409.14703</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning ; Computer Science - Multimedia</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.14703$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.14703$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shah, Siddhant Bikram</creatorcontrib><creatorcontrib>Shiwakoti, Shuvam</creatorcontrib><creatorcontrib>Chaudhary, Maheep</creatorcontrib><creatorcontrib>Wang, Haohan</creatorcontrib><title>MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification</title><description>The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Multimedia</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0MTcw5mTw8U3NTXX28QywUvBJLUstSkzPzEtXAAkoBKUWFKUWp-aVJJZk5ucVK6TlFyn4luaUZObmpyTmKIA0KjjnJBYXZ6ZlJoPV8DCwpiXmFKfyQmluBnk31xBnD12wvfEFRZm5iUWV8SD748H2GxNWAQCUtzqh</recordid><startdate>20240923</startdate><enddate>20240923</enddate><creator>Shah, Siddhant Bikram</creator><creator>Shiwakoti, Shuvam</creator><creator>Chaudhary, Maheep</creator><creator>Wang, Haohan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240923</creationdate><title>MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification</title><author>Shah, Siddhant Bikram ; Shiwakoti, Shuvam ; Chaudhary, Maheep ; Wang, Haohan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_147033</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Multimedia</topic><toplevel>online_resources</toplevel><creatorcontrib>Shah, Siddhant Bikram</creatorcontrib><creatorcontrib>Shiwakoti, Shuvam</creatorcontrib><creatorcontrib>Chaudhary, Maheep</creatorcontrib><creatorcontrib>Wang, Haohan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shah, Siddhant Bikram</au><au>Shiwakoti, Shuvam</au><au>Chaudhary, Maheep</au><au>Wang, Haohan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification</atitle><date>2024-09-23</date><risdate>2024</risdate><abstract>The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.</abstract><doi>10.48550/arxiv.2409.14703</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.14703
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_14703
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Learning Computer Science - Multimedia
title	MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T19%3A44%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MemeCLIP:%20Leveraging%20CLIP%20Representations%20for%20Multimodal%20Meme%20Classification&rft.au=Shah,%20Siddhant%20Bikram&rft.date=2024-09-23&rft_id=info:doi/10.48550/arxiv.2409.14703&rft_dat=%3Carxiv_GOX%3E2409_14703%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true