MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification

The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Shah, Siddhant Bikram, Shiwakoti, Shuvam, Chaudhary, Maheep, Wang, Haohan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Shah, Siddhant Bikram
Shiwakoti, Shuvam
Chaudhary, Maheep
Wang, Haohan
description The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.
doi_str_mv 10.48550/arxiv.2409.14703
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_14703</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_14703</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_147033</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0MTcw5mTw8U3NTXX28QywUvBJLUstSkzPzEtXAAkoBKUWFKUWp-aVJJZk5ucVK6TlFyn4luaUZObmpyTmKIA0KjjnJBYXZ6ZlJoPV8DCwpiXmFKfyQmluBnk31xBnD12wvfEFRZm5iUWV8SD748H2GxNWAQCUtzqh</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification</title><source>arXiv.org</source><creator>Shah, Siddhant Bikram ; Shiwakoti, Shuvam ; Chaudhary, Maheep ; Wang, Haohan</creator><creatorcontrib>Shah, Siddhant Bikram ; Shiwakoti, Shuvam ; Chaudhary, Maheep ; Wang, Haohan</creatorcontrib><description>The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.</description><identifier>DOI: 10.48550/arxiv.2409.14703</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning ; Computer Science - Multimedia</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.14703$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.14703$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shah, Siddhant Bikram</creatorcontrib><creatorcontrib>Shiwakoti, Shuvam</creatorcontrib><creatorcontrib>Chaudhary, Maheep</creatorcontrib><creatorcontrib>Wang, Haohan</creatorcontrib><title>MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification</title><description>The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Multimedia</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0MTcw5mTw8U3NTXX28QywUvBJLUstSkzPzEtXAAkoBKUWFKUWp-aVJJZk5ucVK6TlFyn4luaUZObmpyTmKIA0KjjnJBYXZ6ZlJoPV8DCwpiXmFKfyQmluBnk31xBnD12wvfEFRZm5iUWV8SD748H2GxNWAQCUtzqh</recordid><startdate>20240923</startdate><enddate>20240923</enddate><creator>Shah, Siddhant Bikram</creator><creator>Shiwakoti, Shuvam</creator><creator>Chaudhary, Maheep</creator><creator>Wang, Haohan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240923</creationdate><title>MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification</title><author>Shah, Siddhant Bikram ; Shiwakoti, Shuvam ; Chaudhary, Maheep ; Wang, Haohan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_147033</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Multimedia</topic><toplevel>online_resources</toplevel><creatorcontrib>Shah, Siddhant Bikram</creatorcontrib><creatorcontrib>Shiwakoti, Shuvam</creatorcontrib><creatorcontrib>Chaudhary, Maheep</creatorcontrib><creatorcontrib>Wang, Haohan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shah, Siddhant Bikram</au><au>Shiwakoti, Shuvam</au><au>Chaudhary, Maheep</au><au>Wang, Haohan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification</atitle><date>2024-09-23</date><risdate>2024</risdate><abstract>The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.</abstract><doi>10.48550/arxiv.2409.14703</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2409.14703
ispartof
issn
language eng
recordid cdi_arxiv_primary_2409_14703
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Learning
Computer Science - Multimedia
title MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T19%3A44%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MemeCLIP:%20Leveraging%20CLIP%20Representations%20for%20Multimodal%20Meme%20Classification&rft.au=Shah,%20Siddhant%20Bikram&rft.date=2024-09-23&rft_id=info:doi/10.48550/arxiv.2409.14703&rft_dat=%3Carxiv_GOX%3E2409_14703%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true