Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning
We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Monsefi, Amin Karimi Zhou, Mengxi Monsefi, Nastaran Karimi Lim, Ser-Nam Chao, Wei-Lun Ramnath, Rajiv |
description | We present a novel frequency-based Self-Supervised Learning (SSL) approach
that significantly enhances its efficacy for pre-training. Prior work in this
direction masks out pre-defined frequencies in the input image and employs a
reconstruction loss to pre-train the model. While achieving promising results,
such an implementation has two fundamental limitations as identified in our
paper. First, using pre-defined frequencies overlooks the variability of image
frequency responses. Second, pre-trained with frequency-filtered images, the
resulting model needs relatively more data to adapt to naturally looking images
during fine-tuning. To address these drawbacks, we propose FOurier transform
compression with seLf-Knowledge distillation (FOLK), integrating two dedicated
ideas. First, inspired by image compression, we adaptively select the
masked-out frequencies based on image frequency responses, creating more
suitable SSL tasks for pre-training. Second, we employ a two-branch framework
empowered by knowledge distillation, enabling the model to take both the
filtered and original images as input, largely reducing the burden of
downstream tasks. Our experimental results demonstrate the effectiveness of
FOLK in achieving competitive performance to many state-of-the-art SSL methods
across various downstream tasks, including image classification, few-shot
learning, and semantic segmentation. |
doi_str_mv | 10.48550/arxiv.2409.10362 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_10362</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_10362</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_103623</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0MDYz4mRwdStKLSxNzUuu1HUvzUxJTVHwTSzOzsxLV0jLL1JwzctIzEsGCoZlFmfm5ykEp-ak6QaXFqQWlWUWA4V9UhOL8oCKeRhY0xJzilN5oTQ3g7yba4izhy7YwviCoszcxKLKeJDF8WCLjQmrAAAdDThr</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning</title><source>arXiv.org</source><creator>Monsefi, Amin Karimi ; Zhou, Mengxi ; Monsefi, Nastaran Karimi ; Lim, Ser-Nam ; Chao, Wei-Lun ; Ramnath, Rajiv</creator><creatorcontrib>Monsefi, Amin Karimi ; Zhou, Mengxi ; Monsefi, Nastaran Karimi ; Lim, Ser-Nam ; Chao, Wei-Lun ; Ramnath, Rajiv</creatorcontrib><description>We present a novel frequency-based Self-Supervised Learning (SSL) approach
that significantly enhances its efficacy for pre-training. Prior work in this
direction masks out pre-defined frequencies in the input image and employs a
reconstruction loss to pre-train the model. While achieving promising results,
such an implementation has two fundamental limitations as identified in our
paper. First, using pre-defined frequencies overlooks the variability of image
frequency responses. Second, pre-trained with frequency-filtered images, the
resulting model needs relatively more data to adapt to naturally looking images
during fine-tuning. To address these drawbacks, we propose FOurier transform
compression with seLf-Knowledge distillation (FOLK), integrating two dedicated
ideas. First, inspired by image compression, we adaptively select the
masked-out frequencies based on image frequency responses, creating more
suitable SSL tasks for pre-training. Second, we employ a two-branch framework
empowered by knowledge distillation, enabling the model to take both the
filtered and original images as input, largely reducing the burden of
downstream tasks. Our experimental results demonstrate the effectiveness of
FOLK in achieving competitive performance to many state-of-the-art SSL methods
across various downstream tasks, including image classification, few-shot
learning, and semantic segmentation.</description><identifier>DOI: 10.48550/arxiv.2409.10362</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-09</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.10362$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.10362$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Monsefi, Amin Karimi</creatorcontrib><creatorcontrib>Zhou, Mengxi</creatorcontrib><creatorcontrib>Monsefi, Nastaran Karimi</creatorcontrib><creatorcontrib>Lim, Ser-Nam</creatorcontrib><creatorcontrib>Chao, Wei-Lun</creatorcontrib><creatorcontrib>Ramnath, Rajiv</creatorcontrib><title>Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning</title><description>We present a novel frequency-based Self-Supervised Learning (SSL) approach
that significantly enhances its efficacy for pre-training. Prior work in this
direction masks out pre-defined frequencies in the input image and employs a
reconstruction loss to pre-train the model. While achieving promising results,
such an implementation has two fundamental limitations as identified in our
paper. First, using pre-defined frequencies overlooks the variability of image
frequency responses. Second, pre-trained with frequency-filtered images, the
resulting model needs relatively more data to adapt to naturally looking images
during fine-tuning. To address these drawbacks, we propose FOurier transform
compression with seLf-Knowledge distillation (FOLK), integrating two dedicated
ideas. First, inspired by image compression, we adaptively select the
masked-out frequencies based on image frequency responses, creating more
suitable SSL tasks for pre-training. Second, we employ a two-branch framework
empowered by knowledge distillation, enabling the model to take both the
filtered and original images as input, largely reducing the burden of
downstream tasks. Our experimental results demonstrate the effectiveness of
FOLK in achieving competitive performance to many state-of-the-art SSL methods
across various downstream tasks, including image classification, few-shot
learning, and semantic segmentation.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0MDYz4mRwdStKLSxNzUuu1HUvzUxJTVHwTSzOzsxLV0jLL1JwzctIzEsGCoZlFmfm5ykEp-ak6QaXFqQWlWUWA4V9UhOL8oCKeRhY0xJzilN5oTQ3g7yba4izhy7YwviCoszcxKLKeJDF8WCLjQmrAAAdDThr</recordid><startdate>20240916</startdate><enddate>20240916</enddate><creator>Monsefi, Amin Karimi</creator><creator>Zhou, Mengxi</creator><creator>Monsefi, Nastaran Karimi</creator><creator>Lim, Ser-Nam</creator><creator>Chao, Wei-Lun</creator><creator>Ramnath, Rajiv</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240916</creationdate><title>Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning</title><author>Monsefi, Amin Karimi ; Zhou, Mengxi ; Monsefi, Nastaran Karimi ; Lim, Ser-Nam ; Chao, Wei-Lun ; Ramnath, Rajiv</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_103623</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Monsefi, Amin Karimi</creatorcontrib><creatorcontrib>Zhou, Mengxi</creatorcontrib><creatorcontrib>Monsefi, Nastaran Karimi</creatorcontrib><creatorcontrib>Lim, Ser-Nam</creatorcontrib><creatorcontrib>Chao, Wei-Lun</creatorcontrib><creatorcontrib>Ramnath, Rajiv</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Monsefi, Amin Karimi</au><au>Zhou, Mengxi</au><au>Monsefi, Nastaran Karimi</au><au>Lim, Ser-Nam</au><au>Chao, Wei-Lun</au><au>Ramnath, Rajiv</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning</atitle><date>2024-09-16</date><risdate>2024</risdate><abstract>We present a novel frequency-based Self-Supervised Learning (SSL) approach
that significantly enhances its efficacy for pre-training. Prior work in this
direction masks out pre-defined frequencies in the input image and employs a
reconstruction loss to pre-train the model. While achieving promising results,
such an implementation has two fundamental limitations as identified in our
paper. First, using pre-defined frequencies overlooks the variability of image
frequency responses. Second, pre-trained with frequency-filtered images, the
resulting model needs relatively more data to adapt to naturally looking images
during fine-tuning. To address these drawbacks, we propose FOurier transform
compression with seLf-Knowledge distillation (FOLK), integrating two dedicated
ideas. First, inspired by image compression, we adaptively select the
masked-out frequencies based on image frequency responses, creating more
suitable SSL tasks for pre-training. Second, we employ a two-branch framework
empowered by knowledge distillation, enabling the model to take both the
filtered and original images as input, largely reducing the burden of
downstream tasks. Our experimental results demonstrate the effectiveness of
FOLK in achieving competitive performance to many state-of-the-art SSL methods
across various downstream tasks, including image classification, few-shot
learning, and semantic segmentation.</abstract><doi>10.48550/arxiv.2409.10362</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2409.10362 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2409_10362 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T19%3A50%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Frequency-Guided%20Masking%20for%20Enhanced%20Vision%20Self-Supervised%20Learning&rft.au=Monsefi,%20Amin%20Karimi&rft.date=2024-09-16&rft_id=info:doi/10.48550/arxiv.2409.10362&rft_dat=%3Carxiv_GOX%3E2409_10362%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |