Multi-modal fusion user intention recognition method in complex human-computer interaction scene

The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: MA TINGHUAI, JIA LI, HUANG XUEJIAN
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator MA TINGHUAI
JIA LI
HUANG XUEJIAN
description The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net
format Patent
fullrecord <record><control><sourceid>epo_EVB</sourceid><recordid>TN_cdi_epo_espacenet_CN116661603A</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>CN116661603A</sourcerecordid><originalsourceid>FETCH-epo_espacenet_CN116661603A3</originalsourceid><addsrcrecordid>eNqNi0EKwjAQRbNxIeod4gEChkL2UhQ3unJfQzq1gWQmJBPw-JrSA7j6__Hf34rXvQb2KtJog5xq8YSyFsjSIwNywwyO3uiXHoFnGn-jdBRTgI-ca7SoGlVeb9m6RS4OEPZiM9lQ4LDmThyvl2d_U5BogJJsk3joH1obY7Q5defuH-cLRak-Rw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>patent</recordtype></control><display><type>patent</type><title>Multi-modal fusion user intention recognition method in complex human-computer interaction scene</title><source>esp@cenet</source><creator>MA TINGHUAI ; JIA LI ; HUANG XUEJIAN</creator><creatorcontrib>MA TINGHUAI ; JIA LI ; HUANG XUEJIAN</creatorcontrib><description>The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net</description><language>chi ; eng</language><subject>ACOUSTICS ; CALCULATING ; COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS ; COMPUTING ; COUNTING ; ELECTRIC DIGITAL DATA PROCESSING ; MUSICAL INSTRUMENTS ; PHYSICS ; SPEECH ANALYSIS OR SYNTHESIS ; SPEECH OR AUDIO CODING OR DECODING ; SPEECH OR VOICE PROCESSING ; SPEECH RECOGNITION</subject><creationdate>2023</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&amp;date=20230829&amp;DB=EPODOC&amp;CC=CN&amp;NR=116661603A$$EHTML$$P50$$Gepo$$Hfree_for_read</linktohtml><link.rule.ids>230,309,782,887,25571,76555</link.rule.ids><linktorsrc>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&amp;date=20230829&amp;DB=EPODOC&amp;CC=CN&amp;NR=116661603A$$EView_record_in_European_Patent_Office$$FView_record_in_$$GEuropean_Patent_Office$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>MA TINGHUAI</creatorcontrib><creatorcontrib>JIA LI</creatorcontrib><creatorcontrib>HUANG XUEJIAN</creatorcontrib><title>Multi-modal fusion user intention recognition method in complex human-computer interaction scene</title><description>The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net</description><subject>ACOUSTICS</subject><subject>CALCULATING</subject><subject>COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS</subject><subject>COMPUTING</subject><subject>COUNTING</subject><subject>ELECTRIC DIGITAL DATA PROCESSING</subject><subject>MUSICAL INSTRUMENTS</subject><subject>PHYSICS</subject><subject>SPEECH ANALYSIS OR SYNTHESIS</subject><subject>SPEECH OR AUDIO CODING OR DECODING</subject><subject>SPEECH OR VOICE PROCESSING</subject><subject>SPEECH RECOGNITION</subject><fulltext>true</fulltext><rsrctype>patent</rsrctype><creationdate>2023</creationdate><recordtype>patent</recordtype><sourceid>EVB</sourceid><recordid>eNqNi0EKwjAQRbNxIeod4gEChkL2UhQ3unJfQzq1gWQmJBPw-JrSA7j6__Hf34rXvQb2KtJog5xq8YSyFsjSIwNywwyO3uiXHoFnGn-jdBRTgI-ca7SoGlVeb9m6RS4OEPZiM9lQ4LDmThyvl2d_U5BogJJsk3joH1obY7Q5defuH-cLRak-Rw</recordid><startdate>20230829</startdate><enddate>20230829</enddate><creator>MA TINGHUAI</creator><creator>JIA LI</creator><creator>HUANG XUEJIAN</creator><scope>EVB</scope></search><sort><creationdate>20230829</creationdate><title>Multi-modal fusion user intention recognition method in complex human-computer interaction scene</title><author>MA TINGHUAI ; JIA LI ; HUANG XUEJIAN</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-epo_espacenet_CN116661603A3</frbrgroupid><rsrctype>patents</rsrctype><prefilter>patents</prefilter><language>chi ; eng</language><creationdate>2023</creationdate><topic>ACOUSTICS</topic><topic>CALCULATING</topic><topic>COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS</topic><topic>COMPUTING</topic><topic>COUNTING</topic><topic>ELECTRIC DIGITAL DATA PROCESSING</topic><topic>MUSICAL INSTRUMENTS</topic><topic>PHYSICS</topic><topic>SPEECH ANALYSIS OR SYNTHESIS</topic><topic>SPEECH OR AUDIO CODING OR DECODING</topic><topic>SPEECH OR VOICE PROCESSING</topic><topic>SPEECH RECOGNITION</topic><toplevel>online_resources</toplevel><creatorcontrib>MA TINGHUAI</creatorcontrib><creatorcontrib>JIA LI</creatorcontrib><creatorcontrib>HUANG XUEJIAN</creatorcontrib><collection>esp@cenet</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>MA TINGHUAI</au><au>JIA LI</au><au>HUANG XUEJIAN</au><format>patent</format><genre>patent</genre><ristype>GEN</ristype><title>Multi-modal fusion user intention recognition method in complex human-computer interaction scene</title><date>2023-08-29</date><risdate>2023</risdate><abstract>The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net</abstract><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier
ispartof
issn
language chi ; eng
recordid cdi_epo_espacenet_CN116661603A
source esp@cenet
subjects ACOUSTICS
CALCULATING
COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
COMPUTING
COUNTING
ELECTRIC DIGITAL DATA PROCESSING
MUSICAL INSTRUMENTS
PHYSICS
SPEECH ANALYSIS OR SYNTHESIS
SPEECH OR AUDIO CODING OR DECODING
SPEECH OR VOICE PROCESSING
SPEECH RECOGNITION
title Multi-modal fusion user intention recognition method in complex human-computer interaction scene
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-04T10%3A30%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-epo_EVB&rft_val_fmt=info:ofi/fmt:kev:mtx:patent&rft.genre=patent&rft.au=MA%20TINGHUAI&rft.date=2023-08-29&rft_id=info:doi/&rft_dat=%3Cepo_EVB%3ECN116661603A%3C/epo_EVB%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true