Multi-modal fusion user intention recognition method in complex human-computer interaction scene
The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | MA TINGHUAI JIA LI HUANG XUEJIAN |
description | The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net |
format | Patent |
fullrecord | <record><control><sourceid>epo_EVB</sourceid><recordid>TN_cdi_epo_espacenet_CN116661603A</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>CN116661603A</sourcerecordid><originalsourceid>FETCH-epo_espacenet_CN116661603A3</originalsourceid><addsrcrecordid>eNqNi0EKwjAQRbNxIeod4gEChkL2UhQ3unJfQzq1gWQmJBPw-JrSA7j6__Hf34rXvQb2KtJog5xq8YSyFsjSIwNywwyO3uiXHoFnGn-jdBRTgI-ca7SoGlVeb9m6RS4OEPZiM9lQ4LDmThyvl2d_U5BogJJsk3joH1obY7Q5defuH-cLRak-Rw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>patent</recordtype></control><display><type>patent</type><title>Multi-modal fusion user intention recognition method in complex human-computer interaction scene</title><source>esp@cenet</source><creator>MA TINGHUAI ; JIA LI ; HUANG XUEJIAN</creator><creatorcontrib>MA TINGHUAI ; JIA LI ; HUANG XUEJIAN</creatorcontrib><description>The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net</description><language>chi ; eng</language><subject>ACOUSTICS ; CALCULATING ; COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS ; COMPUTING ; COUNTING ; ELECTRIC DIGITAL DATA PROCESSING ; MUSICAL INSTRUMENTS ; PHYSICS ; SPEECH ANALYSIS OR SYNTHESIS ; SPEECH OR AUDIO CODING OR DECODING ; SPEECH OR VOICE PROCESSING ; SPEECH RECOGNITION</subject><creationdate>2023</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20230829&DB=EPODOC&CC=CN&NR=116661603A$$EHTML$$P50$$Gepo$$Hfree_for_read</linktohtml><link.rule.ids>230,309,782,887,25571,76555</link.rule.ids><linktorsrc>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20230829&DB=EPODOC&CC=CN&NR=116661603A$$EView_record_in_European_Patent_Office$$FView_record_in_$$GEuropean_Patent_Office$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>MA TINGHUAI</creatorcontrib><creatorcontrib>JIA LI</creatorcontrib><creatorcontrib>HUANG XUEJIAN</creatorcontrib><title>Multi-modal fusion user intention recognition method in complex human-computer interaction scene</title><description>The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net</description><subject>ACOUSTICS</subject><subject>CALCULATING</subject><subject>COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS</subject><subject>COMPUTING</subject><subject>COUNTING</subject><subject>ELECTRIC DIGITAL DATA PROCESSING</subject><subject>MUSICAL INSTRUMENTS</subject><subject>PHYSICS</subject><subject>SPEECH ANALYSIS OR SYNTHESIS</subject><subject>SPEECH OR AUDIO CODING OR DECODING</subject><subject>SPEECH OR VOICE PROCESSING</subject><subject>SPEECH RECOGNITION</subject><fulltext>true</fulltext><rsrctype>patent</rsrctype><creationdate>2023</creationdate><recordtype>patent</recordtype><sourceid>EVB</sourceid><recordid>eNqNi0EKwjAQRbNxIeod4gEChkL2UhQ3unJfQzq1gWQmJBPw-JrSA7j6__Hf34rXvQb2KtJog5xq8YSyFsjSIwNywwyO3uiXHoFnGn-jdBRTgI-ca7SoGlVeb9m6RS4OEPZiM9lQ4LDmThyvl2d_U5BogJJsk3joH1obY7Q5defuH-cLRak-Rw</recordid><startdate>20230829</startdate><enddate>20230829</enddate><creator>MA TINGHUAI</creator><creator>JIA LI</creator><creator>HUANG XUEJIAN</creator><scope>EVB</scope></search><sort><creationdate>20230829</creationdate><title>Multi-modal fusion user intention recognition method in complex human-computer interaction scene</title><author>MA TINGHUAI ; JIA LI ; HUANG XUEJIAN</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-epo_espacenet_CN116661603A3</frbrgroupid><rsrctype>patents</rsrctype><prefilter>patents</prefilter><language>chi ; eng</language><creationdate>2023</creationdate><topic>ACOUSTICS</topic><topic>CALCULATING</topic><topic>COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS</topic><topic>COMPUTING</topic><topic>COUNTING</topic><topic>ELECTRIC DIGITAL DATA PROCESSING</topic><topic>MUSICAL INSTRUMENTS</topic><topic>PHYSICS</topic><topic>SPEECH ANALYSIS OR SYNTHESIS</topic><topic>SPEECH OR AUDIO CODING OR DECODING</topic><topic>SPEECH OR VOICE PROCESSING</topic><topic>SPEECH RECOGNITION</topic><toplevel>online_resources</toplevel><creatorcontrib>MA TINGHUAI</creatorcontrib><creatorcontrib>JIA LI</creatorcontrib><creatorcontrib>HUANG XUEJIAN</creatorcontrib><collection>esp@cenet</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>MA TINGHUAI</au><au>JIA LI</au><au>HUANG XUEJIAN</au><format>patent</format><genre>patent</genre><ristype>GEN</ristype><title>Multi-modal fusion user intention recognition method in complex human-computer interaction scene</title><date>2023-08-29</date><risdate>2023</risdate><abstract>The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net</abstract><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | |
ispartof | |
issn | |
language | chi ; eng |
recordid | cdi_epo_espacenet_CN116661603A |
source | esp@cenet |
subjects | ACOUSTICS CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING MUSICAL INSTRUMENTS PHYSICS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION |
title | Multi-modal fusion user intention recognition method in complex human-computer interaction scene |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-04T10%3A30%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-epo_EVB&rft_val_fmt=info:ofi/fmt:kev:mtx:patent&rft.genre=patent&rft.au=MA%20TINGHUAI&rft.date=2023-08-29&rft_id=info:doi/&rft_dat=%3Cepo_EVB%3ECN116661603A%3C/epo_EVB%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |