Video synthesis via multi-modal conditions

A multi-modal video generation framework (MMVID) benefits from text and images provided jointly or separately as inputs. A quantized representation of a video is used together with a bidirectional converter having multiple modalities as an input to predict a discrete video representation. Video qual...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: TULYAKOV SERGEY, OLZEWSKI, KYLE, MINAY, SYLVAIN, BARBERI FRANCESCO, HAN LILONG, LI XINYING, REN JIAN
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A multi-modal video generation framework (MMVID) benefits from text and images provided jointly or separately as inputs. A quantized representation of a video is used together with a bidirectional converter having multiple modalities as an input to predict a discrete video representation. Video quality and consistency are improved using new video tokens that utilize self-learning training and an improved mask prediction algorithm for sampling the video tokens. Text enhancement is used to improve robustness of text representations and diversity of generated videos. The framework incorporates different visual modalities, such as segmentation masks, plots, and partially occluded images. In addition, the MMVID extracts visual information suggested by the text cue. 一种多模态视频生成框架(MMVID),其受益于联合地或单独地作为输入提供的文本和图像。视频的量化表示与具有多个模态的双向转换器一起用作预测离散视频表示的输入。使用利用自学习训练的新视频令牌和用于对视频令牌进行采样的改进的掩模预测算法来改进视频质量和一致性。文本增强被用来提高文本表示的鲁棒性和生成的视频的多样性。框架结合不同视觉模态,诸如分割掩模、绘图和部分遮挡的图像。此外,MMVID提取由文本提示建议的视觉信息。