Realistic Image Generation from Text by Using BERT-Based Embedding

Sanghyuck Na, Mirae Do, Kyeonah Yu, Juntae Kim

Research output: Contribution to journalArticlepeer-review

9 Scopus citations

Abstract

Recently, in the field of artificial intelligence, multimodal learning has received a lot of attention due to expectations for the enhancement of AI performance and potential applications. Text-to-image generation, which is one of the multimodal tasks, is a challenging topic in computer vision and natural language processing. The text-to-image generation model based on generative adversarial network (GAN) utilizes a text encoder pre-trained with image-text pairs. However, text encoders pre-trained with image-text pairs cannot obtain rich information about texts not seen during pre-training, thus it is hard to generate an image that semantically matches a given text description. In this paper, we propose a new text-to-image generation model using pre-trained BERT, which is widely used in the field of natural language processing. The pre-trained BERT is used as a text encoder by performing fine-tuning with a large amount of text, so that rich information about the text is obtained and thus suitable for the image generation task. Through experiments using a multimodal benchmark dataset, we show that the proposed method improves the performance over the baseline model both quantitatively and qualitatively.

Original languageEnglish
Article number764
JournalElectronics (Switzerland)
Volume11
Issue number5
DOIs
StatePublished - 1 Mar 2022

Keywords

  • BERT
  • GAN
  • Multimodal data
  • Text to image generation

Fingerprint

Dive into the research topics of 'Realistic Image Generation from Text by Using BERT-Based Embedding'. Together they form a unique fingerprint.

Cite this