TY - JOUR
T1 - Realistic Image Generation from Text by Using BERT-Based Embedding
AU - Na, Sanghyuck
AU - Do, Mirae
AU - Yu, Kyeonah
AU - Kim, Juntae
N1 - Publisher Copyright:
© 2022 by the authors. Licensee MDPI, Basel, Switzerland.
PY - 2022/3/1
Y1 - 2022/3/1
N2 - Recently, in the field of artificial intelligence, multimodal learning has received a lot of attention due to expectations for the enhancement of AI performance and potential applications. Text-to-image generation, which is one of the multimodal tasks, is a challenging topic in computer vision and natural language processing. The text-to-image generation model based on generative adversarial network (GAN) utilizes a text encoder pre-trained with image-text pairs. However, text encoders pre-trained with image-text pairs cannot obtain rich information about texts not seen during pre-training, thus it is hard to generate an image that semantically matches a given text description. In this paper, we propose a new text-to-image generation model using pre-trained BERT, which is widely used in the field of natural language processing. The pre-trained BERT is used as a text encoder by performing fine-tuning with a large amount of text, so that rich information about the text is obtained and thus suitable for the image generation task. Through experiments using a multimodal benchmark dataset, we show that the proposed method improves the performance over the baseline model both quantitatively and qualitatively.
AB - Recently, in the field of artificial intelligence, multimodal learning has received a lot of attention due to expectations for the enhancement of AI performance and potential applications. Text-to-image generation, which is one of the multimodal tasks, is a challenging topic in computer vision and natural language processing. The text-to-image generation model based on generative adversarial network (GAN) utilizes a text encoder pre-trained with image-text pairs. However, text encoders pre-trained with image-text pairs cannot obtain rich information about texts not seen during pre-training, thus it is hard to generate an image that semantically matches a given text description. In this paper, we propose a new text-to-image generation model using pre-trained BERT, which is widely used in the field of natural language processing. The pre-trained BERT is used as a text encoder by performing fine-tuning with a large amount of text, so that rich information about the text is obtained and thus suitable for the image generation task. Through experiments using a multimodal benchmark dataset, we show that the proposed method improves the performance over the baseline model both quantitatively and qualitatively.
KW - BERT
KW - GAN
KW - Multimodal data
KW - Text to image generation
UR - http://www.scopus.com/inward/record.url?scp=85125419852&partnerID=8YFLogxK
U2 - 10.3390/electronics11050764
DO - 10.3390/electronics11050764
M3 - Article
AN - SCOPUS:85125419852
SN - 2079-9292
VL - 11
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
IS - 5
M1 - 764
ER -