Fine-Grained Cross-Modal Fusion Based Refinement for Text-to-Image Synthesis

SUN Haoran; WANG Yang; LIU Haipeng; QIAN Biao

doi:10.23919/cje.2022.00.227

Volume 32 Issue 6

Nov. 2023

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Electronics > 2023 > 32(6): 1329-1340

SUN Haoran, WANG Yang, LIU Haipeng, et al., “Fine-Grained Cross-Modal Fusion Based Refinement for Text-to-Image Synthesis,” Chinese Journal of Electronics, vol. 32, no. 6, pp. 1329-1340, 2023, doi: 10.23919/cje.2022.00.227

Citation:

SUN Haoran, WANG Yang, LIU Haipeng, et al., “Fine-Grained Cross-Modal Fusion Based Refinement for Text-to-Image Synthesis,” Chinese Journal of Electronics, vol. 32, no. 6, pp. 1329-1340, 2023, doi: 10.23919/cje.2022.00.227

Citation:

PDF( 7008 KB)

Fine-Grained Cross-Modal Fusion Based Refinement for Text-to-Image Synthesis

doi: 10.23919/cje.2022.00.227

SUN Haoran^2
,,
WANG Yang^{1, 2
,},
LIU Haipeng^2
,,
QIAN Biao^2
,

1.
Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education, Hefei University of Technology, Hefei 230000, China
2.
Department of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230000, China

Funds: This work was supported by the National Natural Science Foundation of China (U21A20470, 62172136, U1936217) and the Key Research and Technology Development Projects of Anhui Province (202004a5020043).

More Information

Author Bio:
Haoran SUN received the B.E. degree in 2020. He is currently pursuing the M.S. degree at Hefei University of Technology. His research interests include artificial intelligence and computer vision. (Email: haoranhfut@gmail.com)

Yang WANG (corresponding author) received the Ph.D. degree from The University of New South Wales, Kensington, Australia, in 2015. He is currently a Huangshan Professor and Ph.D. Supervisor at Hefei University of Technology, China. He has published 90 research papers (featured 7 ESI highly cited papers, with all of them to be among top 1%) including two book chapters, most of which are (to be) appeared in the major venues, such as Artificial Intelligence (Elsevier), International Journal of Computer Vision (IJCV), IEEE TIP, IEEE TNNLS, IEEE TMM, ACM TOIS, Machine Learning (Springer), IEEE TKDE, IEEE TCYB, VLDB Journal, KDD, ECCV, IJCAI, AAAI, ACM SIGIR, ACM Multimedia, IEEE ICDM, ACM CIKM, SCIENCE CHINA Information Sciences etc. He currently serves as the Associate Editor of ACM Trans. Information systems. He was the winner of Best Research Paper Runner-up Award for PAKDD 2014, and was a program committee member for various leading conferences such as IJCAI, AAAI, CVPR, ECCV, EMNLP, ACM Multimedia, ACM Multimedia (Asia), ECMLPKDD, etc. (Email: yangwang@hfut.edu.cn)

Haipeng LIU received the B.E. degree in 2018. He is a Ph.D. candidate at the Hefei University of Technology, Hefei, China. His current research interests include computer vision, deep learning and image inpainting. (Email:hpliu_hfut@hotmail.com)

Biao QIAN received the B.E. degree in 2017 and is a Ph.D. candidate at the Hefei University of Technology, Hefei, China. His current research interests include computer vision, deep learning and neural network compression and acceleration. (Email: hfutqian@gmail.com)
Received Date: 2022-07-23
Accepted Date: 2023-02-16

Available Online: 2023-06-01

Publish Date: 2023-11-05

Abstract

Abstract

Text-to-image synthesis refers to generating visual-realistic and semantically consistent images from given textual descriptions. Previous approaches generate an initial low-resolution image and then refine it to be high-resolution. Despite the remarkable progress, these methods are limited in fully utilizing the given texts and could generate text-mismatched images, especially when the text description is complex. We propose a novel fine-grained text-image fusion based generative adversarial networks (FF-GAN), which consists of two modules: Fine-grained text-image fusion block (FF-Block) and global semantic refinement (GSR). The proposed FF-Block integrates an attention block and several convolution layers to effectively fuse the fine-grained word-context features into the corresponding visual features, in which the text information is fully used to refine the initial image with more details. And the GSR is proposed to improve the global semantic consistency between linguistic and visual features during the refinement process. Extensive experiments on CUB-200 and COCO datasets demonstrate the superiority of FF-GAN over other state-of-the-art approaches in generating images with semantic consistency to the given texts.
- Text-to-image synthesis,
- Text-image fusion,
- Generative adversarial network

FullText(HTML)

References(46)

References

[1]	M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol.45, no.11, pp.2673–2681, 1997. doi: 10.1109/78.650093
[2]	T. Mikolov, M. Karafiát, L. Burget, et al., “Recurrent neural network based language model,” in Proceedings of the INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Japan, pp.1045–1048, 2010.
[3]	S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol.9, no.8, pp.1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735
[4]	Y. Wang, W. J. Zhang, L. Wu, et al., “Iterative views agreement: An iterative low-rank based structured optimization method to multi-view spectral clustering,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, pp.2153–2159, 2016.
[5]	C. Szegedy, V. Vanhoucke, S. Ioffe, et al., “Rethinking the inception architecture for computer vision,” in Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp.2818–2826, 2016.
[6]	Y. Wang, “Survey on deep multi-modal data analytics: Collaboration, rivalry, and fusion,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol.17, no.1s, article no.10, 2021. doi: 10.1145/3408317
[7]	B. Qian, Y. Wang, R. C. Hong, et al., “Diversifying inference path selection: Moving-mobile-network for landmark recognition,” IEEE Transactions on Image Processing, vol.30, pp.4894–4904, 2021. doi: 10.1109/TIP.2021.3076275
[8]	B. Qian, Y. Wang, H. Z. Yin, et al., “Switchable online knowledge distillation, ” in Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, pp.449–466, 2022.
[9]	L. Wu, Y. Wang, and L. Shao, “Cycle-consistent deep generative hashing for cross-modal retrieval,” IEEE Transactions on Image Processing, vol.28, no.4, pp.1602–1612, 2019. doi: 10.1109/TIP.2018.2878970
[10]	H. P. Liu, Y. Wang, M. Wang, et al., “Delving globally into texture and structure for image inpainting,” in Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, pp.1270–1278, 2022.
[11]	J. Cheng, F. X. Wu, Y. L. Tian, et al., “RiFeGAN: Rich feature generation for text-to-image synthesis from prior knowledge, ” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.10908–10917, 2020.
[12]	S. He, W. T. Liao, M. Y. Yang, et al., “Context-aware layout to image generation with enhanced object appearance,” in Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp.15044–15053, 2021.
[13]	I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative adversarial nets,” in Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp.2672–2680, 2014.
[14]	B. Qian, Y. Wang, R. C. Hong, et al., “Rethinking data-free quantization as a zero-sum game, ” arXiv preprint, arXiv: 2302.09572, 2023.
[15]	S. E. Reed, Z. Akata, X. C. Yan, et al., “Generative adversarial text to image synthesis,” in Proceedings of the 33rd International Conference on Machine Learning, New York City, NY, USA, pp.1060–1069, 2016.
[16]	H. Zhang, T. Xu, H. S. Li, et al., “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp.5908–5916, 2017.
[17]	T. Miyato and M. Koyama, “cGANS with projection discriminator,” in Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, pp.2337–2346, 2018.
[18]	J. Y. Zhu, T. Park, P. Isola, et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp.2242–2251, 2017.
[19]	T. T. Qiao, J. Zhang, D. Q. Xu, et al., “MirrorGAN: Learning text-to-image generation by redescription,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.1505–1514, 2019.
[20]	T. Xu, P. C. Zhang, Q. Y. Huang, et al., “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.1316–1324, 2018.
[21]	F. L. Mao, B. P. Ma, H. Chang, et al., “MS-GAN: Text to image synthesis with attention-modulated generators and similarity-aware discriminators,” in Proceedings of the 30th British Machine Vision Conference, Cardiff, UK, article no.150, 2019.
[22]	M. F. Zhu, P. B. Pan, W. Chen, et al., “DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 5795–5803, 2019.
[23]	Z. Z. Zhang, Y. P. Xie, and L. Yang, “Photographic text-to-image synthesis with a hierarchically-nested adversarial network,” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.6199–6208, 2018.
[24]	Z. X. Zhang and L. Schomaker, “DTGAN: Dual attention generative adversarial networks for text-to-image generation,” in Proceedings of 2021 International Joint Conference on Neural Networks, Shenzhen, China, pp.1–8, 2021.
[25]	H. Zhang, T. Xu, H. S. Li, et al., “StackGAN++: Realistic image synthesis with stacked generative adversarial networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.41, no.8, pp.1947–1962, 2019. doi: 10.1109/TPAMI.2018.2856256
[26]	A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp.6000–6010, 2017.
[27]	Y. Wang, J. J. Peng, H. B. Wang, et al., “Progressive learning with multi-scale attention network for cross-domain vehicle re-identification,” Science China Information Sciences, vol.65, no.6, article no.160103, 2022. doi: 10.1007/s11432-021-3383-y
[28]	G. J. Yin, B. Liu, L. Sheng, et al., “Semantics disentangling for text-to-image generation,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.2322–2331, 2019.
[29]	E. Perez, F. Strub, H. De Vries, et al., “FiLM: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, pp.3942–3951, 2018.
[30]	A. El-Nouby, S. Sharma, H. Schulz, et al., “Tell, draw, and repeat: Generating and modifying images based on continual linguistic instruction,” in Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp.10303–10311, 2019.
[31]	T. Park, M. Y. Liu, T. C. Wang, et al., “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.2332–2341, 2019.
[32]	H. de Vries, F. Strub, J. Mary, et al., “Modulating early visual processing by language,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp.6597–6607, 2017.
[33]	X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp.1510–1519, 2017.
[34]	S. E. Reed, Z. Akata, S. Mohan, et al., “Learning what and where to draw,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, pp.217–225, 2016.
[35]	A. Nguyen, J. Clune, Y. Bengio, et al., “Plug & play generative networks: Conditional iterative generation of images in latent space,” in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp.3510–3520, 2017.
[36]	S. L. Ruan, Y. Zhang, K. Zhang, et al., “DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis,” in Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, pp.13940–13949, 2021.
[37]	M. Tao, H. Tang, F. Wu, et al., “DF-GAN: A simple and effective baseline for text-to-image synthesis,” in Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp.16494–16504, 2022.
[38]	B. W. Li, X. J. Qi, T. Lukasiewicz, et al., “ManiGAN: Text-guided image manipulation,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.7877–7886, 2020.
[39]	V. Dumoulin, E. Perez, N. Schucher, et al., “Feature-wise transformations: A simple and surprisingly effective family of conditioning mechanisms,” Available at: https://distill.pub/2018/feature-wise-transformations/, 2018-07-09.
[40]	C. Wah, S. Branson, P. Welinder, et al., California Institute of Technology, “The Caltech-UCSD birds-200-2011 dataset, ” Computation & Neural Systems Technical Report, 2010-001, https://authors.library.caltech.edu/27452/, 2011.
[41]	T. Y. Lin, M. Maire, S. J. Belongie, et al., “Microsoft COCO: Common objects in context,” in Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, pp.740–755, 2014.
[42]	M. Heusel, H. Ramsauer, T. Unterthiner, et al., “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp.6629–6640, 2017.
[43]	B. W. Li, X. J. Qi, T. Lukasiewicz, et al., “Controllable text-to-image generation,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, article no.185, 2019.
[44]	H. C. Tan, X. P. Liu, M. Liu, et al., “KT-GAN: Knowledge-transfer generative adversarial network for text-to-image synthesis,” IEEE Transactions on Image Processing, vol.30, pp.1275–1290, 2021. doi: 10.1109/TIP.2020.3026728
[45]	W. M. Huang, R. Y. D. Xu, and I. Oppermann, “Realistic image generation using region-phrase attention,” in Proceedings of the 11th Asian Conference on Machine Learning, Nagoya, Japan, pp.284–299, 2019.
[46]	B. C. Liu, K. P. Song, Y. Z. Zhu, et al., “TIME: Text and image mutual-translation adversarial networks,” in Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, virtual event, pp.2082–2090, 2021.