Citation: | SUN Haoran, WANG Yang, LIU Haipeng, et al., “Fine-Grained Cross-Modal Fusion Based Refinement for Text-to-Image Synthesis,” Chinese Journal of Electronics, vol. 32, no. 6, pp. 1329-1340, 2023, doi: 10.23919/cje.2022.00.227 |
[1] |
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol.45, no.11, pp.2673–2681, 1997. doi: 10.1109/78.650093
|
[2] |
T. Mikolov, M. Karafiát, L. Burget, et al., “Recurrent neural network based language model,” in Proceedings of the INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Japan, pp.1045–1048, 2010.
|
[3] |
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol.9, no.8, pp.1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735
|
[4] |
Y. Wang, W. J. Zhang, L. Wu, et al., “Iterative views agreement: An iterative low-rank based structured optimization method to multi-view spectral clustering,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, pp.2153–2159, 2016.
|
[5] |
C. Szegedy, V. Vanhoucke, S. Ioffe, et al., “Rethinking the inception architecture for computer vision,” in Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp.2818–2826, 2016.
|
[6] |
Y. Wang, “Survey on deep multi-modal data analytics: Collaboration, rivalry, and fusion,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol.17, no.1s, article no.10, 2021. doi: 10.1145/3408317
|
[7] |
B. Qian, Y. Wang, R. C. Hong, et al., “Diversifying inference path selection: Moving-mobile-network for landmark recognition,” IEEE Transactions on Image Processing, vol.30, pp.4894–4904, 2021. doi: 10.1109/TIP.2021.3076275
|
[8] |
B. Qian, Y. Wang, H. Z. Yin, et al., “Switchable online knowledge distillation, ” in Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, pp.449–466, 2022.
|
[9] |
L. Wu, Y. Wang, and L. Shao, “Cycle-consistent deep generative hashing for cross-modal retrieval,” IEEE Transactions on Image Processing, vol.28, no.4, pp.1602–1612, 2019. doi: 10.1109/TIP.2018.2878970
|
[10] |
H. P. Liu, Y. Wang, M. Wang, et al., “Delving globally into texture and structure for image inpainting,” in Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, pp.1270–1278, 2022.
|
[11] |
J. Cheng, F. X. Wu, Y. L. Tian, et al., “RiFeGAN: Rich feature generation for text-to-image synthesis from prior knowledge, ” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.10908–10917, 2020.
|
[12] |
S. He, W. T. Liao, M. Y. Yang, et al., “Context-aware layout to image generation with enhanced object appearance,” in Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp.15044–15053, 2021.
|
[13] |
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative adversarial nets,” in Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp.2672–2680, 2014.
|
[14] |
B. Qian, Y. Wang, R. C. Hong, et al., “Rethinking data-free quantization as a zero-sum game, ” arXiv preprint, arXiv: 2302.09572, 2023.
|
[15] |
S. E. Reed, Z. Akata, X. C. Yan, et al., “Generative adversarial text to image synthesis,” in Proceedings of the 33rd International Conference on Machine Learning, New York City, NY, USA, pp.1060–1069, 2016.
|
[16] |
H. Zhang, T. Xu, H. S. Li, et al., “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp.5908–5916, 2017.
|
[17] |
T. Miyato and M. Koyama, “cGANS with projection discriminator,” in Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, pp.2337–2346, 2018.
|
[18] |
J. Y. Zhu, T. Park, P. Isola, et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp.2242–2251, 2017.
|
[19] |
T. T. Qiao, J. Zhang, D. Q. Xu, et al., “MirrorGAN: Learning text-to-image generation by redescription,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.1505–1514, 2019.
|
[20] |
T. Xu, P. C. Zhang, Q. Y. Huang, et al., “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.1316–1324, 2018.
|
[21] |
F. L. Mao, B. P. Ma, H. Chang, et al., “MS-GAN: Text to image synthesis with attention-modulated generators and similarity-aware discriminators,” in Proceedings of the 30th British Machine Vision Conference, Cardiff, UK, article no.150, 2019.
|
[22] |
M. F. Zhu, P. B. Pan, W. Chen, et al., “DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 5795–5803, 2019.
|
[23] |
Z. Z. Zhang, Y. P. Xie, and L. Yang, “Photographic text-to-image synthesis with a hierarchically-nested adversarial network,” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.6199–6208, 2018.
|
[24] |
Z. X. Zhang and L. Schomaker, “DTGAN: Dual attention generative adversarial networks for text-to-image generation,” in Proceedings of 2021 International Joint Conference on Neural Networks, Shenzhen, China, pp.1–8, 2021.
|
[25] |
H. Zhang, T. Xu, H. S. Li, et al., “StackGAN++: Realistic image synthesis with stacked generative adversarial networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.41, no.8, pp.1947–1962, 2019. doi: 10.1109/TPAMI.2018.2856256
|
[26] |
A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp.6000–6010, 2017.
|
[27] |
Y. Wang, J. J. Peng, H. B. Wang, et al., “Progressive learning with multi-scale attention network for cross-domain vehicle re-identification,” Science China Information Sciences, vol.65, no.6, article no.160103, 2022. doi: 10.1007/s11432-021-3383-y
|
[28] |
G. J. Yin, B. Liu, L. Sheng, et al., “Semantics disentangling for text-to-image generation,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.2322–2331, 2019.
|
[29] |
E. Perez, F. Strub, H. De Vries, et al., “FiLM: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, pp.3942–3951, 2018.
|
[30] |
A. El-Nouby, S. Sharma, H. Schulz, et al., “Tell, draw, and repeat: Generating and modifying images based on continual linguistic instruction,” in Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp.10303–10311, 2019.
|
[31] |
T. Park, M. Y. Liu, T. C. Wang, et al., “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.2332–2341, 2019.
|
[32] |
H. de Vries, F. Strub, J. Mary, et al., “Modulating early visual processing by language,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp.6597–6607, 2017.
|
[33] |
X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp.1510–1519, 2017.
|
[34] |
S. E. Reed, Z. Akata, S. Mohan, et al., “Learning what and where to draw,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, pp.217–225, 2016.
|
[35] |
A. Nguyen, J. Clune, Y. Bengio, et al., “Plug & play generative networks: Conditional iterative generation of images in latent space,” in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp.3510–3520, 2017.
|
[36] |
S. L. Ruan, Y. Zhang, K. Zhang, et al., “DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis,” in Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, pp.13940–13949, 2021.
|
[37] |
M. Tao, H. Tang, F. Wu, et al., “DF-GAN: A simple and effective baseline for text-to-image synthesis,” in Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp.16494–16504, 2022.
|
[38] |
B. W. Li, X. J. Qi, T. Lukasiewicz, et al., “ManiGAN: Text-guided image manipulation,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.7877–7886, 2020.
|
[39] |
V. Dumoulin, E. Perez, N. Schucher, et al., “Feature-wise transformations: A simple and surprisingly effective family of conditioning mechanisms,” Available at: https://distill.pub/2018/feature-wise-transformations/, 2018-07-09.
|
[40] |
C. Wah, S. Branson, P. Welinder, et al., California Institute of Technology, “The Caltech-UCSD birds-200-2011 dataset, ” Computation & Neural Systems Technical Report, 2010-001, https://authors.library.caltech.edu/27452/, 2011.
|
[41] |
T. Y. Lin, M. Maire, S. J. Belongie, et al., “Microsoft COCO: Common objects in context,” in Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, pp.740–755, 2014.
|
[42] |
M. Heusel, H. Ramsauer, T. Unterthiner, et al., “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp.6629–6640, 2017.
|
[43] |
B. W. Li, X. J. Qi, T. Lukasiewicz, et al., “Controllable text-to-image generation,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, article no.185, 2019.
|
[44] |
H. C. Tan, X. P. Liu, M. Liu, et al., “KT-GAN: Knowledge-transfer generative adversarial network for text-to-image synthesis,” IEEE Transactions on Image Processing, vol.30, pp.1275–1290, 2021. doi: 10.1109/TIP.2020.3026728
|
[45] |
W. M. Huang, R. Y. D. Xu, and I. Oppermann, “Realistic image generation using region-phrase attention,” in Proceedings of the 11th Asian Conference on Machine Learning, Nagoya, Japan, pp.284–299, 2019.
|
[46] |
B. C. Liu, K. P. Song, Y. Z. Zhu, et al., “TIME: Text and image mutual-translation adversarial networks,” in Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, virtual event, pp.2082–2090, 2021.
|