Description
Hunyuan-DiT is a powerful multi-resolution diffusion transformer model designed for text-to-image generation with fine-grained understanding of both English and Chinese. Leveraging a pre-trained variational autoencoder (VAE) and incorporating a bilingual CLIP and multilingual T5 encoder, it excels at creating detailed images from user prompts. The model supports dynamic, multi-turn interactions to iteratively refine generated images, offering a state-of-the-art solution, especially in Chinese-to-image generation. Additionally, the model is optimized for various deployments, including TensorRT and Distillation versions for accelerated performance on NVIDIA GPUs.