Lumos𝒳: Relate Any Identities with Their Attributes for Personalized Video Generation

Abstract

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face–attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose Lumos𝒳, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject–attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that Lumos𝒳 achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.

Identity-Consistent Video Generation

1 / 2

ConsisID

Concat-ID

Lumos𝒳

1 / 3

SkyReels-A2

Phantom

Lumos𝒳

Subject-Consistent Video Generation

1 / 4

SkyReels-A2

Phantom

Lumos𝒳

1 / 4

SkyReels-A2

Phantom

Lumos𝒳

Data Collection Pipeline

Dataset construction pipeline for personalized multi-subject video generation. We build the training dataset from raw videos in three steps: (1) generate a caption and detect human subjects in extracted frames; (2) retrieve entity words from the caption and match subjects with their attributes; (3) use these entity tags to localize and segment target subjects and objects, producing a clean background image.

Method Pipeline

Overview of Lumos𝒳. Built on the Wan2.1-T2V model, our framework encodes all condition images into image tokens via a VAE encoder, concatenates them with denoising video tokens, and feeds the result into DiT~\cite{peebles2023scalable} blocks. Within each block, the proposed Relational Self-Attention and Relational Cross-Attention enable causal conditional modeling, enhance visual token representations, and ensure precise face–attribute alignment.

More Visualizations

BibTeX

@misc{xing2025lumosxrelateanyidentities,
          title={Lumos𝒳: Relate Any Identities with Their Attributes for Personalized Video Generation}, 
          author={Jiazheng Xing and Fei Du and Hangjie Yuan and Pengwei Liu and Hongbin Xu and Hai Ci and Ruigang Niu and Weihua Chen and Fan Wang and Yong Liu},
          year={2025},
          eprint={},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={}, 
        }