MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Abstract

In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency.

We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation.

Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter’s limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.

MICo-150K: Advanced Data Construction Pipeline Compared to Previous Methods

Previous MICo methods typically collect high-quality images or video frames as target images (1). Using Open-Vocabulary Detectors (OVD) and SAM, objects within targets are segmented to obtain source images (2). Some methods enhance the targets by retrieving additional frames of the same subject from videos (3), or enhance the sources using S2I (Subject-to-Image) or inpainting models (4).

Training pairs are then constructed along multiple paths: (2→1), (2→3), (4→1), and (4→3). However, the masks in (2) are often incomplete and semantically ambiguous; the generated images in (4) tend to share similar styles, content, and limited diversity due to reliance on a few fixed generative models; the frames in (3) originate from a small number of high-quality videos, leading to limited scene variety and a lack of imaginative or complex multi-subject scenarios.

Our Data Construction Pipeline

(a): High-quality open-source data are collected and cleaned through a dedicated pipeline, categorized into four groups: human, object, clothes, and scene, each with detailed captions. For each task, a diverse set of source images is randomly sampled from these categories, and a multi-image composition prompt is generated by GPT-4o using image captions under a “Composed-by-Retrieval” strategy. The generated prompt is then fed into Nano-Banana to synthesize composite images. Each output image undergoes an automated verification process including QwenVL2.5-72B and ArcFace to ensure that all source images are correctly represented in the final composition before being included in the dataset.

(b): We collect a large number of high-quality single-person portraits through our data-cleaning pipeline and use Nano-Banana to decompose each image into its constituent components—scene, human, objects, and clothes. Human annotators then carefully inspect all decomposed components to ensure their quality. Once all parts meet the required standards, Nano-Banana is used again to recompose them into a complete image.

Visualization Examples of MICo-150K Dataset

[Row 1] (Object-Centric): “2 objects + scene” and “4 objects” compositions.
[Row 2] (Person-Centric): “3 women” and “2 persons + scene”.
[Row 3] (Human-Object Interaction): “1 person + 4 objects” and “2 persons + 2 objects”.
[Row 4] (De&Re): The first image is a real-world photo, the last is the recomposed result, with its intermediate visual elements including decomposed persons, objects, clothes, and scene components.

MICo-Bench: Comprehensive Evaluation for MICo Tasks

With the rapid rise of general-purpose Vision-Language Models (VLMs), evaluating generative tasks such as I2I and T2I through VLMs has become increasingly popular. VIEScore VIEScore offers a representative framework for VLM-based image evaluation by decomposing overall quality into two dimensions: Semantic Consistency (SC) and Perceptual Quality (PQ). The final score is computed as SC × PQ. OmniContext Bench follows this framework and employs GPT-4o to evaluate the SC score of composition results in terms of Prompt Following (PF) and Subject Resemblance (SR).

⚠️ Why Existing I2I Metrics Fail for MICo? This evaluation paradigm requires all source images to be provided simultaneously during evaluation. Although modern VLMs such as GPT-4o are highly capable, their cross-image attention remains limited. When too many images are supplied, the model struggles to accurately perceive each image’s content, leading to unreliable assessments of composition quality and consequently incorrect scores.

Comparison between traditional VIEScore and MICo-Bench evaluation

The Issue with Traditional VIEScore

Traditional VIEScore requires inputting all source images together with the generated image into the evaluator, which often leads to degraded performance as GPT-4o’s cross-image attention becomes overloaded. This prevents the model from fully understanding each image and accurately determining whether every source appears in the composed result, resulting in substantial scoring errors.

The MICo-Bench Solution

In the example shown above, all three human evaluators unanimously agreed that Image B was clearly superior. In contrast, MICo-Bench first assesses whether each individual source image appears in the generated result to produce structured weights. Each test case additionally includes a verified reference image that contains all sources. During evaluation, GPT-4o compares only the generated image with the reference image, enabling reliable, human-level judgment accuracy.

Enabling Community Models with MICo Capability

To validate the effectiveness of our dataset, we train five open-source models: BAGEL, OmniGen2, Lumina-DiMOO, BLIP3-o, Qwen-Image-Edit on MICo-150K, and conduct a comprehensive evaluation of their performance.

MICo-Bench Qualitative Results

Performance comparison on MICo-Bench across different open-source and closed-source models: “base” denotes the original model; “w/o” indicates fine-tuning without the De&Re task; “real” and “synth” correspond to fine-tuning with real and synthetic compositions from the De&Re task, respectively. The best performance of each model under each task is highlighted in bold.

The leftmost displays the source and reference images. The first row shows model outputs before fine-tuning, the second row presents outputs after fine-tuning. The Weighted-Ref-VIEScore for each generated result is annotated in the corner. MICo-150K demonstrates strong robustness: BLIP-3o and Lumina-DiMOO acquire MICo capability from scratch; the emergent MICo abilities of BAGEL and Qwen-Image are significantly strengthened; OmniGen2 achieves further improvement on top of its already strong performance.

Qualitative Comparisons Before & After SFT on MICo-150K

Comparison of open-source models before and after MICo-150K training. Some source images were cropped or background-removed for visualization. BLIP3-o and Lumina-DiMOO gain strong multi-image composition abilities after training. Qwen-Image-Edit and BAGEL were not explicitly trained for MICo tasks, but exhibit emergent MICo capabilities that are further enhanced through fine-tuning. OmniGen2 preserves identity well and produces more aesthetic, prompt-aligned results after training.

More Qualitative Results on MICo-Bench

Click the button below to view the corresponding generation results before and after SFT (or compared with Qwen-2509).

Data Analysis

Dataset Statistics: The statistics of the dataset’s source images and text prompts, demonstrating significantly greater diversity compared to other datasets with similar functionality.

Semantic Redundancy: Semantic redundancy in text prompts. We extract text embeddings using CLIP and set a cosine similarity threshold of 0.85 to identify duplicates.

BibTeX

@article{wei2025mico,
  title={MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition},
  author={Wei, Xinyu and Cen, Kangrui and Wei, Hongyang and Guo, Zhen and Li, Bairui and Wang, Zeqing and Zhang, Jinrui and Zhang, Lei},
  journal={arXiv preprint arXiv:2512.07348},
  year={2025}
}

🎨 MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

CVPR 2026 Conference Proceedings

Qwen-MICo outperforms state-of-the-art Qwen-Image-Edit-2509. (Scroll to view more samples)