Conformal Reliability: A New Evaluation Metric for Conditional Generation

Evaluating not only how good a generated output is, but how bad likely outputs can be.

Yachen Gao^1,2, Xinwei Sun³✉, Yikai Wang⁴, Ye Shi⁵, Jingya Wang⁵, Jianfeng Feng^1,3, Yanwei Fu^2,3✉

¹Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University ²Shanghai Innovation Institute ³School of Data Science, Fudan University ⁴Nanyang Technological University ⁵ShanghaiTech University

Corresponding authors: Xinwei Sun, Yanwei Fu

Abstract

Conditional generative models have recently achieved remarkable success in various applications. However, a suitable metric for evaluating the reliability of these models, which takes into account their inherent uncertainty, is still lacking. Existing metrics, which typically assess a single output, may fail to capture the variability or potential risks in generation. In this paper, we propose a novel evaluation metric called reliability score based on conformal prediction, which measures the worst-case performance within the prediction set at a pre-specified confidence level. However, computing this score is challenging due to the high-dimensional nature of the output space and the nonconvexity of both the metric function and the prediction set. To efficiently compute this score, we introduce Conformal ReLiability (CReL), a framework that can (i) construct the prediction set with desired coverage; and (ii) accurately optimize the reliability score within the constructed prediction set. We provide theoretical results on coverage and demonstrate empirically that our method produces more informative prediction sets than existing approaches. Experiments on synthetic data and the image-to-text and text-to-image tasks further demonstrate the interpretability of our new metric, and the validity and effectiveness of our computational framework.

Motivation

Average metrics can miss risky plausible outputs.

Metrics such as CLIP-SIM, BERT-SIM, and DINO-SIM are useful for measuring generated outputs, but single-output or average-case evaluation can hide failures in the lower-performance region of a model's output distribution. A model may look strong on average while still producing plausible samples with harmful hallucinations, missing semantics, or weak structural fidelity. CReL asks a reliability-centered question: among outputs that remain statistically plausible, how bad can performance become?

Method

CReL builds a calibrated set of likely outputs and optimizes the worst-case score.

CReL projects high-dimensional generated outputs into a structured latent space using a latent generative model. Directional quantile regression constructs a latent quantile region, conformal calibration expands it to satisfy the target coverage level, and the decoded prediction set represents likely model outputs. The final CReL-ρ score is computed by optimizing a user-chosen similarity metric inside this calibrated set, using projected gradient descent over convex latent-space constraints.

Results

Coverage

Synthetic Coverage

On nonlinear synthetic data, CReL reaches target coverage with more informative regions than DQR and Feldman.

α	Coverage					Area in Y
α	Ours-Z	Ours-Y	Feldman-Y	DQR-Z	DQR-Y	Ours	Feldman	DQR
0.02	0.9770	0.9760	0.9718	0.9818	0.9872	398.5	377.8	749.1
0.10	0.8953	0.8915	0.8940	0.8823	0.9145	232.7	234.5	287.4

real world

Image-to-Text

MS-COCO 2014, BLIP/GIT models, evaluated by CLIP-SIM and BERT-SIM at α = 0.1.

Model	CLIP-SIM		BERT-SIM
Model	CLIP	CReL-CLIP	BERT	CReL-BERT
BLIP-base	0.2330 (r4)	0.0070 (r1)	0.8349 (r3)	0.6335 (r3)
BLIP-large	0.2453 (r3)	-0.0074 (r4)	0.8106 (r4)	0.5631 (r4)
GIT-base	0.2511 (r2)	-0.0021 (r2)	0.8620 (r2)	0.6474 (r1)
GIT-large	0.2550 (r1)	-0.0043 (r3)	0.8649 (r1)	0.6459 (r2)

real world

Text-to-Image

MS-COCO 2014, SD3/FLUX/Kandinsky models, evaluated by CLIP-SIM and DINO-SIM at α = 0.1.

Model	CLIP-SIM		DINO-SIM
Model	CLIP	CReL-CLIP	DINO	CReL-DINO
SD3-M	0.2590 (r3)	0.0134 (r1)	0.4615 (r1)	-0.1480 (r4)
SD3.5-L	0.2596 (r2)	0.0116 (r2)	0.4531 (r2)	-0.1365 (r1)
FLUX.1-dev	0.2509 (r4)	0.0056 (r4)	0.4395 (r4)	-0.1411 (r3)
Kandinsky-2.2	0.2603 (r1)	0.0062 (r3)	0.4407 (r3)	-0.1404 (r2)