[Vision-Language] CLIP 모델 핵심 정리 및 유사도 히트맵 실습

1. CLIP

CLIP(Contrastive Language-Image Pretraining)은 이미지와 텍스트를 같은 의미 공간(Semantic Space)으로 정렬하기 위해 대비 학습(Contrastive Learning)을 사용하는 대표적인 Pretrained Vision-Language 모델입니다.

목적: 올바른 이미지-텍스트 쌍은 임베딩 공간에서 가깝게, 관련 없는 쌍은 멀어지도록 학습하여 두 모달리티 간 의미적 대응 관계를 형성합니다.
장점: 별도의 태스크별 미세조정(Fine-tuning) 없이 텍스트 설명만으로 이미지를 분류하는 Zero-shot 분류가 가능하며, 이미지-텍스트 검색(Retrieval)에 매우 강력합니다.
특징: 생성(Generation) 능력은 없으며, 주로 멀티모달 데이터의 이해와 정렬(Understanding & Alignment)에 초점이 맞춰져 있습니다.

CLIP 논문 링크 (Learning Transferable Visual Models From Natural Language Supervision)

2. 기존 비전 모델의 한계와 CLIP의 등장 배경

기존 컴퓨터 비전 모델들은 다음과 같은 명확한 한계점이 있었습니다.

데이터셋마다 고정된 라벨 구조에 지나치게 의존함
새로운 클래스나 개념이 등장하면 모델을 재학습해야 함
현실 세계의 열린(Open-world) 개념을 유연하게 다루기 어려움

💡 CLIP의 해결책

웹에 존재하는 대규모 이미지-텍스트 쌍 데이터를 수집하여, "이미지는 언어로 설명될 수 있고, 언어는 이미지로 대응될 수 있다"는 가정하에 모델을 학습시켜 이 한계를 극복했습니다.

3. 핵심 아이디어: 대비 학습 (Contrastive Learning)

CLIP의 핵심은 두 모달리티(이미지, 텍스트) 간의 임베딩 거리를 조절하는 것입니다.

Positive Pair (올바른 쌍): 임베딩 거리를 가깝게 (유사도 증가)
Negative Pair (관계없는 쌍): 임베딩 거리를 멀게 (유사도 감소)

모델 구조 (Architecture)

CLIP은 두 개의 독립된 인코더로 구성됩니다. 두 인코더의 출력은 같은 차원의 임베딩 공간으로 투영되며, 코사인 유사도(Cosine Similarity)를 통해 비교됩니다.

Vision Encoder: CNN(ResNet) 또는 ViT를 사용하여 이미지를 단일 벡터로 변환
Text Encoder: Transformer를 사용하여 문장을 단일 벡터로 변환

학습 방식 (Training Objective)

이미지 임베딩 ↔ 텍스트 임베딩 간 유사도 행렬을 계산합니다.
Softmax 기반 대칭적 Cross-Entropy Loss를 사용합니다.
- Image $\rightarrow$ Text 방향
- Text $\rightarrow$ Image 방향
양방향 손실 함수를 통해 이미지 검색과 텍스트 검색을 동시에 잘 수행하는 구조가 완성됩니다.

4. Zero-shot Learning이 가능한 이유

CLIP의 가장 혁신적인 특징은 Zero-shot Image Classification입니다. 과정은 매우 단순하지만 강력합니다.

분류하고 싶은 클래스들을 문장 형태(Prompt)로 표현합니다. (예: "a photo of a dog", "a photo of a cat")
타겟 이미지의 임베딩과 각 문장 임베딩 간의 유사도를 계산합니다.
가장 유사도가 높은 문장을 최종 클래스로 예측합니다.

👉 핵심 인사이트: "라벨 = 텍스트 프롬프트"라는 이 개념은 이후 등장한 수많은 LLM 기반 멀티모달 모델들의 핵심 사고방식이 되었습니다.

5. [실습] CLIP 코사인 유사도 히트맵 시각화

HuggingFace의 CLIP 모델과 ImageNetV2 데이터셋을 활용하여 $10 \times 10$ 유사도 행렬을 만들고 히트맵으로 시각화하는 코드입니다.

5-1. 라이브러리 로드 및 기본 설정

import io
import torch
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
from PIL import Image
from datasets import load_dataset
from transformers import CLIPProcessor, CLIPModel

# 기본 설정
MODEL_NAME = 'openai/clip-vit-base-patch32'
DATASET_NAME = 'clip-benchmark/wds_imagenetv2'
SPLIT = 'test'
SAMPLE_SIZE = 10
THUMB_SIZE = (80, 80)
OUT_PATH = 'heatmap.png'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

5-2. 모델 로드 및 데이터 준비

# 모델 및 프로세서 준비
model = CLIPModel.from_pretrained(MODEL_NAME).to(device)
processor = CLIPProcessor.from_pretrained(MODEL_NAME)
model.eval() # 검증 모드

# 데이터셋 로드 (ImageNetV2: ImageNet에 과적합되었는지 확인하기 위한 새로운 데이터셋)
dataset = load_dataset(DATASET_NAME)
ds = dataset[SPLIT]
subset = ds.shuffle(seed=2026).select(range(SAMPLE_SIZE))

# 클래스명 로드 및 라벨 텍스트 생성
cls2label = [line.strip() for line in open('classnames.txt', 'r', encoding='utf-8').readlines()]
label_names = [cls2label[int(c)] for c in list(subset['cls'])]
label_texts = [f'a photo of a {name}' for name in label_names] # 프롬프트 엔지니어링

# 이미지 PIL 변환 함수
def to_pil(x):
    if isinstance(x, Image.Image): return x.convert('RGB')
    if isinstance(x, dict) and 'bytes' in x and x['bytes'] is not None:
        return Image.open(io.BytesIO(x['bytes'])).convert('RGB')
    if isinstance(x, str): return Image.open(x).convert('RGB')
    raise TypeError(f'지원하지 않는 이미지 타입: {type(x)}')

images = [to_pil(img) for img in subset['jpg']]

5-3. 임베딩 추출 및 유사도 계산

with torch.no_grad():
    # 1. 이미지 임베딩 추출
    inputs_image = processor(images=images, return_tensors='pt', padding=True).to(device)
    vision_out = model.vision_model(pixel_values=inputs_image['pixel_values'])
    image_features = model.visual_projection(vision_out.pooler_output)
    
    # 2. 텍스트 임베딩 추출
    inputs_text = processor(text=label_texts, return_tensors='pt', padding=True).to(device)
    text_out = model.text_model(input_ids=inputs_text['input_ids'], attention_mask=inputs_text['attention_mask'])
    text_features = model.text_projection(text_out.pooler_output)

# 3. 임베딩 정규화 (L2 Normalization)
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)

# 4. 코사인 유사도 계산 (내적)
# (10, D) @ (D, 10) = (10, 10) 행렬 생성
similarity_matrix = (image_features @ text_features.T).cpu().numpy()

5-4. 히트맵 시각화

# 썸네일 생성
def create_thumbnail(img, size=(80, 80)):
    return img.resize(size)
thumbnails = [create_thumbnail(img, THUMB_SIZE) for img in images]

# 플롯 설정
fig, ax = plt.subplots(figsize=(20, 12))
im = ax.imshow(similarity_matrix, aspect='auto', cmap='viridis')
plt.colorbar(im, ax=ax, fraction=0.02, pad=0.02)

# 축 설정
ax.set_xticks(np.arange(len(label_texts)))
ax.set_xticklabels(label_texts, rotation=45, ha='right')
ax.set_yticks(np.arange(len(images)))
ax.set_yticklabels([""] * len(images)) # 이미지가 들어갈 자리라 텍스트는 비움

ax.set_title('CLIP Image vs Label Similarity (Cosine Similarity)')
ax.set_xlabel('Label Text')
ax.set_ylabel('Image')

# 셀에 유사도 수치 표기
for i in range(similarity_matrix.shape[0]):
    for j in range(similarity_matrix.shape[1]):
        ax.text(j, i, f'{similarity_matrix[i, j]: .2f}', ha='center', va='center', fontsize=8, color='white')

# y축에 이미지 썸네일 부착
for i, img in enumerate(thumbnails):
    imagebox = OffsetImage(img, zoom=1.0)
    ab = AnnotationBbox(imagebox, (-0.6, i), frameon=False, xycoords='data', boxcoords='data', pad=0)
    ax.add_artist(ab)

ax.set_xlim(-1.2, similarity_matrix.shape[1] - 0.5)
ax.set_ylim(similarity_matrix.shape[0] - 0.5, -0.5)

plt.tight_layout()
plt.savefig(OUT_PATH, dpi=200)
plt.show()
print(f'saved: {OUT_PATH}')

결과 해석 방법

대각선 성분: 각 행에서 정답 라벨 열의 값(대각선)이 가장 높아야 정상적으로 분류된 것입니다.
비대각선 성분: 대각선이 아닌 곳의 수치가 높다면, CLIP이 해당 이미지를 다른 클래스와 헷갈렸거나(오분류), 두 클래스 간의 의미적 유사성이 높다는 것을 의미합니다.

'개념 정리 step2 > 멀티모달(Multi-modal)' 카테고리의 다른 글

[비전 AI] 텍스트로 객체를 찾는 Zero-Shot Detection부터 GroundingDINO까지 (0)	2026.02.14
[머신러닝] 차원 축소 PCA, t-SNE, UMAP부터 CLIP 시각화까지 (0)	2026.02.13
[멀티모달] Multimodal Learning 정리 (0)	2026.02.10
[생성형 AI] GAN와 DCGAN 개념 정리와 실습 (0)	2026.01.31
[멀티 모달] 오토인코더(Autoencoder): 비지도 학습과 생성 모델의 기초 (0)	2026.01.30

1. CLIP

2. 기존 비전 모델의 한계와 CLIP의 등장 배경

💡 CLIP의 해결책

3. 핵심 아이디어: 대비 학습 (Contrastive Learning)

모델 구조 (Architecture)

학습 방식 (Training Objective)

4. Zero-shot Learning이 가능한 이유

5. [실습] CLIP 코사인 유사도 히트맵 시각화

5-1. 라이브러리 로드 및 기본 설정

5-2. 모델 로드 및 데이터 준비

5-3. 임베딩 추출 및 유사도 계산

5-4. 히트맵 시각화

결과 해석 방법

'개념 정리 step2 > 멀티모달(Multi-modal)' 카테고리의 다른 글

티스토리툴바