Zenodo · May 2026
+5.2% Mean Average Precision 2.5× similarity gap 48× storage compression

Applying Principal Component Analysis to text embeddings fitted on a domain-specific corpus improves semantic retrieval without any fine-tuning of the embedding model. Tested on a medical corpus of 20 clinical topics using OpenAI's text-embedding-3-small, PCA-32 with corpus-only fitting achieved MAP 0.9203 versus a baseline of 0.8750 — a 5.2% improvement — alongside a 2.5× increase in similarity gap and 48× reduction in storage. Domain-directed axes are essential; random projections do not replicate the gain.

PCA text embeddings semantic retrieval NLP medical domain