Earth From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance

1Ulsan National Institute of Science and Technology, South Korea
2ETRI, South Korea
3Kyungpook National University, South Korea
WACV 2026
Introduction

Abstract

Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment.

In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement.

Despite using no ground-truth supervision or fine-tuning, our proposed method outperforms prior learning-based approaches on benchmark datasets under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, offering a scalable and cost-efficient alternative to manual annotation. All source code will be made publicly available at street2orbit.github.io.

Overview

Method Overview

Figure: Overall framework for training-free cross-view retrieval.

Method

Our framework consists of three main stages: (1) Satellite query generation via location semantics, (2) Visual embedding and similarity-based retrieval, and (3) Embedding refinement via PCA-based whitening. The entire process is training-free and leverages only pretrained models.

  1. Location Semantics Extraction: We collect web-based context using Google Image Search with the input street-view image. The surrounding textual descriptions and metadata are aggregated.
  2. LLM-Guided Location Inference: We prompt a large language model (Mistral 7B) to extract the most specific and geolocatable name from the retrieved text. This name is then passed to a geocoding API to acquire accurate latitude and longitude.
  3. Satellite Tile Generation: Using the coordinates, we generate a satellite query image from Google Maps Static API centered at the inferred location.
  4. Visual Embedding and Retrieval: Both the query satellite image and a gallery of pre-indexed tiles are passed through a pretrained vision encoder (DINOv2) to extract global features. Cosine similarity is computed for retrieval.
  5. PCA-Based Whitening: To improve retrieval robustness, we apply a lightweight PCA whitening to decorrelate features and suppress low-level artifacts like lighting or texture.

This architecture supports zero-shot generalization and enables scalable construction of street-to-satellite datasets via automated pairing.

Dataset Generation

Figure: Automatic street-to-satellite pair generation using LLM and web-based APIs.

Experiments

We evaluate our method on the University-1652 dataset under the Street-to-Satellite setting. Our training-free framework achieves state-of-the-art performance compared to several supervised baselines.

Quantitative Results

Below are Recall@k metrics for various models:

Method Drone R@1 R@5 R@10 R@1%
LPN1.283.846.596.98
PLCD6.8614.3918.5019.15
Ours (DINOv2-L)22.8433.4837.9337.64
+ PCA Refinement25.5737.2140.6639.94

Qualitative Results

We show sample top-5 satellite retrievals from our pipeline below:

Qualitative Results

Figure: Retrieved top-5 satellite tiles for street-view queries.

BibTeX

@inproceedings{min2026street2orbit,
  author    = {Min, Jeongho and Kim, Dongyoung and Lee, Jaehyup},
  title     = {From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance},
  booktitle = {WACV},
  year      = {2026},
}