Full Report
An ambiguous city street, a freshly mown field, and a parked armoured vehicle were among the example photos we chose to challenge Large Language Models (LLMs) from OpenAI, Google, Anthropic, Mistral and xAI to geolocate. Back in July 2023, Bellingcat analysed the geolocation performance of OpenAI and Google’s models. Both chatbots struggled to identify images […] The post Have LLMs Finally Mastered Geolocation? appeared first on bellingcat.
Analysis Summary
# Research: Evaluating the Geolocation Capabilities of Leading Large Language Models
## Metadata
- Authors: [Implied: Researchers conducting the Bellingcat-style analysis, potentially listed in the full paper but not in the provided text, with Infographics by Logan Williams and Merel Zoet]
- Institution: [Implied: An independent research/investigative body, referencing prior work by Bellingcat]
- Publication: [Implied: A technical report or publication covering current LLM evaluation]
- Date: [Implied: Current (post-release of tested models)]
## Abstract
This research evaluates the current geolocation performance of large language models (LLMs) from OpenAI, Google, Anthropic, Mistral, and xAI against a baseline of traditional reverse image search (Google Lens). By testing 20 different models across 25 previously unpublished, highly varied travel photos, the study aimed to determine how far visual location identification has advanced since similar tests were conducted previously. Key findings indicate that while newer models show improvement, only specific iterations of OpenAI's GPT models outperformed Google Lens, and all models remain susceptible to hallucination, particularly when visual context is ambiguous or temporary.
## Research Objective
To assess and compare the current geolocation accuracy of state-of-the-art LLMs (including versions from OpenAI, Google, Anthropic, Mistral, and xAI) using a set of 25 novel, unpublished travel photographs across various global terrains, and to benchmark their performance against traditional reverse image search tools like Google Lens.
## Methodology
### Approach
A comparative empirical evaluation protocol was used. Researchers presented 20 distinct model versions with 25 unique, metadata-stripped images, one by one. The query for all models was identical: "Where was this photo taken?". Responses were scored rigorously on a scale of 0 (no attempt) to 10 (accurate and specific identification, e.g., neighborhood or exact landmark). The performance of LLMs was benchmarked against the top 10 results from Google Lens's visual match feature.
### Dataset/Environment
The study utilized 25 original travel photos taken by the researchers, sourced from every continent, including Antarctica. The images varied in difficulty, encompassing rural, urban scenes, and scenes with or without clear landmarks (buildings, signs, mountains). Crucially, none of the images had been published online prior to testing, eliminating the possibility of the models fetching existing knowledge.
### Tools & Technologies
- **Models Tested (Selection):** Anthropic (Claude Haiku/Sonnet/Opus 3.5/4.0), Google (Gemini 2.0/2.5 Flash/Pro, Deep Research), Mistral (Pixtral Large), OpenAI (ChatGPT 4o, 4.5, o3, o4-mini variants), xAI (Grok 3 variants, including DeepSearch).
- **Baseline Tool:** Google Lens ("visual match" feature).
- **Scoring:** 0-10 accuracy metric.
## Key Findings
### Primary Results
1. **ChatGPT Leads:** Only specific variants of OpenAI's ChatGPT (particularly `o3`, `o4-mini`, and `o4-mini-high`) successfully outperformed Google Lens in identifying the correct locations.
2. **Google Lens Benchmark:** Google Lens remained a highly effective baseline, with several advanced LLMs (including Google's own Gemini models) scoring lower than the traditional search utility.
3. **Gap in Performance:** Models from Anthropic (Claude) and Mistral significantly lagged behind the top performers from OpenAI, Google, and xAI. Anthropic's most advanced models sometimes only managed continent-level identification.
4. **Cautiousness vs. Confidence:** ChatGPT models tended to be more confident, leading to better accuracy but also a higher rate of hallucinations. Conversely, Anthropic's Claude models, especially in "extended thinking" mode, were highly cautious, often declining to guess or providing only vague regional answers when the regular model might offer a hedged guess.
5. **xAI's Position:** xAI's Grok models performed competitively, with Grok 3 DeeperSearch proving the most effective of the xAI suite and generally outperforming Google Gemini models tested.
### Supporting Evidence
- In one instance involving a Japanese mountain road near Takayama, Gemini 2.5 Pro provided a geo-located answer 15km closer to the actual site than a comparable model response in a previous iteration.
- The risk of hallucination remained high, especially when scenery was temporary (e.g., identifying a beach scene based on a temporary seasonal fair ride present only during the summer).
### Novel Contributions
- A direct, controlled comparison of 20 modern, pre-release and current-generation visual LLMs against an established ground-truth inverse search tool (Google Lens).
- Identification of specific operational behaviors (cautiousness vs. confidence) linked to different model architectures or prompting strategies (e.g., Claude's "extended thinking" mode behavior).
- Highlighting the risk of *contextual leakage*, where models appear to draw upon private user history (past conversations or linked social media profiles) rather than solely the image provided.
## Technical Details
The tests strictly controlled inputs: unpublished photos with zero metadata, and a static, unambiguous prompt ("Where was this photo taken?"). If clarification was sought, the system enforced a strict reply: "There is no supporting information. Use this photo alone." This isolates the visual reasoning capability from external informational inputs. The researchers noted that versions of OpenAI's "deep research" function were currently powered by `o4-mini`.
## Practical Implications
### For Security Practitioners
LLMs, particularly leading GPT variants, are becoming viable forensic tools for initial location triage in open-source intelligence (OSINT) investigations. They offer an advantage over simple reverse image search because they can integrate an image with auxiliary prompt context (though this feature must be used with caution due to contextual leakage risks).
### For Defenders
Security relies less on the model's confidence and more on verification. Given that hallucinations are common, *any* LLM-provided geolocation must be treated as a strong hypothesis requiring independent confirmation, especially if the scenery is transient or lacks clear, permanent landmarks.
### For Researchers
The study indicates that visual reasoning capability is highly version-dependent—a model that performed poorly months prior may significantly improve, necessitating continuous benchmarking. New research should focus on developing verifiable uncertainty metrics for visual LLM outputs, forcing models to self-report confidence levels more accurately than current behavior suggests.
## Limitations
1. **Model Volatility:** The rapid release cycle of new models meant the study could not be exhaustive (e.g., excluding DeepSeek due to its current focus on text extraction).
2. **Dataset Size:** Only 25 images were used. While diverse, this limits generalizability.
3. **Contextual Bias:** The observation of models leveraging private user history (like past tweets or prior conversation topics) suggests that results may vary significantly based on the account used, muddying the purity of the model comparison.
4. **Video Comprehension:** The limitations of current models in processing video content represent a significant blind spot for location analysis.
## Comparison to Prior Work
This work updates and re-evaluates findings from a previous *Bellingcat* analysis conducted when OpenAI and Google models were less developed. The current research explicitly shows convergence and improvement, where previous results indicated near-total failure or high rates of hallucination; newer models now show measurable success above traditional tools, albeit narrowly.
## Real-world Applications
- **First-Pass Geolocation:** Rapidly narrowing down the potential geographic area for photographs encountered in OSINT or incident response.
- **Iterative Refinement:** Utilizing an LLM’s ability to process text prompts alongside the image allows investigators to ask follow-up questions to refine vague results, a capability traditional image search lacks.
## Future Work
- Continuous testing of newly emergent "frontier" models.
- Investigating the efficacy of LLMs when fed rich metadata or partial textual clues alongside the image, to move beyond purely visual input.
- Developing specific adversarial prompts designed to test the limits of LLMs' internal world knowledge against intentionally misleading visual components (e.g., older images of temporary structures).
## References
- [Prior work by Bellingcat on prior iteration of LLM geolocation performance]
- [Information regarding ChatGPT's deep research function powered by o4-mini]
- [General information about model versions, e.g., Gemini previews]