🕵️‍♂️LookerBot: AI Agents for OSINT Collection

Open-source intelligence (OSINT) refers to the collection and analysis of publicly available information. OSINT can be collected from social media platforms, news articles, public records, GitHub, etc.

OSINT has a wide-array of applications, it can be useful for finding information that’s not meant to be public, and it’s also very useful for offensive security engagements. OSINT is often leveraged during the reconnaissance phase of a penetration test or red team, where a team is trying to find as much information on a target as possible to inform future attacks. This can include information on employees, domains, network blocks, technologies used, etc.

When conducting penetration tests at work I’m frequently tasked with conducting OSINT on client organizations. Although this is a valuable and useful process, it gets tedious. Both collecting information and verifying results is taxing and can take a fairly long time. To help postpone my inevitable carpal tunnel I created LookerBot, a proof-of-concept tool that leverages AI Agents to assist with the collection and verification of OSINT.

GitHub Repo

Detective Looker from Pokemon - the project's namesake

Background & Goals

This project was created as a final for my cybersecurity master’s capstone class where I was tasked with using AI or machine learning to solve a cybersecurity problem.

I started out with the following goals:

Create a modular POC that can query various intelligence sources
At a minimum the POC should be able to identify domains, and GitHub repos/secrets associated with a target
AI Agents/LLMs should help with:
- Generating search queries
- Validating results
The agent should
The tool should create a structured JSON report for output allowing for easy integration with other tools/programs

The Prototype

I created a prototype in Python that uses HuggingFace’s Smolagents library to create an AI Agent that can execute Python code.

Although there’s definitely room for improvement, the POC is fairly reliable and can find domains, GitHub repositories, and leaked secrets associated with an organization. It uses custom-built tools for the Smolagents framework to allow an AI Agent to interact with the GitHub search API, pull WHOIS records, and search DuckDuckGo.

Features

Domain reconnaissance
- Uses DuckDuckGo searches & WHOIS records to identify domains associated with a target organization
GitHub reconnaissance
- Uses the GitHub search API to identify repositories attributed to a target organization
- TruffleHog Integration for secret identification & verification
DuckDuckGo reconnaissance
- Uses DuckDuckGo search queries to identify login pages, documents, etc. from identified domains
Result verification
- Assigns a confidence score to each Result
- Removes results that are confidently not associated with the target
- Provides reasoning as to why a confidence score was assigned
LLM Integration via OpenAI API, HuggingFace API, or Ollama

Workflow

When running the tool the agent first searches the internet to generate a basic summary of the organization including it’s location, primary domain, purpose, etc. This summary is used as context for future tasks to assist in the generation of search queries and validation of results.

Next up is domain discovery. The agent runs search queries and scrapes websites for domain names. It then validates the results by examining each domains and its WHOIS record.

The same workflow is followed for all future data sources:

The organization summary and discovered domain names are provided as context
The agent generates search queries
Each result is assessed for relevance & a confidence score & reason is attached
Irrelevant results are discarded
Additional tools such as TruffleHog are used to enrich results

The final result is a structured JSON report with results from each datasource, and a brief explanation as to why the AI chose to include it.

Results

I started off by testing the tool using locally hosted models with LM Studio & Ollama - which didn’t yield great results. Due to hardware limitations the models were slower and less capable, the output wasn’t as consistent, and the reasoning/categorization was subpar.

I switched to GPT-4o-mini which showed a significant improvement across the board. Through testing I was able to reliably identify domains and GitHub repositories associated with a target organization.

Additionally, I found:

API Keys
Cleartext & hashed credentials
Database connection strings
API Documentation
SQLite Databases

The biggest current roadblock is speed. The verification of individual results takes a long time depending on the size of the organization, and uses a lot of tokens.

Future Work

There’s a handful of areas where the project could be improved moving forward

Add additional intelligence/data sources
Implement Retrieval Augmented Generation (RAG)

Additional intelligence sources such as Shodan, Censys, social media platforms, and DNS records could be easily integrated into the current POC to improve the data that LookerBot is able to collect. This would require the development of additional tools for the agent to interact with, and wouldn’t require any significant changes to the workflow or structure of the tool.

Next, Retrieval Augmented Generation, or RAG, could be implemented to help improve the model’s consistency and performance. RAG allows an LLM to retrieve data from an external knowledgebase to inform it’s actions. Allowing the model to query a known-good knowledgebase on OSINT collection and verification techniques could help to improve its consistency and performance, potentially allowing the tool to be used more successfully with smaller self-hosted LLMs.

Overall, I’m happy with this project as a proof of concept. It’s not at the point where it can replace a human analyst, but it’s valuable to get a high-level view of an organization.