šµļøāāļøLookerBot: AI Agents for OSINT Collection
Open-source intelligence (OSINT) refers to the collection and analysis of publicly available information. OSINT can be collected from social media platforms, news articles, public records, GitHub, etc.
OSINT has a wide-array of applications, it can be useful for finding information thatās not meant to be public, and itās also very useful for offensive security engagements. OSINT is often leveraged during the reconnaissance phase of a penetration test or red team, where a team is trying to find as much information on a target as possible to inform future attacks. This can include information on employees, domains, network blocks, technologies used, etc.
When conducting penetration tests at work Iām frequently tasked with conducting OSINT on client organizations. Although this is a valuable and useful process, it gets tedious. Both collecting information and verifying results is taxing and can take a fairly long time. To help postpone my inevitable carpal tunnel I created LookerBot, a proof-of-concept tool that leverages AI Agents to assist with the collection and verification of OSINT.
GitHub Repo
Background & Goals
This project was created as a final for my cybersecurity masterās capstone class where I was tasked with using AI or machine learning to solve a cybersecurity problem.
I started out with the following goals:
- Create a modular POC that can query various intelligence sources
- At a minimum the POC should be able to identify domains, and GitHub repos/secrets associated with a target
- AI Agents/LLMs should help with:
- Generating search queries
- Validating results
- The agent should
- The tool should create a structured JSON report for output allowing for easy integration with other tools/programs
The Prototype
I created a prototype in Python that uses HuggingFaceās Smolagents library to create an AI Agent that can execute Python code.
Although thereās definitely room for improvement, the POC is fairly reliable and can find domains, GitHub repositories, and leaked secrets associated with an organization. It uses custom-built tools for the Smolagents framework to allow an AI Agent to interact with the GitHub search API, pull WHOIS records, and search DuckDuckGo.
Features
- Domain reconnaissance
Ā - Uses DuckDuckGo searches & WHOIS records to identify domains associated with a target organization - GitHub reconnaissance
Ā - Uses the GitHub search API to identify repositories attributed to a target organization
Ā - TruffleHog Integration for secret identification & verification - DuckDuckGo reconnaissance
Ā - Uses DuckDuckGo search queries to identify login pages, documents, etc. from identified domains - Result verification
Ā - Assigns a confidence score to each Result
Ā - Removes results that are confidently not associated with the target
Ā - Provides reasoning as to why a confidence score was assigned - LLM Integration via OpenAI API, HuggingFace API, or Ollama
Workflow
When running the tool the agent first searches the internet to generate a basic summary of the organization including itās location, primary domain, purpose, etc. This summary is used as context for future tasks to assist in the generation of search queries and validation of results.
Next up is domain discovery. The agent runs search queries and scrapes websites for domain names. It then validates the results by examining each domains and its WHOIS record.
The same workflow is followed for all future data sources:
- The organization summary and discovered domain names are provided as context
- The agent generates search queries
- Each result is assessed for relevance & a confidence score & reason is attached
- Irrelevant results are discarded
- Additional tools such as TruffleHog are used to enrich results
The final result is a structured JSON report with results from each datasource, and a brief explanation as to why the AI chose to include it.
Results
I started off by testing the tool using locally hosted models with LM Studio & Ollama - which didnāt yield great results. Due to hardware limitations the models were slower and less capable, the output wasnāt as consistent, and the reasoning/categorization was subpar.
I switched to GPT-4o-mini which showed a significant improvement across the board. Through testing I was able to reliably identify domains and GitHub repositories associated with a target organization.
Additionally, I found:
- API Keys
- Cleartext & hashed credentials
- Database connection strings
- API Documentation
- SQLite Databases
The biggest current roadblock is speed. The verification of individual results takes a long time depending on the size of the organization, and uses a lot of tokens.
Future Work
Thereās a handful of areas where the project could be improved moving forward
- Add additional intelligence/data sources
- Implement Retrieval Augmented Generation (RAG)
Additional intelligence sources such as Shodan, Censys, social media platforms, and DNS records could be easily integrated into the current POC to improve the data that LookerBot is able to collect. This would require the development of additional tools for the agent to interact with, and wouldnāt require any significant changes to the workflow or structure of the tool.
Next, Retrieval Augmented Generation, or RAG, could be implemented to help improve the modelās consistency and performance. RAG allows an LLM to retrieve data from an external knowledgebase to inform itās actions. Allowing the model to query a known-good knowledgebase on OSINT collection and verification techniques could help to improve its consistency and performance, potentially allowing the tool to be used more successfully with smaller self-hosted LLMs.
Overall, Iām happy with this project as a proof of concept. Itās not at the point where it can replace a human analyst, but itās valuable to get a high-level view of an organization.
- Title: šµļøāāļøLookerBot: AI Agents for OSINT Collection
- Author: Liam Geyer
- Created at : 2025-04-24 00:00:00
- Updated at : 2025-07-27 23:36:44
- Link: https://lfgberg.org/2025/04/24/development/LookerBot/
- License: This work is licensed under CC BY-NC-SA 4.0.