mirror of
https://github.com/frankwxu/AI4DigitalForensics.git
synced 2026-04-10 11:23:42 +00:00
142 lines
6.6 KiB
Markdown
142 lines
6.6 KiB
Markdown
# RAG-based Cyber Forensics Investigation Tool
|
|
|
|
[](https://opensource.org/licenses/MIT)
|
|
|
|
## Author
|
|
|
|
**Mohit Ajaykumar Dhabuwala**
|
|
|
|
- M.S. in Cyber Forensics and Counterterrorism
|
|
- Specialization: Digital Forensics & Incident Response (DFIR)
|
|
- Proficient in:
|
|
- Memory, Windows, mobile, and network forensics
|
|
- Forensic tools: Magnet AXIOM, EnCase, Volatility, Wireshark
|
|
- Programming languages: Python, Bash, PowerShell for forensic data parsing and automation
|
|
|
|
## What is RAG?
|
|
|
|
Retrieval-Augmented Generation (RAG) enhances language model responses by combining information retrieval with text generation. It retrieves relevant information from a knowledge base and uses a language model to generate accurate, factual, and contextually appropriate answers. This enables language models to handle complex queries and access domain-specific knowledge effectively.
|
|
|
|
This project implements a RAG system to assist in cyber forensics investigations, leveraging LangChain, Hugging Face models, and FAISS for efficient retrieval and question answering over a provided knowledge base. The system processes a text-based scenario, divides it into manageable chunks, generates embeddings, stores them in a vector store, and employs a language model to answer user queries based on the retrieved information.
|
|
|
|
[](https://colab.research.google.com/drive/1j1mgpfdpJp0s1eQn7JzJr3lhhB8tjXLn?usp=sharing)
|
|
|
|
## Video Demonstrations
|
|
|
|
For a visual demonstration of how this RAG system works, please refer to the following videos:
|
|
|
|
- **RAG Fundamentals:** [https://youtu.be/T-D1OfcDW1M](https://youtu.be/T-D1OfcDW1M) AND [https://youtu.be/W-ulb-DMtsM](https://youtu.be/W-ulb-DMtsM)
|
|
- **RAG Implementation:** [https://youtu.be/shiSITpK0ps](https://youtu.be/shiSITpK0ps)
|
|
|
|
## Technical Description
|
|
|
|
The system operates through these steps:
|
|
|
|
1. **Environment Setup:** Installation of necessary Python libraries, including `langchain`, `langchain-huggingface`, `faiss-cpu`, and `huggingface_hub`.
|
|
2. **Scenario Definition:** The cyberpunk case study is defined as a string (`document_text`).
|
|
3. **Text Splitting:** The scenario text is divided into chunks using `RecursiveCharacterTextSplitter`, controlled by parameters like `chunk_size`, `chunk_overlap`, and `separators`.
|
|
4. **Embeddings:** Text embeddings are generated using the Hugging Face Inference API.
|
|
5. **Vector Store:** FAISS is used to store and retrieve text embeddings based on similarity.
|
|
6. **Retrieval QA Chain:** LangChain's `RetrievalQA` chain combines the vector store with a language model. It retrieves relevant text chunks based on the user's query and generates an answer.
|
|
7. **Language Model:** The Hugging Face Inference API with a specified model (`mistralai/Mistral-7B-Instruct-v0.1`) is used for response generation.
|
|
8. **Query Processing:** The system receives user queries, retrieves relevant information from the vector store, and generates answers using the language model.
|
|
|
|
This setup enables the RAG system to answer questions related to the cyber forensics scenario.
|
|
|
|
## Dependencies
|
|
|
|
To run this project, ensure you have the following:
|
|
|
|
- **Python:** 3.7+
|
|
- **pip:** Python package installer
|
|
- **Hugging Face Account & Access Token:** Required for Hugging Face models and the Inference API.
|
|
- **Google Colab:** To execute the notebook.
|
|
|
|
**Disk space:** Google Colab's virtual environment manages disk space for dependencies.
|
|
|
|
## How This Project Works
|
|
|
|
The project uses:
|
|
|
|
- Hugging Face Models for embedding and text generation.
|
|
- LangChain for language model applications.
|
|
- FAISS for efficient similarity search.
|
|
- Google Colab for running Python code.
|
|
|
|
The system workflow is:
|
|
|
|
1. **Load and split:** Load and divide the cyber forensics document into chunks.
|
|
2. **Embed:** Transform each chunk into a vector representation.
|
|
3. **Store:** Store embeddings in a FAISS index.
|
|
4. **Query:** Transform user's question into an embedding and search the FAISS index.
|
|
5. **Answer:** Generate an answer using a language model based on retrieved information.
|
|
|
|
## Code Overview
|
|
|
|
The code comprises:
|
|
|
|
- Document loading and processing using `RecursiveCharacterTextSplitter`.
|
|
- Embedding generation using the Hugging Face Inference API.
|
|
- FAISS vectorstore creation.
|
|
- `RetrievalQA` chain setup for question answering.
|
|
- A simple chat interface for user interaction.
|
|
|
|
## Why This Approach Is Beneficial
|
|
|
|
RAG offers these advantages:
|
|
|
|
- **Contextualized responses:** Answers are based on the provided cyber forensics document.
|
|
- **Interactive interface:** User-friendly chat interaction.
|
|
- **Efficiency:** FAISS enables fast retrieval.
|
|
- **Cloud-based execution:** Google Colab provides a convenient environment.
|
|
- **Hugging Face Integration:** Simplifies embedding and text generation.
|
|
|
|
## System Workflow Diagram
|
|
|
|
_(Flowchart image included here)_
|
|
|
|

|
|
|
|
## Setup and Usage
|
|
|
|
1. **Create a Hugging Face Account** (if needed): Go to [https://huggingface.co/](https://huggingface.co/) and sign up.
|
|
2. **Generate a Hugging Face Access Token:**
|
|
- Log in to your Hugging Face account.
|
|
- Go to your profile settings.
|
|
- Find the "Access Tokens" section.
|
|
- Create a new token.
|
|
- Copy the generated token.
|
|
3. **Open a Google Colab Notebook:**
|
|
4. **Install Python dependencies:** Execute these commands in a Colab cell:
|
|
|
|
```bash
|
|
!pip install transformers langchain langchain_community faiss-cpu huggingface_hub pypdf pymupdf -U langchain langchain-huggingface
|
|
!pip install --upgrade langchain
|
|
```
|
|
|
|
5. **Provide Hugging Face API Token:** Add a code cell to set the `HUGGINGFACEHUB_API_TOKEN` environment variable with your token:
|
|
|
|
```python
|
|
import os
|
|
os.environ['HUGGINGFACEHUB_API_TOKEN'] = 'hf_your_token' # Replace 'hf_your_token' with your actual token
|
|
```
|
|
|
|
6. **Provide your knowledge base:** Add a cell to define `document_text` (the scenario).
|
|
7. **Run the code:** Execute the cells to interact with the RAG system.
|
|
|
|
## Features
|
|
|
|
- **Google Colab Integration:** Streamlined setup and execution in a cloud-based setting.
|
|
- **Hugging Face Integration:** Leverages pre-trained models for embedding and text generation.
|
|
- **FAISS Vectorstore:** Enables efficient and rapid similarity search.
|
|
- **Text Chunking:** Divides documents into manageable chunks for processing.
|
|
- **Chat Interface:** Offers a simple text-based interface for user interaction.
|
|
|
|
## Contributing
|
|
|
|
Contributions are welcome! Submit issues or pull requests to improve the project.
|
|
|
|
## License
|
|
|
|
This project is released under the MIT License.
|