digital-forensics-labs/AI4DigitalForensics

Fork 0

mirror of https://github.com/frankwxu/AI4DigitalForensics.git synced 2026-04-10 11:23:42 +00:00

Files

Frank Xu fd42fba72f add RAG

2025-03-29 19:33:13 -04:00

7.0 KiB

Raw Blame History

RAG-based Cyber Forensics Investigation Tool

What is RAG?

Retrieval-Augmented Generation (RAG) enhances language model responses by combining information retrieval with text generation. It retrieves relevant information from a knowledge base and uses a language model to generate accurate, factual, and contextually appropriate answers. This enables language models to handle complex queries and access domain-specific knowledge effectively.

This project implements a RAG system to assist in cyber forensics investigations, leveraging LangChain, Hugging Face models, and FAISS for efficient retrieval and question answering over a provided knowledge base. The system processes a text-based scenario, divides it into manageable chunks, generates embeddings, stores them in a vector store, and employs a language model to answer user queries based on the retrieved information.

Video Demonstrations

For a visual demonstration of how this RAG system works, please refer to the following videos:

RAG Fundamentals: https://youtu.be/T-D1OfcDW1M AND https://youtu.be/W-ulb-DMtsM
RAG Implementation: https://youtu.be/shiSITpK0ps

Technical Description

The system operates through these steps:

Environment Setup: Installation of necessary Python libraries, including langchain, langchain-huggingface, faiss-cpu, and huggingface_hub.
Scenario Definition: The cyberpunk case study is defined as a string (document_text).
Text Splitting: The scenario text is divided into chunks using RecursiveCharacterTextSplitter, controlled by parameters like chunk_size, chunk_overlap, and separators.
Embeddings: Text embeddings are generated using the Hugging Face Inference API.
Vector Store: FAISS is used to store and retrieve text embeddings based on similarity.
Retrieval QA Chain: LangChain's RetrievalQA chain combines the vector store with a language model. It retrieves relevant text chunks based on the user's query and generates an answer.
Language Model: The Hugging Face Inference API with a specified model (mistralai/Mistral-7B-Instruct-v0.1) is used for response generation.
Query Processing: The system receives user queries, retrieves relevant information from the vector store, and generates answers using the language model.

This setup enables the RAG system to answer questions related to the cyber forensics scenario.

Dependencies

To run this project, ensure you have the following:

Python: 3.7+
pip: Python package installer
Hugging Face Account & Access Token: Required for Hugging Face models and the Inference API.
Google Colab: To execute the notebook.

Disk space: Google Colab's virtual environment manages disk space for dependencies.

How This Project Works

The project uses:

Hugging Face Models for embedding and text generation.
LangChain for language model applications.
FAISS for efficient similarity search.
Google Colab for running Python code.

The system workflow is:

Load and split: Load and divide the cyber forensics document into chunks.
Embed: Transform each chunk into a vector representation.
Store: Store embeddings in a FAISS index.
Query: Transform user's question into an embedding and search the FAISS index.
Answer: Generate an answer using a language model based on retrieved information.

Code Overview

The code comprises:

Document loading and processing using RecursiveCharacterTextSplitter.
Embedding generation using the Hugging Face Inference API.
FAISS vectorstore creation.
RetrievalQA chain setup for question answering.
A simple chat interface for user interaction.

Why This Approach Is Beneficial

RAG offers these advantages:

Contextualized responses: Answers are grounded in the provided cyber forensics document.
Interactive interface: User-friendly chat interaction.
Efficiency: FAISS enables fast retrieval.
Cloud-based execution: Google Colab provides a convenient environment.
Hugging Face Integration: Simplifies embedding and text generation.

System Workflow Diagram

(Flowchart image included here)

Setup and Usage

Create a Hugging Face Account (if needed): Go to https://huggingface.co/ and sign up.
Generate a Hugging Face Access Token:
- Log in to your Hugging Face account.
- Go to your profile settings.
- Find the "Access Tokens" section.
- Create a new token.
- Copy the generated token.
Open a Google Colab Notebook:

Install Python dependencies: Execute these commands in a Colab cell:

!pip install -U langchain langchain-core langchain-huggingface langchain_community faiss-cpu huggingface_hub
!pip install --upgrade langchain

Provide Hugging Face API Token: Add a code cell to set the HUGGINGFACEHUB_API_TOKEN environment variable:
```
api_token = "ENTER THE API KEY"  # Replace 'ENTER THE API KEY' with your actual token
```
Provide Your Knowledge Base: Add a cell to define scenario_text (Any passage of your choice).
Run the Code: Execute the cells in order to interact with the RAG system.

Background Story Used

This project utilizes a futuristic cyberpunk scenario to simulate a cybercrime investigation. Detective Y investigates a complex ransomware attack targeting robotics engineer Z by "The Serpent," who employs advanced techniques to encrypt and steal research data. This scenario serves as the knowledge base for the RAG system.

Story based Questions

The RAG system answers questions based on the provided cyber forensics scenario. Examples:

In-Text Questions:

What type of cyberattack did Detective Y investigate?
What was the victim's profession?
Where was the remote server located that led to the perpetrator's arrest?

Out-of-Text Questions (Answers not in the text):

What specific encryption algorithm did The Serpent use?
What was the name of the university where the security breach occurred?
Did Detective Y's team collaborate with external experts?

Features

Google Colab Integration: Streamlined setup and execution in a cloud-based setting.
Hugging Face Integration: Leverages pre-trained models for embedding and text generation.
FAISS Vectorstore: Enables efficient and rapid similarity search.
Text Chunking: Divides documents into manageable chunks for processing.
Chat Interface: Offers a simple text-based interface for user interaction.

Contributing

Contributions are welcome! Submit issues or pull requests to improve the project.

License

This project is released under the MIT License.

7.0 KiB Raw Blame History