digital-forensics-labs/AI4DigitalForensics

Fork 0

mirror of https://github.com/frankwxu/AI4DigitalForensics.git synced 2026-04-10 11:23:42 +00:00

Files

History

Frank Xu fa99f9d99a revise RAG readme

2025-04-07 11:40:15 -04:00

README.md

revise RAG readme

2025-04-07 11:40:15 -04:00

Retrieval_Augmented_Generation_Simple.ipynb

add RAG

2025-03-29 19:33:13 -04:00

README.md

RAG-based Cyber Forensics Investigation Tool

Author

Mohit Ajaykumar Dhabuwala

M.S. in Cyber Forensics and Counterterrorism
Specialization: Digital Forensics & Incident Response (DFIR)
Proficient in:
- Memory, Windows, mobile, and network forensics
- Forensic tools: Magnet AXIOM, EnCase, Volatility, Wireshark
- Programming languages: Python, Bash, PowerShell for forensic data parsing and automation

What is RAG?

Retrieval-Augmented Generation (RAG) enhances language model responses by combining information retrieval with text generation. It retrieves relevant information from a knowledge base and uses a language model to generate accurate, factual, and contextually appropriate answers. This enables language models to handle complex queries and access domain-specific knowledge effectively.

This project implements a RAG system to assist in cyber forensics investigations, leveraging LangChain, Hugging Face models, and FAISS for efficient retrieval and question answering over a provided knowledge base. The system processes a text-based scenario, divides it into manageable chunks, generates embeddings, stores them in a vector store, and employs a language model to answer user queries based on the retrieved information.

Video Demonstrations

For a visual demonstration of how this RAG system works, please refer to the following videos:

RAG Fundamentals: https://youtu.be/T-D1OfcDW1M AND https://youtu.be/W-ulb-DMtsM
RAG Implementation: https://youtu.be/shiSITpK0ps

Technical Description

The system operates through these steps:

Environment Setup: Installation of necessary Python libraries, including langchain, langchain-huggingface, faiss-cpu, and huggingface_hub.
Scenario Definition: The cyberpunk case study is defined as a string (document_text).
Text Splitting: The scenario text is divided into chunks using RecursiveCharacterTextSplitter, controlled by parameters like chunk_size, chunk_overlap, and separators.
Embeddings: Text embeddings are generated using the Hugging Face Inference API.
Vector Store: FAISS is used to store and retrieve text embeddings based on similarity.
Retrieval QA Chain: LangChain's RetrievalQA chain combines the vector store with a language model. It retrieves relevant text chunks based on the user's query and generates an answer.
Language Model: The Hugging Face Inference API with a specified model (mistralai/Mistral-7B-Instruct-v0.1) is used for response generation.
Query Processing: The system receives user queries, retrieves relevant information from the vector store, and generates answers using the language model.

This setup enables the RAG system to answer questions related to the cyber forensics scenario.

Dependencies

To run this project, ensure you have the following:

Python: 3.7+
pip: Python package installer
Hugging Face Account & Access Token: Required for Hugging Face models and the Inference API.
Google Colab: To execute the notebook.

Disk space: Google Colab's virtual environment manages disk space for dependencies.

How This Project Works

The project uses:

Hugging Face Models for embedding and text generation.
LangChain for language model applications.
FAISS for efficient similarity search.
Google Colab for running Python code.

The system workflow is:

Load and split: Load and divide the cyber forensics document into chunks.
Embed: Transform each chunk into a vector representation.
Store: Store embeddings in a FAISS index.
Query: Transform user's question into an embedding and search the FAISS index.
Answer: Generate an answer using a language model based on retrieved information.

Code Overview

The code comprises:

Document loading and processing using RecursiveCharacterTextSplitter.
Embedding generation using the Hugging Face Inference API.
FAISS vectorstore creation.
RetrievalQA chain setup for question answering.
A simple chat interface for user interaction.

Why This Approach Is Beneficial

RAG offers these advantages:

Contextualized responses: Answers are based on the provided cyber forensics document.
Interactive interface: User-friendly chat interaction.
Efficiency: FAISS enables fast retrieval.
Cloud-based execution: Google Colab provides a convenient environment.
Hugging Face Integration: Simplifies embedding and text generation.

System Workflow Diagram

(Flowchart image included here)

Setup and Usage

Create a Hugging Face Account (if needed): Go to https://huggingface.co/ and sign up.
Generate a Hugging Face Access Token:
- Log in to your Hugging Face account.
- Go to your profile settings.
- Find the "Access Tokens" section.
- Create a new token.
- Copy the generated token.
Open a Google Colab Notebook:

Install Python dependencies: Execute these commands in a Colab cell:

!pip install transformers langchain langchain_community faiss-cpu huggingface_hub pypdf pymupdf -U langchain langchain-huggingface
!pip install --upgrade langchain

Provide Hugging Face API Token: Add a code cell to set the HUGGINGFACEHUB_API_TOKEN environment variable with your token:

import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = 'hf_your_token'  # Replace 'hf_your_token' with your actual token

Provide your knowledge base: Add a cell to define document_text (the scenario).
Run the code: Execute the cells to interact with the RAG system.

Features

Google Colab Integration: Streamlined setup and execution in a cloud-based setting.
Hugging Face Integration: Leverages pre-trained models for embedding and text generation.
FAISS Vectorstore: Enables efficient and rapid similarity search.
Text Chunking: Divides documents into manageable chunks for processing.
Chat Interface: Offers a simple text-based interface for user interaction.

Contributing

Contributions are welcome! Submit issues or pull requests to improve the project.

License

This project is released under the MIT License.