7.0 KiB
RAG-based Cyber Forensics Investigation Tool 
What is RAG?
Retrieval-Augmented Generation (RAG) enhances language model responses by combining information retrieval with text generation. It retrieves relevant information from a knowledge base and uses a language model to generate accurate, factual, and contextually appropriate answers. This enables language models to handle complex queries and access domain-specific knowledge effectively.
This project implements a RAG system to assist in cyber forensics investigations, leveraging LangChain, Hugging Face models, and FAISS for efficient retrieval and question answering over a provided knowledge base. The system processes a text-based scenario, divides it into manageable chunks, generates embeddings, stores them in a vector store, and employs a language model to answer user queries based on the retrieved information.
Video Demonstrations
For a visual demonstration of how this RAG system works, please refer to the following videos:
- RAG Fundamentals: https://youtu.be/T-D1OfcDW1M AND https://youtu.be/W-ulb-DMtsM
- RAG Implementation: https://youtu.be/shiSITpK0ps
Technical Description
The system operates through these steps:
- Environment Setup: Installation of necessary Python libraries, including
langchain,langchain-huggingface,faiss-cpu, andhuggingface_hub. - Scenario Definition: The cyberpunk case study is defined as a string (
document_text). - Text Splitting: The scenario text is divided into chunks using
RecursiveCharacterTextSplitter, controlled by parameters likechunk_size,chunk_overlap, andseparators. - Embeddings: Text embeddings are generated using the Hugging Face Inference API.
- Vector Store: FAISS is used to store and retrieve text embeddings based on similarity.
- Retrieval QA Chain: LangChain's
RetrievalQAchain combines the vector store with a language model. It retrieves relevant text chunks based on the user's query and generates an answer. - Language Model: The Hugging Face Inference API with a specified model (
mistralai/Mistral-7B-Instruct-v0.1) is used for response generation. - Query Processing: The system receives user queries, retrieves relevant information from the vector store, and generates answers using the language model.
This setup enables the RAG system to answer questions related to the cyber forensics scenario.
Dependencies
To run this project, ensure you have the following:
- Python: 3.7+
- pip: Python package installer
- Hugging Face Account & Access Token: Required for Hugging Face models and the Inference API.
- Google Colab: To execute the notebook.
Disk space: Google Colab's virtual environment manages disk space for dependencies.
How This Project Works
The project uses:
- Hugging Face Models for embedding and text generation.
- LangChain for language model applications.
- FAISS for efficient similarity search.
- Google Colab for running Python code.
The system workflow is:
- Load and split: Load and divide the cyber forensics document into chunks.
- Embed: Transform each chunk into a vector representation.
- Store: Store embeddings in a FAISS index.
- Query: Transform user's question into an embedding and search the FAISS index.
- Answer: Generate an answer using a language model based on retrieved information.
Code Overview
The code comprises:
- Document loading and processing using
RecursiveCharacterTextSplitter. - Embedding generation using the Hugging Face Inference API.
- FAISS vectorstore creation.
RetrievalQAchain setup for question answering.- A simple chat interface for user interaction.
Why This Approach Is Beneficial
RAG offers these advantages:
- Contextualized responses: Answers are grounded in the provided cyber forensics document.
- Interactive interface: User-friendly chat interaction.
- Efficiency: FAISS enables fast retrieval.
- Cloud-based execution: Google Colab provides a convenient environment.
- Hugging Face Integration: Simplifies embedding and text generation.
System Workflow Diagram
(Flowchart image included here)
Setup and Usage
-
Create a Hugging Face Account (if needed): Go to https://huggingface.co/ and sign up.
-
Generate a Hugging Face Access Token:
- Log in to your Hugging Face account.
- Go to your profile settings.
- Find the "Access Tokens" section.
- Create a new token.
- Copy the generated token.
-
Open a Google Colab Notebook:
-
Install Python dependencies: Execute these commands in a Colab cell:
!pip install -U langchain langchain-core langchain-huggingface langchain_community faiss-cpu huggingface_hub !pip install --upgrade langchain -
Provide Hugging Face API Token: Add a code cell to set the
HUGGINGFACEHUB_API_TOKENenvironment variable:api_token = "ENTER THE API KEY" # Replace 'ENTER THE API KEY' with your actual token -
Provide Your Knowledge Base: Add a cell to define
scenario_text(Any passage of your choice). -
Run the Code: Execute the cells in order to interact with the RAG system.
Background Story Used
This project utilizes a futuristic cyberpunk scenario to simulate a cybercrime investigation. Detective Y investigates a complex ransomware attack targeting robotics engineer Z by "The Serpent," who employs advanced techniques to encrypt and steal research data. This scenario serves as the knowledge base for the RAG system.
Story based Questions
The RAG system answers questions based on the provided cyber forensics scenario. Examples:
In-Text Questions:
- What type of cyberattack did Detective Y investigate?
- What was the victim's profession?
- Where was the remote server located that led to the perpetrator's arrest?
Out-of-Text Questions (Answers not in the text):
- What specific encryption algorithm did The Serpent use?
- What was the name of the university where the security breach occurred?
- Did Detective Y's team collaborate with external experts?
Features
- Google Colab Integration: Streamlined setup and execution in a cloud-based setting.
- Hugging Face Integration: Leverages pre-trained models for embedding and text generation.
- FAISS Vectorstore: Enables efficient and rapid similarity search.
- Text Chunking: Divides documents into manageable chunks for processing.
- Chat Interface: Offers a simple text-based interface for user interaction.
Contributing
Contributions are welcome! Submit issues or pull requests to improve the project.
License
This project is released under the MIT License.
