From fd42fba72f8ae98c86687550ce41ac57973748c6 Mon Sep 17 00:00:00 2001 From: Frank Xu Date: Sat, 29 Mar 2025 19:33:13 -0400 Subject: [PATCH] add RAG --- README.md | 2 + lab03/README.md | 2 - lab03_RAG/README.md | 147 ++++++++++++++++++ ...etrieval_Augmented_Generation_Simple.ipynb | 0 4 files changed, 149 insertions(+), 2 deletions(-) delete mode 100644 lab03/README.md create mode 100644 lab03_RAG/README.md rename {lab03 => lab03_RAG}/Retrieval_Augmented_Generation_Simple.ipynb (100%) diff --git a/README.md b/README.md index 00fc92b..15d76d1 100644 --- a/README.md +++ b/README.md @@ -35,6 +35,8 @@ https://colab.research.google.com/github/frankwxu/AI4DigitalForensics/blob/main/ - Lab 2: [Gun detection](https://colab.research.google.com/github/frankwxu/AI4DigitalForensics/blob/main/lab02_Gun_detection_fasterRCNN/gun_detection_fasterRCNN.ipynb) +- Lab 3: [Retrieval-Augmented Generation](https://colab.research.google.com/github/frankwxu/AI4DigitalForensics/blob/main/lab3_RAG//Retrieval_Augmented_Generation_Simple.ipynb) + - Lab 10: [Reinforcement Learning](https://colab.research.google.com/github/frankwxu/AI4DigitalForensics/blob/main/lab10_Reinforcement_Learning/dqn_lunar_lander_demo.ipynb) ## Contributing diff --git a/lab03/README.md b/lab03/README.md deleted file mode 100644 index 96a6a4a..0000000 --- a/lab03/README.md +++ /dev/null @@ -1,2 +0,0 @@ - # AI for Digital Forensics Hand-on Labs - # Lab X diff --git a/lab03_RAG/README.md b/lab03_RAG/README.md new file mode 100644 index 0000000..4f16ad9 --- /dev/null +++ b/lab03_RAG/README.md @@ -0,0 +1,147 @@ +# RAG-based Cyber Forensics Investigation Tool [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) + +## What is RAG? + +Retrieval-Augmented Generation (RAG) enhances language model responses by combining information retrieval with text generation. It retrieves relevant information from a knowledge base and uses a language model to generate accurate, factual, and contextually appropriate answers. This enables language models to handle complex queries and access domain-specific knowledge effectively. + +This project implements a RAG system to assist in cyber forensics investigations, leveraging LangChain, Hugging Face models, and FAISS for efficient retrieval and question answering over a provided knowledge base. The system processes a text-based scenario, divides it into manageable chunks, generates embeddings, stores them in a vector store, and employs a language model to answer user queries based on the retrieved information. + +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1j1mgpfdpJp0s1eQn7JzJr3lhhB8tjXLn?usp=sharing) + +## Video Demonstrations + +For a visual demonstration of how this RAG system works, please refer to the following videos: + +- **RAG Fundamentals:** [https://youtu.be/T-D1OfcDW1M](https://youtu.be/T-D1OfcDW1M) AND [https://youtu.be/W-ulb-DMtsM](https://youtu.be/W-ulb-DMtsM) +- **RAG Implementation:** [https://youtu.be/shiSITpK0ps](https://youtu.be/shiSITpK0ps) + +## Technical Description + +The system operates through these steps: + +1. **Environment Setup:** Installation of necessary Python libraries, including `langchain`, `langchain-huggingface`, `faiss-cpu`, and `huggingface_hub`. +2. **Scenario Definition:** The cyberpunk case study is defined as a string (`document_text`). +3. **Text Splitting:** The scenario text is divided into chunks using `RecursiveCharacterTextSplitter`, controlled by parameters like `chunk_size`, `chunk_overlap`, and `separators`. +4. **Embeddings:** Text embeddings are generated using the Hugging Face Inference API. +5. **Vector Store:** FAISS is used to store and retrieve text embeddings based on similarity. +6. **Retrieval QA Chain:** LangChain's `RetrievalQA` chain combines the vector store with a language model. It retrieves relevant text chunks based on the user's query and generates an answer. +7. **Language Model:** The Hugging Face Inference API with a specified model (`mistralai/Mistral-7B-Instruct-v0.1`) is used for response generation. +8. **Query Processing:** The system receives user queries, retrieves relevant information from the vector store, and generates answers using the language model. + +This setup enables the RAG system to answer questions related to the cyber forensics scenario. + +## Dependencies + +To run this project, ensure you have the following: + +- **Python:** 3.7+ +- **pip:** Python package installer +- **Hugging Face Account & Access Token:** Required for Hugging Face models and the Inference API. +- **Google Colab:** To execute the notebook. + +**Disk space:** Google Colab's virtual environment manages disk space for dependencies. + +## How This Project Works + +The project uses: + +- Hugging Face Models for embedding and text generation. +- LangChain for language model applications. +- FAISS for efficient similarity search. +- Google Colab for running Python code. + +The system workflow is: + +1. **Load and split:** Load and divide the cyber forensics document into chunks. +2. **Embed:** Transform each chunk into a vector representation. +3. **Store:** Store embeddings in a FAISS index. +4. **Query:** Transform user's question into an embedding and search the FAISS index. +5. **Answer:** Generate an answer using a language model based on retrieved information. + +## Code Overview + +The code comprises: + +- Document loading and processing using `RecursiveCharacterTextSplitter`. +- Embedding generation using the Hugging Face Inference API. +- FAISS vectorstore creation. +- `RetrievalQA` chain setup for question answering. +- A simple chat interface for user interaction. + +## Why This Approach Is Beneficial + +RAG offers these advantages: + +- **Contextualized responses:** Answers are grounded in the provided cyber forensics document. +- **Interactive interface:** User-friendly chat interaction. +- **Efficiency:** FAISS enables fast retrieval. +- **Cloud-based execution:** Google Colab provides a convenient environment. +- **Hugging Face Integration:** Simplifies embedding and text generation. + +## System Workflow Diagram + +_(Flowchart image included here)_ + +![Flowchart](Colab_RAG.png) + +## Setup and Usage + +1. **Create a Hugging Face Account** (if needed): Go to [https://huggingface.co/](https://huggingface.co/) and sign up. +2. **Generate a Hugging Face Access Token:** + - Log in to your Hugging Face account. + - Go to your profile settings. + - Find the "Access Tokens" section. + - Create a new token. + - Copy the generated token. +3. **Open a Google Colab Notebook:** +4. **Install Python dependencies:** Execute these commands in a Colab cell: + + ```bash + !pip install -U langchain langchain-core langchain-huggingface langchain_community faiss-cpu huggingface_hub + !pip install --upgrade langchain + ``` + +5. **Provide Hugging Face API Token:** Add a code cell to set the `HUGGINGFACEHUB_API_TOKEN` environment variable: + + ```python + api_token = "ENTER THE API KEY" # Replace 'ENTER THE API KEY' with your actual token + ``` + +6. **Provide Your Knowledge Base:** Add a cell to define `scenario_text` (Any passage of your choice). +7. **Run the Code:** Execute the cells in order to interact with the RAG system. + +## Background Story Used + +This project utilizes a futuristic cyberpunk scenario to simulate a cybercrime investigation. Detective Y investigates a complex ransomware attack targeting robotics engineer Z by "The Serpent," who employs advanced techniques to encrypt and steal research data. This scenario serves as the knowledge base for the RAG system. + +## Story based Questions + +The RAG system answers questions based on the provided cyber forensics scenario. Examples: + +**In-Text Questions:** + +1. What type of cyberattack did Detective Y investigate? +2. What was the victim's profession? +3. Where was the remote server located that led to the perpetrator's arrest? + +**Out-of-Text Questions (Answers not in the text):** + +1. What specific encryption algorithm did The Serpent use? +2. What was the name of the university where the security breach occurred? +3. Did Detective Y's team collaborate with external experts? + +## Features + +- **Google Colab Integration:** Streamlined setup and execution in a cloud-based setting. +- **Hugging Face Integration:** Leverages pre-trained models for embedding and text generation. +- **FAISS Vectorstore:** Enables efficient and rapid similarity search. +- **Text Chunking:** Divides documents into manageable chunks for processing. +- **Chat Interface:** Offers a simple text-based interface for user interaction. + +## Contributing + +Contributions are welcome! Submit issues or pull requests to improve the project. + +## License + +This project is released under the MIT License. diff --git a/lab03/Retrieval_Augmented_Generation_Simple.ipynb b/lab03_RAG/Retrieval_Augmented_Generation_Simple.ipynb similarity index 100% rename from lab03/Retrieval_Augmented_Generation_Simple.ipynb rename to lab03_RAG/Retrieval_Augmented_Generation_Simple.ipynb