diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/01_evidence_entity_recognition.ipynb b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/01_evidence_entity_recognition.ipynb index 1411d39..ea2c905 100644 --- a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/01_evidence_entity_recognition.ipynb +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/01_evidence_entity_recognition.ipynb @@ -1,26 +1,63 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A tutorial to identify evidence entities from a cyber incident report\n", + "\n", + "The cyber incident report records a conversation between an IT Security Specialist and an Employee. The conversation describes an email phishing attack scenario.\n", + "\n", + "### Goal\n", + "- Familiar with [DSPy: Declarative Self-improving Language Programs, pythonically](https://github.com/stanfordnlp/dspy). \n", + " - DSPy is a framework for algorithmically optimizing LM prompts and weights.\n", + " - The framework for programming—not prompting—foundation models\n", + "- Identify a list of evidence entities\n", + "- Identify a list of relationships between entities" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Download libraries and files for the lab\n", + "- Make use you download necessary library and files. \n", + "- All downloaded and saved files can be located in the `content` folder if using google Colab" + ] + }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ + "# uncomment the commands to download libraries and files\n", + "#!pip install python-dotenv\n", + "#!pip install dspy-ai\n", "#!pip install graphviz\n", + "# !wget https://raw.githubusercontent.com/frankwxu/digital-forensics-lab/main/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/conversation.txt\n", "\n", "import dspy\n", "import os\n", "import openai\n", "import json\n", "from dotenv import load_dotenv\n", - "\n", - "from graphviz import Digraph\n", "from IPython.display import display" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Config DSPy with openAI \n", + "- You `MUST` have an openAI api key\n", + "- load an openAI api key from `openai_api_key.txt` file\n", + "- or, hard code your open api key" + ] + }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -51,40 +88,22 @@ " dspy.settings.configure(lm=turbo)\n", " return turbo\n", "\n", + "# provide `openai_api_key.txt` with your openAI api key\n", "turbo=set_dspy()\n", - "# comment out set_dspy() and use set_dspy_hardcode_openai_key is your option\n", + "# optionally, hard code your openAI api key at line 21 \n", "# turbo=set_dspy_hardcode_openai_key()" ] }, { - "cell_type": "code", - "execution_count": 14, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "def load_text_file(file_path):\n", - " \"\"\"\n", - " Load a text file and return its contents as a string.\n", - "\n", - " Parameters:\n", - " file_path (str): The path to the text file.\n", - "\n", - " Returns:\n", - " str: The contents of the text file.\n", - " \"\"\"\n", - " try:\n", - " with open(file_path, \"r\") as file:\n", - " contents = file.read()\n", - " return contents\n", - " except FileNotFoundError:\n", - " return \"File not found.\"\n", - " except Exception as e:\n", - " return f\"An error occurred: {e}\"\n" + "### Step 3: Load the cyber incident repot (e.g., conversation)" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -134,30 +153,101 @@ } ], "source": [ - "conversation=load_text_file(\"conversation.txt\")\n", + "def load_text_file(file_path):\n", + " \"\"\"\n", + " Load a text file and return its contents as a string.\n", + "\n", + " Parameters:\n", + " file_path (str): The path to the text file.\n", + "\n", + " Returns:\n", + " str: The contents of the text file.\n", + " \"\"\"\n", + " try:\n", + " with open(file_path, \"r\") as file:\n", + " contents = file.read()\n", + " return contents\n", + " except FileNotFoundError:\n", + " return \"File not found.\"\n", + " except Exception as e:\n", + " return f\"An error occurred: {e}\"\n", + "\n", + "conversation = load_text_file(\"conversation.txt\")\n", "print(conversation)" ] }, { - "cell_type": "code", - "execution_count": 16, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "class EvidenceIdentifier(dspy.Signature):\n", - " \"\"\"Idenitfy evidence entities from a conversation between -Alex (IT Security Specialist) and Taylor (Employee).\"\"\"\n", + "### Step 4: Tell an LLM `WHAT` are the inputs/outputs by defining DSPy: Signature \n", "\n", - " question = dspy.InputField(\n", - " desc=\"a conversation between -Alex (IT Security Specialist) and Bob (Employee).\"\n", - " )\n", - " answer = dspy.OutputField(\n", - " desc=\"a list of evidence, inlcuding but not limited to emaile, IP address, URL, File name, timestamps, etc, in the conversation as a Python dictionary. For example, {evidence type: evidence value, ...}\"\n", - " )" + "- A signature is one of the basic building blocks in DSPy's prompt programming\n", + "- It is a declarative specification of input/output behavior of a DSPy module\n", + " - Think about a function signature\n", + "- Allow you to tell the LLM what it needs to do. \n", + " - Don't need to specify how we should ask the LLM to do it.\n", + "- The following signature identifies a list of evidence based on the conversation\n", + " - Inherit from `dspy.Signature`\n", + " - Exact `ONE` input, e.g., the conversation \n", + " - Exact `ONE` output, e.g., a list of evidence entities\n", + "\n", + "### The following `EvidenceIdentifier` is equivalent to \n", + "\n", + "```\n", + "Identify evidence entities from a conversation ....\n", + "---\n", + "Follow the following format.\n", + "Question: a conversation between an IT Security Specialist and Employe\n", + "Answer: a list of evidence, inlcuding ...\n", + "---\n", + "Question: {a new unseen conversation}\n", + "Answer: write your answer here\n", + "```\n" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "class EvidenceIdentifier(dspy.Signature):\n", + " \"\"\"Identify evidence entities from a conversation between an IT Security Specialist and an Employee.\"\"\"\n", + "\n", + " question = dspy.InputField(\n", + " desc=\"a conversation between an IT Security Specialist and Employee.\"\n", + " )\n", + " answer = dspy.OutputField(\n", + " desc=\"a list of evidence, inlcuding but not limited to emails, IP addresses, URLs, File names, timestamps, etc, in the conversation as a Python dictionary. For example, {evidence type: evidence value, ...}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 5: Tell an LLM `HOW` to generate answer in a function: \n", + "\n", + "Generates and saves evidence from a conversation using a specified signature.\n", + "\n", + "#### Parameters:\n", + "- `signature` (dspy.Signature): The signature defining the input and output structure for evidence identification.\n", + "- `conversation` (str): The conversation text to analyze for evidence.\n", + "- `output_file` (str): The file path where the identified evidence will be saved as JSON.\n", + "\n", + "#### Returns:\n", + "None. The function saves the result to a file and prints a confirmation message.\n", + "\n", + "#### Notes:\n", + "- This function uses `dspy.Predict` to process the conversation and identify evidence.\n", + "- The result is saved as a formatted JSON file.\n", + "- The function prints the result to the console and saves it to the specified file." + ] + }, + { + "cell_type": "code", + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -172,17 +262,27 @@ " print(f\"The evidence has been saved to the file {output_file}\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 6: Execute above function and generate entities with three inputs\n", + "- Which signature: `EvidenceIdentifier`\n", + "- What input: conversation\n", + "- Where to save results: the name of output file" + ] + }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "{'Email From': 'support@banksecure.com', 'Email Subject': 'Urgent: Verify Your Account Now', 'IP Address': '192.168.10.45', 'Domain': 'banksecure.com', 'Actual Domain Registration': 'Russia', 'URL Clicked': 'http://banksecure-verification.com/login', 'URL Visited 1': 'http://banksecure-verification.com/login', 'URL Visited 2': 'http://banksecure-verification.com/account-details', 'File Downloaded': 'AccountDetails.exe', 'File Creation Time': '10:20 AM', 'MD5 Hash': 'e99a18c428cb38d5f260853678922e03', 'Network Logs Timestamp': '10:20 AM'}\n", - "The evidence has been saved to the file 01_output_evidence_entity.txt\n" + "{'Email': {'From': 'support@banksecure.com', 'Subject': 'Urgent: Verify Your Account Now', 'Content': 'strange email asking to verify account details urgently'}, 'IP Address': '192.168.10.45', 'Domain': 'banksecure.com', 'URLs': ['http://banksecure-verification.com/login', 'http://banksecure-verification.com/account-details'], 'File': {'Name': 'AccountDetails.exe', 'Creation Time': '10:20 AM', 'MD5 Hash': 'e99a18c428cb38d5f260853678922e03'}, 'Timestamps': {'Visited at 10:15 AM': 'http://banksecure-verification.com/login', 'Visited at 10:17 AM': 'http://banksecure-verification.com/account-details'}}\n", + "The evidence has been saved to the file 01_output_entity.txt\n" ] } ], @@ -194,9 +294,21 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 7: Inspect the last prompt send to the LLM\n", + "\n", + "You want to check:\n", + "- Prompt Description Section: Description in the signature\n", + "- Format Section: `Following the following format.` \n", + "- Result Section: Question (scenario) and Answer (entities) section" + ] + }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -206,31 +318,39 @@ "\n", "\n", "\n", - "Idenitfy evidence entities from a conversation between -Alex (IT Security Specialist) and Taylor (Employee).\n", + "Identify evidence entities from a conversation between an IT Security Specialist and an Employee.\n", "\n", "---\n", "\n", "Follow the following format.\n", "\n", - "Question: a conversation between -Alex (IT Security Specialist) and Bob (Employee).\n", + "Question: a conversation between an IT Security Specialist and Employee.\n", "Answer: a list of evidence, inlcuding but not limited to emaile, IP address, URL, File name, timestamps, etc, in the conversation as a Python dictionary. For example, {evidence type: evidence value, ...}\n", "\n", "---\n", "\n", "Question: Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was \"Urgent: Verify Your Account Now\". The email looks suspicious to me. Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from. Alice: Sure, forwarding it now. Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It's actually registered to someone in Russia. Alice: That’s definitely not right. Should I be worried? Bob: We should investigate further. Did you click on any links or download any attachments? Alice: I did click on a link that took me to a page asking for my login credentials. I didn't enter anything though. The URL was http://banksecure-verification.com/login. Bob: Good call on not entering your details. Let’s check the URL. This domain was just registered two days ago. It’s highly likely it’s a phishing site. Alice: What should I do next? Bob: First, clear your browser history and cache. Also, run a full antivirus scan on your computer. Can you also provide me with any browser history entries and cookies from that session? Alice: I’ve cleared the history and started the antivirus scan. Here are the relevant entries from my browser history: Visited at 10:15 AM: http://banksecure-verification.com/login Visited at 10:17 AM: http://banksecure-verification.com/account-details Bob: Thanks. I’ll analyze these URLs further. Also, check if there are any suspicious files downloaded or present in your downloads folder. Look for anything unusual. Alice: There's a file named \"AccountDetails.exe\" that I don’t remember downloading. It was created at 10:20 AM. Bob: Definitely suspicious. Don’t open it. Let’s hash the file to verify its integrity. Can you run an MD5 hash on it? Alice: Done. The MD5 hash is e99a18c428cb38d5f260853678922e03. Bob: This hash matches known malware in our database. We’ll need to quarantine it and check if it has established any network connections. I’ll look into our network logs for the IP 192.168.10.45 around 10:20 AM. Alice: Is there anything else I need to do? Bob: For now, avoid using your computer for sensitive tasks. We’ll also reset your passwords from a different device and enable two-factor authentication on your accounts. Alice: Thanks, Bob. I’ll follow these steps immediately.\n", "Answer: {\n", - " \"Email From\": \"support@banksecure.com\",\n", - " \"Email Subject\": \"Urgent: Verify Your Account Now\",\n", + " \"Email\": {\n", + " \"From\": \"support@banksecure.com\",\n", + " \"Subject\": \"Urgent: Verify Your Account Now\",\n", + " \"Content\": \"strange email asking to verify account details urgently\"\n", + " },\n", " \"IP Address\": \"192.168.10.45\",\n", " \"Domain\": \"banksecure.com\",\n", - " \"Actual Domain Registration\": \"Russia\",\n", - " \"URL Clicked\": \"http://banksecure-verification.com/login\",\n", - " \"URL Visited 1\": \"http://banksecure-verification.com/login\",\n", - " \"URL Visited 2\": \"http://banksecure-verification.com/account-details\",\n", - " \"File Downloaded\": \"AccountDetails.exe\",\n", - " \"File Creation Time\": \"10:20 AM\",\n", - " \"MD5 Hash\": \"e99a18c428cb38d5f260853678922e03\",\n", - " \"Network Logs Timestamp\": \"10:20 AM\"\n", + " \"URLs\": [\n", + " \"http://banksecure-verification.com/login\",\n", + " \"http://banksecure-verification.com/account-details\"\n", + " ],\n", + " \"File\": {\n", + " \"Name\": \"AccountDetails.exe\",\n", + " \"Creation Time\": \"10:20 AM\",\n", + " \"MD5 Hash\": \"e99a18c428cb38d5f260853678922e03\"\n", + " },\n", + " \"Timestamps\": {\n", + " \"Visited at 10:15 AM\": \"http://banksecure-verification.com/login\",\n", + " \"Visited at 10:17 AM\": \"http://banksecure-verification.com/account-details\"\n", + " }\n", "}\n", "\n", "\n", @@ -240,10 +360,10 @@ { "data": { "text/plain": [ - "'\\n\\n\\nIdenitfy evidence entities from a conversation between -Alex (IT Security Specialist) and Taylor (Employee).\\n\\n---\\n\\nFollow the following format.\\n\\nQuestion: a conversation between -Alex (IT Security Specialist) and Bob (Employee).\\nAnswer: a list of evidence, inlcuding but not limited to emaile, IP address, URL, File name, timestamps, etc, in the conversation as a Python dictionary. For example, {evidence type: evidence value, ...}\\n\\n---\\n\\nQuestion: Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was \"Urgent: Verify Your Account Now\". The email looks suspicious to me. Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from. Alice: Sure, forwarding it now. Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It\\'s actually registered to someone in Russia. Alice: That’s definitely not right. Should I be worried? Bob: We should investigate further. Did you click on any links or download any attachments? Alice: I did click on a link that took me to a page asking for my login credentials. I didn\\'t enter anything though. The URL was http://banksecure-verification.com/login. Bob: Good call on not entering your details. Let’s check the URL. This domain was just registered two days ago. It’s highly likely it’s a phishing site. Alice: What should I do next? Bob: First, clear your browser history and cache. Also, run a full antivirus scan on your computer. Can you also provide me with any browser history entries and cookies from that session? Alice: I’ve cleared the history and started the antivirus scan. Here are the relevant entries from my browser history: Visited at 10:15 AM: http://banksecure-verification.com/login Visited at 10:17 AM: http://banksecure-verification.com/account-details Bob: Thanks. I’ll analyze these URLs further. Also, check if there are any suspicious files downloaded or present in your downloads folder. Look for anything unusual. Alice: There\\'s a file named \"AccountDetails.exe\" that I don’t remember downloading. It was created at 10:20 AM. Bob: Definitely suspicious. Don’t open it. Let’s hash the file to verify its integrity. Can you run an MD5 hash on it? Alice: Done. The MD5 hash is e99a18c428cb38d5f260853678922e03. Bob: This hash matches known malware in our database. We’ll need to quarantine it and check if it has established any network connections. I’ll look into our network logs for the IP 192.168.10.45 around 10:20 AM. Alice: Is there anything else I need to do? Bob: For now, avoid using your computer for sensitive tasks. We’ll also reset your passwords from a different device and enable two-factor authentication on your accounts. Alice: Thanks, Bob. I’ll follow these steps immediately.\\nAnswer:\\x1b[32m {\\n \"Email From\": \"support@banksecure.com\",\\n \"Email Subject\": \"Urgent: Verify Your Account Now\",\\n \"IP Address\": \"192.168.10.45\",\\n \"Domain\": \"banksecure.com\",\\n \"Actual Domain Registration\": \"Russia\",\\n \"URL Clicked\": \"http://banksecure-verification.com/login\",\\n \"URL Visited 1\": \"http://banksecure-verification.com/login\",\\n \"URL Visited 2\": \"http://banksecure-verification.com/account-details\",\\n \"File Downloaded\": \"AccountDetails.exe\",\\n \"File Creation Time\": \"10:20 AM\",\\n \"MD5 Hash\": \"e99a18c428cb38d5f260853678922e03\",\\n \"Network Logs Timestamp\": \"10:20 AM\"\\n}\\x1b[0m\\n\\n\\n'" + "'\\n\\n\\nIdentify evidence entities from a conversation between an IT Security Specialist and an Employee.\\n\\n---\\n\\nFollow the following format.\\n\\nQuestion: a conversation between an IT Security Specialist and Employee.\\nAnswer: a list of evidence, inlcuding but not limited to emaile, IP address, URL, File name, timestamps, etc, in the conversation as a Python dictionary. For example, {evidence type: evidence value, ...}\\n\\n---\\n\\nQuestion: Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was \"Urgent: Verify Your Account Now\". The email looks suspicious to me. Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from. Alice: Sure, forwarding it now. Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It\\'s actually registered to someone in Russia. Alice: That’s definitely not right. Should I be worried? Bob: We should investigate further. Did you click on any links or download any attachments? Alice: I did click on a link that took me to a page asking for my login credentials. I didn\\'t enter anything though. The URL was http://banksecure-verification.com/login. Bob: Good call on not entering your details. Let’s check the URL. This domain was just registered two days ago. It’s highly likely it’s a phishing site. Alice: What should I do next? Bob: First, clear your browser history and cache. Also, run a full antivirus scan on your computer. Can you also provide me with any browser history entries and cookies from that session? Alice: I’ve cleared the history and started the antivirus scan. Here are the relevant entries from my browser history: Visited at 10:15 AM: http://banksecure-verification.com/login Visited at 10:17 AM: http://banksecure-verification.com/account-details Bob: Thanks. I’ll analyze these URLs further. Also, check if there are any suspicious files downloaded or present in your downloads folder. Look for anything unusual. Alice: There\\'s a file named \"AccountDetails.exe\" that I don’t remember downloading. It was created at 10:20 AM. Bob: Definitely suspicious. Don’t open it. Let’s hash the file to verify its integrity. Can you run an MD5 hash on it? Alice: Done. The MD5 hash is e99a18c428cb38d5f260853678922e03. Bob: This hash matches known malware in our database. We’ll need to quarantine it and check if it has established any network connections. I’ll look into our network logs for the IP 192.168.10.45 around 10:20 AM. Alice: Is there anything else I need to do? Bob: For now, avoid using your computer for sensitive tasks. We’ll also reset your passwords from a different device and enable two-factor authentication on your accounts. Alice: Thanks, Bob. I’ll follow these steps immediately.\\nAnswer:\\x1b[32m {\\n \"Email\": {\\n \"From\": \"support@banksecure.com\",\\n \"Subject\": \"Urgent: Verify Your Account Now\",\\n \"Content\": \"strange email asking to verify account details urgently\"\\n },\\n \"IP Address\": \"192.168.10.45\",\\n \"Domain\": \"banksecure.com\",\\n \"URLs\": [\\n \"http://banksecure-verification.com/login\",\\n \"http://banksecure-verification.com/account-details\"\\n ],\\n \"File\": {\\n \"Name\": \"AccountDetails.exe\",\\n \"Creation Time\": \"10:20 AM\",\\n \"MD5 Hash\": \"e99a18c428cb38d5f260853678922e03\"\\n },\\n \"Timestamps\": {\\n \"Visited at 10:15 AM\": \"http://banksecure-verification.com/login\",\\n \"Visited at 10:17 AM\": \"http://banksecure-verification.com/account-details\"\\n }\\n}\\x1b[0m\\n\\n\\n'" ] }, - "execution_count": 19, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -252,9 +372,35 @@ "turbo.inspect_history(n=1)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A tutorial to identify `evidence relationship` from a cyber incident report\n", + "\n", + "The cyber incident report records a conversation between an IT Security Specialist and an Employee. The conversation describes an email phishing attack scenario.\n", + "\n", + "### Goal\n", + "- In addition to a list of evidence entities, we want to identify a list of `relationships` between entities" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Define a signature that identifies a list of `relationships` in the conversation\n", + "\n", + "It is important to note that:\n", + "- There is ONE input \n", + " - Cyber incident conversation\n", + "- There are `TWO` outputs:\n", + " - a list of entities\n", + " - a list of relationships" + ] + }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -274,9 +420,20 @@ " )" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: A function that can receive two outputs\n", + "\n", + "We have to revise the function `generate_answer()` so that we can receive two outputs. The following function `generate_answers` can receive two outputs from a LLM (e.g, openAI)\n", + "- a list of entities\n", + "- a list of relationships" + ] + }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -306,9 +463,20 @@ " print(f\"The evidence has been saved to the file {output_file}\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 3: Execute code to generate evidences and relations\n", + "- Input 1: Signature: E`videnceRelationIdentifier`\n", + "- Input 2: a conversation\n", + "- Output 1: a file that saves entities and relations\n", + "- Output 2: a list of entities and relations" + ] + }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 11, "metadata": {}, "outputs": [ { @@ -319,7 +487,7 @@ " answer_relations='{\\n \"Email Header Analysis\": \"IP Address -> Domain\",\\n \"URL Analysis\": \"URL -> Domain\",\\n \"Browser History Analysis\": \"URL -> Timestamp\",\\n \"File Analysis\": \"File Name -> Timestamp, File Name -> MD5 Hash\",\\n \"Malware Analysis\": \"MD5 Hash -> Malware Database\"\\n}',\n", " answer_evidence='{\\n \"Email Sender\": \"support@banksecure.com\",\\n \"Email Subject\": \"Urgent: Verify Your Account Now\",\\n \"IP Address\": \"192.168.10.45\",\\n \"Domain\": \"banksecure.com\",\\n \"Domain Registration\": \"Russia\",\\n \"URL\": \"http://banksecure-verification.com/login\",\\n \"URL Registration Date\": \"Two days ago\",\\n \"File Name\": \"AccountDetails.exe\",\\n \"File Creation Timestamp\": \"10:20 AM\",\\n \"MD5 Hash\": \"e99a18c428cb38d5f260853678922e03\"\\n}'\n", ")\n", - "The evidence has been saved to the file 01_output_evidence_entity_relation.txt\n" + "The evidence has been saved to the file 01_output_entity_relation.txt\n" ] } ], @@ -338,7 +506,9 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "turbo.inspect_history(n=1)" + ] } ], "metadata": { diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/01_output_entity.txt b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/01_output_entity.txt index a35d365..2ca5de6 100644 --- a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/01_output_entity.txt +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/01_output_entity.txt @@ -1,14 +1,22 @@ { - "Email From": "support@banksecure.com", - "Email Subject": "Urgent: Verify Your Account Now", + "Email": { + "From": "support@banksecure.com", + "Subject": "Urgent: Verify Your Account Now", + "Content": "strange email asking to verify account details urgently" + }, "IP Address": "192.168.10.45", "Domain": "banksecure.com", - "Actual Domain Registration": "Russia", - "URL Clicked": "http://banksecure-verification.com/login", - "URL Visited 1": "http://banksecure-verification.com/login", - "URL Visited 2": "http://banksecure-verification.com/account-details", - "File Downloaded": "AccountDetails.exe", - "File Creation Time": "10:20 AM", - "MD5 Hash": "e99a18c428cb38d5f260853678922e03", - "Network Logs Timestamp": "10:20 AM" + "URLs": [ + "http://banksecure-verification.com/login", + "http://banksecure-verification.com/account-details" + ], + "File": { + "Name": "AccountDetails.exe", + "Creation Time": "10:20 AM", + "MD5 Hash": "e99a18c428cb38d5f260853678922e03" + }, + "Timestamps": { + "Visited at 10:15 AM": "http://banksecure-verification.com/login", + "Visited at 10:17 AM": "http://banksecure-verification.com/account-details" + } } \ No newline at end of file diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_evidence_knowledge_dot_generator.ipynb b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_evidence_knowledge_dot_generator.ipynb index 1b84784..54768e3 100644 --- a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_evidence_knowledge_dot_generator.ipynb +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_evidence_knowledge_dot_generator.ipynb @@ -1,26 +1,64 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A tutorial to visualize forensic evidence and relationships\n", + "\n", + "### Motivation\n", + "- An evidence graph can enhance investigators' understanding of evidence entities and their relationships.\n", + "\n", + "### Goal\n", + "- Familiar with Graph Visualization Software (graphviz)\n", + " - an open-source graph visualization software developed by AT&T Labs Research.\n", + "- Generate graph directly from the conversation\n", + "- Gain criminal insights visually using graphviz. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Download libraries and files for the lab\n", + "- Make use you download necessary library and files. \n", + "- All downloaded and saved files can be located in the `content` folder if using google Colab" + ] + }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ + "# uncomment the commands to download libraries and files\n", + "#!pip install python-dotenv\n", + "#!pip install dspy-ai\n", "#!pip install graphviz\n", + "# !wget https://raw.githubusercontent.com/frankwxu/digital-forensics-lab/main/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/conversation.txt\n", "\n", "import dspy\n", "import os\n", "import openai\n", "import json\n", "from dotenv import load_dotenv\n", - "\n", "from graphviz import Source\n", "from IPython.display import display" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Config DSPy with openAI \n", + "- You `MUST` have an openAI api key\n", + "- load an openAI api key from `openai_api_key.txt` file\n", + "- or, hard code your open api key" + ] + }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -43,48 +81,31 @@ "\n", "\n", "def set_dspy_hardcode_openai_key():\n", - " os.environ[\"OPENAI_API_KEY\"] = (\n", - " \"sk-proj-yourapikeyhere\"\n", - " )\n", + " os.environ[\"OPENAI_API_KEY\"] = \"sk-proj-yourapikeyhere\"\n", " openai.api_key = os.environ[\"OPENAI_API_KEY\"]\n", - " turbo = dspy.OpenAI(model=\"gpt-3.5-turbo\", temperature=0, max_tokens=2000)\n", + " turbo = dspy.OpenAI(model=\"gpt-3.5-turbo\", temperature=0, max_tokens=2000)\n", " dspy.settings.configure(lm=turbo)\n", " return turbo\n", "\n", - "turbo=set_dspy()\n", - "# comment out set_dspy() and use set_dspy_hardcode_openai_key is your option\n", + "\n", + "# provide `openai_api_key.txt` with your openAI api key\n", + "turbo = set_dspy()\n", + "# optionally, hard code your openAI api key at line 21\n", "# turbo=set_dspy_hardcode_openai_key()" ] }, { - "cell_type": "code", - "execution_count": 3, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "def load_text_file(file_path):\n", - " \"\"\"\n", - " Load a text file and return its contents as a string.\n", + "### Step 3: Load the cyber incident repot (e.g., conversation)\n", "\n", - " Parameters:\n", - " file_path (str): The path to the text file.\n", - "\n", - " Returns:\n", - " str: The contents of the text file.\n", - " \"\"\"\n", - " try:\n", - " with open(file_path, \"r\") as file:\n", - " contents = file.read()\n", - " return contents\n", - " except FileNotFoundError:\n", - " return \"File not found.\"\n", - " except Exception as e:\n", - " return f\"An error occurred: {e}\"\n" + "- the same conversation as previous tutorial" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -134,31 +155,87 @@ } ], "source": [ + "def load_text_file(file_path):\n", + " \"\"\"\n", + " Load a text file and return its contents as a string.\n", + "\n", + " Parameters:\n", + " file_path (str): The path to the text file.\n", + "\n", + " Returns:\n", + " str: The contents of the text file.\n", + " \"\"\"\n", + " try:\n", + " with open(file_path, \"r\") as file:\n", + " contents = file.read()\n", + " return contents\n", + " except FileNotFoundError:\n", + " return \"File not found.\"\n", + " except Exception as e:\n", + " return f\"An error occurred: {e}\"\n", + "\n", "conversation = load_text_file(\"conversation.txt\")\n", "print(conversation)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 4: Tell an LLM `WHAT` are the inputs/outputs by defining DSPy: Signature \n", + "\n", + "- Goal\n", + " - to generate a simple plaintext file, call `DOT`. In DOT, you define nodes and edges.\n", + " - Graphviz uses ``DOT` to describe and visualize graphs. \n", + "\n", + "- The following signature identifies a list of evidence entities and relationships based on the conversation\n", + " - Inherit from `dspy.Signature`\n", + " - Exact `ONE` input, e.g., the conversation \n", + " - Exact `ONE` output, a DOT file" + ] + }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "class DotGenerator(dspy.Signature):\n", - " \"\"\"Generate a evidence knowledge graph based on a conversation between an IT Security Specialist and an Employee. \"\"\"\n", + " \"\"\"Generate a foresnic evidence knowledge graph based on a conversation between an IT Security Specialist and an Employee. \"\"\"\n", "\n", " question: str = dspy.InputField(\n", " desc=\"a conversation describing a cyber incident between an IT Security Specialist and an employee.\"\n", " )\n", "\n", " answer: str = dspy.OutputField(\n", - " desc=\"a graph in a dot format. The nodes of the graph are evidence entities and the edges of the graph are the relationship between evidence entities. A dot format is primarily associated with Graphviz, a graph visualization software. For example, a dot should looks like: digraph incident_name {...}. Don't include `````` \"\n", + " desc=\"a graph in a dot format. The nodes of the graph are evidence entities and the edges of the graph are the relationships between evidence entities. A DOT format is primarily associated with Graphviz, a graph visualization software. For example, a DOT should looks like: digraph incident_name {...}. Don't include `````` \"\n", " )" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 5: Tell an LLM `HOW` to generate answer: \n", + "\n", + "Generates and saves DOT file from a conversation using a specified signature.\n", + "\n", + "#### Parameters:\n", + "- `signature` (dspy.Signature): The signature defining the input and output structure for evidence identification.\n", + "- `conversation` (str): The conversation text to analyze for evidence.\n", + "- `output_file` (str): The file path where the identified evidence will be saved in DOT format.\n", + "\n", + "#### Notes:\n", + "- This function uses [`dspy.ChainOfThought`](https://arxiv.org/pdf/2201.11903) to process the conversation and create sknowledge graph\n", + "- Other options include \n", + " - `dspy.ChainOfThoughtWithHint` : Provide hints for reasoning\n", + " - `dspy.Retrieve`: Retrieves passages from a retriever module\n", + " - `dspy.ReAct`: Consists steps of Thought, Action, and Observation.\n" + ] + }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -173,28 +250,38 @@ " print(f\"The evidence has been saved to the file {output_file}\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 6: Call LLM to generate the graph in a `.DOT` file" + ] + }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "digraph cyber_incident {\n", - " \"Suspicious Email\" -> \"IP Address: 192.168.10.45\"\n", - " \"Suspicious Email\" -> \"Domain: banksecure.com (Registered to someone in Russia)\"\n", - " \"Suspicious Email\" -> \"URL: http://banksecure-verification.com/login\"\n", - " \"Suspicious Email\" -> \"URL: http://banksecure-verification.com/account-details\"\n", - " \"URL: http://banksecure-verification.com/login\" -> \"Domain: banksecure-verification.com (Registered 2 days ago)\"\n", - " \"URL: http://banksecure-verification.com/account-details\" -> \"Domain: banksecure-verification.com (Registered 2 days ago)\"\n", - " \"Browser History Entries\" -> \"Visited at 10:15 AM: http://banksecure-verification.com/login\"\n", - " \"Browser History Entries\" -> \"Visited at 10:17 AM: http://banksecure-verification.com/account-details\"\n", - " \"Downloaded File: AccountDetails.exe\" -> \"Created at 10:20 AM\"\n", - " \"Downloaded File: AccountDetails.exe\" -> \"MD5 Hash: e99a18c428cb38d5f260853678922e03 (Matched known malware)\"\n", - " \"MD5 Hash: e99a18c428cb38d5f260853678922e03 (Matched known malware)\" -> \"Quarantined File: AccountDetails.exe\"\n", - " \"IP Address: 192.168.10.45\" -> \"Network Logs Analysis around 10:20 AM\"\n", + "digraph phishing_incident {\n", + " \"Email from support@banksecure.com\" -> \"IP address 192.168.10.45\";\n", + " \"Email from support@banksecure.com\" -> \"Domain banksecure.com\";\n", + " \"Domain banksecure.com\" -> \"Registered to someone in Russia\";\n", + " \"URL http://banksecure-verification.com/login\" -> \"Domain registered two days ago\";\n", + " \"URL http://banksecure-verification.com/account-details\" -> \"Domain registered two days ago\";\n", + " \"Browser history entries\" -> \"Visited at 10:15 AM: http://banksecure-verification.com/login\";\n", + " \"Browser history entries\" -> \"Visited at 10:17 AM: http://banksecure-verification.com/account-details\";\n", + " \"Downloaded file AccountDetails.exe\" -> \"Created at 10:20 AM\";\n", + " \"Downloaded file AccountDetails.exe\" -> \"MD5 hash e99a18c428cb38d5f260853678922e03\";\n", + " \"MD5 hash e99a18c428cb38d5f260853678922e03\" -> \"Matches known malware in database\";\n", + " \"IP address 192.168.10.45\" -> \"Network connections established\";\n", + " \"Security measures\" -> \"Clear browser history and cache\";\n", + " \"Security measures\" -> \"Run full antivirus scan\";\n", + " \"Security measures\" -> \"Reset passwords from different device\";\n", + " \"Security measures\" -> \"Enable two-factor authentication\";\n", "}\n" ] } @@ -208,9 +295,16 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 7: Render the graph based on the `.DOT` file " + ] + }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -221,173 +315,227 @@ " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n", "\n", - "\n", - "\n", + "\n", + "\n", "\n", - "cyber_incident\n", - "\n", - "\n", + "phishing_incident\n", + "\n", + "\n", "\n", - "Suspicious Email\n", - "\n", - "Suspicious Email\n", + "Email from support@banksecure.com\n", + "\n", + "Email from support@banksecure.com\n", "\n", - "\n", + "\n", "\n", - "IP Address: 192.168.10.45\n", - "\n", - "IP Address: 192.168.10.45\n", + "IP address 192.168.10.45\n", + "\n", + "IP address 192.168.10.45\n", "\n", - "\n", + "\n", "\n", - "Suspicious Email->IP Address: 192.168.10.45\n", - "\n", - "\n", + "Email from support@banksecure.com->IP address 192.168.10.45\n", + "\n", + "\n", "\n", - "\n", + "\n", "\n", - "Domain: banksecure.com (Registered to someone in Russia)\n", - "\n", - "Domain: banksecure.com (Registered to someone in Russia)\n", + "Domain banksecure.com\n", + "\n", + "Domain banksecure.com\n", "\n", - "\n", + "\n", "\n", - "Suspicious Email->Domain: banksecure.com (Registered to someone in Russia)\n", - "\n", - "\n", + "Email from support@banksecure.com->Domain banksecure.com\n", + "\n", + "\n", "\n", - "\n", + "\n", + "\n", + "Network connections established\n", + "\n", + "Network connections established\n", + "\n", + "\n", + "\n", + "IP address 192.168.10.45->Network connections established\n", + "\n", + "\n", + "\n", + "\n", "\n", - "URL: http://banksecure-verification.com/login\n", - "\n", - "URL: http://banksecure-verification.com/login\n", + "Registered to someone in Russia\n", + "\n", + "Registered to someone in Russia\n", "\n", - "\n", + "\n", "\n", - "Suspicious Email->URL: http://banksecure-verification.com/login\n", - "\n", - "\n", + "Domain banksecure.com->Registered to someone in Russia\n", + "\n", + "\n", "\n", - "\n", + "\n", "\n", - "URL: http://banksecure-verification.com/account-details\n", - "\n", - "URL: http://banksecure-verification.com/account-details\n", + "URL http://banksecure-verification.com/login\n", + "\n", + "URL http://banksecure-verification.com/login\n", "\n", - "\n", - "\n", - "Suspicious Email->URL: http://banksecure-verification.com/account-details\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Network Logs Analysis around 10:20 AM\n", - "\n", - "Network Logs Analysis around 10:20 AM\n", - "\n", - "\n", - "\n", - "IP Address: 192.168.10.45->Network Logs Analysis around 10:20 AM\n", - "\n", - "\n", - "\n", - "\n", + "\n", "\n", - "Domain: banksecure-verification.com (Registered 2 days ago)\n", - "\n", - "Domain: banksecure-verification.com (Registered 2 days ago)\n", + "Domain registered two days ago\n", + "\n", + "Domain registered two days ago\n", "\n", - "\n", - "\n", - "URL: http://banksecure-verification.com/login->Domain: banksecure-verification.com (Registered 2 days ago)\n", - "\n", - "\n", + "\n", + "\n", + "URL http://banksecure-verification.com/login->Domain registered two days ago\n", + "\n", + "\n", "\n", - "\n", - "\n", - "URL: http://banksecure-verification.com/account-details->Domain: banksecure-verification.com (Registered 2 days ago)\n", - "\n", - "\n", - "\n", - "\n", + "\n", "\n", - "Browser History Entries\n", - "\n", - "Browser History Entries\n", + "URL http://banksecure-verification.com/account-details\n", + "\n", + "URL http://banksecure-verification.com/account-details\n", + "\n", + "\n", + "\n", + "URL http://banksecure-verification.com/account-details->Domain registered two days ago\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Browser history entries\n", + "\n", + "Browser history entries\n", "\n", "\n", - "\n", + "\n", "Visited at 10:15 AM: http://banksecure-verification.com/login\n", - "\n", - "Visited at 10:15 AM: http://banksecure-verification.com/login\n", + "\n", + "Visited at 10:15 AM: http://banksecure-verification.com/login\n", "\n", - "\n", - "\n", - "Browser History Entries->Visited at 10:15 AM: http://banksecure-verification.com/login\n", - "\n", - "\n", + "\n", + "\n", + "Browser history entries->Visited at 10:15 AM: http://banksecure-verification.com/login\n", + "\n", + "\n", "\n", "\n", - "\n", - "Visited at 10:17 AM: http://banksecure-verification.com/account-details\n", - "\n", - "Visited at 10:17 AM: http://banksecure-verification.com/account-details\n", - "\n", - "\n", - "\n", - "Browser History Entries->Visited at 10:17 AM: http://banksecure-verification.com/account-details\n", - "\n", - "\n", - "\n", - "\n", "\n", - "Downloaded File: AccountDetails.exe\n", - "\n", - "Downloaded File: AccountDetails.exe\n", + "Visited at 10:17 AM: http://banksecure-verification.com/account-details\n", + "\n", + "Visited at 10:17 AM: http://banksecure-verification.com/account-details\n", + "\n", + "\n", + "\n", + "Browser history entries->Visited at 10:17 AM: http://banksecure-verification.com/account-details\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Downloaded file AccountDetails.exe\n", + "\n", + "Downloaded file AccountDetails.exe\n", "\n", "\n", - "\n", - "Created at 10:20 AM\n", - "\n", - "Created at 10:20 AM\n", - "\n", - "\n", - "\n", - "Downloaded File: AccountDetails.exe->Created at 10:20 AM\n", - "\n", - "\n", - "\n", - "\n", "\n", - "MD5 Hash: e99a18c428cb38d5f260853678922e03 (Matched known malware)\n", - "\n", - "MD5 Hash: e99a18c428cb38d5f260853678922e03 (Matched known malware)\n", + "Created at 10:20 AM\n", + "\n", + "Created at 10:20 AM\n", "\n", - "\n", - "\n", - "Downloaded File: AccountDetails.exe->MD5 Hash: e99a18c428cb38d5f260853678922e03 (Matched known malware)\n", - "\n", - "\n", + "\n", + "\n", + "Downloaded file AccountDetails.exe->Created at 10:20 AM\n", + "\n", + "\n", "\n", - "\n", + "\n", "\n", - "Quarantined File: AccountDetails.exe\n", - "\n", - "Quarantined File: AccountDetails.exe\n", + "MD5 hash e99a18c428cb38d5f260853678922e03\n", + "\n", + "MD5 hash e99a18c428cb38d5f260853678922e03\n", "\n", - "\n", - "\n", - "MD5 Hash: e99a18c428cb38d5f260853678922e03 (Matched known malware)->Quarantined File: AccountDetails.exe\n", - "\n", - "\n", + "\n", + "\n", + "Downloaded file AccountDetails.exe->MD5 hash e99a18c428cb38d5f260853678922e03\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Matches known malware in database\n", + "\n", + "Matches known malware in database\n", + "\n", + "\n", + "\n", + "MD5 hash e99a18c428cb38d5f260853678922e03->Matches known malware in database\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Security measures\n", + "\n", + "Security measures\n", + "\n", + "\n", + "\n", + "Clear browser history and cache\n", + "\n", + "Clear browser history and cache\n", + "\n", + "\n", + "\n", + "Security measures->Clear browser history and cache\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Run full antivirus scan\n", + "\n", + "Run full antivirus scan\n", + "\n", + "\n", + "\n", + "Security measures->Run full antivirus scan\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Reset passwords from different device\n", + "\n", + "Reset passwords from different device\n", + "\n", + "\n", + "\n", + "Security measures->Reset passwords from different device\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Enable two-factor authentication\n", + "\n", + "Enable two-factor authentication\n", + "\n", + "\n", + "\n", + "Security measures->Enable two-factor authentication\n", + "\n", + "\n", "\n", "\n", "\n" ], "text/plain": [ - "" + "" ] }, "metadata": {}, @@ -399,7 +547,7 @@ "'02_output_email_analysis.png'" ] }, - "execution_count": 11, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -413,6 +561,84 @@ "graph.render(\"02_output_email_analysis\", format=\"png\", cleanup=True)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 7: Inspect the last prompt send to the LLM\n", + "\n", + "You want to check:\n", + "- Prompt Description Section: Description in the signature\n", + "- Format Section: `Following the following format.` \n", + " - Pay attention to a new inserted field `REASONING: Let's think step by step ...`\n", + "- Result Section: a forensic knowledge graph in `.DOT`" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "\n", + "Generate a foresnic evidence knowledge graph based on a conversation between an IT Security Specialist and an Employee.\n", + "\n", + "---\n", + "\n", + "Follow the following format.\n", + "\n", + "Question: a conversation describing a cyber incident between an IT Security Specialist and an employee.\n", + "Reasoning: Let's think step by step in order to ${produce the answer}. We ...\n", + "Answer: a graph in a dot format. The nodes of the graph are evidence entities and the edges of the graph are the relationships between evidence entities. A DOT format is primarily associated with Graphviz, a graph visualization software. For example, a DOT should looks like: digraph incident_name {...}. Don't include ``````\n", + "\n", + "---\n", + "\n", + "Question: Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was \"Urgent: Verify Your Account Now\". The email looks suspicious to me. Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from. Alice: Sure, forwarding it now. Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It's actually registered to someone in Russia. Alice: That’s definitely not right. Should I be worried? Bob: We should investigate further. Did you click on any links or download any attachments? Alice: I did click on a link that took me to a page asking for my login credentials. I didn't enter anything though. The URL was http://banksecure-verification.com/login. Bob: Good call on not entering your details. Let’s check the URL. This domain was just registered two days ago. It’s highly likely it’s a phishing site. Alice: What should I do next? Bob: First, clear your browser history and cache. Also, run a full antivirus scan on your computer. Can you also provide me with any browser history entries and cookies from that session? Alice: I’ve cleared the history and started the antivirus scan. Here are the relevant entries from my browser history: Visited at 10:15 AM: http://banksecure-verification.com/login Visited at 10:17 AM: http://banksecure-verification.com/account-details Bob: Thanks. I’ll analyze these URLs further. Also, check if there are any suspicious files downloaded or present in your downloads folder. Look for anything unusual. Alice: There's a file named \"AccountDetails.exe\" that I don’t remember downloading. It was created at 10:20 AM. Bob: Definitely suspicious. Don’t open it. Let’s hash the file to verify its integrity. Can you run an MD5 hash on it? Alice: Done. The MD5 hash is e99a18c428cb38d5f260853678922e03. Bob: This hash matches known malware in our database. We’ll need to quarantine it and check if it has established any network connections. I’ll look into our network logs for the IP 192.168.10.45 around 10:20 AM. Alice: Is there anything else I need to do? Bob: For now, avoid using your computer for sensitive tasks. We’ll also reset your passwords from a different device and enable two-factor authentication on your accounts. Alice: Thanks, Bob. I’ll follow these steps immediately.\n", + "Reasoning: Let's think step by step in order to produce the answer. We will start by identifying the evidence entities mentioned in the conversation, such as email headers, IP addresses, domain registration information, URLs, browser history entries, cookies, downloaded files, MD5 hashes, malware database, network logs, and security measures like clearing browser history, running antivirus scans, hashing files, and resetting passwords. We will then establish the relationships between these evidence entities based on the conversation provided.\n", + "\n", + "Answer:\n", + "digraph phishing_incident {\n", + " \"Email from support@banksecure.com\" -> \"IP address 192.168.10.45\";\n", + " \"Email from support@banksecure.com\" -> \"Domain banksecure.com\";\n", + " \"Domain banksecure.com\" -> \"Registered to someone in Russia\";\n", + " \"URL http://banksecure-verification.com/login\" -> \"Domain registered two days ago\";\n", + " \"URL http://banksecure-verification.com/account-details\" -> \"Domain registered two days ago\";\n", + " \"Browser history entries\" -> \"Visited at 10:15 AM: http://banksecure-verification.com/login\";\n", + " \"Browser history entries\" -> \"Visited at 10:17 AM: http://banksecure-verification.com/account-details\";\n", + " \"Downloaded file AccountDetails.exe\" -> \"Created at 10:20 AM\";\n", + " \"Downloaded file AccountDetails.exe\" -> \"MD5 hash e99a18c428cb38d5f260853678922e03\";\n", + " \"MD5 hash e99a18c428cb38d5f260853678922e03\" -> \"Matches known malware in database\";\n", + " \"IP address 192.168.10.45\" -> \"Network connections established\";\n", + " \"Security measures\" -> \"Clear browser history and cache\";\n", + " \"Security measures\" -> \"Run full antivirus scan\";\n", + " \"Security measures\" -> \"Reset passwords from different device\";\n", + " \"Security measures\" -> \"Enable two-factor authentication\";\n", + "}\n", + "\n", + "\n", + "\n" + ] + }, + { + "data": { + "text/plain": [ + "'\\n\\n\\nGenerate a foresnic evidence knowledge graph based on a conversation between an IT Security Specialist and an Employee.\\n\\n---\\n\\nFollow the following format.\\n\\nQuestion: a conversation describing a cyber incident between an IT Security Specialist and an employee.\\nReasoning: Let\\'s think step by step in order to ${produce the answer}. We ...\\nAnswer: a graph in a dot format. The nodes of the graph are evidence entities and the edges of the graph are the relationships between evidence entities. A DOT format is primarily associated with Graphviz, a graph visualization software. For example, a DOT should looks like: digraph incident_name {...}. Don\\'t include ``````\\n\\n---\\n\\nQuestion: Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was \"Urgent: Verify Your Account Now\". The email looks suspicious to me. Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from. Alice: Sure, forwarding it now. Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It\\'s actually registered to someone in Russia. Alice: That’s definitely not right. Should I be worried? Bob: We should investigate further. Did you click on any links or download any attachments? Alice: I did click on a link that took me to a page asking for my login credentials. I didn\\'t enter anything though. The URL was http://banksecure-verification.com/login. Bob: Good call on not entering your details. Let’s check the URL. This domain was just registered two days ago. It’s highly likely it’s a phishing site. Alice: What should I do next? Bob: First, clear your browser history and cache. Also, run a full antivirus scan on your computer. Can you also provide me with any browser history entries and cookies from that session? Alice: I’ve cleared the history and started the antivirus scan. Here are the relevant entries from my browser history: Visited at 10:15 AM: http://banksecure-verification.com/login Visited at 10:17 AM: http://banksecure-verification.com/account-details Bob: Thanks. I’ll analyze these URLs further. Also, check if there are any suspicious files downloaded or present in your downloads folder. Look for anything unusual. Alice: There\\'s a file named \"AccountDetails.exe\" that I don’t remember downloading. It was created at 10:20 AM. Bob: Definitely suspicious. Don’t open it. Let’s hash the file to verify its integrity. Can you run an MD5 hash on it? Alice: Done. The MD5 hash is e99a18c428cb38d5f260853678922e03. Bob: This hash matches known malware in our database. We’ll need to quarantine it and check if it has established any network connections. I’ll look into our network logs for the IP 192.168.10.45 around 10:20 AM. Alice: Is there anything else I need to do? Bob: For now, avoid using your computer for sensitive tasks. We’ll also reset your passwords from a different device and enable two-factor authentication on your accounts. Alice: Thanks, Bob. I’ll follow these steps immediately.\\nReasoning: Let\\'s think step by step in order to\\x1b[32m produce the answer. We will start by identifying the evidence entities mentioned in the conversation, such as email headers, IP addresses, domain registration information, URLs, browser history entries, cookies, downloaded files, MD5 hashes, malware database, network logs, and security measures like clearing browser history, running antivirus scans, hashing files, and resetting passwords. We will then establish the relationships between these evidence entities based on the conversation provided.\\n\\nAnswer:\\ndigraph phishing_incident {\\n \"Email from support@banksecure.com\" -> \"IP address 192.168.10.45\";\\n \"Email from support@banksecure.com\" -> \"Domain banksecure.com\";\\n \"Domain banksecure.com\" -> \"Registered to someone in Russia\";\\n \"URL http://banksecure-verification.com/login\" -> \"Domain registered two days ago\";\\n \"URL http://banksecure-verification.com/account-details\" -> \"Domain registered two days ago\";\\n \"Browser history entries\" -> \"Visited at 10:15 AM: http://banksecure-verification.com/login\";\\n \"Browser history entries\" -> \"Visited at 10:17 AM: http://banksecure-verification.com/account-details\";\\n \"Downloaded file AccountDetails.exe\" -> \"Created at 10:20 AM\";\\n \"Downloaded file AccountDetails.exe\" -> \"MD5 hash e99a18c428cb38d5f260853678922e03\";\\n \"MD5 hash e99a18c428cb38d5f260853678922e03\" -> \"Matches known malware in database\";\\n \"IP address 192.168.10.45\" -> \"Network connections established\";\\n \"Security measures\" -> \"Clear browser history and cache\";\\n \"Security measures\" -> \"Run full antivirus scan\";\\n \"Security measures\" -> \"Reset passwords from different device\";\\n \"Security measures\" -> \"Enable two-factor authentication\";\\n}\\x1b[0m\\n\\n\\n'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "turbo.inspect_history(n=1)" + ] + }, { "cell_type": "code", "execution_count": null, diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_output.dot b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_output.dot index 2733ba7..b31b7f5 100644 --- a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_output.dot +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_output.dot @@ -1,5 +1,17 @@ -digraph file_not_found { - File [label="File" shape="rectangle" color="blue"] - NotFound [label="Not Found" shape="ellipse" color="red"] - File -> NotFound [label="Indicator"] +digraph phishing_incident { + "Email from support@banksecure.com" -> "IP address 192.168.10.45"; + "Email from support@banksecure.com" -> "Domain banksecure.com"; + "Domain banksecure.com" -> "Registered to someone in Russia"; + "URL http://banksecure-verification.com/login" -> "Domain registered two days ago"; + "URL http://banksecure-verification.com/account-details" -> "Domain registered two days ago"; + "Browser history entries" -> "Visited at 10:15 AM: http://banksecure-verification.com/login"; + "Browser history entries" -> "Visited at 10:17 AM: http://banksecure-verification.com/account-details"; + "Downloaded file AccountDetails.exe" -> "Created at 10:20 AM"; + "Downloaded file AccountDetails.exe" -> "MD5 hash e99a18c428cb38d5f260853678922e03"; + "MD5 hash e99a18c428cb38d5f260853678922e03" -> "Matches known malware in database"; + "IP address 192.168.10.45" -> "Network connections established"; + "Security measures" -> "Clear browser history and cache"; + "Security measures" -> "Run full antivirus scan"; + "Security measures" -> "Reset passwords from different device"; + "Security measures" -> "Enable two-factor authentication"; } \ No newline at end of file diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_output_email_analysis.png b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_output_email_analysis.png index d5fd79b..91c23ea 100644 Binary files a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_output_email_analysis.png and b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/02_output_email_analysis.png differ diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/03_evidence_stix_zeroshot.ipynb b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/03_evidence_stix_zeroshot.ipynb index 368ba1c..8bcef74 100644 --- a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/03_evidence_stix_zeroshot.ipynb +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/03_evidence_stix_zeroshot.ipynb @@ -1,12 +1,75 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A tutorial to generate evidence in a standard and structured format. \n", + "\n", + "### Benefits of Using Standardized formats for forensics evidence\n", + "- Consistency: easier to compare and analyze different pieces of evidence\n", + "- Interoperability: exchange of evidence across different systems and platforms\n", + "- Accuracy: reduces the risk of errors and omissions\n", + "- Automation: facilitate the use of automated tools and technologies, such as machine learning algorithms, for evidence analysis.\n", + "\n", + "### Solution: Structured Threat Information eXpression (STIX)\n", + "- Share information about cyber threats\n", + " - think of it as a common language that everyone in the cybersecurity community can use to communicate effectively\n", + " - improve their threat intelligence capabilities\n", + "- Include basic predefined objects can be used as `digital forensics evidence`\n", + " - email, URL, indentity, etc.\n", + "- Community support: maintained by the Organization for the Advancement of Structured Information Standards (OASIS)\n", + " - open sourced\n", + " - tools and library support\n", + "- Adaptability: flexible and can be extended to accommodate new types of threat information as the cybersecurity landscape evolves.\n", + "\n", + "### Example of `email-message` \n", + "```\n", + "in STIX\n", + " {\n", + " \"type\": \"email-message\",\n", + " \"id\": \"email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\",\n", + " \"is_multipart\": false,\n", + " \"subject\": \"Urgent Benefits Package Update\",\n", + " \"from_ref\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\n", + " \"body\": \"Please click the link to review the changes to your benefits package.\"\n", + " }\n", + "\n", + " vs.\n", + "without STIX\n", + "\n", + " \"Email\": {\n", + " \"From\": \"support@banksecure.com\",\n", + " \"Subject\": \"Urgent: Verify Your Account Now\",\n", + " \"Content\": \"strange email asking to verify account details urgently\"\n", + " }\n", + "```\n", + "\n", + "### Goal\n", + "- Capture threat information in STIX directly from the conversation\n", + "- Evidence entities and/or relationships are in the STIX" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Download libraries and files for the lab\n", + "- Make use you download necessary library and files. \n", + "- All downloaded and saved files can be located in the `content` folder if using google Colab" + ] + }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ + "# uncomment the commands to download libraries and files\n", + "#!pip install python-dotenv\n", + "#!pip install dspy-ai\n", "#!pip install graphviz\n", + "# !wget https://raw.githubusercontent.com/frankwxu/digital-forensics-lab/main/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/conversation.txt\n", "\n", "import dspy\n", "import os\n", @@ -16,9 +79,19 @@ "from IPython.display import display" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Config DSPy with openAI \n", + "- You `MUST` have an openAI api key\n", + "- load an openAI api key from `openai_api_key.txt` file\n", + "- or, hard code your open api key" + ] + }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -41,48 +114,29 @@ "\n", "\n", "def set_dspy_hardcode_openai_key():\n", - " os.environ[\"OPENAI_API_KEY\"] = (\n", - " \"sk-proj-yourapikeyhere\"\n", - " )\n", + " os.environ[\"OPENAI_API_KEY\"] = \"sk-proj-yourapikeyhere\"\n", " openai.api_key = os.environ[\"OPENAI_API_KEY\"]\n", - " turbo = dspy.OpenAI(model=\"gpt-3.5-turbo\", temperature=0, max_tokens=2000)\n", + " turbo = dspy.OpenAI(model=\"gpt-3.5-turbo\", temperature=0, max_tokens=2000)\n", " dspy.settings.configure(lm=turbo)\n", " return turbo\n", "\n", - "turbo=set_dspy()\n", - "# comment out set_dspy() and use set_dspy_hardcode_openai_key is your option\n", + "\n", + "# provide `openai_api_key.txt` with your openAI api key\n", + "turbo = set_dspy()\n", + "# optionally, hard code your openAI api key at line 21\n", "# turbo=set_dspy_hardcode_openai_key()" ] }, { - "cell_type": "code", - "execution_count": 3, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "def load_text_file(file_path):\n", - " \"\"\"\n", - " Load a text file and return its contents as a string.\n", - "\n", - " Parameters:\n", - " file_path (str): The path to the text file.\n", - "\n", - " Returns:\n", - " str: The contents of the text file.\n", - " \"\"\"\n", - " try:\n", - " with open(file_path, \"r\") as file:\n", - " contents = file.read()\n", - " return contents\n", - " except FileNotFoundError:\n", - " return \"File not found.\"\n", - " except Exception as e:\n", - " return f\"An error occurred: {e}\"\n" + "### Step 3: Load the cyber incident repot (e.g., conversation)" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -132,17 +186,54 @@ } ], "source": [ - "conversation=load_text_file(\"conversation.txt\")\n", + "def load_text_file(file_path):\n", + " \"\"\"\n", + " Load a text file and return its contents as a string.\n", + "\n", + " Parameters:\n", + " file_path (str): The path to the text file.\n", + "\n", + " Returns:\n", + " str: The contents of the text file.\n", + " \"\"\"\n", + " try:\n", + " with open(file_path, \"r\") as file:\n", + " contents = file.read()\n", + " return contents\n", + " except FileNotFoundError:\n", + " return \"File not found.\"\n", + " except Exception as e:\n", + " return f\"An error occurred: {e}\"\n", + "\n", + "\n", + "conversation = load_text_file(\"conversation.txt\")\n", "print(conversation)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 4: Tell an LLM `WHAT` are the inputs/outputs by defining DSPy: Signature \n", + "\n", + "- A signature is one of the basic building blocks in DSPy's prompt programming\n", + "- It is a declarative specification of input/output behavior of a DSPy module\n", + " - Think about a function signature\n", + "- Allow you to tell the LLM what it needs to do. \n", + " - Don't need to specify how we should ask the LLM to do it.\n", + "- The following signature identifies a list of evidence based on the conversation\n", + " - Inherit from `dspy.Signature`\n", + " - Exact `ONE` input, e.g., the conversation \n", + " - Exact `ONE` output, e.g., cyber threat information in JSON" + ] + }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ - "class SITXGenerator(dspy.Signature):\n", + "class STIXGenerator(dspy.Signature):\n", " \"\"\"Describe a conversation in STIX, which stands for Structured Threat Information eXpression, is a standardized language for representing cyber threat information.\"\"\"\n", "\n", " question: str = dspy.InputField(\n", @@ -154,9 +245,26 @@ " )" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 5: Tell an LLM `HOW` to generate answer: \n", + "\n", + "The following function generates and saves threat information from a conversation using a specified signature.\n", + "\n", + "#### Parameters:\n", + "- `signature` (dspy.Signature): The signature defining the input and output structure for evidence identification.\n", + "- `conversation` (str): The conversation text to analyze for threat information.\n", + "- `output_file` (str): The file path where the identified threat information will be saved as JSON.\n", + "\n", + "#### Returns:\n", + "None. The function saves the result to a file and prints a confirmation message." + ] + }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -171,9 +279,16 @@ " print(f\"The evidence has been saved to the file {output_file}\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 6: Generate entities using `STIXGenerator`" + ] + }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -228,15 +343,28 @@ "source": [ "output_file = \"03_output.json\"\n", "generate_answer_CoT(\n", - " SITXGenerator,\n", + " STIXGenerator,\n", " conversation,\n", " output_file,\n", ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 7: Inspect the last prompt send to the LLM\n", + "\n", + "You want to check:\n", + "- Prompt Description Section: Description in the signature\n", + "- Format Section: `Following the following format.` \n", + " - Pay attention to a new inserted field `REASONING: Let's think step by step ...`\n", + "- Result Section: a threat information in `.JSON`" + ] + }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -314,7 +442,7 @@ "'\\n\\n\\nDescribe a conversation in STIX, which stands for Structured Threat Information eXpression, is a standardized language for representing cyber threat information.\\n\\n---\\n\\nFollow the following format.\\n\\nQuestion: a conversation describing a cyber incident between an IT Security Specialist and an employee.\\nReasoning: Let\\'s think step by step in order to ${produce the answer}. We ...\\nAnswer: the formalized STIX in JSON representing cyber threat information based on the conversation, e.g., [{object 1}, {object 2}, ... {object n}]\\n\\n---\\n\\nQuestion: Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was \"Urgent: Verify Your Account Now\". The email looks suspicious to me. Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from. Alice: Sure, forwarding it now. Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It\\'s actually registered to someone in Russia. Alice: That’s definitely not right. Should I be worried? Bob: We should investigate further. Did you click on any links or download any attachments? Alice: I did click on a link that took me to a page asking for my login credentials. I didn\\'t enter anything though. The URL was http://banksecure-verification.com/login. Bob: Good call on not entering your details. Let’s check the URL. This domain was just registered two days ago. It’s highly likely it’s a phishing site. Alice: What should I do next? Bob: First, clear your browser history and cache. Also, run a full antivirus scan on your computer. Can you also provide me with any browser history entries and cookies from that session? Alice: I’ve cleared the history and started the antivirus scan. Here are the relevant entries from my browser history: Visited at 10:15 AM: http://banksecure-verification.com/login Visited at 10:17 AM: http://banksecure-verification.com/account-details Bob: Thanks. I’ll analyze these URLs further. Also, check if there are any suspicious files downloaded or present in your downloads folder. Look for anything unusual. Alice: There\\'s a file named \"AccountDetails.exe\" that I don’t remember downloading. It was created at 10:20 AM. Bob: Definitely suspicious. Don’t open it. Let’s hash the file to verify its integrity. Can you run an MD5 hash on it? Alice: Done. The MD5 hash is e99a18c428cb38d5f260853678922e03. Bob: This hash matches known malware in our database. We’ll need to quarantine it and check if it has established any network connections. I’ll look into our network logs for the IP 192.168.10.45 around 10:20 AM. Alice: Is there anything else I need to do? Bob: For now, avoid using your computer for sensitive tasks. We’ll also reset your passwords from a different device and enable two-factor authentication on your accounts. Alice: Thanks, Bob. I’ll follow these steps immediately.\\nReasoning: Let\\'s think step by step in order to\\x1b[32m produce the answer. We need to formalize the conversation into STIX objects that represent the cyber threat information discussed between Alice and Bob.\\n\\nAnswer: \\n[\\n {\\n \"email\": {\\n \"sender\": \"support@banksecure.com\",\\n \"subject\": \"Urgent: Verify Your Account Now\",\\n \"headers\": {\\n \"IP_address\": \"192.168.10.45\",\\n \"domain\": \"banksecure.com\",\\n \"registered_to\": \"Russia\"\\n },\\n \"links_clicked\": [\\n {\\n \"URL\": \"http://banksecure-verification.com/login\",\\n \"timestamp\": \"10:15 AM\"\\n },\\n {\\n \"URL\": \"http://banksecure-verification.com/account-details\",\\n \"timestamp\": \"10:17 AM\"\\n }\\n ],\\n \"attachments\": [\\n {\\n \"file_name\": \"AccountDetails.exe\",\\n \"created_at\": \"10:20 AM\",\\n \"MD5_hash\": \"e99a18c428cb38d5f260853678922e03\",\\n \"status\": \"known_malware\"\\n }\\n ]\\n }\\n },\\n {\\n \"actions_taken\": [\\n \"Clear browser history and cache\",\\n \"Run full antivirus scan\",\\n \"Provide browser history entries and cookies\",\\n \"Quarantine suspicious file\",\\n \"Check network connections\",\\n \"Reset passwords and enable two-factor authentication\"\\n ]\\n }\\n]\\x1b[0m\\n\\n\\n'" ] }, - "execution_count": 8, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/04_evidence_stix_oneshot.ipynb b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/04_evidence_stix_oneshot.ipynb index 6ca2fe8..71e8d5c 100644 --- a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/04_evidence_stix_oneshot.ipynb +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/04_evidence_stix_oneshot.ipynb @@ -1,26 +1,65 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A tutorial on applying the one-shot fine-tuning technique in digital forensics\n", + "\n", + "### Motivation\n", + "- The generated evidence graph (consists of evidence and their relations) doesn't follow STIX. \n", + "\n", + "### Solution: One-shot learning\n", + "\n", + "- Provide one training example to LLMs\n", + "- LLMs often produce more accurate results by learning the example \n", + "\n", + "### Implementation\n", + "- Add one-shot example as the `context` of answer (e.g., conversation)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Download libraries and files for the lab\n", + "- Make use you download necessary library and files. \n", + "- All downloaded and saved files can be located in the `content` folder if using google Colab" + ] + }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ + "# uncomment the commands to download libraries and files\n", + "#!pip install python-dotenv\n", + "#!pip install dspy-ai\n", "#!pip install graphviz\n", + "# !wget https://raw.githubusercontent.com/frankwxu/digital-forensics-lab/main/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/conversation.txt\n", "\n", "import dspy\n", "import os\n", "import openai\n", "import json\n", "from dotenv import load_dotenv\n", - "\n", - "from graphviz import Digraph\n", "from IPython.display import display" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Config DSPy with openAI \n", + "- You `MUST` have an openAI api key\n", + "- load an openAI api key from `openai_api_key.txt` file\n", + "- or, hard code your open api key" + ] + }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -36,55 +75,36 @@ " # Set the API key as an environment variable\n", " os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n", " openai.api_key = os.environ[\"OPENAI_API_KEY\"]\n", - " turbo = dspy.OpenAI(model=\"gpt-3.5-turbo\", max_tokens=3000, temperature=0.5)\n", + " turbo = dspy.OpenAI(model=\"gpt-3.5-turbo\", max_tokens=2000, temperature=0.5)\n", " dspy.settings.configure(lm=turbo)\n", " return turbo\n", " # ==============end of set openAI enviroment=========\n", "\n", "\n", "def set_dspy_hardcode_openai_key():\n", - " os.environ[\"OPENAI_API_KEY\"] = (\n", - " \"sk-proj-yourapikeyhere\"\n", - " )\n", + " os.environ[\"OPENAI_API_KEY\"] = \"sk-proj-yourapikeyhere\"\n", " openai.api_key = os.environ[\"OPENAI_API_KEY\"]\n", - " turbo = dspy.OpenAI(model=\"gpt-3.5-turbo\", temperature=0, max_tokens=2000)\n", + " turbo = dspy.OpenAI(model=\"gpt-3.5-turbo\", temperature=0, max_tokens=2000)\n", " dspy.settings.configure(lm=turbo)\n", " return turbo\n", "\n", - "turbo=set_dspy()\n", - "# comment out set_dspy() and use set_dspy_hardcode_openai_key is your option\n", + "\n", + "# provide `openai_api_key.txt` with your openAI api key\n", + "turbo = set_dspy()\n", + "# optionally, hard code your openAI api key at line 21\n", "# turbo=set_dspy_hardcode_openai_key()" ] }, { - "cell_type": "code", - "execution_count": 3, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "def load_text_file(file_path):\n", - " \"\"\"\n", - " Load a text file and return its contents as a string.\n", - "\n", - " Parameters:\n", - " file_path (str): The path to the text file.\n", - "\n", - " Returns:\n", - " str: The contents of the text file.\n", - " \"\"\"\n", - " try:\n", - " with open(file_path, \"r\") as file:\n", - " contents = file.read()\n", - " return contents\n", - " except FileNotFoundError:\n", - " return \"File not found.\"\n", - " except Exception as e:\n", - " return f\"An error occurred: {e}\"\n" + "### Step 3: Load the cyber incident repot (e.g., conversation)" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -134,13 +154,55 @@ } ], "source": [ - "conversation=load_text_file(\"conversation.txt\")\n", + "def load_text_file(file_path):\n", + " \"\"\"\n", + " Load a text file and return its contents as a string.\n", + "\n", + " Parameters:\n", + " file_path (str): The path to the text file.\n", + "\n", + " Returns:\n", + " str: The contents of the text file.\n", + " \"\"\"\n", + " try:\n", + " with open(file_path, \"r\") as file:\n", + " contents = file.read()\n", + " return contents\n", + " except FileNotFoundError:\n", + " return \"File not found.\"\n", + " except Exception as e:\n", + " return f\"An error occurred: {e}\"\n", + "\n", + "\n", + "conversation = load_text_file(\"conversation.txt\")\n", "print(conversation)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 4: Define a structure for one-shot examples understood by `DSPy`.\n", + "\n", + "#### DSPy accpets training data in a certain format \n", + "- For instance, if you were working on a question-answering system\n", + "```\n", + "example = dspy.Example(\n", + " question=\"What is the capital of France?\",\n", + " answer=\"The capital of France is Paris.\"\n", + ").with_inputs(\"question\")\n", + "```\n", + "- `.with_inputs(\"question\")`: Telling the dspy framework that the \"question\" field should be treated as an input when using this example\n", + "\n", + "#### Key components in our one-shot example\n", + "- one-shot example: a similar conversation with correct evidence and relations in STIX\n", + "- question: a conversation describing the cyber incident scenario\n", + "- answer: the enhanced evidence entities and relations in STIX bacause the one-shot learning" + ] + }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -253,9 +315,41 @@ ").with_inputs(\"question\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 5: Load the one-shot example using a retriever \n", + "\n", + "#### What is retriever in `DSPy`?\n", + "- designed to fetch relevant information or documents from a larger corpus or database based on a given query \n", + "- often based on vector representations of text\n", + "- use cases: Question-answer systems, chatbots, or any application\n", + " - where relevant information needs to be fetched from a large dataset to inform further processing or responses \n", + "### The retriever in this example\n", + "- enhance accuracy for forensic evidence analysis\n", + " - identified evidence entities and relationships that comply with STIX\n", + "- hard-coded just return one-example for one-shot learning\n", + "- can be improved to retrieve more or related examples" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 6: Implement `OneShotRetriever`\n", + "The main retrieval method. It returns a formatted string containing the predefined example, regardless of the input query.\n", + "\n", + "- Parameters:\n", + " - query: The input `query` (currently not used in the retrieval process).\n", + "- Returns: A formatted string containing:\n", + " - The example scenario (from self.example.question)\n", + " - The corresponding STIX JSON (from self.example.answer)" + ] + }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -273,17 +367,24 @@ " return one_example" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 7: Implement `STIXGeneratorSig`" + ] + }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ - "class SITXGeneratorSig(dspy.Signature):\n", + "class STIXGeneratorSig(dspy.Signature):\n", " \"\"\"Describe a conversation in STIX, which stands for Structured Threat Information eXpression, is a standardized language for representing cyber threat information.\"\"\"\n", "\n", " # Make sure to define context here, otherwise, one-short learning won't work\n", - " context = dspy.InputField(desc=\"one example, which contain a scenario and the coreposing STIX in JSON\")\n", + " context = dspy.InputField(desc=\"contain a scenario and the coreposing STIX in JSON\")\n", "\n", " question: str = dspy.InputField(\n", " desc=\"a conversation describing a cyber incident between an IT Security Specialist and an employee.\"\n", @@ -294,17 +395,36 @@ " )" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 7: Implement `STIXGeneratorSig` Module\n", + "\n", + "\n", + "#### `dspy.Module` \n", + "- implement your business logic\n", + "- can include multiple submodules\n", + "- syntactic similarity to PyTorch\n", + " - `__init__()`: declares the used submodules.\n", + " - `forward()`: describes the control flow among the defined submodules.\n", + "\n", + "![Diagram of forward modules](04_forward_module.svg)\n", + "\n", + "\n" + ] + }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ - "class STXIGenCoT(dspy.Module):\n", + "class STIXGenCoT(dspy.Module):\n", " def __init__(self, example):\n", " super().__init__()\n", " self.retriever = OneShotRetriever(example)\n", - " self.predictor = dspy.ChainOfThought(SITXGeneratorSig)\n", + " self.predictor = dspy.ChainOfThought(STIXGeneratorSig)\n", "\n", " def forward(self, question):\n", " context = self.retriever(question)\n", @@ -314,19 +434,29 @@ " # last_interaction = turbo.inspect_history(n=1)\n", " # print(\"Last interaction:\")\n", " # print(last_interaction)\n", - " \n", + "\n", " return results" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 8: Tell an LLM `HOW` to generate answer in a function: \n", + "\n", + "- `HOW`: defined in `STIXGenCoT`\n", + "- save the output in JSON" + ] + }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def generate_answer(conversation, output_file):\n", " # Create an instance of your module with the one-shot example\n", - " my_module = STXIGenCoT(example)\n", + " my_module = STIXGenCoT(example)\n", "\n", " # Use your module with a new input\n", " answer = my_module(question=conversation).answer\n", @@ -338,9 +468,16 @@ " print(f\"The results have been saved to the file {output_file}\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 9: Execute the function above with an input and output" + ] + }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 14, "metadata": {}, "outputs": [ { @@ -443,9 +580,21 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 10: Inspect the last prompt send to the LLM\n", + "\n", + "You want to check:\n", + "- Prompt Description Section: Description in the signature\n", + "- Format Section: `Following the following format.` \n", + "- Context: Example scenario: " + ] + }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 15, "metadata": {}, "outputs": [ { @@ -463,7 +612,7 @@ "\n", "Question: a conversation describing a cyber incident between an IT Security Specialist and an employee.\n", "\n", - "Context: one example, which contain a scenario and the coreposing STIX in JSON\n", + "Context: contain a scenario and the coreposing STIX in JSON\n", "\n", "Reasoning: Let's think step by step in order to ${produce the answer}. We ...\n", "\n", @@ -581,9 +730,10 @@ "]\n", "\n", "\n", - "Reasoning: Let's think step by step in order to produce the answer. We need to identify the key elements in the conversation, such as the email address, suspicious URLs, actions taken by the employee, and recommendations provided by the IT Security Specialist. By breaking down the conversation into these components, we can create corresponding STIX objects in JSON format to represent the cyber threat information.\n", + "Reasoning: Let's think step by step in order to produce the answer. We need to identify the key elements of the conversation, such as the email address, URL, user account, and indicators of a phishing attack. By converting these elements into STIX objects in JSON format, we can represent the cyber threat information in a structured and standardized way.\n", "\n", - "Answer: [\n", + "Answer: \n", + "[\n", " {\n", " \"type\": \"identity\",\n", " \"id\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\n", @@ -675,10 +825,10 @@ { "data": { "text/plain": [ - "'\\n\\n\\nDescribe a conversation in STIX, which stands for Structured Threat Information eXpression, is a standardized language for representing cyber threat information.\\n\\n---\\n\\nFollow the following format.\\n\\nQuestion: a conversation describing a cyber incident between an IT Security Specialist and an employee.\\n\\nContext: one example, which contain a scenario and the coreposing STIX in JSON\\n\\nReasoning: Let\\'s think step by step in order to ${produce the answer}. We ...\\n\\nAnswer: the formalized STIX in JSON representing cyber threat information based on the conversation, e.g., [{object 1}, {object 2}, ... {object n}]\\n\\n---\\n\\nQuestion: Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was \"Urgent: Verify Your Account Now\". The email looks suspicious to me. Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from. Alice: Sure, forwarding it now. Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It\\'s actually registered to someone in Russia. Alice: That’s definitely not right. Should I be worried? Bob: We should investigate further. Did you click on any links or download any attachments? Alice: I did click on a link that took me to a page asking for my login credentials. I didn\\'t enter anything though. The URL was http://banksecure-verification.com/login. Bob: Good call on not entering your details. Let’s check the URL. This domain was just registered two days ago. It’s highly likely it’s a phishing site. Alice: What should I do next? Bob: First, clear your browser history and cache. Also, run a full antivirus scan on your computer. Can you also provide me with any browser history entries and cookies from that session? Alice: I’ve cleared the history and started the antivirus scan. Here are the relevant entries from my browser history: Visited at 10:15 AM: http://banksecure-verification.com/login Visited at 10:17 AM: http://banksecure-verification.com/account-details Bob: Thanks. I’ll analyze these URLs further. Also, check if there are any suspicious files downloaded or present in your downloads folder. Look for anything unusual. Alice: There\\'s a file named \"AccountDetails.exe\" that I don’t remember downloading. It was created at 10:20 AM. Bob: Definitely suspicious. Don’t open it. Let’s hash the file to verify its integrity. Can you run an MD5 hash on it? Alice: Done. The MD5 hash is e99a18c428cb38d5f260853678922e03. Bob: This hash matches known malware in our database. We’ll need to quarantine it and check if it has established any network connections. I’ll look into our network logs for the IP 192.168.10.45 around 10:20 AM. Alice: Is there anything else I need to do? Bob: For now, avoid using your computer for sensitive tasks. We’ll also reset your passwords from a different device and enable two-factor authentication on your accounts. Alice: Thanks, Bob. I’ll follow these steps immediately.\\n\\nContext:\\nExample scenairo: \\n Taylor: Hey Alex, I think I might have clicked on a suspicious link in an email.\\n Alex: Oh no, Taylor. Can you describe what happened?\\n Taylor: I got an email from what looked like our HR department. It said there was an urgent update to our benefits package, and I needed to click a link to review the changes.\\n Alex: Did the email address seem legitimate?\\n Taylor: At first glance, yes, but now that I think about it, the domain was slightly different. It was hr-dept@ourcompany-security.com instead of @ourcompany.com.\\n Alex: That sounds like phishing. What happened after you clicked the link?\\n Taylor: It took me to a login page that looked just like our internal portal. I entered my username and password.\\n Alex: Did you notice anything unusual after entering your credentials?\\n Taylor: Not immediately, but a few minutes later, I got an alert that someone attempted to log into my account from a different location.\\n Alex: Okay, this sounds serious. I need you to change your password immediately and enable two-factor authentication if you haven\\'t already.\\n Taylor: Done. What should we do next?\\n Alex: I\\'ll start by examining the email headers to trace the origin. Also, I need to check the link you clicked on to understand its structure and where it leads.\\n Taylor: Alright, I’ll forward you the email.\\n Alex: Thanks. I’ll also run a network scan to see if any other devices might have been compromised.\\n Taylor: Should I inform the rest of the team?\\n Alex: Yes, let them know about the phishing attempt and advise them to be cautious. I’ll send an official email with detailed instructions.\\n Taylor: Got it. Thanks, Alex. Is there anything else I should do?\\n Alex: Just keep an eye out for any unusual activities in your accounts. I’ll handle the technical investigation and follow up with you if I need more information.\\n Taylor: Will do. Thanks again.\\n Alex: No problem. Stay safe online.\\n Example generated STIX in JSON based on the scenairo: [\\n {\\n \"type\": \"identity\",\\n \"id\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"name\": \"OurCompany\",\\n \"identity_class\": \"organization\",\\n \"sectors\": [\"technology\"],\\n \"contact_information\": \"info@ourcompany.com\"\\n },\\n {\\n \"type\": \"email-addr\",\\n \"id\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\\n \"value\": \"hr-dept@ourcompany-security.com\"\\n },\\n {\\n \"type\": \"email-message\",\\n \"id\": \"email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\",\\n \"is_multipart\": false,\\n \"subject\": \"Urgent Benefits Package Update\",\\n \"from_ref\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\\n \"body\": \"Please click the link to review the changes to your benefits package.\"\\n },\\n {\\n \"type\": \"url\",\\n \"id\": \"url--4c3b-4c4b-bb6c-ded6b2a4a567\",\\n \"value\": \"http://phishing-link.com/login\"\\n },\\n {\\n \"type\": \"user-account\",\\n \"id\": \"user-account--bd5631cf-2af6-4bba-bc92-37c60d020400\",\\n \"user_id\": \"Taylor\",\\n \"account_login\": \"taylor@ourcompany.com\"\\n },\\n {\\n \"type\": \"observable\",\\n \"id\": \"observable--001\",\\n \"observable_type\": \"email\",\\n \"observable_value\": \"hr-dept@ourcompany-security.com\"\\n },\\n {\\n \"type\": \"observable\",\\n \"id\": \"observable--002\",\\n \"observable_type\": \"url\",\\n \"observable_value\": \"http://phishing-link.com/login\"\\n },\\n {\\n \"type\": \"indicator\",\\n \"id\": \"indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"name\": \"Phishing Email Indicator\",\\n \"pattern\": \"[email-message:subject = \\'Urgent Benefits Package Update\\']\",\\n \"valid_from\": \"2024-07-17T00:00:00Z\"\\n },\\n {\\n \"type\": \"incident\",\\n \"id\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"name\": \"Phishing Attack on OurCompany\",\\n \"description\": \"A phishing attack where a suspicious email was sent to an employee of OurCompany.\",\\n \"first_seen\": \"2024-07-17T08:00:00Z\",\\n \"last_seen\": \"2024-07-17T08:10:00Z\",\\n \"status\": \"ongoing\",\\n \"affected_assets\": [\"user-account--bd5631cf-2af6-4bba-bc92-37c60d020400\"]\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--3f1a8d8b-6a6e-4b5d-8e15-2d6d9a2b3f1d\",\\n \"relationship_type\": \"indicates\",\\n \"source_ref\": \"indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"target_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\"\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--4b6e65f3-743d-40c2-9194-3b5e38b3efed\",\\n \"relationship_type\": \"attributed-to\",\\n \"source_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"target_ref\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\"\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--5c9b6eaf-27a6-4b2b-9b17-49e3b00f6051\",\\n \"relationship_type\": \"uses\",\\n \"source_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"target_ref\": \"url--4c3b-4c4b-bb6c-ded6b2a4a567\"\\n }\\n]\\n\\n\\nReasoning: Let\\'s think step by step in order to\\x1b[32m produce the answer. We need to identify the key elements in the conversation, such as the email address, suspicious URLs, actions taken by the employee, and recommendations provided by the IT Security Specialist. By breaking down the conversation into these components, we can create corresponding STIX objects in JSON format to represent the cyber threat information.\\n\\nAnswer: [\\n {\\n \"type\": \"identity\",\\n \"id\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"name\": \"OurCompany\",\\n \"identity_class\": \"organization\",\\n \"sectors\": [\"technology\"],\\n \"contact_information\": \"info@ourcompany.com\"\\n },\\n {\\n \"type\": \"email-addr\",\\n \"id\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\\n \"value\": \"hr-dept@ourcompany-security.com\"\\n },\\n {\\n \"type\": \"email-message\",\\n \"id\": \"email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\",\\n \"is_multipart\": false,\\n \"subject\": \"Urgent Benefits Package Update\",\\n \"from_ref\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\\n \"body\": \"Please click the link to review the changes to your benefits package.\"\\n },\\n {\\n \"type\": \"url\",\\n \"id\": \"url--4c3b-4c4b-bb6c-ded6b2a4a567\",\\n \"value\": \"http://phishing-link.com/login\"\\n },\\n {\\n \"type\": \"user-account\",\\n \"id\": \"user-account--bd5631cf-2af6-4bba-bc92-37c60d020400\",\\n \"user_id\": \"Taylor\",\\n \"account_login\": \"taylor@ourcompany.com\"\\n },\\n {\\n \"type\": \"observable\",\\n \"id\": \"observable--001\",\\n \"observable_type\": \"email\",\\n \"observable_value\": \"hr-dept@ourcompany-security.com\"\\n },\\n {\\n \"type\": \"observable\",\\n \"id\": \"observable--002\",\\n \"observable_type\": \"url\",\\n \"observable_value\": \"http://phishing-link.com/login\"\\n },\\n {\\n \"type\": \"indicator\",\\n \"id\": \"indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"name\": \"Phishing Email Indicator\",\\n \"pattern\": \"[email-message:subject = \\'Urgent Benefits Package Update\\']\",\\n \"valid_from\": \"2024-07-17T00:00:00Z\"\\n },\\n {\\n \"type\": \"incident\",\\n \"id\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"name\": \"Phishing Attack on OurCompany\",\\n \"description\": \"A phishing attack where a suspicious email was sent to an employee of OurCompany.\",\\n \"first_seen\": \"2024-07-17T08:00:00Z\",\\n \"last_seen\": \"2024-07-17T08:10:00Z\",\\n \"status\": \"ongoing\",\\n \"affected_assets\": [\"user-account--bd5631cf-2af6-4bba-bc92-37c60d020400\"]\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--3f1a8d8b-6a6e-4b5d-8e15-2d6d9a2b3f1d\",\\n \"relationship_type\": \"indicates\",\\n \"source_ref\": \"indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"target_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\"\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--4b6e65f3-743d-40c2-9194-3b5e38b3efed\",\\n \"relationship_type\": \"attributed-to\",\\n \"source_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"target_ref\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\"\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--5c9b6eaf-27a6-4b2b-9b17-49e3b00f6051\",\\n \"relationship_type\": \"uses\",\\n \"source_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"target_ref\": \"url--4c3b-4c4b-bb6c-ded6b2a4a567\"\\n }\\n]\\x1b[0m\\n\\n\\n'" + "'\\n\\n\\nDescribe a conversation in STIX, which stands for Structured Threat Information eXpression, is a standardized language for representing cyber threat information.\\n\\n---\\n\\nFollow the following format.\\n\\nQuestion: a conversation describing a cyber incident between an IT Security Specialist and an employee.\\n\\nContext: contain a scenario and the coreposing STIX in JSON\\n\\nReasoning: Let\\'s think step by step in order to ${produce the answer}. We ...\\n\\nAnswer: the formalized STIX in JSON representing cyber threat information based on the conversation, e.g., [{object 1}, {object 2}, ... {object n}]\\n\\n---\\n\\nQuestion: Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was \"Urgent: Verify Your Account Now\". The email looks suspicious to me. Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from. Alice: Sure, forwarding it now. Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It\\'s actually registered to someone in Russia. Alice: That’s definitely not right. Should I be worried? Bob: We should investigate further. Did you click on any links or download any attachments? Alice: I did click on a link that took me to a page asking for my login credentials. I didn\\'t enter anything though. The URL was http://banksecure-verification.com/login. Bob: Good call on not entering your details. Let’s check the URL. This domain was just registered two days ago. It’s highly likely it’s a phishing site. Alice: What should I do next? Bob: First, clear your browser history and cache. Also, run a full antivirus scan on your computer. Can you also provide me with any browser history entries and cookies from that session? Alice: I’ve cleared the history and started the antivirus scan. Here are the relevant entries from my browser history: Visited at 10:15 AM: http://banksecure-verification.com/login Visited at 10:17 AM: http://banksecure-verification.com/account-details Bob: Thanks. I’ll analyze these URLs further. Also, check if there are any suspicious files downloaded or present in your downloads folder. Look for anything unusual. Alice: There\\'s a file named \"AccountDetails.exe\" that I don’t remember downloading. It was created at 10:20 AM. Bob: Definitely suspicious. Don’t open it. Let’s hash the file to verify its integrity. Can you run an MD5 hash on it? Alice: Done. The MD5 hash is e99a18c428cb38d5f260853678922e03. Bob: This hash matches known malware in our database. We’ll need to quarantine it and check if it has established any network connections. I’ll look into our network logs for the IP 192.168.10.45 around 10:20 AM. Alice: Is there anything else I need to do? Bob: For now, avoid using your computer for sensitive tasks. We’ll also reset your passwords from a different device and enable two-factor authentication on your accounts. Alice: Thanks, Bob. I’ll follow these steps immediately.\\n\\nContext:\\nExample scenairo: \\n Taylor: Hey Alex, I think I might have clicked on a suspicious link in an email.\\n Alex: Oh no, Taylor. Can you describe what happened?\\n Taylor: I got an email from what looked like our HR department. It said there was an urgent update to our benefits package, and I needed to click a link to review the changes.\\n Alex: Did the email address seem legitimate?\\n Taylor: At first glance, yes, but now that I think about it, the domain was slightly different. It was hr-dept@ourcompany-security.com instead of @ourcompany.com.\\n Alex: That sounds like phishing. What happened after you clicked the link?\\n Taylor: It took me to a login page that looked just like our internal portal. I entered my username and password.\\n Alex: Did you notice anything unusual after entering your credentials?\\n Taylor: Not immediately, but a few minutes later, I got an alert that someone attempted to log into my account from a different location.\\n Alex: Okay, this sounds serious. I need you to change your password immediately and enable two-factor authentication if you haven\\'t already.\\n Taylor: Done. What should we do next?\\n Alex: I\\'ll start by examining the email headers to trace the origin. Also, I need to check the link you clicked on to understand its structure and where it leads.\\n Taylor: Alright, I’ll forward you the email.\\n Alex: Thanks. I’ll also run a network scan to see if any other devices might have been compromised.\\n Taylor: Should I inform the rest of the team?\\n Alex: Yes, let them know about the phishing attempt and advise them to be cautious. I’ll send an official email with detailed instructions.\\n Taylor: Got it. Thanks, Alex. Is there anything else I should do?\\n Alex: Just keep an eye out for any unusual activities in your accounts. I’ll handle the technical investigation and follow up with you if I need more information.\\n Taylor: Will do. Thanks again.\\n Alex: No problem. Stay safe online.\\n Example generated STIX in JSON based on the scenairo: [\\n {\\n \"type\": \"identity\",\\n \"id\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"name\": \"OurCompany\",\\n \"identity_class\": \"organization\",\\n \"sectors\": [\"technology\"],\\n \"contact_information\": \"info@ourcompany.com\"\\n },\\n {\\n \"type\": \"email-addr\",\\n \"id\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\\n \"value\": \"hr-dept@ourcompany-security.com\"\\n },\\n {\\n \"type\": \"email-message\",\\n \"id\": \"email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\",\\n \"is_multipart\": false,\\n \"subject\": \"Urgent Benefits Package Update\",\\n \"from_ref\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\\n \"body\": \"Please click the link to review the changes to your benefits package.\"\\n },\\n {\\n \"type\": \"url\",\\n \"id\": \"url--4c3b-4c4b-bb6c-ded6b2a4a567\",\\n \"value\": \"http://phishing-link.com/login\"\\n },\\n {\\n \"type\": \"user-account\",\\n \"id\": \"user-account--bd5631cf-2af6-4bba-bc92-37c60d020400\",\\n \"user_id\": \"Taylor\",\\n \"account_login\": \"taylor@ourcompany.com\"\\n },\\n {\\n \"type\": \"observable\",\\n \"id\": \"observable--001\",\\n \"observable_type\": \"email\",\\n \"observable_value\": \"hr-dept@ourcompany-security.com\"\\n },\\n {\\n \"type\": \"observable\",\\n \"id\": \"observable--002\",\\n \"observable_type\": \"url\",\\n \"observable_value\": \"http://phishing-link.com/login\"\\n },\\n {\\n \"type\": \"indicator\",\\n \"id\": \"indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"name\": \"Phishing Email Indicator\",\\n \"pattern\": \"[email-message:subject = \\'Urgent Benefits Package Update\\']\",\\n \"valid_from\": \"2024-07-17T00:00:00Z\"\\n },\\n {\\n \"type\": \"incident\",\\n \"id\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"name\": \"Phishing Attack on OurCompany\",\\n \"description\": \"A phishing attack where a suspicious email was sent to an employee of OurCompany.\",\\n \"first_seen\": \"2024-07-17T08:00:00Z\",\\n \"last_seen\": \"2024-07-17T08:10:00Z\",\\n \"status\": \"ongoing\",\\n \"affected_assets\": [\"user-account--bd5631cf-2af6-4bba-bc92-37c60d020400\"]\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--3f1a8d8b-6a6e-4b5d-8e15-2d6d9a2b3f1d\",\\n \"relationship_type\": \"indicates\",\\n \"source_ref\": \"indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"target_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\"\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--4b6e65f3-743d-40c2-9194-3b5e38b3efed\",\\n \"relationship_type\": \"attributed-to\",\\n \"source_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"target_ref\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\"\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--5c9b6eaf-27a6-4b2b-9b17-49e3b00f6051\",\\n \"relationship_type\": \"uses\",\\n \"source_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"target_ref\": \"url--4c3b-4c4b-bb6c-ded6b2a4a567\"\\n }\\n]\\n\\n\\nReasoning: Let\\'s think step by step in order to\\x1b[32m produce the answer. We need to identify the key elements of the conversation, such as the email address, URL, user account, and indicators of a phishing attack. By converting these elements into STIX objects in JSON format, we can represent the cyber threat information in a structured and standardized way.\\n\\nAnswer: \\n[\\n {\\n \"type\": \"identity\",\\n \"id\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"name\": \"OurCompany\",\\n \"identity_class\": \"organization\",\\n \"sectors\": [\"technology\"],\\n \"contact_information\": \"info@ourcompany.com\"\\n },\\n {\\n \"type\": \"email-addr\",\\n \"id\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\\n \"value\": \"hr-dept@ourcompany-security.com\"\\n },\\n {\\n \"type\": \"email-message\",\\n \"id\": \"email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\",\\n \"is_multipart\": false,\\n \"subject\": \"Urgent Benefits Package Update\",\\n \"from_ref\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\\n \"body\": \"Please click the link to review the changes to your benefits package.\"\\n },\\n {\\n \"type\": \"url\",\\n \"id\": \"url--4c3b-4c4b-bb6c-ded6b2a4a567\",\\n \"value\": \"http://phishing-link.com/login\"\\n },\\n {\\n \"type\": \"user-account\",\\n \"id\": \"user-account--bd5631cf-2af6-4bba-bc92-37c60d020400\",\\n \"user_id\": \"Taylor\",\\n \"account_login\": \"taylor@ourcompany.com\"\\n },\\n {\\n \"type\": \"observable\",\\n \"id\": \"observable--001\",\\n \"observable_type\": \"email\",\\n \"observable_value\": \"hr-dept@ourcompany-security.com\"\\n },\\n {\\n \"type\": \"observable\",\\n \"id\": \"observable--002\",\\n \"observable_type\": \"url\",\\n \"observable_value\": \"http://phishing-link.com/login\"\\n },\\n {\\n \"type\": \"indicator\",\\n \"id\": \"indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"name\": \"Phishing Email Indicator\",\\n \"pattern\": \"[email-message:subject = \\'Urgent Benefits Package Update\\']\",\\n \"valid_from\": \"2024-07-17T00:00:00Z\"\\n },\\n {\\n \"type\": \"incident\",\\n \"id\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"name\": \"Phishing Attack on OurCompany\",\\n \"description\": \"A phishing attack where a suspicious email was sent to an employee of OurCompany.\",\\n \"first_seen\": \"2024-07-17T08:00:00Z\",\\n \"last_seen\": \"2024-07-17T08:10:00Z\",\\n \"status\": \"ongoing\",\\n \"affected_assets\": [\"user-account--bd5631cf-2af6-4bba-bc92-37c60d020400\"]\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--3f1a8d8b-6a6e-4b5d-8e15-2d6d9a2b3f1d\",\\n \"relationship_type\": \"indicates\",\\n \"source_ref\": \"indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\\n \"target_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\"\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--4b6e65f3-743d-40c2-9194-3b5e38b3efed\",\\n \"relationship_type\": \"attributed-to\",\\n \"source_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"target_ref\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\"\\n },\\n {\\n \"type\": \"relationship\",\\n \"id\": \"relationship--5c9b6eaf-27a6-4b2b-9b17-49e3b00f6051\",\\n \"relationship_type\": \"uses\",\\n \"source_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\\n \"target_ref\": \"url--4c3b-4c4b-bb6c-ded6b2a4a567\"\\n }\\n]\\x1b[0m\\n\\n\\n'" ] }, - "execution_count": 11, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } @@ -686,20 +836,6 @@ "source": [ "turbo.inspect_history(n=1)" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/04_forward_module.svg b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/04_forward_module.svg new file mode 100644 index 0000000..2aa5ae2 --- /dev/null +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/04_forward_module.svg @@ -0,0 +1,85 @@ + + +STIXGenCoT + + +cluster_forward + +forward method + + + +STIXGenCoT + +LLM + + + +Question + +question + + + +Retriever + +self.retriever + + + +Question->Retriever + + + + + +Predictor + +self.predictor + + + +Question->Predictor + + + + + +Context + +context + + + +Retriever->Context + + + + + +Context->Predictor + + + + + +Results + +prompt + + + +Predictor->Results + + + + + +Results->STIXGenCoT + + +to + + + \ No newline at end of file diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_evidence_stix_dot_generator.ipynb b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_evidence_stix_dot_generator.ipynb new file mode 100644 index 0000000..038ef63 --- /dev/null +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_evidence_stix_dot_generator.ipynb @@ -0,0 +1,493 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A tutorial to visualize one-shot learning results\n", + "\n", + "### Goal\n", + "- Compare one-shot learning with zero-shot learning\n", + "- To visualize the different\n", + "\n", + "### Approach\n", + "- Directly generate a DOT file from one-shot learning example completed in previous tutorial" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Download libraries and files for the lab\n", + "- Make use you download necessary library and files. \n", + "- All downloaded and saved files can be located in the `content` folder if using google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "# uncomment the commands to download libraries and files\n", + "#!pip install python-dotenv\n", + "#!pip install dspy-ai\n", + "#!pip install graphviz\n", + "\n", + "import dspy\n", + "import os\n", + "import openai\n", + "import json\n", + "from dotenv import load_dotenv\n", + "from graphviz import Source\n", + "from IPython.display import display" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "def set_dspy():\n", + " # ==============set openAI enviroment=========\n", + " # Path to your API key file\n", + " key_file_path = \"openai_api_key.txt\"\n", + "\n", + " # Load the API key from the file\n", + " with open(key_file_path, \"r\") as file:\n", + " openai_api_key = file.read().strip()\n", + "\n", + " # Set the API key as an environment variable\n", + " os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n", + " openai.api_key = os.environ[\"OPENAI_API_KEY\"]\n", + " turbo = dspy.OpenAI(model=\"gpt-3.5-turbo\", max_tokens=2000, temperature=0)\n", + " dspy.settings.configure(lm=turbo)\n", + " return turbo\n", + " # ==============end of set openAI enviroment=========\n", + "\n", + "\n", + "def set_dspy_hardcode_openai_key():\n", + " os.environ[\"OPENAI_API_KEY\"] = (\n", + " \"sk-proj-yourapikeyhere\"\n", + " )\n", + " openai.api_key = os.environ[\"OPENAI_API_KEY\"]\n", + " turbo = dspy.OpenAI(model=\"gpt-3.5-turbo\", temperature=0, max_tokens=2000)\n", + " dspy.settings.configure(lm=turbo)\n", + " return turbo\n", + "\n", + "turbo=set_dspy()\n", + "# comment out set_dspy() and use set_dspy_hardcode_openai_key is your option\n", + "# turbo=set_dspy_hardcode_openai_key()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "def load_text_file(file_path):\n", + " \"\"\"\n", + " Load a text file and return its contents as a string.\n", + "\n", + " Parameters:\n", + " file_path (str): The path to the text file.\n", + "\n", + " Returns:\n", + " str: The contents of the text file.\n", + " \"\"\"\n", + " try:\n", + " with open(file_path, \"r\") as file:\n", + " contents = file.read()\n", + " return contents\n", + " except FileNotFoundError:\n", + " return \"File not found.\"\n", + " except Exception as e:\n", + " return f\"An error occurred: {e}\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[\n", + " {\n", + " \"type\": \"identity\",\n", + " \"id\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\n", + " \"name\": \"OurCompany\",\n", + " \"identity_class\": \"organization\",\n", + " \"sectors\": [\n", + " \"technology\"\n", + " ],\n", + " \"contact_information\": \"info@ourcompany.com\"\n", + " },\n", + " {\n", + " \"type\": \"email-addr\",\n", + " \"id\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\n", + " \"value\": \"hr-dept@ourcompany-security.com\"\n", + " },\n", + " {\n", + " \"type\": \"email-message\",\n", + " \"id\": \"email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\",\n", + " \"is_multipart\": false,\n", + " \"subject\": \"Urgent Benefits Package Update\",\n", + " \"from_ref\": \"email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798\",\n", + " \"body\": \"Please click the link to review the changes to your benefits package.\"\n", + " },\n", + " {\n", + " \"type\": \"url\",\n", + " \"id\": \"url--4c3b-4c4b-bb6c-ded6b2a4a567\",\n", + " \"value\": \"http://phishing-link.com/login\"\n", + " },\n", + " {\n", + " \"type\": \"user-account\",\n", + " \"id\": \"user-account--bd5631cf-2af6-4bba-bc92-37c60d020400\",\n", + " \"user_id\": \"Taylor\",\n", + " \"account_login\": \"taylor@ourcompany.com\"\n", + " },\n", + " {\n", + " \"type\": \"observable\",\n", + " \"id\": \"observable--001\",\n", + " \"observable_type\": \"email\",\n", + " \"observable_value\": \"hr-dept@ourcompany-security.com\"\n", + " },\n", + " {\n", + " \"type\": \"observable\",\n", + " \"id\": \"observable--002\",\n", + " \"observable_type\": \"url\",\n", + " \"observable_value\": \"http://phishing-link.com/login\"\n", + " },\n", + " {\n", + " \"type\": \"indicator\",\n", + " \"id\": \"indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\n", + " \"name\": \"Phishing Email Indicator\",\n", + " \"pattern\": \"[email-message:subject = 'Urgent Benefits Package Update']\",\n", + " \"valid_from\": \"2024-07-17T00:00:00Z\"\n", + " },\n", + " {\n", + " \"type\": \"incident\",\n", + " \"id\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\n", + " \"name\": \"Phishing Attack on OurCompany\",\n", + " \"description\": \"A phishing attack where a suspicious email was sent to an employee of OurCompany.\",\n", + " \"first_seen\": \"2024-07-17T08:00:00Z\",\n", + " \"last_seen\": \"2024-07-17T08:10:00Z\",\n", + " \"status\": \"ongoing\",\n", + " \"affected_assets\": [\n", + " \"user-account--bd5631cf-2af6-4bba-bc92-37c60d020400\"\n", + " ]\n", + " },\n", + " {\n", + " \"type\": \"relationship\",\n", + " \"id\": \"relationship--3f1a8d8b-6a6e-4b5d-8e15-2d6d9a2b3f1d\",\n", + " \"relationship_type\": \"indicates\",\n", + " \"source_ref\": \"indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\",\n", + " \"target_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\"\n", + " },\n", + " {\n", + " \"type\": \"relationship\",\n", + " \"id\": \"relationship--4b6e65f3-743d-40c2-9194-3b5e38b3efed\",\n", + " \"relationship_type\": \"attributed-to\",\n", + " \"source_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\n", + " \"target_ref\": \"identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f\"\n", + " },\n", + " {\n", + " \"type\": \"relationship\",\n", + " \"id\": \"relationship--5c9b6eaf-27a6-4b2b-9b17-49e3b00f6051\",\n", + " \"relationship_type\": \"uses\",\n", + " \"source_ref\": \"incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857\",\n", + " \"target_ref\": \"url--4c3b-4c4b-bb6c-ded6b2a4a567\"\n", + " }\n", + "]\n" + ] + } + ], + "source": [ + "conversation = load_text_file(\"04_output_for_viz.json\")\n", + "print(conversation)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "class DotGenerator(dspy.Signature):\n", + " \"\"\"Generate a evidence knowledge graph based on a cyber incident expressed in Structured Threat Information Expression (STIX).\"\"\"\n", + "\n", + " question: str = dspy.InputField(\n", + " desc=\"a cyber incident expressed in Structured Threat Information Expression with JSON format.\"\n", + " )\n", + "\n", + " answer: str = dspy.OutputField(\n", + " desc=\"a graph in a dot format. The nodes of the graph are evidence entities in STIX or Cyber Forensic Domain Objects and Cyber Forensic Observable Objects in DFKG and the edges of the graph are the relationships between evidence entities in STIX. A dot format is primarily associated with Graphviz, a graph visualization software. For example, a dot should looks like: digraph incident_name {...}. Don't include `````` \"\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "# Important: Predict is better than ChainOfThough\n", + "def generate_answer_CoT(signature, text, output_file):\n", + " generate_answer = dspy.Predict(signature)\n", + " answer = generate_answer(question=text).answer # here we use the module\n", + "\n", + " with open(output_file, \"w\") as dot_file:\n", + " print(answer)\n", + " dot_file.write(answer)\n", + " return answer\n", + " print(f\"The evidence has been saved to the file {output_file}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "digraph phishing_attack {\n", + " \"OurCompany\" [label=\"OurCompany\\norganization\\ninfo@ourcompany.com\"]\n", + " \"hr-dept@ourcompany-security.com\" [label=\"hr-dept@ourcompany-security.com\"]\n", + " \"email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\" [label=\"email-message\\nUrgent Benefits Package Update\\nFrom: hr-dept@ourcompany-security.com\\nPlease click the link to review the changes to your benefits package.\"]\n", + " \"http://phishing-link.com/login\" [label=\"http://phishing-link.com/login\"]\n", + " \"Taylor\" [label=\"Taylor\\ntaylor@ourcompany.com\"]\n", + " \n", + " \"hr-dept@ourcompany-security.com\" -> \"email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\"\n", + " \"email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\" -> \"http://phishing-link.com/login\"\n", + " \"Taylor\" -> \"hr-dept@ourcompany-security.com\"\n", + " \n", + " \"Phishing Email Indicator\" [label=\"Phishing Email Indicator\\nPattern: [email-message:subject = 'Urgent Benefits Package Update']\\nValid From: 2024-07-17T00:00:00Z\"]\n", + " \"Phishing Attack on OurCompany\" [label=\"Phishing Attack on OurCompany\\nDescription: A phishing attack where a suspicious email was sent to an employee of OurCompany.\\nFirst Seen: 2024-07-17T08:00:00Z\\nLast Seen: 2024-07-17T08:10:00Z\\nStatus: ongoing\"]\n", + " \n", + " \"Phishing Email Indicator\" -> \"Phishing Attack on OurCompany\"\n", + " \"OurCompany\" -> \"Phishing Attack on OurCompany\"\n", + " \"Phishing Attack on OurCompany\" -> \"Taylor\"\n", + " \"Phishing Attack on OurCompany\" -> \"http://phishing-link.com/login\"\n", + "}\n" + ] + } + ], + "source": [ + "output_file = \"05_output.dot\"\n", + "dot_description = generate_answer_CoT(\n", + " DotGenerator,\n", + " conversation,\n", + " output_file,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Graph saved as: 05_output_stix_oneshot.png\n" + ] + }, + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "phishing_attack\n", + "\n", + "\n", + "\n", + "OurCompany\n", + "\n", + "OurCompany\n", + "organization\n", + "info@ourcompany.com\n", + "\n", + "\n", + "\n", + "Phishing Attack on OurCompany\n", + "\n", + "Phishing Attack on OurCompany\n", + "Description: A phishing attack where a suspicious email was sent to an employee of OurCompany.\n", + "First Seen: 2024-07-17T08:00:00Z\n", + "Last Seen: 2024-07-17T08:10:00Z\n", + "Status: ongoing\n", + "\n", + "\n", + "\n", + "OurCompany->Phishing Attack on OurCompany\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "hr-dept@ourcompany-security.com\n", + "\n", + "hr-dept@ourcompany-security.com\n", + "\n", + "\n", + "\n", + "email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\n", + "\n", + "email-message\n", + "Urgent Benefits Package Update\n", + "From: hr-dept@ourcompany-security.com\n", + "Please click the link to review the changes to your benefits package.\n", + "\n", + "\n", + "\n", + "hr-dept@ourcompany-security.com->email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "http://phishing-link.com/login\n", + "\n", + "http://phishing-link.com/login\n", + "\n", + "\n", + "\n", + "email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97->http://phishing-link.com/login\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Taylor\n", + "\n", + "Taylor\n", + "taylor@ourcompany.com\n", + "\n", + "\n", + "\n", + "Taylor->hr-dept@ourcompany-security.com\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Phishing Email Indicator\n", + "\n", + "Phishing Email Indicator\n", + "Pattern: [email-message:subject = 'Urgent Benefits Package Update']\n", + "Valid From: 2024-07-17T00:00:00Z\n", + "\n", + "\n", + "\n", + "Phishing Email Indicator->Phishing Attack on OurCompany\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Phishing Attack on OurCompany->http://phishing-link.com/login\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Phishing Attack on OurCompany->Taylor\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Load the .DOT file\n", + "dot_file_path = \"05_output.dot\"\n", + "\n", + "with open(dot_file_path, \"r\") as file:\n", + " dot_content = file.read()\n", + "\n", + "# Create a Graphviz Source object and render it\n", + "dot = Source(dot_content)\n", + "\n", + "# Render the graph and save it as a PNG file\n", + "output_file_path = \"05_output_stix_oneshot\"\n", + "dot.format = \"png\"\n", + "dot.render(output_file_path, cleanup=True)\n", + "\n", + "# Display the saved PNG file path\n", + "print(f\"Graph saved as: {output_file_path}.png\")\n", + "\n", + "# Display the graph in the Jupyter notebook\n", + "dot" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import Image\n", + "\n", + "# Path to the image file\n", + "image_path = \"path/to/your/image.png\"\n", + "\n", + "# Display the image\n", + "Image(filename=image_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Summary\n", + "\n", + ", e.g., [Digital Forensic Knowledge Graph (DFKG)](https://github.com/frankwxu/digital-forensics-lab/tree/main/STIX_for_digital_forensics). " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_forward_module.svg b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_forward_module.svg new file mode 100644 index 0000000..2aa5ae2 --- /dev/null +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_forward_module.svg @@ -0,0 +1,85 @@ + + +STIXGenCoT + + +cluster_forward + +forward method + + + +STIXGenCoT + +LLM + + + +Question + +question + + + +Retriever + +self.retriever + + + +Question->Retriever + + + + + +Predictor + +self.predictor + + + +Question->Predictor + + + + + +Context + +context + + + +Retriever->Context + + + + + +Context->Predictor + + + + + +Results + +prompt + + + +Predictor->Results + + + + + +Results->STIXGenCoT + + +to + + + \ No newline at end of file diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output.dot b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output.dot index 33c01b4..f15a043 100644 --- a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output.dot +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output.dot @@ -1,8 +1,21 @@ -digraph Phishing_Attack { - "OurCompany" -> "incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857" [label="attributed-to"]; - "email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798" -> "email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97" [label="from"]; - "email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97" -> "url--4c3b-4c4b-bb6c-ded6b2a4a567" [label="contains"]; - "email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97" -> "observable--001" [label="observable"]; - "url--4c3b-4c4b-bb6c-ded6b2a4a567" -> "observable--002" [label="observable"]; - "incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857" -> "indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f" [label="indicates"]; +digraph phishing_attack { + "identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f" [label="OurCompany", shape="ellipse"]; + "email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798" [label="hr-dept@ourcompany-security.com", shape="ellipse"]; + "email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97" [label="Urgent Benefits Package Update", shape="box"]; + "url--4c3b-4c4b-bb6c-ded6b2a4a567" [label="http://phishing-link.com/login", shape="ellipse"]; + "user-account--bd5631cf-2af6-4bba-bc92-37c60d020400" [label="Taylor (taylor@ourcompany.com)", shape="ellipse"]; + + "observable--001" [label="hr-dept@ourcompany-security.com", shape="ellipse"]; + "observable--002" [label="http://phishing-link.com/login", shape="ellipse"]; + + "indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f" [label="Phishing Email Indicator", shape="diamond"]; + + "incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857" [label="Phishing Attack on OurCompany", shape="box"]; + + "relationship--3f1a8d8b-6a6e-4b5d-8e15-2d6d9a2b3f1d" -> "indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f"; + "relationship--3f1a8d8b-6a6e-4b5d-8e15-2d6d9a2b3f1d" -> "incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857"; + "relationship--4b6e65f3-743d-40c2-9194-3b5e38b3efed" -> "incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857"; + "relationship--4b6e65f3-743d-40c2-9194-3b5e38b3efed" -> "identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f"; + "relationship--5c9b6eaf-27a6-4b2b-9b17-49e3b00f6051" -> "incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857"; + "relationship--5c9b6eaf-27a6-4b2b-9b17-49e3b00f6051" -> "url--4c3b-4c4b-bb6c-ded6b2a4a567"; } \ No newline at end of file diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output.png b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output.png index b640bb7..e52b652 100644 Binary files a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output.png and b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output.png differ diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_stix_oneshot.png b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_stix_oneshot.png new file mode 100644 index 0000000..4805427 Binary files /dev/null and b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_stix_oneshot.png differ diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_stix_zeroshot.svg b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_stix_zeroshot.svg new file mode 100644 index 0000000..b0e9c8d --- /dev/null +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_stix_zeroshot.svg @@ -0,0 +1,152 @@ + + +G + + + +Email + +Email +sender: support@banksecure.com +subject: Urgent: Verify Your Account Now + + + +Headers + +Headers +IP_address: 192.168.10.45 +domain: banksecure.com +registered_to: Russia + + + +Email->Headers + + + + + +Link_0 + +Link +URL: http://banksecure-verification.com/login +timestamp: 10:15 AM + + + +Email->Link_0 + + + + + +Link_1 + +Link +URL: http://banksecure-verification.com/account-details +timestamp: 10:17 AM + + + +Email->Link_1 + + + + + +Attachment_0 + +Attachment +file_name: AccountDetails.exe +created_at: 10:20 AM +MD5_hash: e99a18c428cb38d5f260853678922e03 +status: known_malware + + + +Email->Attachment_0 + + + + + +Actions + +Actions Taken + + + +Action_0 + +Clear browser history and cache + + + +Actions->Action_0 + + + + + +Action_1 + +Run full antivirus scan + + + +Actions->Action_1 + + + + + +Action_2 + +Provide browser history entries and cookies + + + +Actions->Action_2 + + + + + +Action_3 + +Quarantine suspicious file + + + +Actions->Action_3 + + + + + +Action_4 + +Check network connections + + + +Actions->Action_4 + + + + + +Action_5 + +Reset passwords and enable two-factor authentication + + + +Actions->Action_5 + + + + + \ No newline at end of file diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_viz.dot b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_viz.dot new file mode 100644 index 0000000..a730ee2 --- /dev/null +++ b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_viz.dot @@ -0,0 +1,19 @@ +digraph phishing_attack { + "OurCompany" [label="OurCompany\norganization"] + "hr-dept@ourcompany-security.com" [label="hr-dept@ourcompany-security.com\nemail-addr"] + "Urgent Benefits Package Update" [label="Urgent Benefits Package Update\nemail-message"] + "http://phishing-link.com/login" [label="http://phishing-link.com/login\nurl"] + "Taylor" [label="Taylor\nuser-account"] + + "OurCompany" -> "hr-dept@ourcompany-security.com" [label="email"] + "hr-dept@ourcompany-security.com" -> "Urgent Benefits Package Update" [label="email"] + "Urgent Benefits Package Update" -> "http://phishing-link.com/login" [label="link"] + "Taylor" -> "hr-dept@ourcompany-security.com" [label="login"] + + "Urgent Benefits Package Update" -> "Phishing Email Indicator" [label="indicator"] + "http://phishing-link.com/login" -> "Phishing Email Indicator" [label="indicator"] + + "Phishing Email Indicator" -> "Phishing Attack on OurCompany" [label="indicates"] + "Phishing Attack on OurCompany" -> "OurCompany" [label="attributed-to"] + "Phishing Attack on OurCompany" -> "http://phishing-link.com/login" [label="uses"] +} \ No newline at end of file diff --git a/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_viz.png b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_viz.png new file mode 100644 index 0000000..e52b652 Binary files /dev/null and b/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/05_output_viz.png differ diff --git a/AI4Forensics/CKIM2024/readme.md b/AI4Forensics/CKIM2024/readme.md index 654eee0..7f5ec48 100644 --- a/AI4Forensics/CKIM2024/readme.md +++ b/AI4Forensics/CKIM2024/readme.md @@ -30,15 +30,26 @@ By fostering a collaborative learning environment, this tutorial aims to empower ## Table of Contents - Introduction -- Forensic evidence entity recognition -- Profiling suspect based on browser history -- [Political insights analysis based on Hillary's leaked Emails](#political-insight-analysis-leveraging-llms) -- Evidence knowledge reconstruction +- [Forensic evidence entity recognition (hands-on lab)](#forensic-evidence-analysis) + - [Evidence entity recognition](PhishingAttack\PhishingAttackScenarioDemo\01_evidence_entity_recognition.ipynb) + - [Visualize evidence and their relations](PhishingAttackScenarioDemo\02_evidence_knowledge_dot_generator.ipynb) +- [Evidence knowledge graphs reconstruction (hands-on lab)](#forensic-evidence-analysis) + - [Construct a knowledge graph in STIX (zero-shot)](PhishingAttackScenarioDemo\03_evidence_stix_zeroshot.ipynb) + - [Construct a knowledge graph in STIX (one-shot)](PhishingAttackScenarioDemo\04_evidence_stix_oneshot.ipynb) + - [Compare one-shot vs. zero-shot](PhishingAttackScenarioDemo\05_evidence_stix_dot_generator.ipynb) +- Profiling suspect based on browser history (hands-on lab) +- [Political insights analysis based on Hillary's leaked Emails (hands-on lab)](#political-insight-analysis-leveraging-llms) - Challenges and Limitations of Leveraging LLM in Digital Forensics - Conclusion --- +### Forensic Evidence Analysis + +The cyber incident report documents a conversation between an IT Security Specialist and an Employee about an email phishing attack. We use LLMs to identify evidence entities and relationships and to construct digital forensic knowledge graphs. + +Here is an example of a reconstructed digital forensics knowledge graph: + ### Political Insight Analysis Leveraging LLMs The case study demonstrates how to Leverage Large Language Models to gain political insight based on an email dataset. The dataset we have used in the case study is a set of leaked [emails](https://github.com/benhamner/hillary-clinton-emails?tab=readme-ov-file) obtained from Hillary Clinton's private email server. @@ -47,7 +58,7 @@ The background of the leaked emails is a significant chapter in recent U.S. poli The leaked email dataset from Hillary Clinton's private email server is a comprehensive collection of communications covering her entire tenure as Secretary of State from 2009 to 2013. It includes approximately 30,000 emails with a wide range of topics from official diplomatic communications to personal correspondences. The release and subsequent analysis of these emails have played a crucial role in political debates, legal inquiries, and public discussions about transparency and security in government communications. -Our dataset: [a set of email summaries](/AI4Forensics/CKIM2024/HillaryEmails/results_email_summary.txt). Each email summary is a summarization of an email generated by Gemini from an original email in the original leaked [email dataset](https://github.com/benhamner/hillary-clinton-emails?tab=readme-ov-file). We are only interested in emails containing the keyword "israel". +Our dataset: [a set of email summaries](/AI4Forensics/CKIM2024/HillaryEmails/results_email_summary.txt). Each email summary is a summarization of an email generated by Gemini from an original email in the original leaked [email dataset](https://github.com/benhamner/hillary-clinton-emails?tab=readme-ov-file). We are only interested in emails containing the keyword "Israel". Our results: [Code in Jupyter Notebook](/AI4Forensics/CKIM2024/HillaryEmails/email_analysis_political_insight.ipynb).