revise political analysis

2026-02-21 11:17:52 +00:00 · 2024-07-24 16:46:47 -04:00
parent 5826f7985f
commit 772d3df88d
4 changed files with 195 additions and 71 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,4 @@
 openai_api_key.txt
 openai_api_key.env
-openai_api_key
+openai_api_key
+my_config.env
--- a/AI4Forensics/CKIM2024/HillaryEmails/email_analysis_political_insight.ipynb
+++ b/AI4Forensics/CKIM2024/HillaryEmails/email_analysis_political_insight.ipynb
@@ -1,92 +1,168 @@
 {
 "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## A tutorial to analysis political insight leveraging LLMs\n",
+    "\n",
+    "The case study demonstrates how to Leverage Large Language Models (LLMs) to gain political insight based on a leaked [email](https://github.com/benhamner/hillary-clinton-emails?tab=readme-ov-file) dataset from Hillary Clinton's private email server. \n",
+    "- The email dataset is a comprehensive collection of communications covering her entire tenure as Secretary of State from 2009 to 2013. \n",
+    "- It includes approximately 30,000 emails with a wide range of topics from official diplomatic communications to personal correspondences. \n",
+    "- The release and subsequent analysis of these emails have played a crucial role in political debates, legal inquiries, and public discussions about transparency and security in government communications.\n",
+    "\n",
+    "### Goals of analysis with a LLM\n",
+    "- Input for LLM: emails with various political scenarios, historical events, or current affairs related to Israel\n",
+    "- Task for LLM: analyze emails from a political, social, and economic perspective \n",
+    "    - provide insights into the implications of these scenarios, \n",
+    "    - how they reflect on Israel's domestic and foreign policy, and \n",
+    "    - what potential outcomes or future developments could arise from them.\n",
+    "- Output: analyze results\n",
+    "    - No specific format is required.\n",
+    "\n",
+    "\n",
+    "### Dataset in this study\n",
+    "A set of email summaries (138 paragraph) from the leaked email dataset\n",
+    "- each summary is a summarization of an email containing the keyword \"Israel\"\n",
+    "    - some emails is very long. LLMs have token limitation\n",
+    "- summarization is done by Gemini \n",
+    "    - Gemini API is [free](https://aistudio.google.com/app/apikey)\n",
+    "\n",
+    "\n",
+    "### Implementation Plan\n",
+    "- [langchain](https://www.langchain.com/)\n",
+    "    - a popular open-source framework \n",
+    "    - designed to simplify the development of applications using LLMs\n",
+    "- Gemini - API is [free](https://aistudio.google.com/app/apikey)\n",
+    "    - summarization\n",
+    "    - political analysis \n",
+    "- Can we use DSPy?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 1: Download libraries \n",
+    "- Make use you use `pip` to download necessary libraries \n",
+    "- All downloaded and saved files can be located in the `content` folder if using google Colab"
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
-    "# pip -q install google-generativeai==0.3.0\n",
-    "# pip -q install google-ai-generativelanguage==0.4.0\n",
-    "# pip install python-dotenv\n",
-    "# pip install --upgrade langchain\n",
-    "# pip -q install langchain_experimental langchain_core\n",
-    "# pip -q install langchain-google-genai\n",
-    "# pip show langchain langchain-core\n",
-    "# pip install python-pptxy\n",
+    "# !pip -q install google-generativeai\n",
+    "# !pip -q install langchain-google-genai\n",
+    "# !pip install python-dotenv\n",
+    "# !pip -q install langchain_experimental langchain_core\n",
+    "# !pip install --upgrade langchain\n",
    "\n",
-    "\n",
-    "import numpy as np\n",
-    "import os\n",
-    "import re\n",
-    "import datetime\n",
-    "import time\n",
-    "import tenacity\n",
-    "import argparse\n",
-    "import configparser\n",
-    "import json\n",
-    "import tiktoken\n",
-    "import jieba\n",
-    "from collections import namedtuple\n",
-    "\n",
-    "# setup\n",
    "import google.generativeai as genai\n",
-    "\n",
    "from IPython.display import display\n",
    "from IPython.display import Markdown\n",
-    "\n",
-    "import os\n",
    "from dotenv import load_dotenv\n",
-    "\n",
+    "from langchain_google_genai import ChatGoogleGenerativeAI, HarmBlockThreshold, HarmCategory\n",
+    "from langchain.prompts import ChatPromptTemplate\n",
+    "from langchain_core.output_parsers import StrOutputParser"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 2: Config Gemini\n",
+    "- replace your own Gemini API\n",
+    "```genai.configure(api_key=GOOGLE_AI_STUDIO)```\n",
+    "- set up Gemini model\n",
+    "- config safety settings "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ================ Key configuration===========\n",
    "# Load environment variables from the .env file\n",
    "load_dotenv(\"my_config.env\")\n",
    "\n",
    "# Access the environment variables\n",
    "GOOGLE_AI_STUDIO = os.getenv(\"GOOGLE_AI_STUDIO2\")\n",
-    "genai.configure(api_key=GOOGLE_AI_STUDIO )"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 21,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "\n",
+    "# replace your own Gemini API key\n",
+    "genai.configure(api_key=\"GOOGLE_AI_STUDIO\")\n",
+    "\n",
+    "\n",
+    "# ======= Gerneration configuration===========\n",
    "# Set up the model\n",
+    "# Temperature controls the randomness of the model's output.\n",
    "generation_config = {\n",
-    "  \"temperature\": 0.0,\n",
-    "  \"top_p\": 1,\n",
-    "  \"top_k\": 32,\n",
-    "  \"max_output_tokens\": 4096,\n",
+    "    \"temperature\": 0.0,  # Controls the randomness of the model's output\n",
+    "    \"top_p\": 1,  # Chooses the smallest set of tokens whose cumulative probability exceeds the threshold p.  1 means all tokens are considered\n",
+    "    \"top_k\": 16,  # Selects the k most likely next tokens.\n",
+    "    \"max_output_tokens\": 4096,\n",
    "}\n",
    "\n",
-    "safety_settings = [\n",
-    "    {\"category\": \"HARM_CATEGORY_HARASSMENT\", \"threshold\": \"BLOCK_NONE\"},\n",
-    "    {\"category\": \"HARM_CATEGORY_HATE_SPEECH\", \"threshold\": \"BLOCK_NONE\"},\n",
-    "    {\"category\": \"HARM_CATEGORY_SEXUALLY_EXPLICIT\", \"threshold\": \"BLOCK_NONE\"},\n",
-    "    {\"category\": \"HARM_CATEGORY_DANGEROUS_CONTENT\", \"threshold\": \"BLOCK_NONE\"},\n",
-    "]"
+    "# ======= Safety configuration=================\n",
+    "# disable safety settings though langchain\n",
+    "safety_settings = {\n",
+    "    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,\n",
+    "    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,\n",
+    "    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,\n",
+    "    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 3: build a Gemini model with configurations"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
-    "# read a paper\n",
-    "from langchain_google_genai import ChatGoogleGenerativeAI\n",
-    "from langchain.prompts import ChatPromptTemplate\n",
-    "from langchain_core.output_parsers import StrOutputParser\n",
-    "\n",
    "model = ChatGoogleGenerativeAI(\n",
    "    model=\"gemini-pro\",\n",
    "    generation_config=generation_config,\n",
    "    safety_settings=safety_settings,\n",
    "    google_api_key=GOOGLE_AI_STUDIO,\n",
-    ")\n",
-    "\n",
-    "\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 4: Create a prompt template\n",
+    "- This is a multi-line string containing placeholders in curly braces.\n",
+    "```\n",
+    "        formatted_prompt = prompt.format(\n",
+    "            role=\"You are a helpful assistant.\",\n",
+    "            provided_data=\"Here's some context: ...\",\n",
+    "            start=\"Please answer the following question:\"\n",
+    "        )\n",
+    "```\n",
+    "- `{role}, {provided_data}, and {start}` are placeholders that will be filled in later.\n",
+    "    - `{role}`: definition specifies the role's name, overall objective, task specific context, and any applicable constraints. \n",
+    "    - `{provided_data}`:  outlines the required datasets for task completion\n",
+    "    - `{start}`: the initiation instruction serves as a trigger, prompting the role to carry out the task"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
    "template = \"\"\" \n",
    "{role}\\\n",
    "{provided_data}\\\n",
@@ -95,35 +171,71 @@
    "prompt = ChatPromptTemplate.from_template(template)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 5: use LangChain to create a simple processing chain\n",
+    "\n",
+    "Flow of operation `chain = prompt | model | output_parser`\n",
+    "- The prompt is first formatted and sent to the model.\n",
+    "- The model processes the prompt and generates a response.\n",
+    "- The output parser then processes the model's response, ensuring it's in the correct string format."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
-       "**Sure, here are some political insights based on the leaked email summaries obtained from Hillary Clinton's private email server that are related to Israel:**\n",
+       "**Political Insights Based on Leaked Hillary Clinton Emails Related to Israel**\n",
       "\n",
-       "* **Israel's settlement policy is a major obstacle to peace.** This is evident from the fact that the Obama administration repeatedly urged Israel to freeze settlement construction, but Israel refused. Netanyahu was under pressure from right-wing parties in his coalition government to block the renewal of the settlement freeze policy, and he ultimately chose not to extend it. This decision was seen as a major setback to US peacemaking efforts.\n",
-       "* **Netanyahu's negotiating tactics are self-defeating.** Netanyahu's approach to negotiations with the Palestinians has been criticized by many, including former Shin Bet chief Yuval Diskin. Diskin warned that Netanyahu's tactics were contributing to distrust on the Palestinian side and making it more difficult to reach a peace agreement.\n",
-       "* **The Israeli public is ready for a peace deal.** This is evident from the fact that Kadima leader Tzipi Livni was willing to bring her party into the government without demanding rotation if Netanyahu was serious about negotiating peace. However, Netanyahu's failure to make a serious move towards peace could further delegitimize Israel internationally.\n",
-       "* **The US-Israel relationship is strong, but it is also complex.** The emails show that the US and Israel have a close relationship, but they also have disagreements on a number of issues, including the settlement issue. The US has been critical of Israel's settlement policy, and it has also urged Israel to take steps to improve the humanitarian situation in Gaza.\n",
-       "* **The US is committed to a two-state solution.** This is evident from the fact that the US has repeatedly called for a two-state solution to the Israeli-Palestinian conflict. The US believes that a two-state solution is the only way to achieve a lasting peace in the region.\n",
+       "The leaked email summaries provide valuable insights into the political dynamics surrounding Israel during Hillary Clinton's tenure as Secretary of State. These insights can be categorized as follows:\n",
       "\n",
-       "These are just a few of the political insights that can be gleaned from the leaked email summaries. These emails provide a valuable glimpse into the US-Israel relationship and the challenges to peace in the Middle East."
+       "**1. Diplomatic Challenges and Negotiations:**\n",
+       "\n",
+       "* The emails reveal ongoing diplomatic efforts to facilitate peace talks between Israel and the Palestinians, highlighting the complexities and challenges involved in negotiations.\n",
+       "* They shed light on the delicate balance between maintaining good relations with Israel while also addressing concerns from Arab and Palestinian partners.\n",
+       "\n",
+       "**2. Settlement Freeze and Construction:**\n",
+       "\n",
+       "* The emails discuss the controversial issue of Israeli settlement construction in the West Bank, including the Obama administration's efforts to secure a settlement freeze and Israel's reluctance to fully comply.\n",
+       "* They provide evidence of ongoing tensions between the US and Israel over this issue, which remains a significant obstacle to peace efforts.\n",
+       "\n",
+       "**3. Public Perception and International Pressure:**\n",
+       "\n",
+       "* The emails reflect the challenges faced by Israel in managing its international image, particularly in the wake of incidents like the Gaza Flotilla raid.\n",
+       "* They show how the US administration attempted to mediate between Israel and the international community, emphasizing the importance of accountability and restraint.\n",
+       "\n",
+       "**4. Domestic Political Considerations:**\n",
+       "\n",
+       "* The emails provide glimpses into the domestic political dynamics within Israel, including the influence of right-wing parties and the challenges faced by Prime Minister Netanyahu in balancing their demands with international pressure.\n",
+       "* They highlight the complexities of Israeli coalition politics and the impact on decision-making.\n",
+       "\n",
+       "**5. Security Concerns:**\n",
+       "\n",
+       "* The emails touch on security-related issues, such as the humanitarian crisis in Gaza and the need for a two-state solution to address both Israeli security concerns and Palestinian aspirations.\n",
+       "* They demonstrate the interconnectedness of political and security matters in the region.\n",
+       "\n",
+       "Overall, these leaked emails offer valuable insights into the complexities of US-Israel relations, the challenges of peace negotiations, and the political dynamics shaping Israel's domestic and foreign policy. They underscore the importance of diplomacy, dialogue, and a balanced approach to address the multifaceted issues in the region."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
-     "execution_count": 23,
+     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
+    "# a LangChain utility that parses the output of a language model into a simple string.\n",
    "output_parser = StrOutputParser()\n",
+    "\n",
+    "# This line creates a processing chain using the pipe (|) operator.\n",
    "chain = prompt | model | output_parser\n",
    "\n",
    "with open(r\".\\role_political_analyst.txt\", \"r\") as file:\n",
@@ -135,20 +247,31 @@
    "with open(r\".\\start_political_analyst.txt\", \"r\") as file:\n",
    "    start = file.read()\n",
    "\n",
-    "\n",
    "result = chain.invoke(\n",
+    "\n",
+    "\n",
    "    {\n",
+    "\n",
+    "\n",
    "        \"role\": role,\n",
+    "\n",
+    "\n",
    "        \"provided_data\": provided_data,\n",
+    "\n",
+    "\n",
    "        \"start\": start,\n",
+    "\n",
+    "\n",
    "    }\n",
    ")\n",
+    "\n",
+    "\n",
    "Markdown(result)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
@@ -161,7 +284,7 @@
   ],
   "source": [
    "# Open a file for writing ('w' mode) and create it if it doesn't exist\n",
-    "with open(r\".\\result_political.txt\", \"w\") as file:\n",
+    "with open(r\"result_political.txt\", \"w\") as file:\n",
    "    # Write content to the file\n",
    "    file.write(result)\n",
    "\n",
@@ -192,7 +315,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.18"
+   "version": "3.12.3"
  }
 },
 "nbformat": 4,
--- a/AI4Forensics/CKIM2024/readme.md
+++ b/AI4Forensics/CKIM2024/readme.md
@@ -70,7 +70,7 @@ Here are some political insights based on the leaked email summaries obtained fr

 ---

-Please cite our [paper](/papers/cikm2024.pdf):
+Please cite our [paper](/papers/CIKM2024.pdf):

 Eric Xu, Wenbin Zhang, and Weifeng Xu, "Transforming Digital Forensics with Large Language Models: Unlocking Automation, Insights, and Justice," in <em>Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), Boise, USA, October 21-25, 2024</em>

--- a/papers/CIKM2024.pdf
+++ b/papers/CIKM2024.pdf