{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HELMify Demo: Peptide Monomer HELM Name Generation\n",
    "\n",
    "This notebook describes and demonstrates the three different approaches implemented in HELMify: A Hybrid Rule- and LLM-Based Generator of Peptide Monomer HELM Names.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "Before running this notebook, ensure you have:\n",
    "1. Set up the conda environment: `conda env create -f environment.yaml`\n",
    "2. Activated the conda environment: `conda activate helmify`\n",
    "3. Configured your `.env` file with the required API keys and database paths inside the /helmify directory\n",
    "4. Started the API server from within the helmify directory: `uvicorn main:api --env-file .env`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import json\n",
    "from pprint import pprint\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# API endpoint (adjust if running on different host/port)\n",
    "API_BASE_URL = \"http://127.0.0.1:8000\"\n",
    "NAMING_ENDPOINT = f\"{API_BASE_URL}/helm-api-name\"\n",
    "\n",
    "# API Timeout Constants (PEP 8: Constants at module level)\n",
    "API_TIMEOUT_FAST = 60   # For zone-based operations (faster rule-based processing)\n",
    "API_TIMEOUT_LLM = 120   # For LLM operations (requires model inference time)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 1: Standard Amino Acid Derivatives\n",
    "\n",
    "Let's start with some modified amino acids that are commonly found in peptide research:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample SMILES strings - modified amino acids\n",
    "sample_smiles = [\n",
    "    \"c1cc(c(cc1C[C@@H](C(=O)O)N)F)F\",\n",
    "    \"C=CC[C@@H](C(=O)O)N\",\n",
    "    \"C1CN[C@@H]1C(=O)O\"]\n",
    "# Display our test molecules\n",
    "print(\"🧪 Test SMILES:\")\n",
    "for i, smiles in enumerate(sample_smiles, 1):\n",
    "    print(f\"{i}. {smiles}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Full LLM Naming Approach\n",
    "\n",
    "This method will work, assuming all of the openai/LLM model credentials are configured correctly,t he openeye license server is set, and the monomer database has more entries than the number of nearest neighbors used for context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test Full LLM approach\n",
    "def test_full_llm_method(smiles_list):\n",
    "    \"\"\"Test the full LLM naming method.\"\"\"\n",
    "    payload = {\n",
    "        \"smiles_strings\": smiles_list,\n",
    "        \"full_llm_flag\": True,\n",
    "        \"explanation_flag\": True,\n",
    "        \"prompt_flag\": True,\n",
    "        \"zone_flag\": False,\n",
    "        \"hybrid_flag\": False\n",
    "    }\n",
    "    \n",
    "    try:\n",
    "        response = requests.post(NAMING_ENDPOINT, json=payload, timeout=API_TIMEOUT_LLM)\n",
    "        response.raise_for_status()\n",
    "        return response.json()\n",
    "    except requests.exceptions.RequestException as e:\n",
    "        print(f\"❌ API Error: {e}\")\n",
    "        return None\n",
    "\n",
    "print(\"🤖 Testing Full LLM Method...\")\n",
    "print(\"This may take 30-60 seconds as it calls the LLM for each molecule...\\n\")\n",
    "\n",
    "full_llm_results = test_full_llm_method(sample_smiles[:2])  # Test first 2 molecules\n",
    "\n",
    "if full_llm_results:\n",
    "    print(\"✅ Full LLM Results:\")\n",
    "    for i, (smiles, result) in enumerate(zip(sample_smiles[:2], full_llm_results), 1):\n",
    "        print(f\"\\n{i}. SMILES: {smiles}\")\n",
    "        print(f\"   HELM Name: {result.get('Full LLM', 'Not generated')}\")\n",
    "        if result.get('Full LLM Explanation'):\n",
    "            print(f\"   Explanation: {result['Full LLM Explanation'][:100]}...\")\n",
    "else:\n",
    "    print(\"⚠️  Could not test Full LLM method. Check your API configuration.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Zone-Based Naming Approach\n",
    "\n",
    "This method will not work out of the box, as it needs a zone-based implementation and the corresponding API (ZONE_URL) to function correctly.This method uses structural decomposition to identify zones and build HELM names using predefined rules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test Zone-based approach\n",
    "def test_zone_method(smiles_list):\n",
    "    \"\"\"Test the zone-based naming method.\"\"\"\n",
    "    payload = {\n",
    "        \"smiles_strings\": smiles_list,\n",
    "        \"full_llm_flag\": False,\n",
    "        \"explanation_flag\": False,\n",
    "        \"prompt_flag\": False,\n",
    "        \"zone_flag\": True,\n",
    "        \"hybrid_flag\": False\n",
    "    }\n",
    "    \n",
    "    try:\n",
    "        response = requests.post(NAMING_ENDPOINT, json=payload, timeout=API_TIMEOUT_FAST)\n",
    "        response.raise_for_status()\n",
    "        return response.json()\n",
    "    except requests.exceptions.RequestException as e:\n",
    "        print(f\"❌ API Error: {e}\")\n",
    "        return None\n",
    "\n",
    "print(\"⚙️  Testing Zone-Based Method...\")\n",
    "zone_results = test_zone_method(sample_smiles)\n",
    "\n",
    "if zone_results:\n",
    "    print(\"✅ Zone-Based Results:\")\n",
    "    for i, (smiles, result) in enumerate(zip(sample_smiles, zone_results), 1):\n",
    "        print(f\"\\n{i}. SMILES: {smiles}\")\n",
    "        print(f\"   HELM Name: {result.get('Zone', 'Not generated')}\")\n",
    "else:\n",
    "    print(\"⚠️  Could not test Zone method. Check your zone service configuration.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hybrid Approach\n",
    "\n",
    "This method combines zone-based naming with LLM naming for unknown substituents, providing the best of both worlds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test Hybrid approach\n",
    "def test_hybrid_method(smiles_list):\n",
    "    \"\"\"Test the hybrid naming method.\"\"\"\n",
    "    payload = {\n",
    "        \"smiles_strings\": smiles_list,\n",
    "        \"full_llm_flag\": False,\n",
    "        \"explanation_flag\": True,\n",
    "        \"prompt_flag\": True,\n",
    "        \"zone_flag\": True,\n",
    "        \"hybrid_flag\": True\n",
    "    }\n",
    "    \n",
    "    try:\n",
    "        response = requests.post(NAMING_ENDPOINT, json=payload, timeout=API_TIMEOUT_LLM)\n",
    "        response.raise_for_status()\n",
    "        return response.json()\n",
    "    except requests.exceptions.RequestException as e:\n",
    "        print(f\"❌ API Error: {e}\")\n",
    "        return None\n",
    "\n",
    "print(\"🔄 Testing Hybrid Method...\")\n",
    "hybrid_results = test_hybrid_method(sample_smiles[:2])\n",
    "\n",
    "if hybrid_results:\n",
    "    print(\"✅ Hybrid Results:\")\n",
    "    for i, (smiles, result) in enumerate(zip(sample_smiles[:2], hybrid_results), 1):\n",
    "        print(f\"\\n{i}. SMILES: {smiles}\")\n",
    "        print(f\"   Zone Name: {result.get('Zone', 'Not generated')}\")\n",
    "        if result.get('Zone + Partial LLM'):\n",
    "            print(f\"   Hybrid Name: {result['Zone + Partial LLM']}\")\n",
    "            if result.get('Partial LLM Explanation(s)'):\n",
    "                print(f\"   LLM Explanation: {result['Partial LLM Explanation(s)'][:100]}...\")\n",
    "        else:\n",
    "            print(f\"   Hybrid Name: No unknown substituents found\")\n",
    "else:\n",
    "    print(\"⚠️  Could not test Hybrid method. Check your configuration.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method Comparison\n",
    "\n",
    "The full LLM and zone-based methods can be compared. The hybrid approach will only run when:\n",
    "1. the zone based method is implemented correctly\n",
    "2. there are unknown substituents that cannot be identified by the zone-based namer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare all methods on a single molecule\n",
    "def compare_all_methods(smiles):\n",
    "    \"\"\"Run all three methods on a single SMILES string.\"\"\"\n",
    "    payload = {\n",
    "        \"smiles_strings\": [smiles],\n",
    "        \"full_llm_flag\": True,\n",
    "        \"explanation_flag\": True,\n",
    "        \"prompt_flag\": False,  # Skip prompts for cleaner output\n",
    "        \"zone_flag\": True,\n",
    "        \"hybrid_flag\": True\n",
    "    }\n",
    "    \n",
    "    try:\n",
    "        response = requests.post(NAMING_ENDPOINT, json=payload, timeout=API_TIMEOUT_LLM)\n",
    "        response.raise_for_status()\n",
    "        return response.json()[0]  # Get first (and only) result\n",
    "    except requests.exceptions.RequestException as e:\n",
    "        print(f\"❌ API Error: {e}\")\n",
    "        return None\n",
    "\n",
    "# Test with tryptophan derivative\n",
    "test_smiles = \"N[C@@H](CC1=CNC2=CC=CC=C12)C(=O)O\"\n",
    "print(f\"🔬 Comparing all methods for: {test_smiles}\\n\")\n",
    "print(\"This is tryptophan - let's see how each method handles it...\\n\")\n",
    "\n",
    "comparison_result = compare_all_methods(test_smiles)\n",
    "\n",
    "if comparison_result:\n",
    "    print(\"📊 Method Comparison Results:\")\n",
    "    print(\"=\" * 50)\n",
    "    \n",
    "    methods = [\n",
    "        (\"Zone-Based\", \"Zone\"),\n",
    "        (\"Full LLM\", \"Full LLM\"),\n",
    "        (\"Hybrid\", \"Zone + Partial LLM\")\n",
    "    ]\n",
    "    \n",
    "    for method_name, result_key in methods:\n",
    "        result_value = comparison_result.get(result_key, \"Not generated\")\n",
    "        print(f\"\\n{method_name:12}: {result_value}\")\n",
    "        \n",
    "        if method_name == \"Full LLM\" and comparison_result.get(\"Full LLM Explanation\"):\n",
    "            print(f\"{'':12}  Explanation: {comparison_result['Full LLM Explanation'][:80]}...\")\n",
    "else:\n",
    "    print(\"⚠️  Could not run comparison. Please check your API configuration.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Example: Complex Peptide Monomers\n",
    "\n",
    "Let's test with more complex structures that might challenge the different methods:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# More challenging examples\n",
    "complex_smiles = [\n",
    "    \"CC(C)(C)OC(=O)N[C@@H](CC1=CC=CC=C1)C(=O)O\",  # Boc-protected phenylalanine\n",
    "    \"N[C@@H](CCCNC(=N)N)C(=O)O\",  # Arginine\n",
    "    \"N[C@@H](CC(=O)N)C(=O)O\"  # Asparagine\n",
    "]\n",
    "\n",
    "print(\"🧬 Testing Complex Molecules:\")\n",
    "print(\"These include protected amino acids and polar residues...\\n\")\n",
    "\n",
    "for i, smiles in enumerate(complex_smiles, 1):\n",
    "    print(f\"{i}. {smiles}\")\n",
    "    \n",
    "    # Quick zone-based test (fastest method)\n",
    "    zone_result = test_zone_method([smiles])\n",
    "    if zone_result:\n",
    "        print(f\"   Zone Result: {zone_result[0].get('Zone', 'Error')}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Troubleshooting Guide\n",
    "\n",
    "If you encounter issues:\n",
    "\n",
    "### API Connection Errors:\n",
    "- ✅ Check that the API server is running: `uvicorn main:api --env-file .env`\n",
    "- ✅ Verify the API_BASE_URL points to the correct host/port\n",
    "- ✅ Ensure no firewall is blocking the connection\n",
    "\n",
    "### Configuration Errors:\n",
    "- ✅ Verify all required environment variables are set in `.env`\n",
    "- ✅ Check that database files exist and are accessible\n",
    "- ✅ Validate GPT API key (OPENAI_API_KEY) and endpoint (OPENAI_API_ROOT) configuration\n",
    "\n",
    "### Method-Specific Issues:\n",
    "- **Zone-based failures**: Check ZONE_URL and zone service availability\n",
    "- **LLM failures**: Verify GPT API credentials and model access\n",
    "- **Database errors**: Ensure monomer database and substituent dictionary files are available\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more information, see HELMify: A Hybrid Rule- and LLM-Based Generator of Peptide Monomer HELM Names"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "helmify",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}