{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# HELMify Demo: Peptide Monomer HELM Name Generation\n", "\n", "This notebook describes and demonstrates the three different approaches implemented in HELMify: A Hybrid Rule- and LLM-Based Generator of Peptide Monomer HELM Names.\n", "\n", "## Prerequisites\n", "\n", "Before running this notebook, ensure you have:\n", "1. Set up the conda environment: `conda env create -f environment.yaml`\n", "2. Activated the conda environment: `conda activate helmify`\n", "3. Configured your `.env` file with the required API keys and database paths inside the /helmify directory\n", "4. Started the API server from within the helmify directory: `uvicorn main:api --env-file .env`" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import json\n", "from pprint import pprint\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# API endpoint (adjust if running on different host/port)\n", "API_BASE_URL = \"http://127.0.0.1:8000\"\n", "NAMING_ENDPOINT = f\"{API_BASE_URL}/helm-api-name\"\n", "\n", "# API Timeout Constants (PEP 8: Constants at module level)\n", "API_TIMEOUT_FAST = 60 # For zone-based operations (faster rule-based processing)\n", "API_TIMEOUT_LLM = 120 # For LLM operations (requires model inference time)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1: Standard Amino Acid Derivatives\n", "\n", "Let's start with some modified amino acids that are commonly found in peptide research:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sample SMILES strings - modified amino acids\n", "sample_smiles = [\n", " \"c1cc(c(cc1C[C@@H](C(=O)O)N)F)F\",\n", " \"C=CC[C@@H](C(=O)O)N\",\n", " \"C1CN[C@@H]1C(=O)O\"]\n", "# Display our test molecules\n", "print(\"๐Ÿงช Test SMILES:\")\n", "for i, smiles in enumerate(sample_smiles, 1):\n", " print(f\"{i}. {smiles}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Full LLM Naming Approach\n", "\n", "This method will work, assuming all of the openai/LLM model credentials are configured correctly,t he openeye license server is set, and the monomer database has more entries than the number of nearest neighbors used for context." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test Full LLM approach\n", "def test_full_llm_method(smiles_list):\n", " \"\"\"Test the full LLM naming method.\"\"\"\n", " payload = {\n", " \"smiles_strings\": smiles_list,\n", " \"full_llm_flag\": True,\n", " \"explanation_flag\": True,\n", " \"prompt_flag\": True,\n", " \"zone_flag\": False,\n", " \"hybrid_flag\": False\n", " }\n", " \n", " try:\n", " response = requests.post(NAMING_ENDPOINT, json=payload, timeout=API_TIMEOUT_LLM)\n", " response.raise_for_status()\n", " return response.json()\n", " except requests.exceptions.RequestException as e:\n", " print(f\"โŒ API Error: {e}\")\n", " return None\n", "\n", "print(\"๐Ÿค– Testing Full LLM Method...\")\n", "print(\"This may take 30-60 seconds as it calls the LLM for each molecule...\\n\")\n", "\n", "full_llm_results = test_full_llm_method(sample_smiles[:2]) # Test first 2 molecules\n", "\n", "if full_llm_results:\n", " print(\"โœ… Full LLM Results:\")\n", " for i, (smiles, result) in enumerate(zip(sample_smiles[:2], full_llm_results), 1):\n", " print(f\"\\n{i}. SMILES: {smiles}\")\n", " print(f\" HELM Name: {result.get('Full LLM', 'Not generated')}\")\n", " if result.get('Full LLM Explanation'):\n", " print(f\" Explanation: {result['Full LLM Explanation'][:100]}...\")\n", "else:\n", " print(\"โš ๏ธ Could not test Full LLM method. Check your API configuration.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Zone-Based Naming Approach\n", "\n", "This method will not work out of the box, as it needs a zone-based implementation and the corresponding API (ZONE_URL) to function correctly.This method uses structural decomposition to identify zones and build HELM names using predefined rules." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test Zone-based approach\n", "def test_zone_method(smiles_list):\n", " \"\"\"Test the zone-based naming method.\"\"\"\n", " payload = {\n", " \"smiles_strings\": smiles_list,\n", " \"full_llm_flag\": False,\n", " \"explanation_flag\": False,\n", " \"prompt_flag\": False,\n", " \"zone_flag\": True,\n", " \"hybrid_flag\": False\n", " }\n", " \n", " try:\n", " response = requests.post(NAMING_ENDPOINT, json=payload, timeout=API_TIMEOUT_FAST)\n", " response.raise_for_status()\n", " return response.json()\n", " except requests.exceptions.RequestException as e:\n", " print(f\"โŒ API Error: {e}\")\n", " return None\n", "\n", "print(\"โš™๏ธ Testing Zone-Based Method...\")\n", "zone_results = test_zone_method(sample_smiles)\n", "\n", "if zone_results:\n", " print(\"โœ… Zone-Based Results:\")\n", " for i, (smiles, result) in enumerate(zip(sample_smiles, zone_results), 1):\n", " print(f\"\\n{i}. SMILES: {smiles}\")\n", " print(f\" HELM Name: {result.get('Zone', 'Not generated')}\")\n", "else:\n", " print(\"โš ๏ธ Could not test Zone method. Check your zone service configuration.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hybrid Approach\n", "\n", "This method combines zone-based naming with LLM naming for unknown substituents, providing the best of both worlds." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test Hybrid approach\n", "def test_hybrid_method(smiles_list):\n", " \"\"\"Test the hybrid naming method.\"\"\"\n", " payload = {\n", " \"smiles_strings\": smiles_list,\n", " \"full_llm_flag\": False,\n", " \"explanation_flag\": True,\n", " \"prompt_flag\": True,\n", " \"zone_flag\": True,\n", " \"hybrid_flag\": True\n", " }\n", " \n", " try:\n", " response = requests.post(NAMING_ENDPOINT, json=payload, timeout=API_TIMEOUT_LLM)\n", " response.raise_for_status()\n", " return response.json()\n", " except requests.exceptions.RequestException as e:\n", " print(f\"โŒ API Error: {e}\")\n", " return None\n", "\n", "print(\"๐Ÿ”„ Testing Hybrid Method...\")\n", "hybrid_results = test_hybrid_method(sample_smiles[:2])\n", "\n", "if hybrid_results:\n", " print(\"โœ… Hybrid Results:\")\n", " for i, (smiles, result) in enumerate(zip(sample_smiles[:2], hybrid_results), 1):\n", " print(f\"\\n{i}. SMILES: {smiles}\")\n", " print(f\" Zone Name: {result.get('Zone', 'Not generated')}\")\n", " if result.get('Zone + Partial LLM'):\n", " print(f\" Hybrid Name: {result['Zone + Partial LLM']}\")\n", " if result.get('Partial LLM Explanation(s)'):\n", " print(f\" LLM Explanation: {result['Partial LLM Explanation(s)'][:100]}...\")\n", " else:\n", " print(f\" Hybrid Name: No unknown substituents found\")\n", "else:\n", " print(\"โš ๏ธ Could not test Hybrid method. Check your configuration.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method Comparison\n", "\n", "The full LLM and zone-based methods can be compared. The hybrid approach will only run when:\n", "1. the zone based method is implemented correctly\n", "2. there are unknown substituents that cannot be identified by the zone-based namer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compare all methods on a single molecule\n", "def compare_all_methods(smiles):\n", " \"\"\"Run all three methods on a single SMILES string.\"\"\"\n", " payload = {\n", " \"smiles_strings\": [smiles],\n", " \"full_llm_flag\": True,\n", " \"explanation_flag\": True,\n", " \"prompt_flag\": False, # Skip prompts for cleaner output\n", " \"zone_flag\": True,\n", " \"hybrid_flag\": True\n", " }\n", " \n", " try:\n", " response = requests.post(NAMING_ENDPOINT, json=payload, timeout=API_TIMEOUT_LLM)\n", " response.raise_for_status()\n", " return response.json()[0] # Get first (and only) result\n", " except requests.exceptions.RequestException as e:\n", " print(f\"โŒ API Error: {e}\")\n", " return None\n", "\n", "# Test with tryptophan derivative\n", "test_smiles = \"N[C@@H](CC1=CNC2=CC=CC=C12)C(=O)O\"\n", "print(f\"๐Ÿ”ฌ Comparing all methods for: {test_smiles}\\n\")\n", "print(\"This is tryptophan - let's see how each method handles it...\\n\")\n", "\n", "comparison_result = compare_all_methods(test_smiles)\n", "\n", "if comparison_result:\n", " print(\"๐Ÿ“Š Method Comparison Results:\")\n", " print(\"=\" * 50)\n", " \n", " methods = [\n", " (\"Zone-Based\", \"Zone\"),\n", " (\"Full LLM\", \"Full LLM\"),\n", " (\"Hybrid\", \"Zone + Partial LLM\")\n", " ]\n", " \n", " for method_name, result_key in methods:\n", " result_value = comparison_result.get(result_key, \"Not generated\")\n", " print(f\"\\n{method_name:12}: {result_value}\")\n", " \n", " if method_name == \"Full LLM\" and comparison_result.get(\"Full LLM Explanation\"):\n", " print(f\"{'':12} Explanation: {comparison_result['Full LLM Explanation'][:80]}...\")\n", "else:\n", " print(\"โš ๏ธ Could not run comparison. Please check your API configuration.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Example: Complex Peptide Monomers\n", "\n", "Let's test with more complex structures that might challenge the different methods:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# More challenging examples\n", "complex_smiles = [\n", " \"CC(C)(C)OC(=O)N[C@@H](CC1=CC=CC=C1)C(=O)O\", # Boc-protected phenylalanine\n", " \"N[C@@H](CCCNC(=N)N)C(=O)O\", # Arginine\n", " \"N[C@@H](CC(=O)N)C(=O)O\" # Asparagine\n", "]\n", "\n", "print(\"๐Ÿงฌ Testing Complex Molecules:\")\n", "print(\"These include protected amino acids and polar residues...\\n\")\n", "\n", "for i, smiles in enumerate(complex_smiles, 1):\n", " print(f\"{i}. {smiles}\")\n", " \n", " # Quick zone-based test (fastest method)\n", " zone_result = test_zone_method([smiles])\n", " if zone_result:\n", " print(f\" Zone Result: {zone_result[0].get('Zone', 'Error')}\")\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Troubleshooting Guide\n", "\n", "If you encounter issues:\n", "\n", "### API Connection Errors:\n", "- โœ… Check that the API server is running: `uvicorn main:api --env-file .env`\n", "- โœ… Verify the API_BASE_URL points to the correct host/port\n", "- โœ… Ensure no firewall is blocking the connection\n", "\n", "### Configuration Errors:\n", "- โœ… Verify all required environment variables are set in `.env`\n", "- โœ… Check that database files exist and are accessible\n", "- โœ… Validate GPT API key (OPENAI_API_KEY) and endpoint (OPENAI_API_ROOT) configuration\n", "\n", "### Method-Specific Issues:\n", "- **Zone-based failures**: Check ZONE_URL and zone service availability\n", "- **LLM failures**: Verify GPT API credentials and model access\n", "- **Database errors**: Ensure monomer database and substituent dictionary files are available\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more information, see HELMify: A Hybrid Rule- and LLM-Based Generator of Peptide Monomer HELM Names" ] } ], "metadata": { "kernelspec": { "display_name": "helmify", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 4 }