fix: naming for fuzz target discovery (#24)

joyguoguo · web-flow · commit 1f2f5519d50d · 2025-08-25T14:55:20.000-07:00
The naming method for the insertion file has changed from print1 to .bak.py, modified the find target method, and the way to find the function names for inserting print statements.

* Upload the python project Fuzz test script

valid_projects.txt: Python project list
script_fuzz_py_final.sh: Single project test script
script_fuzz_py_batch_final.sh: Batch projects test script

* feat: Add OSS-Fuzz submodule tracking main branch

* chore: Switch oss-fuzz submodule to personal fork

* Switch oss-fuzz submodule to personal fork

* move the valid_project file

* move the .py file

* create build_oss_fuzz.py

* create run_fuzz_target.py

* split the pool.py into build_oss_fuzz and run_fuzz_target

* delete the .sh files

* translate to english

* fuzz_runner_pool.py:74

* edit stdout

* 添加空值检查

* modify stdout, delete pool.py

* indentation level check

* Remove build log write files

* Remove build log write files

* use logging mdule

* use precise logging

* use logging

* use precise exception log info

* correct type problems

* correct some mistakes

* correct some mistakes

* correct

* modify discover fuzz target

* modify the oss-fuzz dir

* Redirect the output to an empty device without retaining any output

* add always yes

* split the build script

* split the build script

* build scripts test successfully

* build.py

* collect targets first and then run

* list, tuple, ptional

* list,tuple,optional

* translate

* build_fuzz.py, run_fuzz_all_target.py

* correct

* original

* record input

* Fatal error in main program: cannot unpack non-iterable NoneType object

* name 'target_functions' is not defined
fuzz_util_instrumented.py does not seem to exist

* 准备大改

* create modify file script  add"print(data)" to each fuzz_.py

* build_fuzzer script

* modify tuple dict list

* remove stdout stderr in build fuzz

* test successfully

* rename run fuzz ds to run fuzz print1

* add print(data ) to fuzz target and rename the file with "_print1"

* oss -fuzz change

* rename the print1.py

* modify the exegesis

* modify

* modify log name

* type error

* list dict tuple

* type error

* construct errors module

* run_command module

* combine the run_command instrument to one file

* remove the  run_command

* modify

* mytype check

* mytype

* mytype

* mytype

* translate

* remove run command

* timeout - shell instrument

* correct  in out error and return Popen directly

* ready to change from rust script

* 修改build_image

* y/n

* correct repo_id and repo_name in main

* test build_image 构建日志

* add build_fuzzer

* fuzz and testgen

* correct run_one_target

* fuzz ok

* transform

* testgen  need to ^ help: add `;` here

* test successful

* example output project

* type error

* English ver

* delete privious scripts

* python template

* python template

* correct the template

* ver2 wrong template

* ok

* testgen file change into copy the original and then add input_data =b""

* only read b' ' inputs

* remove transform

* clean the inputs and testgen

* set max_file

* max input file

* input b""

* modify  the method of writing files  into PIPE

* use max total time; remove size monitor

* 修改并行错误, 写入方法还是直接写入文件 延时控制为max total time

* 补充日志输出

* 模板生成成功

* testgen完成

* 删除冗余, 修改代码

* 更换为未删除冗余版本

* template插入data=b""
函数header改为test_()

* translation

* A complete script for building the processes of build_image, build_fuzzer, fuzz, transform, and testgen, suitable for Python projects.

* delete some imports

* use ASTfor transform and testgen

* use AST

* Set up command line arguments

* use fire

* use FIre

* black formatter

* deal the data after closing the file

* when doing line-matching, check for # This is a test template in the line

* when doing line-matching, check for # This is a test template in the line

* delete UnicodeDecodeError

* apply  transformations on the original unmodified fuzz targets.

* put all AST related class/module/function in another file and import from there.

* put all AST related class/module/function in another file and import from there.

* translation

* use relative address

* use relative address

* remove the class outside of the function

* add tuple's type

* Properly handle indentation and process data after the file is closed.

* correct tne relative path

* add black to requirements.txt

* 修改添加print()的命名方式, 函数匹配选择识别atheris.Setup()的第二个参数,目前还不准确

* 此脚本会读取 data/valid_projects.txt中的项目列表
对于每个项目，在 fuzz/oss-fuzz/projects目录中查找对应项目文件夹
删除所有 _print1.py和 .bak.py结尾的文件
处理剩余的 .py文件：
创建备份文件（.bak.py）
查找 atheris.Setup()语句并提取第二个参数（函数签名）
在该函数中添加 print(data)语句
保持原文件名不变

* correct the source file name as .bak.py

* 寻找fuzz target 排除了一些常见的工具列表

* 添加了符合条件的python项目名单(有Dockerfile, .yaml language = python, build.sh, 有.py 文件)

* black format
diff --git a/data/valid_projects.txt b/data/valid_projects.txt
@@ -1,6 +1,7 @@
 abseil-py
 adal
 aiohttp
+airflow
 aniso8601
 ansible
 argcomplete
@@ -15,10 +16,13 @@ autopep8
 azure-sdk-for-python
 babel
 black
+bleach
 botocore
 bottleneck
+bs4
 bz2file
 cachetools
+cbor2
 cffi
 chardet
 charset_normalizer
@@ -70,6 +74,7 @@ g-cloud-logging-py
 gcp-python-cloud-storage
 genshi
 gitdb
+github_scarecrow
 glom
 gprof2dot
 g-py-bigquery
@@ -98,8 +103,8 @@ jinja2
 jmespathpy
 joblib
 jsmin
-jupyter-nbconvert
 jupyter_server
+jupyter-nbconvert
 kafka
 keras
 kiwisolver
@@ -123,6 +128,7 @@ nbclassic
 nbformat
 netaddr-py
 networkx
+nfstream
 ntlm2
 ntlm-auth
 numexpr
@@ -142,6 +148,7 @@ pandas
 paramiko
 parse
 parsimonious
+parso
 pasta
 pathlib2
 pdoc
@@ -200,6 +207,7 @@ retry
 rfc3967
 rich
 sacremoses
+scapy
 scikit-learn
 scipy
 setuptools
@@ -208,6 +216,7 @@ simplejson
 six
 smart_open
 soupsieve
+sqlalchemy
 sqlalchemy_jsonfield
 sqlalchemy-utils
 sqlparse
@@ -220,6 +229,7 @@ toolbelt
 toolz
 tqdm
 typing_extensions
+ujson
 underscore
 uritemplate
 urlextract
@@ -230,5 +240,6 @@ websocket-client
 wheel
 wtforms
 xlrd
+xmltodict
 yarl
 zipp
diff --git a/fuzz/ast_utils.py b/fuzz/ast_utils.py
@@ -146,8 +146,8 @@ def generate_test_template(target_name: str, repo_path: str):
     """
     src_file = os.path.join(repo_path, target_name)
     logging.info(f"Generating test template for {src_file}")
-    if not src_file.endswith(".py"):
-        src_file += ".py"
+    if not src_file.endswith(".bak.py"):
+        src_file += ".bak.py"
     if not os.path.exists(src_file):
         logging.error(f"Source target file not found: {src_file}")
         return None
diff --git a/fuzz/collect_fuzz_python.py b/fuzz/collect_fuzz_python.py
@@ -102,14 +102,38 @@ def discover_targets(project_name: str, oss_fuzz_dir: Path) -> list[str]:
         logging.warning(f"Build output directory for {project_name} does not exist")
         return targets
 
+    # 常见非 fuzz target 的工具列表
+    non_target_tools = {
+        "llvm-symbolizer",
+        "asan_symbolize",
+        "msan_symbolize",
+        "tsan_symbolize",
+        "ubsan_symbolize",
+        "clang",
+        "clang++",
+        "llvm-ar",
+        "llvm-nm",
+        "llvm-objcopy",
+        "llvm-objdump",
+        "llvm-ranlib",
+        "llvm-readelf",
+        "llvm-readobj",
+        "llvm-size",
+        "llvm-strings",
+        "llvm-strip",
+        "ld",
+        "ld.lld",
+        "lld",
+        "lld-link",
+    }
+
     try:
         for f in out_dir.iterdir():
             if (
                 f.is_file()
-                and f.name.startswith("fuzz_")
                 and "." not in f.name
-                and f.name.endswith("print1")
                 and os.access(f, os.X_OK)
+                and f.name not in non_target_tools  # 排除已知的工具
             ):
                 targets.append(f.name)
         logging.info(
@@ -222,7 +246,7 @@ def _transform_repo(repo: str):
 
 def substitute_one_repo(
     repo: str,
-    targets: list[tuple[str,str]],  # Each element is (transformed_target, raw_target)
+    targets: list[tuple[str, str]],  # Each element is (transformed_target, raw_target)
     n_fuzz: int,
     strategy: str,
     max_len: int,
diff --git a/fuzz/modify_fuzz_files.py b/fuzz/modify_fuzz_files.py
@@ -1,72 +1,137 @@
 #!/usr/bin/env python3
 import os
+import re
+import shutil
 import ast
-import fire
+import astunparse
 
 
-class InsertPrintTransformer(ast.NodeTransformer):
-    def visit_FunctionDef(self, node):
-        if node.name in ("TestOneInput", "TestInput") and node.args.args:
-            first_arg_name = node.args.args[0].arg
-            print_stmt = ast.Expr(
-                value=ast.Call(
-                    func=ast.Name(id='print', ctx=ast.Load()),
-                    args=[ast.Name(id=first_arg_name, ctx=ast.Load())],
-                    keywords=[]
-                )
-            )
-            # 添加空body检查
-            if not node.body:
-                node.body.append(print_stmt)
-            else:
-                # 增强重复检查逻辑
-                first_stmt = node.body[0]
-                if not (isinstance(first_stmt, ast.Expr) 
-                        and isinstance(first_stmt.value, ast.Call)
-                        and hasattr(first_stmt.value.func, 'id')
-                        and first_stmt.value.func.id == 'print'):
-                    node.body.insert(0, print_stmt)
-        return node
-
-def add_print_to_testoneinput(file_path):
-    with open(file_path, 'r') as f:
-        content = f.read()
-
-    tree = ast.parse(content)
-    transformer = InsertPrintTransformer()
-    new_tree = transformer.visit(tree)
-    ast.fix_missing_locations(new_tree)
-
-    import astor
-    new_content = astor.to_source(new_tree)
-    return new_content
+def process_projects():
+    # Read valid projects from data/valid_projects.txt
+    with open("data/valid_projects.txt", "r") as f:
+        projects = [line.strip() for line in f.readlines() if line.strip()]
 
-def main(
-    projects_path="fuzz/oss-fuzz/projects",
-    valid_projects_file="data/valid_projects.txt"
-):
-    """为fuzz target添加打印语句"""
-    with open(valid_projects_file, 'r') as f:
-        projects = [line.strip() for line in f if line.strip()]
+    # Base directory containing projects
+    base_dir = "fuzz/oss-fuzz/projects"
 
     for project in projects:
-        project_dir = os.path.join(projects_path, project)
-        if not os.path.isdir(project_dir):
+        project_dir = os.path.join(base_dir, project)
+
+        if not os.path.exists(project_dir):
+            print(f"Project directory not found: {project_dir}")
             continue
 
-        for root, _, files in os.walk(project_dir):
+        print(f"Processing project: {project}")
+
+        # Remove _print1.py and .bak files
+        for root, dirs, files in os.walk(project_dir):
             for file in files:
-                if file.startswith('fuzz_') and file.endswith('.py'):
+                if file.endswith("_print1.py") or file.endswith(".bak.py"):
                     file_path = os.path.join(root, file)
-                    try:
-                        new_content = add_print_to_testoneinput(file_path)
-                        new_file_path = file_path.rsplit('.', 1)[0] + '_print1.py'
-                        with open(new_file_path, 'w') as f:
-                            f.write(new_content)
-                        print(f"Processed: {file_path} -> {new_file_path}")
+                    os.remove(file_path)
+                    print(f"Removed: {file_path}")
+
+        # Find all remaining .py files
+        py_files = []
+        for root, dirs, files in os.walk(project_dir):
+            for file in files:
+                if file.endswith(".py"):
+                    py_files.append(os.path.join(root, file))
+
+        # Process each .py file
+        for py_file in py_files:
+            process_py_file(py_file)
+
+
+class FunctionVisitor(ast.NodeVisitor):
+    def __init__(self, target_func):
+        self.target_func = target_func
+        self.found_node = None
+        self.first_param = None
+
+    def visit_FunctionDef(self, node):
+        if node.name == self.target_func:
+            self.found_node = node
+            if node.args.args:
+                self.first_param = node.args.args[0].arg
+        self.generic_visit(node)
+
+
+def process_py_file(file_path):
+    print(f"Processing file: {file_path}")
+
+    # Create backup with .bak.py suffix
+    base_name = os.path.splitext(file_path)[0]  # Remove .py extension
+    backup_path = base_name + ".bak.py"
+    shutil.copy2(file_path, backup_path)
+    print(f"Created backup: {backup_path}")
+
+    # Read file content
+    with open(file_path, "r") as f:
+        content = f.read()
+
+    # Find atheris.Setup() call and extract function signature (only the second parameter)
+    # 改进的正则表达式，只匹配第二个参数
+    setup_pattern = r"atheris\.Setup\([^,]*,\s*([^,)]*)"
+    match = re.search(setup_pattern, content)
+
+    if not match:
+        print(f"No atheris.Setup() found in {file_path}")
+        return
+
+    function_signature = match.group(1).strip()
+    print(f"Found function signature: {function_signature}")
+
+    # Parse AST to find target function
+    try:
+        tree = ast.parse(content)
+    except SyntaxError as e:
+        print(f"Syntax error in {file_path}: {e}")
+        return
+
+    visitor = FunctionVisitor(function_signature)
+    visitor.visit(tree)
+
+    if not visitor.found_node:
+        print(f"Function {function_signature} not found in {file_path}")
+        return
+
+    if not visitor.first_param:
+        print(f"No parameters found in function {function_signature}")
+        return
+
+    # Create print statement node
+    print_stmt = ast.Expr(
+        value=ast.Call(
+            func=ast.Name(id="print", ctx=ast.Load()),
+            args=[ast.Name(id=visitor.first_param, ctx=ast.Load())],
+            keywords=[],
+        )
+    )
+
+    # Insert print statement at the beginning of function body
+    if visitor.found_node.body:
+        # Preserve docstring if present
+        first_item = visitor.found_node.body[0]
+        if isinstance(first_item, ast.Expr) and isinstance(first_item.value, ast.Str):
+            # Insert after docstring
+            visitor.found_node.body.insert(1, print_stmt)
+        else:
+            # Insert at the very beginning
+            visitor.found_node.body.insert(0, print_stmt)
+    else:
+        visitor.found_node.body = [print_stmt]
+
+    # Generate modified source code
+    modified_content = astunparse.unparse(tree)
+
+    # Write modified content back to file
+    with open(file_path, "w") as f:
+        f.write(modified_content)
+
+    print(f"Added print({visitor.first_param}) to function {function_signature}")
 
-                    except Exception as e:
-                        print(f"Error processing {file_path}: {str(e)}")
 
 if __name__ == "__main__":
-    fire.Fire(main)
+    process_projects()
+