Cuckoo Filter + Abstract vs Baseline 效果分析

问题

为什么 Cuckoo Filter + Abstract 比单独 Cuckoo Filter 效果好，但却和 Baseline（不加 Cuckoo Filter 的 main2）效果差不多？

代码逻辑分析

1. Baseline (search_method=0, main2)

# rag_base/rag_complete.py:238-241
if search_method == 0 or forest is None:
    # 对于baseline RAG，直接使用向量数据库检索的结果，只取top k
    results = results[:k]

特点：

✅ 只使用向量数据库检索
✅ 直接获取top-k最相似的chunks
✅ 简单直接，没有额外的过滤
❌ 不使用entity tree搜索

最终prompt：

Information:
[source_knowledge from vector DB top-k]
Question: [query]

2. Abstract RAG (search_method=2, 使用Bloom Filter)

# rag_base/rag_complete.py:243-262
else:
    # Abstract RAG: Enrich results with summary_embeddings
    results = enrich_results_with_summary_embeddings(...)
    
    # Dual-similarity filtering: query–chunk AND query–summary
    results = filter_contexts_by_dual_threshold(
        results,
        input_embedding,
        threshold_chunk=0.6,
        threshold_summary=0.6,
    )

特点：

✅ 向量数据库检索 + 双重相似度过滤（abstract级别）
✅ Entity tree搜索（通过Bloom Filter）
✅ 两个知识来源：source_knowledge + tree_knowledge

最终prompt：

Information:
[source_knowledge from filtered vector DB]
Relations:
[tree_knowledge from entity tree]
Question: [query]

3. Cuckoo Filter + Abstract (search_method=7)

# rag_base/rag_complete.py:303
# search_method == 7 (cuckoo filter) removed - now uses vector database only (search_method == 0)

⚠️ 注意： 从代码看，search_method==7的处理已被注释掉。但如果有旧版本或特殊配置，cuckoo filter的entity tree搜索逻辑是：

# entity/ruler.py:107-123
def search_entity_info_cuckoofilter(nlp, search):
    search_context = []
    doc = nlp(search)
    for ent in doc.ents:
        if ent.label_ == 'EXTRA':  # 只有EXTRA标签的实体才会被搜索
            entity_number += 1
            find_ = hash.cuckoo_extract(ent.text)
            if find_ is not None:
                search_context += list(find_.split("。"))
    return search_context

为什么效果差不多？

原因1: Entity Tree搜索在AESLC数据集上贡献有限

问题：

AESLC是邮件摘要任务
Query格式："Summarize the following email: [email content]"
Entity识别主要从query中提取实体（EXTRA标签）
但邮件摘要的query可能不包含很多明确的实体

证据：

如果query entity number = 0，说明没有识别到实体
那么tree_knowledge为空，最终效果和baseline相同

验证方法：

# 检查benchmark运行日志中的"query entity number"
# 如果大部分query的entity number都是0，说明entity tree搜索没有生效

原因2: 双重相似度过滤可能过度筛选

问题：

Abstract RAG使用了双重阈值过滤：
- sim(query, chunk) >= 0.6
- sim(query, summary/abstract) >= 0.6
如果阈值设置不当，可能过滤掉有用的chunks
虽然有回退逻辑（过滤后为空时使用过滤前结果），但仍可能影响最终结果

代码位置：

# rag_base/rag_complete.py:249-259
results = filter_contexts_by_dual_threshold(
    results,
    input_embedding,
    threshold_chunk=0.6,
    threshold_summary=0.6,
)
# 如果过滤后结果为空，回退到未过滤的结果
if len(results) == 0:
    results = results_before_filter[:k]

原因3: 知识融合方式简单

问题：

source_knowledge和tree_knowledge是简单字符串拼接
如果tree_knowledge质量不高（噪声、不相关），可能稀释source_knowledge的效果
对于摘要任务，向量DB的语义相似度搜索可能已经足够，额外的entity tree信息帮助有限

代码位置：

# rag_base/rag_complete.py:329
augmented_prompt = (
    f"Answer the question using the provided information. Be concise and direct.\n\n"
    f"Information:\n{source_knowledge}\n\n"
    f"Relations:\n{tree_knowledge}\n\n"  # 简单拼接
    f"Question: \n{query}"
)

原因4: Cuckoo Filter在邮件摘要任务中的局限性

问题：

Cuckoo Filter主要用于快速查找已知实体是否存在于entity tree中
对于邮件摘要任务：
- Query中的实体可能不在entity tree中（邮件内容多样）
- 即使找到了实体，对应的context可能不是最相关的（entity tree是基于文档构建的，不是基于query的）
因此cuckoo filter的优势（快速查找）可能无法转化为质量提升

为什么Cuckoo Filter + Abstract > 单独Cuckoo Filter？

如果单独使用Cuckoo Filter（没有Abstract的双重过滤机制）：

只能依赖entity tree搜索
没有向量DB的语义相似度搜索作为补充
如果entity识别失败，完全无法检索

而Cuckoo Filter + Abstract：

✅ 有向量DB检索作为基础（即使entity识别失败，仍能检索）
✅ 有双重相似度过滤提高质量
✅ Entity tree作为补充信息源

建议

1. 验证Entity识别效果

# 检查benchmark日志中的"query entity number"
# 如果大部分query的entity number都是0，说明需要改进entity识别

2. 分析tree_knowledge的质量

在debug模式下，输出tree_knowledge的内容
检查这些内容是否真的对摘要有帮助
如果大部分为空或不相关，说明entity tree搜索在该任务上效果有限

3. 调整双重过滤阈值

尝试不同的threshold_chunk和threshold_summary值
如果阈值太严格，可能导致过度筛选
可以考虑动态阈值或自适应阈值

4. 改进Entity识别策略

对于摘要任务，可能需要：

从email content中提取实体，而不仅仅是query
使用更宽松的entity匹配策略
或者针对摘要任务设计专门的entity识别规则

5. 考虑任务特性

对于结构化知识问答（如MedQA），entity tree搜索可能更有用
对于文档摘要（如AESLC），向量DB的语义搜索可能已经足够
Cuckoo Filter的优势在大规模、结构化的知识库中更明显

结论

Cuckoo Filter + Abstract 和 Baseline 效果差不多的主要原因是：

Entity tree搜索在邮件摘要任务上贡献有限（实体识别效果不佳）
向量DB的语义搜索已经足够好（对于摘要任务）
知识融合方式简单（简单拼接可能引入噪声）

这说明：

✅ Abstract机制（双重相似度过滤）本身是有帮助的
⚠️ 但Entity tree搜索在特定任务（如邮件摘要）上可能不是必需的
💡 对于不同的任务，可能需要不同的检索策略组合

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuckoo Filter + Abstract vs Baseline 效果分析

问题

代码逻辑分析

1. Baseline (search_method=0, main2)

2. Abstract RAG (search_method=2, 使用Bloom Filter)

3. Cuckoo Filter + Abstract (search_method=7)

为什么效果差不多？

原因1: Entity Tree搜索在AESLC数据集上贡献有限

原因2: 双重相似度过滤可能过度筛选

原因3: 知识融合方式简单

原因4: Cuckoo Filter在邮件摘要任务中的局限性

为什么Cuckoo Filter + Abstract > 单独Cuckoo Filter？

建议

1. 验证Entity识别效果

2. 分析tree_knowledge的质量

3. 调整双重过滤阈值

4. 改进Entity识别策略

5. 考虑任务特性

结论

FilesExpand file tree

CUCKOO_FILTER_ANALYSIS.md

Latest commit

History

CUCKOO_FILTER_ANALYSIS.md

File metadata and controls

Cuckoo Filter + Abstract vs Baseline 效果分析

问题

代码逻辑分析

1. Baseline (search_method=0, main2)

2. Abstract RAG (search_method=2, 使用Bloom Filter)

3. Cuckoo Filter + Abstract (search_method=7)

为什么效果差不多？

原因1: Entity Tree搜索在AESLC数据集上贡献有限

原因2: 双重相似度过滤可能过度筛选

原因3: 知识融合方式简单

原因4: Cuckoo Filter在邮件摘要任务中的局限性

为什么Cuckoo Filter + Abstract > 单独Cuckoo Filter？

建议

1. 验证Entity识别效果

2. 分析tree_knowledge的质量

3. 调整双重过滤阈值

4. 改进Entity识别策略

5. 考虑任务特性

结论