🔎 Search before asking
🐛 Bug (问题描述)
PaddleOCR使用PPStructureV3识别内容中带有形如:
<recv response="200" response_txn="invite" />
<pause milliseconds="5000"/>
这样的图片,使用
save_to_json
save_to_markdown
保存识别的结果,发现保存的markdown文件中有部分内容是以html形式保存的,但是这个保存的内容存在严重的问题,也就是没有对识别出来的真正字符串进行转义处理,导致显示异常问题;
一个识别出来的结果如下:
| $\mathrm{search\_in=''msg}$ $\mathrm{it}=\mathrm{''true''}$ $\mathrm{as}\mathrm{s}\mathrm{i}\mathrm{gn}_{-}\mathrm{to}=\left.1,2\right|\geq$
|
址和端口号。 |
表示收到响应消息200后,在消息中搜索匹配正则表 达式([0-9]{1,3}\.{3}[0-9]{1,3}:[0-9]*的字符串,并存储在 变量1和2中,这个表达式的意义实际上是提取ip地 |
|
| regexp_m 布尔值。检查收到的请 atch 求(Request)是否匹配 指定包含的消息,如果 匹配则应用正则表达 式,这个方法可以用来 一次提取多个请求中的 字符串,非常方便 |
例:检查是否匹配MESSAGE or PUBLISH or SUBSCRIBE requests : 表示仅响应以 发送的消息 $\mathrm{start\_txn=''invite''}$
|
response_ txn millisecon ds variable |
| ds 为ms。当没有指定时间 时,则使用命令行参数-d 来指定。 variable 表示使用哪个呼叫变量 来决定该呼叫是否需要 暂停 distributio 表示使用GSL(GNU科学 n 计算库)决定的呼叫长 度来对呼叫进行暂停。 如果不用GSL,则可以使 用固定值或者指定一个 范围。使用高斯分布, 则有以下几种统计分布 可选: normal, exponential, gamma, lambda, lognormal, negbin, (negative binomial), pareto,和 weibull,在选择哪种分布 时,需要指定对应的参 数。 |
:暂停脚本5秒钟 :当含有呼叫变量1时暂停 |
1、不使用GSL时,可以使用以下两种方法来对呼叫进 行暂停: ➢ , 表示暂停1秒钟 ➢ ,表示暂停时间在2秒到5秒钟之间。 2、使用GSL时需要指定参数,参数的命名与Wikipedia 中关于分布的描述页面一致。举几例如下: ,提供一个平均偏差为60秒和标 准差为15秒的暂停分布值,平均差与标准差的值 为毫秒整型值,分布图形如下: $\max=5000$
|
原始的文件中包含:
<recv response="200"response_txn="invite"/>
这样的内容,但是这里没有被转义处理,显然这是不对的;
曾试图使用html.escape对识别出来的内容进行转义,但是发现有时候真正识别出来的结果是单一的"<"或者">",显然这会破坏PaddleOCR以html格式保存的markdown文档内容;
测试图片:
image_000020_47a9f6d4-1b3b-42a6-8f66-f3bdc2724d44.jpg
调用save_to_json保存的结果:
image_000020_47a9f6d4-1b3b-42a6-8f66-f3bdc2724d44_res.json
调用save_to_markdown保存的结果:
image_000020_47a9f6d4-1b3b-42a6-8f66-f3bdc2724d44.md
这几个文件保存在文件2025-07-21-paddleocr-image-files-and-result.tar.gz中;
运行log:
Creating model: ('PP-LCNet_x1_0_doc_ori', '/home/xxxyyy/paddlex/official_models/PP-LCNet_x1_0_doc_ori')
Creating model: ('PP-DocBlockLayout', '/home/xxxyyy/paddlex/official_models/PP-DocBlockLayout')
Creating model: ('PP-DocLayout_plus-L', '/home/xxxyyy/paddlex/official_models/PP-DocLayout_plus-L')
Creating model: ('PP-LCNet_x1_0_textline_ori', '/home/xxxyyy/paddlex/official_models/PP-LCNet_x1_0_textline_ori')
Creating model: ('PP-OCRv5_server_det', '/home/xxxyyy/paddlex/official_models/PP-OCRv5_server_det')
Creating model: ('PP-OCRv5_server_rec', '/home/xxxyyy/paddlex/official_models/PP-OCRv5_server_rec')
Creating model: ('PP-OCRv4_server_seal_det', '/home/xxxyyy/paddlex/official_models/PP-OCRv4_server_seal_det')
Creating model: ('PP-OCRv5_server_rec', '/home/xxxyyy/paddlex/official_models/PP-OCRv5_server_rec')
Creating model: ('PP-LCNet_x1_0_table_cls', '/home/xxxyyy/paddlex/official_models/PP-LCNet_x1_0_table_cls')
Creating model: ('SLANeXt_wired', '/home/xxxyyy/paddlex/official_models/SLANeXt_wired')
The model(SLANeXt_wired) is not supported to run in MKLDNN mode! Using paddle instead!
Creating model: ('SLANet_plus', '/home/xxxyyy/paddlex/official_models/SLANet_plus')
The model(SLANet_plus) is not supported to run in MKLDNN mode! Using paddle instead!
Creating model: ('RT-DETR-L_wired_table_cell_det', '/home/xxxyyy/paddlex/official_models/RT-DETR-L_wired_table_cell_det')
Creating model: ('RT-DETR-L_wireless_table_cell_det', '/home/xxxyyy/paddlex/official_models/RT-DETR-L_wireless_table_cell_det')
Creating model: ('PP-FormulaNet_plus-L', '/home/xxxyyy/paddlex/official_models/PP-FormulaNet_plus-L')
The model(PP-FormulaNet_plus-L) is not supported to run in MKLDNN mode! Using paddle instead!
Creating model: ('PP-Chart2Table', '/home/xxxyyy/paddlex/official_models/PP-Chart2Table')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading configuration file /home/xxxyyy/paddlex/official_models/PP-Chart2Table/config.json
Loading weights file /home/xxxyyy/paddlex/official_models/PP-Chart2Table/model_state.pdparams
Loaded weights file from disk, setting weights to model.
All model checkpoint weights were used when initializing PPChart2TableInference.
All the weights of PPChart2TableInference were initialized from the model checkpoint at /home/xxxyyy/paddlex/official_models/PP-Chart2Table.
If your task is similar to the task the model of the checkpoint was trained on, you can already use PPChart2TableInference for predictions without further training.
Loading configuration file /home/xxxyyy/paddlex/official_models/PP-Chart2Table/generation_config.json
Creating model: ('PP-LCNet_x1_0_doc_ori', '/home/xxxyyy/paddlex/official_models/PP-LCNet_x1_0_doc_ori')
Creating model: ('PP-LCNet_x1_0_textline_ori', '/home/xxxyyy/paddlex/official_models/PP-LCNet_x1_0_textline_ori')
Creating model: ('PP-OCRv5_server_det', '/home/xxxyyy/paddlex/official_models/PP-OCRv5_server_det')
Creating model: ('PP-OCRv5_server_rec', '/home/xxxyyy/paddlex/official_models/PP-OCRv5_server_rec')
{'input_path': '/home/xxxyyy/doc/image_000020_47a9f6d4-1b3b-42a6-8f66-f3bdc2724d44.jpg', 'page_index': None, 'doc_preprocessor_res': {'output_img': array([[[255, ..., 255],
2025-07-21-paddleocr-image-files-and-result.tar.gz
🏃♂️ Environment (运行环境)
ubuntu 24.04
paddleocr==3.1.0
paddlepaddle==3.1.0
paddlex==3.1.3
只有cpu,没有独立显卡
🌰 Minimal Reproducible Example (最小可复现问题的Demo)
ubuntu 24.04
paddleocr==3.1.0
paddlepaddle==3.1.0
paddlex==3.1.3
只有cpu,没有独立显卡
🔎 Search before asking
🐛 Bug (问题描述)
PaddleOCR使用PPStructureV3识别内容中带有形如:
<recv response="200" response_txn="invite" />
<pause milliseconds="5000"/>
这样的图片,使用
save_to_json
save_to_markdown
保存识别的结果,发现保存的markdown文件中有部分内容是以html形式保存的,但是这个保存的内容存在严重的问题,也就是没有对识别出来的真正字符串进行转义处理,导致显示异常问题;
一个识别出来的结果如下:
曾试图使用html.escape对识别出来的内容进行转义,但是发现有时候真正识别出来的结果是单一的"<"或者">",显然这会破坏PaddleOCR以html格式保存的markdown文档内容;
测试图片:
image_000020_47a9f6d4-1b3b-42a6-8f66-f3bdc2724d44.jpg
调用save_to_json保存的结果:
image_000020_47a9f6d4-1b3b-42a6-8f66-f3bdc2724d44_res.json
调用save_to_markdown保存的结果:
image_000020_47a9f6d4-1b3b-42a6-8f66-f3bdc2724d44.md
这几个文件保存在文件2025-07-21-paddleocr-image-files-and-result.tar.gz中;
运行log:
Creating model: ('PP-LCNet_x1_0_doc_ori', '/home/xxxyyy/paddlex/official_models/PP-LCNet_x1_0_doc_ori')
Creating model: ('PP-DocBlockLayout', '/home/xxxyyy/paddlex/official_models/PP-DocBlockLayout')
Creating model: ('PP-DocLayout_plus-L', '/home/xxxyyy/paddlex/official_models/PP-DocLayout_plus-L')
Creating model: ('PP-LCNet_x1_0_textline_ori', '/home/xxxyyy/paddlex/official_models/PP-LCNet_x1_0_textline_ori')
Creating model: ('PP-OCRv5_server_det', '/home/xxxyyy/paddlex/official_models/PP-OCRv5_server_det')
Creating model: ('PP-OCRv5_server_rec', '/home/xxxyyy/paddlex/official_models/PP-OCRv5_server_rec')
Creating model: ('PP-OCRv4_server_seal_det', '/home/xxxyyy/paddlex/official_models/PP-OCRv4_server_seal_det')
Creating model: ('PP-OCRv5_server_rec', '/home/xxxyyy/paddlex/official_models/PP-OCRv5_server_rec')
Creating model: ('PP-LCNet_x1_0_table_cls', '/home/xxxyyy/paddlex/official_models/PP-LCNet_x1_0_table_cls')
Creating model: ('SLANeXt_wired', '/home/xxxyyy/paddlex/official_models/SLANeXt_wired')
The model(SLANeXt_wired) is not supported to run in MKLDNN mode! Using
paddleinstead!Creating model: ('SLANet_plus', '/home/xxxyyy/paddlex/official_models/SLANet_plus')
The model(SLANet_plus) is not supported to run in MKLDNN mode! Using
paddleinstead!Creating model: ('RT-DETR-L_wired_table_cell_det', '/home/xxxyyy/paddlex/official_models/RT-DETR-L_wired_table_cell_det')
Creating model: ('RT-DETR-L_wireless_table_cell_det', '/home/xxxyyy/paddlex/official_models/RT-DETR-L_wireless_table_cell_det')
Creating model: ('PP-FormulaNet_plus-L', '/home/xxxyyy/paddlex/official_models/PP-FormulaNet_plus-L')
The model(PP-FormulaNet_plus-L) is not supported to run in MKLDNN mode! Using
paddleinstead!Creating model: ('PP-Chart2Table', '/home/xxxyyy/paddlex/official_models/PP-Chart2Table')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading configuration file /home/xxxyyy/paddlex/official_models/PP-Chart2Table/config.json
Loading weights file /home/xxxyyy/paddlex/official_models/PP-Chart2Table/model_state.pdparams
Loaded weights file from disk, setting weights to model.
All model checkpoint weights were used when initializing PPChart2TableInference.
All the weights of PPChart2TableInference were initialized from the model checkpoint at /home/xxxyyy/paddlex/official_models/PP-Chart2Table.
If your task is similar to the task the model of the checkpoint was trained on, you can already use PPChart2TableInference for predictions without further training.
Loading configuration file /home/xxxyyy/paddlex/official_models/PP-Chart2Table/generation_config.json
Creating model: ('PP-LCNet_x1_0_doc_ori', '/home/xxxyyy/paddlex/official_models/PP-LCNet_x1_0_doc_ori')
Creating model: ('PP-LCNet_x1_0_textline_ori', '/home/xxxyyy/paddlex/official_models/PP-LCNet_x1_0_textline_ori')
Creating model: ('PP-OCRv5_server_det', '/home/xxxyyy/paddlex/official_models/PP-OCRv5_server_det')
Creating model: ('PP-OCRv5_server_rec', '/home/xxxyyy/paddlex/official_models/PP-OCRv5_server_rec')
{'input_path': '/home/xxxyyy/doc/image_000020_47a9f6d4-1b3b-42a6-8f66-f3bdc2724d44.jpg', 'page_index': None, 'doc_preprocessor_res': {'output_img': array([[[255, ..., 255],
2025-07-21-paddleocr-image-files-and-result.tar.gz
🏃♂️ Environment (运行环境)
ubuntu 24.04
paddleocr==3.1.0
paddlepaddle==3.1.0
paddlex==3.1.3
只有cpu,没有独立显卡
🌰 Minimal Reproducible Example (最小可复现问题的Demo)
ubuntu 24.04
paddleocr==3.1.0
paddlepaddle==3.1.0
paddlex==3.1.3
只有cpu,没有独立显卡