【Python计算生态】img2table——表格识别和提取库

By 进击的码农设计师

2025年1月5日

0

51

Python受欢迎的原因之一就是其计算生态丰富，据不完全统计，Python 目前为止有约13万+的第三方库。

本系列将会陆续整理分享一些有趣、有用的第三方库。

文章配套代码获取有以下两种途径：

通过百度网盘获取：

链接：https://pan.baidu.com/s/1FSGLd7aI_UQlCQuovVHc_Q?pwd=mnsj 提取码：mnsj

前往GitHub获取：

https://github.com/returu/Python_Ecosystem

01

简介：

img2table是一个简单、易用、基于 OpenCV图像处理的表格识别和提取的Python第三方库，支持最常见的图像文件格式和PDF文件。

img2table中所有图像处理都是使用OpenCV和opencv-python库完成的。使用Hough变换算法（cv2.HoughLinesP）来检测图像的水平线和垂直线，从而识别图像中的表格。

识别图像（包括JPEG、PNG、WebP等）和PDF文件中的表格，包括单元级别的边界框。
支持多种OCR（光学字符识别）服务，以此来提取表内容。包括Tesseract、PaddleOCR、EasyOCR、Surya OCR以及云服务如Google Vision OCR和Azure OCR。
可以处理复杂的表格结构，如合并的单元格。
实现一种纠正图像偏斜和旋转的方法。
提取的表作为简单对象返回，包括 Pandas DataFrame 表示形式。
将提取的表导出到 Excel 文件的选项，保留其原始结构。

02

安装与配置：

使用pip安装img2table：

pip install img2table

也可以根据需要安装支持不同OCR服务的扩展包：

# 标准安装，支持Tesseract
pip install img2table
# 使用Paddle OCR
pip install img2table[paddle]: For usage with 
# 使用EasyOCR
pip install img2table[easyocr]: For usage with 
# 使用Surya OCR
pip install img2table[surya]: For usage with 
# 使用Google Vision OCR
pip install img2table[gcp]: For usage with 
# 使用AWS Textract OCR
pip install img2table[aws]: For usage with 
# 使用Azure Cognitive Services OCR
pip install img2table[azure]: For usage with

03

使用：

img2table较轻量，简单易用，以下是使用img2table进行表格识别的基本步骤：

（1）、创建OCR实例：

如果想提取表格的内容，则需要 OCR 工具。img2table 为多个 OCR 服务和工具提供了接口，以便解析表格内容。

例如，通过以下语句使用之前介绍过的EasyOCR：

from img2table.ocr import EasyOCR
ocr = EasyOCR(lang=["en"],kw={"kwarg": kw_value, ...})

其他OCR工具的使用可以查看官方文档。

（2）、实例化图像文件：

图像的实例化过程：

from img2table.document import Image
image = Image(src, detect_rotation=False)

其中：

src：图像路径；
detect_rotation：设置是否检测和纠正图像的倾斜/旋转，默认为False。需要注意的是，当设置为 True 时，其他方法返回的图像坐标和边界框可能与原始图像不一致。

PDF文件的实例化过程：

from img2table.document import PDF
pdf = PDF(src,           pages=[0, 2],          detect_rotation=False,          pdf_text_extraction=True)

其中：

src：图像路径；
pages：要处理的 PDF 页面索引列表，如不设置则处理所有页面；
detect_rotation：设置是否检测和纠正从PDF中提取的图像的倾斜/旋转；
pdf_text_extraction：是否从PDF文件中提取文本以获得原生PDF文本。

（3）、使用extract_tables()函数提取表格：

使用extract_tables() 函数方法，可以从 PDF 页面/图像中一次性提取多个表格。

from img2table.ocr import TesseractOCRfrom img2table.document import Image
# Instantiation of OCRocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDFdoc = Image(src)
# Table extractionextracted_tables = doc.extract_tables(ocr=ocr,                                      implicit_rows=False,                                      implicit_columns=False,                                      borderless_tables=False,                                      min_confidence=50)

其中：

ocr : 用于解析文档文本的 OCR 实例；
implicit_rows :是否识别隐含行；
implicit_columns :是否识别隐含列；
borderless_tables :是否提取无边框表格；
min_confidence :OCR 处理文本的最低置信度，取值从 0（最差）到 99（最佳）。

Image 类中的 extract_tables 方法会返回一个 ExtractedTable 对象列表。

output = [ExtractedTable(...), ExtractedTable(...), ...]

PDF 类中的 extract_tables 方法会返回一个以页面索引为键的 OrderedDict 对象和 ExtractedTable 对象列表。

output = {    0: [ExtractedTable(...), ...],    1: [],    ...    last_page: [ExtractedTable(...), ...]}

可以通过以下方法获取指定对象的属性信息：

bbox : 表格边界框；
title :提取的表格标题；
content :以行索引为键、以 TableCell 对象列表为值的 Dict；
df :识别的表格的 Pandas DataFrame 对象；
html :识别的表格的 HTML 对象。

要访问单元格级别的边界框，可以使用以下代码片段：

for id_row, row in enumerate(table.content.values()):    for id_col, cell in enumerate(row):        x1 = cell.bbox.x1        y1 = cell.bbox.y1        x2 = cell.bbox.x2        y2 = cell.bbox.y2        value = cell.value

（4）、使用to_xlsx()函数将提取的表格导出到Excel文件：

使用to_xlsx()方法可以将从文档中提取的表格可以导出为 xlsx 文件。生成的文件由每个提取的表格组成一个工作表。该方法中的参数与 extract_tables 方法相同。

from img2table.ocr import TesseractOCRfrom img2table.document import Image
# Instantiation of OCRocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDFdoc = Image(src)
# Extraction of tables and creation of a xlsx file containing tablesdoc.to_xlsx(dest=dest,            ocr=ocr,            implicit_rows=False,            implicit_columns=False,            borderless_tables=False,            min_confidence=50)

04

示例：

本次测试图片如下：

具体代码如下：

from img2table.document import Image
from img2table.ocr import EasyOCR

# 创建OCR实例
ocr = EasyOCR(lang=["ch_sim"])

# 实例化图像文件
image_path = 'test.png'
image = Image(image_path, detect_rotation=False)

# 提取表格
tables = image.extract_tables(ocr=ocr, implicit_rows=True, implicit_columns=True, borderless_tables=True)

# 将提取的表格导出到Excel文件
image.to_xlsx('output.xlsx', ocr=ocr, implicit_rows=True, implicit_columns=True, borderless_tables=True)

输出的excel表格如下所示。可以看到有些表格并未被识别，为了提升 OCR 的识别效果，可以预先对图像进行预处理（消除噪声、增强文字对比度、规范化排版等），本次就不做过多介绍。

另外，可以使用下述代码绘制识别到的表格边界框。

import cv2
from PIL import Image as PILImage

# Display extracted tables
table_img = cv2.imread("test.png")

for table in tables:
    for row in table.content.values():
        for cell in row:
            cv2.rectangle(table_img, (cell.bbox.x1, cell.bbox.y1), (cell.bbox.x2, cell.bbox.y2), (255, 0, 0), 2)
            
PILImage.fromarray(table_img)

效果如下，可以看到表格识别的效果还是不错的。

更多内容，可以前往官方GitHub页面查看：

https://github.com/xavctn/img2table

本篇文章来源于微信公众号: 码农设计师

Previous article【ArcGIS工具箱】230.属性域——移除字段的属性域

Next article【ArcGIS工具箱】231.属性域——删除属性域

欢迎留下您的宝贵建议 Cancel reply

Please enter your comment!

Please enter your name here

You have entered an incorrect email address!

Please enter your email address here

【Python计算生态】img2table——表格识别和提取库

【Python计算生态】Dooit——待办事项管理...

【Python内置函数】hex()函数

【Python计算生态】Black——代码格式化工...

欢迎留下您的宝贵建议 Cancel reply

Most Popular

【Python计算生态】Dooit——待办事项管理...

【Python内置函数】hex()函数

【Python计算生态】Black——代码格式化工...

【Python内置函数】help()函数

Recent Comments

EDITOR PICKS

RSS

3D Map Generator Terrain

1.ENVI软件操作基础——窗口介绍及打开、浏览数...

POPULAR POSTS

【Shapely矢量数据空间分析】16.其他操作—...

【ArcGIS小操作】37.构建点对连线

实战案例——股票数据定向爬虫

POPULAR CATEGORY