「Python」PPT 转 Markdown

2022-10-21 技术►Python PPT, Python, 技术 0 评论字数统计: 1.8k(字) 阅读时长: 8(分)

背景

最近在学习学科类课程，教师提供的课件都是PPT的形式。我自己想整理转成Markdown就需要一个个ppt去复制里面的文本，在洗澡的时候就想着能不能直接通过脚本来一键转换，然后搜了一下类库就开始操作。

开始

我检索到当前比较适合的类库——python-pptx 以下是相关网站

PYPI

Github

官方文档

官方文档示例

从官方文档可以找到到获取文本的示例：官方文档

# Extract all text from slides in presentation
# 从演示文稿中的幻灯片中提取所有文本
from pptx import Presentation

# 打开 ppt
prs = Presentation(path_to_presentation)

# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []

# 获取 Slide 幻灯片
for slide in prs.slides:
  	# 获取形状 Shape
    for shape in slide.shapes:
      	# 判断是否有文字框 text_frame
        if not shape.has_text_frame:
            continue
        # 获取文字框中的段落 paragraphs
        for paragraph in shape.text_frame.paragraphs:
          	# 文字块 run
            for run in paragraph.runs:
              	# 获取文字并加到字符串数组中
                text_runs.append(run.text)
                
# 打印测试结果     
print(text_runs)

思路

创建 result 结果列表，最后我们会将全部信息存储到 result 列表中，并写入一个 md 文档当中。

ppt这个库的逻辑是逐页（page）读取每个形状（shape）

我们的程序是首先判断遇到的形状是否是文本框，如果是文本框的话，直接将全部文本信息写入results。

如果不是文本框的话再判断这个形状是不是图片对象。如果是的话，新建一个文件夹，将这个图片存储到这个文件夹中。最后将存储到本地的图片的存储路径标准化为markdown格式写入results内。

完整代码

单ppt文件转换markdown

import collections.abc
# 此处是用的是python-pptx 包
# pip3 install python-pptx
from pptx import Presentation

# 操作系统（用于生产文件）
import os
# 正则匹配
import re

# PPT 文件相对路径 filepath
filepath = "这里写入 ppt 的路径名称.pptx"
# 文件名（截取文件名开头至文件名末尾倒数 5 位）
file_name = filepath[:-5]

# 实例化ppt对象
prs = Presentation(filepath)

# 结果文本数组
results = []

# 获取 Slide 幻灯片
for slide in prs.slides:
    # 获取形状 Shape
    for shape in slide.shapes:
        # 判断是否有文字框 text_frame
        if shape.has_text_frame:
            # 获取文字框中的段落 paragraphs
            for paragraph in shape.text_frame.paragraphs:
                part = []
                # 文字块 run
                for run in paragraph.runs:
                    text = run.text
                    # 如果匹配 第x章，则设置为主标题 # 第一章
                    if re.search('第.+章', text):
                        text = '# ' + text
                    # 如果匹配 一、二、三、之类的，则设置为副标题 # 一、
                    elif re.search('[一二三四五六七八九十]+、', text):
                        text = '## ' + text
                    # 如果匹配格式为数字+小数点，类似 1.1.1
                    elif re.search('\d\.', text):
                        # 则设置副标题 例如 #### 1.1.1
                        text = ('#' * (text.count('.') + 2)) + ' '  + text
                    # 获取文字块并加到字符串数组中
                    part.append(text)
                results.append(''.join(part))
        # 否则判断是否为图片
        else:
            try:
                imdata = shape.image.blob
                # 判断文件后缀类型
                imagetype = shape.image.content_type
                typekey = imagetype.find('/') + 1
                imtype = imagetype[typekey:]
                # 创建image文件夹保存抽出图片
                path = "图片文件夹/{}_image/".format(file_name)
                if not os.path.exists(path):
                    os.makedirs(path)
                # 图片生成
                image_file = path + shape.name + "." + imtype
                name = shape.name
                file_str = open(image_file, 'wb')
                file_str.write(imdata)
                file_str.close()
                # 标准化为markdown图片格式
                results.append('![{}]({})'.format(name, image_file))
            except:
                pass
# 去掉多余空格
results = [line for line in results if line.strip()]

# 写入全部results信息
with open('{}.md'.format(file_name), 'w') as f:
    f.write('\n'.join(results))

如果批量操作，只需要循环目录下的文件进行操作即可

指定目录下面所有ppt转换markdown

import collections.abc
# 此处是用的是python-pptx 包
# pip3 install python-pptx
from pptx import Presentation

# 操作系统（用于生产文件）
import os
# 正则匹配
import re

# 目录
dir = './课件'
# 遍历目录下面的文件
item = os.listdir(dir)

# 遍历文件
for file in item:
  	# 如果文件名包含.pptx 则执行转换代码
    if '.pptx' in file:
      
        # PPT 文件相对路径 filepath
        filepath = dir + '/' + file
        # 文件名（截取文件名开头至文件名末尾倒数 5 位）
        file_name = filepath[:-5]

        # 实例化ppt对象
        prs = Presentation(filepath)

        # 结果文本数组
        results = []

        # 获取 Slide 幻灯片
        for slide in prs.slides:
            # 获取形状 Shape
            for shape in slide.shapes:
                # 判断是否有文字框 text_frame
                if shape.has_text_frame:
                    # 获取文字框中的段落 paragraphs
                    for paragraph in shape.text_frame.paragraphs:
                        part = []
                        # 文字块 run
                        for run in paragraph.runs:
                            text = run.text
                            # 如果匹配 第x章，则设置为主标题 # 第一章
                            if re.search('第.+章', text):
                                text = '# ' + text
                            # 如果匹配 一、二、三、之类的，则设置为副标题 # 一、
                            elif re.search('[一二三四五六七八九十]+、', text):
                                text = '## ' + text
                            # 如果匹配格式为数字+小数点，类似 1.1.1
                            elif re.search('\d\.', text):
                                # 则设置副标题 例如 #### 1.1.1
                                text = ('#' * (text.count('.') + 2)) + ' '  + text
                            # 获取文字块并加到字符串数组中
                            part.append(text)
                        results.append(''.join(part))
                # 否则判断是否为图片
                else:
                    try:
                        imdata = shape.image.blob
                        # 判断文件后缀类型
                        imagetype = shape.image.content_type
                        typekey = imagetype.find('/') + 1
                        imtype = imagetype[typekey:]
                        # 创建image文件夹保存抽出图片
                        path = "图片文件夹/{}_image/".format(file_name)
                        if not os.path.exists(path):
                            os.makedirs(path)
                        # 图片生成
                        image_file = path + shape.name + "." + imtype
                        name = shape.name
                        file_str = open(image_file, 'wb')
                        file_str.write(imdata)
                        file_str.close()
                        # 标准化为markdown图片格式
                        results.append('![{}]({})'.format(name, image_file))
                    except:
                        pass
        # 去掉多余空格
        results = [line for line in results if line.strip()]

        # 写入全部results信息
        with open('{}.md'.format(file_name), 'w') as f:
            f.write('\n'.join(results))

更好的做法是将转换的代码抽离成一个函数，提供调用，这里不做展开，欢迎自己探索。

异常

如果运行代码出现 AttributeError: module 'collections' has no attribute 'abc’ 异常

Traceback (most recent call last):
  File "/Users/用户名/.local/share/virtualenvs/smallScript-RtozSf8y/lib/python3.10/site-packages/pptx/compat/__init__.py", line 10, in <module>
    Container = collections.abc.Container
AttributeError: module 'collections' has no attribute 'abc'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/用户名/workspace/python/smallScript/office/ppt2markdown.py", line 3, in <module>
    from pptx import Presentation
  File "/Users/用户名/.local/share/virtualenvs/smallScript-RtozSf8y/lib/python3.10/site-packages/pptx/__init__.py", line 14, in <module>
    from pptx.api import Presentation  # noqa
  File "/Users/用户名/.local/share/virtualenvs/smallScript-RtozSf8y/lib/python3.10/site-packages/pptx/api.py", line 15, in <module>
    from .package import Package
  File "/Users/用户名/.local/share/virtualenvs/smallScript-RtozSf8y/lib/python3.10/site-packages/pptx/package.py", line 6, in <module>
    from pptx.opc.package import OpcPackage
  File "/Users/用户名/.local/share/virtualenvs/smallScript-RtozSf8y/lib/python3.10/site-packages/pptx/opc/package.py", line 11, in <module>
    from pptx.compat import is_string, Mapping
  File "/Users/用户名/.local/share/virtualenvs/smallScript-RtozSf8y/lib/python3.10/site-packages/pptx/compat/__init__.py", line 14, in <module>
    Container = collections.Container
AttributeError: module 'collections' has no attribute 'Container'

则在python文件前引入依赖 import collections.abc 即可

参考

官方文档
看完这篇Python操作PPT总结，从此使用Python玩转Office全家桶就没有压力了！
【python自动化】读取ppt内全部文本和图片信息并导出markdown文档

本文链接： http://blog.heyb.top/2022/10/21/[python]-ppt-to-markdown.html
版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

何永彪Java 开发工程师

个人简介。