V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
garywill
V2EX  ›  编程

中文文献(文言)两个版本差异对比,有没有什么方案?

  •  
  •   garywill · 318 天前 · 898 次点击
    这是一个创建于 318 天前的主题,其中的信息可能已经有所发展或是发生改变。

    同一篇中文文章(文言),有两个版本,想找一个程序做差异对比,有没有什么现有的或相关的工具?

    同时想要以下这些差异被忽略:

    • 标点不同。古文不标号的,都是后人标的,所以标点肯定不同
    • 排版不同。换行、分段的位置
    • 繁体和简体字的差异

    例如, 版本 1 ,都是简体字,且无标点,又有空格,每小句都换行

    床 前 看 月 光

    疑 是 地 上 霜

    举 头 望 山 月

    低 头 思 故 乡

    版本 2

    牀前明月光,疑是地上霜。

    舉頭望明月,低頭思故鄉。

    希望软件给出的对比结果是:

    1. 看<->明

    2. 山<->明

    以上例子是五言,每一句字数一样。还需要比较每句字数不一样的文章

    有没有什么现有的或相关的工具?

    6 条回复    2024-01-09 14:02:01 +08:00
    vacuitym
        1
    vacuitym  
       317 天前
    这个自己实现起来应该比较容易,首先把两个都专程简体或者繁体,然后对符号也都转成一样的,然后直接对比差异
    garywill
        2
    garywill  
    OP
       317 天前
    补充个难度更大点的例子。断句、复杂的标点。(甚至有中间缺失几句的)

    版本 1
    帝曰:「有其年已老而有子者,何也?」岐伯曰:「此其天壽過度,氣脈常通,而腎氣有餘也。此雖有子,男不過盡八八,女不過盡七七,而天地之精氣皆竭矣。」

    版本 2
    帝曰有其年已老而有子者何也
    岐伯曰此其天壽過度氣脈常通而腎氣有餘也
    此雖有子男不過盡八八女不過盡七七
    而天地之精氣皆竭矣
    superychen
        3
    superychen  
       317 天前
    字数都一样么?问问 gpt 就能用 python 给你生成个代码
    superychen
        4
    superychen  
       317 天前
    ```python
    import opencc
    import re
    from difflib import SequenceMatcher

    PATTERN_CHINESE = re.compile(r'[\u4e00-\u9fa5]')
    CONVERTER = opencc.OpenCC("t2s")

    # 只保留中文
    def clean(text):
    return ''.join(PATTERN_CHINESE.findall(text))

    # 繁体转简体
    def simplify(text):
    return CONVERTER.convert(text)

    # 比较文本
    def compare_text(text1, text2):
    text1 = clean(text1)
    text2 = clean(text2)
    text1a = simplify(text1)
    text2a = simplify(text2)
    matcher = SequenceMatcher(None, text1a, text2a)
    diffs = matcher.get_opcodes()
    index = 0
    for tag, i1, i2, j1, j2 in diffs:
    if tag == 'replace':
    index += 1
    print(f'{index}. {text1[i1:i2]} <-> {text2[j1:j2]}')

    # 简体转繁体
    simplified_text = '''床 前 看 月 光

    疑 是 地 上 霜

    举 头 望 山 月

    低 头 思 故 乡'''
    traditional_text = '''牀前明月光,疑是地上霜。

    舉頭望明月,低頭思故鄉。'''

    compare_text(simplified_text,traditional_text)
    ```
    geelaw
        5
    geelaw  
       317 天前   ❤️ 1
    可以用编辑距离建模。

    准备工作:找一本字典,记住所有的标点、空白、汉字,以及同一个字的不同写法(简体繁体异体字)。

    1. 两个字符串都删除所有的标点空白,只留汉字。
    2. 计算编辑距离最小的编辑:把一个字替换为它的其他写法、删除一个字、增加一个字的代价可以都设置为 1 (这样的话把一个字改成和它没关系的另一个字的代价就是 2 )。

    第二步是标准的动态规划问题。
    superychen
        6
    superychen  
       317 天前
    <iframe
    src="https://carbon.now.sh/embed?bg=rgba%2874%2C74%2C74%2C1%29&t=vscode&wt=none&l=python&width=680&ds=true&dsyoff=20px&dsblur=68px&wc=true&wa=true&pv=56px&ph=56px&ln=false&fl=1&fm=Hack&fs=14px&lh=133%25&si=false&es=2x&wm=false&code=import%2520opencc%250Aimport%2520re%250Afrom%2520difflib%2520import%2520SequenceMatcher%250A%250APATTERN_CHINESE%2520%253D%2520re.compile%28r%27%255B%255Cu4e00-%255Cu9fa5%255D%27%29%250ACONVERTER%2520%253D%2520opencc.OpenCC%28%2522t2s%2522%29%250A%250A%2523%2520%25E5%258F%25AA%25E4%25BF%259D%25E7%2595%2599%25E4%25B8%25AD%25E6%2596%2587%250Adef%2520clean%28text%29%253A%250A%2520%2520%2520%2520return%2520%27%27.join%28PATTERN_CHINESE.findall%28text%29%29%250A%250A%2523%2520%25E7%25B9%2581%25E4%25BD%2593%25E8%25BD%25AC%25E7%25AE%2580%25E4%25BD%2593%250Adef%2520simplify%28text%29%253A%250A%2520%2520%2520%2520return%2520CONVERTER.convert%28text%29%250A%250A%2523%2520%25E6%25AF%2594%25E8%25BE%2583%25E6%2596%2587%25E6%259C%25AC%250Adef%2520compare_text%28text1%252C%2520text2%29%253A%250A%2520%2520%2520%2520text1%2520%253D%2520clean%28text1%29%250A%2520%2520%2520%2520text2%2520%253D%2520clean%28text2%29%250A%2520%2520%2520%2520text1a%2520%253D%2520simplify%28text1%29%250A%2520%2520%2520%2520text2a%2520%253D%2520simplify%28text2%29%250A%2520%2520%2520%2520matcher%2520%253D%2520SequenceMatcher%28None%252C%2520text1a%252C%2520text2a%29%250A%2520%2520%2520%2520diffs%2520%253D%2520matcher.get_opcodes%28%29%250A%2520%2520%2520%2520index%2520%253D%25200%250A%2520%2520%2520%2520for%2520tag%252C%2520i1%252C%2520i2%252C%2520j1%252C%2520j2%2520in%2520diffs%253A%250A%2520%2520%2520%2520%2520%2520%2520%2520if%2520tag%2520%253D%253D%2520%27replace%27%253A%250A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520index%2520%252B%253D%25201%250A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520print%28f%27%257Bindex%257D.%2520%257Btext1%255Bi1%253Ai2%255D%257D%2520%253C-%253E%2520%257Btext2%255Bj1%253Aj2%255D%257D%27%29%250A%250A%2523%2520%25E7%25AE%2580%25E4%25BD%2593%25E8%25BD%25AC%25E7%25B9%2581%25E4%25BD%2593%250Asimplified_text%2520%253D%2520%27%27%27%25E5%25BA%258A%2520%25E5%2589%258D%2520%25E7%259C%258B%2520%25E6%259C%2588%2520%25E5%2585%2589%250A%250A%25E7%2596%2591%2520%25E6%2598%25AF%2520%25E5%259C%25B0%2520%25E4%25B8%258A%2520%25E9%259C%259C%250A%250A%25E4%25B8%25BE%2520%25E5%25A4%25B4%2520%25E6%259C%259B%2520%25E5%25B1%25B1%2520%25E6%259C%2588%250A%250A%25E4%25BD%258E%2520%25E5%25A4%25B4%2520%25E6%2580%259D%2520%25E6%2595%2585%2520%25E4%25B9%25A1%27%27%27%250Atraditional_text%2520%253D%2520%27%27%27%25E7%2589%2580%25E5%2589%258D%25E6%2598%258E%25E6%259C%2588%25E5%2585%2589%25EF%25BC%258C%25E7%2596%2591%25E6%2598%25AF%25E5%259C%25B0%25E4%25B8%258A%25E9%259C%259C%25E3%2580%2582%250A%250A%25E8%2588%2589%25E9%25A0%25AD%25E6%259C%259B%25E6%2598%258E%25E6%259C%2588%25EF%25BC%258C%25E4%25BD%258E%25E9%25A0%25AD%25E6%2580%259D%25E6%2595%2585%25E9%2584%2589%25E3%2580%2582%27%27%27%250A%250Acompare_text%28simplified_text%252Ctraditional_text%29"
    style="width: 673px; height: 951px; border:0; transform: scale(1); overflow:hidden;"
    sandbox="allow-scripts allow-same-origin">
    </iframe>
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   5790 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 25ms · UTC 02:26 · PVG 10:26 · LAX 18:26 · JFK 21:26
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.