Python3 乱码恢复

Posted on 2019-05-14

最近项目中遇到一个问题，就是数据库中某些较早的记录中的中文名字显示为乱码，就像这样。

1	é™³å½¥ç’‹ # 陳彥璋

于是就寻思着怎样才能用Python3自动进行乱码恢复。

关键代码

先贴关键代码，下面这段函数实现了被错误解码为windows-1252的乱码的恢复。

def get_correct_value(messy_code, from_coding, to_coding):
    '''
    Get the correct value of messy code.
    
    :param messy_code: str, the messy code to recover, e.g. 'é™³å½¥ç’‹'
    :param from_coding: str, the wrongly decoded encoding. e.g. 'windows-1252'
    :param to_coding: str, the correct decoded encoding. e.g. 'utf-8'
    '''
    if not messy_code:
        return None
    res = messy_code.encode(from_coding, 'backslashreplace')
    res = str(res)
    res = res.replace('\\\\', '\\')
    res = res.replace('b', '', 1)
    res = res.replace("'", '')
    res = res.split('\\x')
    res = ''.join(res)

    return bytes.fromhex(res).decode('utf-8')

Python3 乱码恢复基本思路

首先我们要明白字符编码和解码的概念，对应Python3中的函数即encode()和decode()

可以利用在线工具得到乱码之前被错误地用什么编码来解码，应该用什么编码来解码。比如说在我的例子中，乱码之所以成为乱码是因为被windows-1252解码，但实际上应该用utf-8解码。

参考

Joseph Chu

一日之功很有限，不过可以积少成多

1. 关键代码
2. Python3 乱码恢复基本思路
3. 参考