python - Unbaking mojibake -
when have incorrectly decoded characters, how can identify candidates original string?
Ä×èÈÄÄî▒è¤ô_üiâaâjâüâpâxüj_10òb.png
i know fact image filename should have been japanese characters. various guesses @ urllib quoting/unquoting, encode , decode iso8859-1, utf8, haven't been able unmunge , original filename.
is corruption reversible?
you use chardet (install pip):
import chardet your_str = "Ä×èÈÄÄî▒è¤ô_üiâaâjâüâpâxüj_10òb" detected_encoding = chardet.detect(your_str)["encoding"] try: correct_str = your_str.decode(detected_encoding) except unicodedecodeerror: print("could not estimate encoding")
result: 時間試験観点(アニメパス)_10秒 (no idea if correct or not)
for python 3 (source file encoded utf8):
import chardet import codecs falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâaâjâüâpâxüj_10òb" try: encoded_str = falsely_decoded_str.encode("cp850") except unicodeencodeerror: print("could not encode falsely decoded string") encoded_str = none if encoded_str: detected_encoding = chardet.detect(encoded_str)["encoding"] try: correct_str = encoded_str.decode(detected_encoding) except unicodeencodeerror: print("could not decode encoded_str %s" % detected_encoding) codecs.open("output.txt", "w", "utf-8-sig") out: out.write(correct_str)
Comments
Post a Comment