python - Unbaking mojibake -

- June 15, 2012

when have incorrectly decoded characters, how can identify candidates original string?

Ä×èÈÄÄî▒è¤ô_üiâaâjâüâpâxüj_10òb.png

i know fact image filename should have been japanese characters. various guesses @ urllib quoting/unquoting, encode , decode iso8859-1, utf8, haven't been able unmunge , original filename.

is corruption reversible?

you use chardet (install pip):

import chardet  your_str = "Ä×èÈÄÄî▒è¤ô_üiâaâjâüâpâxüj_10òb" detected_encoding = chardet.detect(your_str)["encoding"]  try:     correct_str = your_str.decode(detected_encoding) except unicodedecodeerror:     print("could not estimate encoding")

result: 時間試験観点（アニメパス）_10秒 (no idea if correct or not)

for python 3 (source file encoded utf8):

import chardet import codecs  falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâaâjâüâpâxüj_10òb"  try:     encoded_str = falsely_decoded_str.encode("cp850") except unicodeencodeerror:     print("could not encode falsely decoded string")     encoded_str = none  if encoded_str:     detected_encoding = chardet.detect(encoded_str)["encoding"]      try:         correct_str = encoded_str.decode(detected_encoding)     except unicodeencodeerror:         print("could not decode encoded_str %s" % detected_encoding)      codecs.open("output.txt", "w", "utf-8-sig") out:         out.write(correct_str)

Search This Blog

Backgorund

python - Unbaking mojibake -

Comments

Post a Comment

Popular posts from this blog

C# random value from dictionary and tuple -

cgi - How do I interpret URLs without extension as files rather than missing directories in nginx? -

.htaccess - htaccess convert request to clean url and add slash at the end of the url -