I have a database full of strings where the accented characters have been replaced by their non-accented equivalents, and a spreadsheet full of strings with accents in them. I’m supposed to look up the info in the database given the info in the spreadsheet.
title = u"some string with accented characters in it like b\xe9cancour" import unicodedata unicodedata.normalize('NFKD', title).encode('ascii', 'ignore') 'some string with accented characters in it like becancour'
Normalize with ‘NFKD’ will decompose each character in the string into its composing characters. For example, if there was an e with acute accent, it separates it into an e and an acute accent. The K part of NFKD ensures the ‘e’ is the simplest possible e (presumably if there is an ‘e’ in ASCII, it will prefer that one). Then the
encode ('ascii', 'ignore') will drop all the non-ASCII characters, which by now are just the accents which have been separated from the rest of the letter.
Awesome. And it works in python 2.5.