inside Habbie's mind

silly Python unicode mistake

written by peter, on Jun 12, 2010 7:58:00 AM.

For a simple blog-to-twitter posting gateway (source code) I’m relying on the excellent feedparser and twitter modules, and I am trusting them to handle unicode strings without trouble. With most well-written Python modules (and these two are no exception!) methods will return unicode strings as they see fit, and other methods will accept these unicode strings and handle all the nitty gritty encoding details for me.

A simplified version of my workflow would look like this:

def post(entry):
  title = entry.title
  print "posting [%s]" % title
  api.PostUpdate(title) # api is a twitter Api object

feed = feedparser.parse(config["feed"])
for e in reversed(feed.entries):
  if not e.id in seen:
    post(e)
This code bombed out with an exception on the first post that had a non-ASCII title. Can you spot why?

It’s the print statement. All the APIs I’m using have zero trouble with unicode, but print wants to encode for your terminal and it’ll usually assume that that is ASCII. My ‘debugging’ output actually broke the program. My workaround is to say title.encode("ascii","replace")

Brend on #python pointed out to me that the issue is not, exactly, print. The issue is interpolating title into a non-unicode string. Depending on environment, using print on the unicode object might in fact work. For those environments, saying print u"posting [%s]" % title could help. In my case however, I ran into the issue from cron with no locale set at all, so dumbing the string down to ascii is still the right thing to do.

Comments

  • As far as I understand, interpolating an unicode into a non-unicode string itself is not the problem - easily tested at the Python prompt:

    >>> a = "x %s"
    >>> b = u"\u00d6"
    >>> a % b
    u'x \xd6'

    But the problem is indeed printing that unicode string to stdout, when the encoding of stdout cannot handle all chars of the resulting unicode string. That’s one issue, and can be solved by reencoding to ascii with the ‘replace’ option.

    The second issue is this: when called interactively (on Linux), Python uses LC_CTYPE to determine the encoding of stdout, but it always chooses ‘ascii’ when stdout is not a terminal (e.g. because stdout is redirected). (That’s a bug in my opinion, but some people seem to disagree.)

    Anyhow, here’s a solution (combining all blog posts I could find by a quick search ;) ), that solves both issues:

    import locale
    import codecs
    import sys

    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout, errors='replace')

    print u'\u00d6\n'"

    It always uses the preferred encoding, regardless of whether stdout is a terminal or not, and uses a lossy transformation if necessary.

    Comment by thm — Jun 17, 2010 1:52:00 PM | # - re

Leave a Reply