Skip to content Skip to sidebar Skip to footer

Python Convert Html Ascii Encoded Text To Utf8

I have a xml file, which I need to convert to utf8. Unfortunately the entities contain text like this: /mytext, I'm using the codec library to convert files to u

Solution 1:

You can pass the text of the file through an unescape function before passing it to the XML parser.

Alternatively, if you're only parsing HTML, lxml's http parser does this for you:

>>> import lxml.html
>>> html = lxml.html.fromstring("<html><body><p>&#047;mytext&#044;</p></body></html>")
>>> lxml.html.tostring(html)
'<html><body><p>/mytext,</p></body></html>'

Solution 2:

Recently posted the below in response to a similar question:

import HTMLParser     # html.parser in Python 3
h = HTMLParser.HTMLParser()
h.unescape('&#047;mytext&#044;')

Technically this method is "internal" and undocumented, but it's been in the API quite a while and isn't marked with a leading underscore.

Found it here; other approaches are also mentioned, of which BeautifulSoup is probably the best if you don't mind its "heaviness."

Post a Comment for "Python Convert Html Ascii Encoded Text To Utf8"