Parsing HTML encoded content! How to convert?

**JohnPotier** · Jan 25th, 2024, 04:02 PM

Hi all,

(Really? No one knows of a simple way to handle these encoded characters? Surprises me!)

Is there a simple way to convert HTML encoded characters ("Ą", "&quote;" ) to their printable equivalents?

Background:
In my program I'm parsing HTML manually looking for all kinds of attributes in our online product catalog and I have no option to address the data before they are cached on the webserver (implying I can't go to my source data. I need to inspect after business servers have mixed our data with other data providers...). The format and layout is under my control, so manual parsing is ok, as I'm well aware of any changes to the structure :- )

My challenge however is that the data is mixed with data from other providers from all over the world and I encounter encoded characters from the entire Unicode universe. I'm tired of constantly expanding my fixString() function with yet another case, like

Code:

html = Replace(html, "&#x0104;", "Ą") ' similar to "awn"

These come in at least 4 variants for the same characters, as far as I've found:

Code:

Words: &quote;
Hex: &#x0104;
Decimal:  & # 29 ;     (without the spaces)
Strange: "Ã€"     (this one is probably because VB6 controls has limited capabilities to print international characters)

So I repeat: Is there a simpler way to convert all these encoded characters to the printable equivalents?

At least I would expect to find a comunity built function that would convert all encoded charaters... Haven't found it yet! I'll keep expanding my fixString(:-)...

I will store the data I need in UTF-8 text files for further processing.

Thread: Parsing HTML encoded content! How to convert?

Thread Tools

Display

Parsing HTML encoded content! How to convert?

Tags for this Thread

Posting Permissions