HTML Help Builder
Encoding issues
Gravatar is a globally recognized avatar based on your email address. Encoding issues
  Richard Kaye
  All
  Sep 3, 2020 @ 06:32am

Hi Rick,

I've been adding some topics to my help file by copying and pasting from a PDF. I've noticed that some characters are coming out as utf-8 encoded strings instead of the actual character. (For example, the ellipsis character.) It looks fine in the WWHB editor. Is there a recommended procedure as far as getting content from pre-existing PDF or Word documents that will avoid this conversion in the output?

TIA

Gravatar is a globally recognized avatar based on your email address. re: Encoding issues
  Richard Kaye
  Richard Kaye
  Sep 3, 2020 @ 06:47am

Further to this it seems like it doesn't like typographic characters so in addition to the ellipsis, an em-dash, etc. Perhaps I should try the paste as html when copying from a PDF?

Gravatar is a globally recognized avatar based on your email address. re: Encoding issues
  Richard Kaye
  Richard Kaye
  Sep 3, 2020 @ 08:09am

And one more note, I tried using an html endpoint and it still gets utf encoded when viewed via the browser.

I will double-check the templates to make sure they are on the current version in case that's the source of the encoding problems. I am pretty certain this project was built from scratch in 5.x but...

Gravatar is a globally recognized avatar based on your email address. re: Encoding issues
  Richard Kaye
  Richard Kaye
  Sep 3, 2020 @ 08:14am

And it just gets more strange. When I view the page with this issue on my local dev W10 system, I see the encoding issue. When I look at the same page in the same browser on my external staging server, the character looks right. So there must be some environmental difference between my local site/IIS and the one located on my external server...

Gravatar is a globally recognized avatar based on your email address. re: Encoding issues
  Rick Strahl
  Richard Kaye
  Sep 4, 2020 @ 01:16pm

Yeah this is a complex problem and I haven't found a good solution that works for all scenarios.

This is a FoxPro problem because FoxPro can't store non-ANSI characters internally and there's no good way to retrieve the raw UTF-8 content out of the editor. So extended characters are basically lost when you save the data to FoxPro.

You can post any text into the editor including upper Unicode characters and it works in the editor because the editor itself is a Web page that supports Unicode characters (via UTF8).

It gets worse - if data actually would be stored as UTF-8 in the text field then rendering the topic will double encode because all HTML output is generated with UTF-8 encoding...

This why FoxPro text rendering and lack of Unicode support sucks. Luckily for the most part this is not a big problem.

As to PDF pasting - I think there the problem is actually that a lot of the PDF tools don't actually fix up the content they render. So they'll give you the raw text which is is UTF-8 encoded. Nothing I can do about that - that's a bug in PDF clients. I see this with Nitro (which is what I use).

+++ Rick --

Gravatar is a globally recognized avatar based on your email address. re: Encoding issues
  Richard Kaye
  Rick Strahl
  Sep 4, 2020 @ 01:23pm

I've read your white papers on the topic. My head spins...

I wonder if using one of the VFP9 binary data types would handle this better?

Gravatar is a globally recognized avatar based on your email address. re: Encoding issues
  Rick Strahl
  Richard Kaye
  Sep 4, 2020 @ 02:56pm

It's possible to make this work, but it sucks.

  • Don't auto UTF-8 encode the entire document
  • Encode every field everything explicitly and manually
  • Don't encode the Markdown editing fields

Basically I do this here on the message board. This page isn't wholesale UTF-8 encoded and each field that needs it is manually encoded and the file on disk is UTF-8 by default. It's a ton of <%= STRCONV( expr, 9 ) %> which is ugly. And the file has to be saved as UTF-8, while other templates should be saved as ANSI. It sucks!

For an application internal solution that would work. But for a generic solution like Help Builder this is a royal pain in the ass for anybody who wants to customize the templates.

Plus if data was stored in UTF-8 then editing in plain text (as a number of the other fields do) would show the UTF-8 encoding in the FoxPro text box.

So, long story short, this behavior won't change because it will cause so many run-on side effects. We unfortunately have to live with FoxPro's string limitations.

+++ Rick ---

© 1996-2024