REM State

06 Sep

6 Hours Fighting a Rowdy Apostrophe

Well, I’ve put in another 8 hour day — this one turned out to be just about as grueling as the last. Here I thought I was going to get all this work done, and then one obstinate bug crops up to throw my plans out of whack. What could possibly detain me for the better part of a day? Why, nothing else but the humble apostrophe.

WordPress has a feature — I’m sure users in the audience are quite aware of it — called “texturizing”. I type ‘"foo"’, but the published article contains ‘“foo”’ instead, just as two dashes become a “—”, three dots a “…”, and so on. Well, an ASCII apostrophe/single quote gets turned into an appropriate curly-quote when WordPress texturizes post titles and content, just like all the other fancy characters.

Now, there’s another part to this story called HTML entities. You probably know about these as well, if you’ve been around the web enough — they’re things like “&amp;”, “&nbsp;” and (in the case of the curly apostrophe) “&rsquo;”. Now, entities are really handy for folks who type for the web, because unless you’ve got a special keyboard, or are really good with a character map program, you probably can’t type a symbol like “…” offhand. You can type “&hellip;” on any standard keyboard, though — that’s the human reason we have entities. The computer reason is twofold: sometimes a file or editor uses a character encoding that lacks full Unicode support (think of it as if the little gremlin in the computer doesn’t have that character on his keyboard :), and sometimes a given character has a special meaning in the context of the file (like “<” in HTML) and needs to be escaped. In both cases, you need an alternate way to express the characters, and entities provide that.

Naturally, all this causes a great, big, horrifically confusing mess. Since “’” could be an ASCII single-quote, or a named HTML entity, or even a numbered HTML entity (&#8217;), we have to decide on one representation so that no matter what representation the original symbol was in, it’s in the agreed-upon format when we go to make comparisons (this is called a canonical form). The best example is, of course, the text you see right here: “’”, “’”, and “’” all appear the same, even though they were typed in three different ways. One way to canonicalize these values is to convert them to UTF-8, which can encode every Unicode code point (this gremlin has one big keyboard).

While PHP has, for a long time, had decent support for translating UTF-8 into HTML entities, it has not always had great support for doing the reverse. This makes it a very difficult and tricky task to correctly canonicalize to UTF-8. WordPress doesn’t use the human-readable names when it outputs code, instead opting to use numerical codes such as “&#8217;”. Not all versions of PHP can convert these. Even when using the right function with a supported version of PHP, you still have to pass a special parameter in order for single quotes to be translated (they aren’t, by default). Then, you need to provide one more parameter to specify output in UTF-8 — otherwise, the output is in ISO-8859-1, which may not allow all of the symbols to be translated (e.g., curly-quotes). Only if you overcome all three of these obstacles with the properly canonicalized UTF-8 be produced.

The specific problem I was having was that my tests would try to create posts with apostrophes, quotes, and other goodies in the title. Everything worked — except for the apostrophe. I bashed my head against the problem all day long, totally frustrated at the fact that I could not reproduce the problem outside of my tests. My tests always generated pages with titles like ‘"That&#039;s all, folks..."’ — while attempting the same thing in my browser always resulted in the correct ‘"That's all, folks..."’. Finally, I realized (after reading through the HTTP traffic on the wire) that the tests were indeed creating the post with the correct name — the issue was that when they modified the post, the updates changed the name to the bad value.

The problem? Simpletest was using one of the compatible-but-incomplete methods of canonicalizing the HTML entities into UTF-8. Simpletest didn’t have to read or canonicalize the apostrophe when creating the page, but it did when re-reading the page to modify the post, so the issue always cropped up after the first edit. It was literally one line of code in Simpletest to fix this problem (three, if you count me commenting out the existing code).

Sometimes I really hate PHP. ;)

6 Responses to “6 Hours Fighting a Rowdy Apostrophe”

  1. Avatardwebman
    1

    I admire your fortitude. Step up to the bar.

    Reply to this comment.
  2. AvatarQuandary
    2
    Author Comment

    Well thanks — I try my best. Some days are just like this one, while some are just totally smooth sailing all the way through. That said, I seem to experience a lot more of the former in testing and QA than I do in dev or architecting. Maybe that’s why everyone wants to be a dev. ;)

    Reply to this comment.
  3. AvatarRobert
    3

    Great plugin, I now have two plugins up and running, in series and orgseries, I like this one better!

    I have a question though: would it be possible to have the toc in a box on the right, which can be configured?

    and would it be possible to have the Next in series at the bottom be changed to the title of the next article, or the two behind each other.

    I tried to look at things myself, but am not a programmer.

    thanks

    Robert

    Reply to this comment.
  4. AvatarQuandary
    4
    Author Comment

    Yes, to both — though putting the ToC in a box on the right will require a little more work, depending on exactly how you want to do it. I’ll start with changing the Next and Previous links, since that’s relatively easy. :)

    To change how the Next and Previous links render, follow the instructions in the readme to get to the configuration screen. Then, simply change the Previous and Next format fields to contain “<a href='%url' title='%title'>%title</a>” (without the quotes) — this will render the title of the previous/next series as the link text, instead of the static text.

    As for the ToC, you will need to either use CSS to make the ToC render in the manner you want, or modify your site’s theme to insert the ToC information in a more suitable location. By default, the table of contents is wrapped in a <div> tag, with its class attribute set to “series_toc”. You can use this to, e.g., float the whole div off to the right (or whatever you want to do). If you want to go the route of modifying your theme, you’ll want to use the InSeries::ToC() function, which will insert the same ToC data as before, but in a location that’s outside of the post’s content. You’ll probably still want to style it with CSS somehow, but you may need to go the second route if you’re having a hard time getting the ToC to show up in the part of the screen where you want.

    Reply to this comment.
  5. AvatarRobert
    5

    Thanks, applied those and mostly looks good except for one thing,

    On each page with a post that is not part of a series it shows the box as well, please have a look at: http://www.rugpijnweg.nl/rugpijn-behandeling/
    Here you see a box that should not be there.

    On pages that are part of a series it now starts to look ok, except I have to do some formatting of the ol tag.

    SO my question is, how to get rid of the box, on posts pages that are not part of a series.

    thanks for you reply!

    Robert

    Reply to this comment.
  6. AvatarQuandary
    6
    Author Comment

    Use InSeries::adv_CurrentSeries to tell whether or not the current post is in a series. If the response is NULL, then the post is not in a series, and you should not output the box.

    Also, it looks like you have several XHTML errors in your page — you may want to close those <meta> tags, and fix your call to InSeries::ToC() so that it’s in a proper <?php … ?> block.

    Hope that gets everything working all right for you. :)

    Reply to this comment.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Comment Preview:

© 2008 REM State | Entries (RSS) and Comments (RSS)

Global Positioning System Gazettewordpress logo