« Backing up your blog | Main | You may notice some changes »

Dealing with funky characters

As a Mediajunkie blogger you may have noticed that I republish all the blog entries written by all of us at Telegraph (a page I intend to redesign, by the way, when I find the time). Another thing you may have noticed is that there are often strange characters in the midst of various entries over there. This is caused by incompatibilities in the way that different applications and operating systems render special characters, particularly curly quotes and dashes.

Unfortunately, Windows (and Microsoft Word) have a unique way of generating these characters that looks fine when first posted but ends up screwing up when an article is reposted or copied from or reblogged elsewhere.

Every now and then I run a search and replace on the blog database to fix the funky characters. This is why you may occasionally see what are called HTML entities in your posts (they start with an ampersand and end with a semicolon and usually have a string of numbers in between).

If you want to avoid these character problems, I can recommend three different approaches. Perhaps one will fit your posting style best. These approaches are

  1. Saving Word documents as plain text before posting.
  2. Using the “Markdown with Smartypants” filter
  3. Entering the character entities directly yourself.

I’ll explain each approach.

Save as plain text

If you like to compose your entries in Word before posting them to your blog, then when you are ready to post, save the entry (if you’ve saved it already, choose “Save As”) and then in the Save dialog box, choose the Text or ASCII Text or Plain Text format. Line breaks or no, it doesn’t matter. The important thing is, once you’ve done this, close the document and then reopen it in Word. Otherwise, the special characters won’t be cleaned up.

Once you’ve reopened the text version of your post, you can copy and paste it into a new blog post.

This will override Word’s tendency to make quotation marks curly and turn hyphens into dashes.

But what if you do want those more sophisticated looking typographical elements? That’s were the second approach comes in.

Use the Smartypants filter

Smartypants is text filter built into our installation of Movable Type that converts straight quotes to curly quotes, single hyphens to en-dashes and double hyphens to em-dashes. If you use this filter your posts willl look typographically slick but you won’t have any funky characters in your posts. (The filter generates the correct HTML entities). Speaking of HTML entities, entering them directly is the third approach.

Use HTML entities

This last approach is probably overkill for most of you, but for completeness, I am going to include a short table of the key HTML entities, what they’re called, and what they look like. If you type any of these as-is into your entries you’ll get the corresponding typographical character:

For each of these entities there are actually two ways to enter them, a numerical reference and an abbreviation of the name (in each case they start with an ampersand and end with a semicolon).

EntityNumericCharacterName
‘‘left single quotation mark
’’right single quotation mark
““left double quotation mark
””right double quotation mark
——em dash
––en dash
……horizontal ellipsis

Comments

Another (very easy) approach is to use the inconv() PHP call to filter text before display: http://www.php.net/manual/en/ref.iconv.php

Altough with an MT blog this is a bit trickier, but still doable.

Post a comment