Dumb smart quotes

Most computer keyboards have about 128 keys, which puts some restrictions on the what characters you can directly input. You can’t enter the whole ASCII range using single keys on your keyboard, let alone any of the trillion emoticons in the Unicode standard. There are OS-specific ways of entering extra characters via key combinations, but they’re complicated and need to memorised. For that reason, most word processors and typesetting programs perform intelligent character substitutions for common non-keyboard characters.

For example, when you press a key that looks like ", the result comes out looking like one of these: “ ” (note the curls). And when you press a key that looks like - and put spaces around it,1 the result will often come out looking like this: – (it’s slightly longer).

If you’re writing prose, this is usually what you want. Instead of manually entering special closing and opening quotes or different types of horizontal lines, you enter the closest approximation your keyboard gives you and let the word processor figure it out.

But prose isn’t the only thing made up of characters typed on your keyboard. Besides poetry, you might also want to type out some programming code. Most programming languages are very strict about the characters you have to use, and, for ease of use, these are usually characters that can be directly entered from single key presses on a standard keyboard (unless you’re programming in APL).

Consider the two lines below. Although they look similar to the human eye, one is valid Python code and the other will generate an error.

mystr = "This is a string."
mystr = “This is a string.”

If we have syntax highlighting enabled, this becomes even clearer.

mystr = "This is a string."
mystr = This is a string.

These two lines are similar; one is ls with a single flag, and the other will cause an error.2

ls –almost-all
ls --almost-all

These errors happen because word processors replace certain characters or combinations of characters with completely different characters. The table below gives some common variants and their unicode IDs.

Key Literal character Common variants
- - (U+002D) (U+2013), — (U+2014)
' (U+0027) (U+2018), ’ (U+2019)
" " (U+0022) (U+201C), ” (U+201D)

I see these errors a lot in technical blog posts and documentation. I’ve also seen unawareness of these substitutions cause considerable pain to someone who was using an MS Word document as a scratchpad for code and terminal commands. If you’ve never considered the differences between straight and curled quotes, or the differences between a bunch of horizontal lines, it’s an easy enough mistake to make, and people will generally understand what you mean. But it causes little annoyances, such as making code impossible to easily copy-paste and messing up syntax highlighting. And on a literal, perhaps pedantic, level, it makes your code examples incorrect.

Prose and code have very different typographical affordances and should be treated separately. Writing code in a monospaced font is not merely a stylistic choice, but a mechanism for avoiding these issues in sensible document processing systems (i.e. not Microsoft Word).3 Most markdown processors will avoid making character substitutions in <pre> blocks, and LaTeX generally behaves the same in code listing environments. But if you must use Word, you can generally get around this behaviour with very careful copying and pasting.

If you’ve been guilty of this sort of thing in the past, go forth and sin no more.

  1. This behaviour is not as universal as quote transformation – sometimes you need to press - twice in a row to get the effect, and sometimes you’re out of luck and have to use those input codes. ↩︎

  2. Unless you create a file or folder called –almost-all, in which case it will happily list that. ↩︎

  3. Not that there’s anything wrong with writing code in a proportional font (Rob Pike is a fan), so long as you’re careful about character substitutions. ↩︎

similar posts