ASCII to My Heart

The ASCII character encoding may now be somewhat obsolete, but its influences on computing still show. There are also interesting things to learn about how it was designed.

All of this, of course, is not at all new. ASCII is now over fifty years old, but I had never before given it much thought. When I actually looked into it, I was pleasantly surprised.

The ASCIIbet

This is a classic ASCII code chart,¹ which is laid out in a way that shows many of its design niceties. An obvious property is the overall ordering of the characters.

The characters are laid out such that common separators appear first, e.g. space, hyphen, full stop and slash, followed by numbers, capital letters, and finally minuscule letters. This layout means that ASCII strings can easily be sorted by comparing their numerical values. The resulting order is called ASCIIbetical.

Hurray for lazy programmers!

ASCII for humans

The layout of the chart also reveals how human-readable ASCII can be. By looking at the first two or three most significant bits, many of the characters become obvious.

Values starting with 011 represent the digit character corresponding to the last four bits. For example, 0110011 is the character “3”.
Values starting with 010 sometimes represent the shifted symbol corresponding to the digit. This was based on the layouts of electric typewriters at the time, so doesn’t hold up as much today. It is still accurate for 1, 3, 4 and 5 on US QWERTY layouts. For example, 0100011 is the character “#”.
Values starting with 10 represent the capital letter at the position in the alphabet of the last 5 bits. For example, 1000011 is the character “C”.
Similarly, values starting with 11 represent the minuscule letter at that position in the alphabet. For example, 1100011 is the character “c”.

There are also other aligned shifted characters, such as ,./ and <>?, which have their third most significant bit flipped. The characters []| also correspond to {}| with their second most significant bit flipped.

Delete

This human-readability would have been much more useful than it is now back when ASCII text was often stored on punched paper tape. Paper tape is also the reason there is one control-code that doesn’t appear in the first 32 positions with the rest.

At position 127, the DEL is represented in binary as 1111111. On paper tape, this means that any previously punched character can be replaced with DEL by punching the remaining holes. This clever bit of design means that characters can be “erased” even when represented physically.

Diacritics

Although this clever design is no longer relevant, there are some less clever decisions that have had a lot of influence. The caret, backtick, and tilde characters were included in ASCII as a sort of work-around, intended to be used as diacritics rather than including accented letters themselves.²

But even though we now have proper accented characters thanks to Unicode, the ASCII diacritics are still commonly used in programming, and are still standard on keyboard layouts.

I sometimes wonder if a non-programmer has ever purposely typed a backtick.

Caret notation

By far the most interesting thing I discovered when reading about ASCII was the origin of caret notation and many common terminal key sequences. For example, the control-C sequence, or ^C.

On old systems, the control key was used literally to enter control-codes. The way this worked was that each character in the 100 and 101 columns (the capital letters and some punctuation), corresponded to a control-code in the 000 and 001 columns.

These sequences were represented with caret notation, so ^@ was NUL, ^A was SOH, and so on. DEL 1111111 was represented by ^?, where ? is 0111111. This means that the control-code for a key could be calculated by inverting the most significant bit.

Most of the ASCII control-codes are no longer relevant, but some of the key sequences are still very common in the terminal.

The most common of these is, of course, ^C, used to interrupt program execution. The corresponding control-code is ETX, end of text.

Similarly, ^D, which indicates end-of-file, corresponds to EOT, end of transmission, in ASCII.

Another somewhat common sequence is ^Z for suspending processes. This comes from the control-code SUB, substitute.

Many terminals still clear the screen on ^L, which corresponds to FF, form feed. This originally caused printers to load the next page.

Probably less well known are the sequences ^S and ^Q, which can be used to pause and resume output. In ASCII, these were simply DC3 and DC1, two of the four device-specific control-codes. The Teletype Model 33 implemented these as XOFF and XON, which turned the tape reader off and on. This definition stuck around.

That’s all

Those are basically what I found to be the interesting parts of the good old 7-bit character encoding from a modern perspective. It might not be useful information day-to-day, but it hopefully makes for fun trivia.

ASCII Code Chart, Wikipedia ↩
The backtick was also used as a left single quote, with ' on the right, which looks terrible. ↩