Skip to content

Communication, Codes and Cyphers

Introduction

1.2 Examples

1.2.1 Morse Code

Morse code is probably the code the lay public would most recognize - once it was the international standard for telecommunication. Its source alphabet are the 26 letters of the uppercase Roman alphabet and it encodes them to the channel alphabet which is a binary code of dots and dashes. The translation is as follows:

A· -N- ·
B- · · ·O- - -
C- · - ·P· - - ·
D- · ·Q- - · -
E·R· - ·
F· · - ·S· · ·
G- - ·T-
H· · · ·U· · -
I· ·V· · · -
J· - - -W· - -
K- · -X- · · -
L· - · ·Y- · - -
M- -Z- - · ·

There are a number of features of Morse code that we will see repeated later on. The first is that each English letter gets translated into a string of dots and dashes, and that the translations or codewords of different letters may have different lengths - Morse code is a variable length code. The second, and perhaps most important, feature is that these lengths vary roughly according to how often the letters occur in English - the most common letters, like E and T have the shortest codewords, while uncommon letters, like Q and Z, have the longest. The correspondence is not exact - O is more common in English than I, yet it has a longer codeword. The reason for this is that Morse developed his code before the techniques we will be looking in to later were devised.

However Morse code as presented above has a serious flaw. If I sent you the message

· · · - - - · · ·
I might mean SOS, but I may also have meant VMS, or EEETTTEEE or a number of other possible messages. In other words, using just dots and dashes, Morse code is not uniquely decipherable. In practice, the way that people got around this was to leave a pause after the end of each letter (usually of the same length as a dash). In effect this is adding another symbol to the channel alphabet: a pause. We will denote this by a slash character /. So what we should have transmitted to make our meaning unambiguous was:
· · · / - - - / · · · /
So we now have a code which is uniquely decipherable (in fact it is now instantaneously decipherable; see Section 1.4). The problem with this change is that it makes all our codewords one symbol longer, which slows down the transmission rate.

Morse code is definitely not the best we can do, and this is largely the reason it is no longer a major standard for communication.

1.2.2 ASCII

ASCII, or the American Standard Code for Information Interchange, is now one of the standards for electronic communication, particularly between computers. Indeed it is the basic standard coding for Internet messages. It is also a standard for the representation of data within a computer. The source alphabet for ASCII is a collection of 128 (= 27) characters, including upper- and lower- case Roman characters, numbers, punctuation, a <space> character and a variety of control characters such as <return> and <tab>. If you have used a computer at all, the source alphabet of ASCII will be familiar to you as, roughly speaking, the characters that you can type from your keyboard.

Since, internally, computers work in binary, the channel alphabet is simply the two binary digits (or bits) 0 and 1. The encoding takes each source character and translates it to a string of seven 0's and 1's. For example, the upper-case A becomes 1000001, while <space> becomes 0100000. The complete code is:

CodeCharacterCodeCharacterCode CharacterCodeCharacter
0000000NUL0100000SP1000000@1100000`
0000001SOH0100001!1000001A1100001a
0000010STX0100010"1000010B1100010b
0000011ETX0100011#1000011C1100011c
0000100EOT0100100$1000100D1100100d
0000101ENQ0100101%1000101E1100101e
0000110ACK0100110&1000110F1100110f
0000111BEL0100111'1000111G1100111g
0001000BS0101000(1001000H1101000h
0001001HT0101001)1001001I1101001i
0001010LF0101010*1001010J1101010j
0001011VT0101011+1001011K1101011k
0001100FF0101100,1001100L1101100l
0001101CR0101101-1001101M1101101m
0001110SO0101110.1001110N1101110n
0001111SI0101111/1001111O1101111o
0010000DLE011000001010000P1110000p
0010001DC1011000111010001Q1110001q
0010010DC2011001021010010R1110010r
0010011DC3011001131010011S1110011s
0010100DC4011010041010100T1110100t
0010101NAK011010151010101U1110101u
0010110SYN011011061010110V1110110v
0010111ETB011011171010111W1110111w
0011000CAN011100081011000X1111000x
0011001EM011100191011001Y1111001y
0011010SUB0111010:1011010Z1111010z
0011011ESC0111011;1011011[1111011{
0011100FS0111100<1011100\1111100|
0011101GS0111101=1011101]1111101}
0011110RS0111110>1011110^1111110~
0011111US0111111?1011111_1111111DEL

Most of the first thirty-two characters are various control codes, but the most commonly used are CR (Carriage Return), LF (Line Feed) and SP (SPace). Others, like BS (BackSpace), DEL (DELete) and ESC (ESCape) may be familiar as keys on a standard computer keyboard.

Since computers usually store information in bytes, which hold 8 bits, we have one bit left over. What this bit is used for can vary: some systems will use it to extend the code to allow representation of additional characters; often however it is used in a simple error detection scheme called a parity check.

Parity checking involves looking at the codeword for a character, and counting the number of 1's in it. If this number is odd, then we set the eighth bit to be 1; if the number is even, we set the eighth bit to 0. This means that the total number of 1's in the 8 bit string is now always even. So if the decoder receives a codeword which has an odd number of 1's, we know that there must have been an error in the channel. Unfortunately, we have no way of telling what the error was. This is no problem if the decoder can contact the sender and ask for the appropriate bit of the message to be sent again. However, if two errors happen in the codeword, we will have no way of knowing, since we will once again have an even number of 1's in our codeword.

1.2.3 ISBN

Most books and other similar (non-periodical) publications have an ISBN, or International Standard Book Number. This is a 10 digit code number that can usually be found on the same page as the publication details or on the back cover. A typical example might be:

0-19-853287-3
The ISBN consists of 4 parts, separated by dashes (the actual length of each section may vary). The first is a country code, the second a code identifying the publisher, the third part identifies the particular book, and the last is a single error-check digit.

The tenth digit is chosen so that if the ISBN has digits abcdefghij then the number given by the formula

10a + 9b + 8c + 7d + 6e + 5f + 4g + 3h + 2i + j
is evenly divisible by 11 (this might mean that j would have to be 10 and in that case an X is used for the tenth "digit"). This scheme can not only detect if one of the digits is changed, but can also detect if the order of two of the digits get swapped. We will see why this is true when we talk about modulo arithmetic, but a couple of examples should convince you that it works.

Example

Suppose that the ISBN above gets changed by an error to

0-19-857287-3
Then we have that the formula gives
0 + 9 + 72 + 56 + 30 + 35 + 8 + 24 + 14 + 3 = 251
and this has remainder 9 when we divide by 11.

Similarly, if the ISBN were to be changed to

0-12-853987-3
then we would get that the formula gives
0 + 9 + 16 + 56 + 30 + 15 + 36 + 24 + 14 + 3 = 203
and this has remainder 5 when we divide by 11.


1.2.4 Substitution Cyphers

The simple substitution cypher dates back at least to Roman times. The way it works is that you simply mix-up (or permute) the letters of your alphabet and replace each letter in your message by the corresponding letter in the permuted alphabet.

For example, on the USENET newsgroup rec.humor.funny, sick or offensive jokes are usually encoded so that people who do not want to read them do not have to. The code used is called ROT-13, where every letter of the alphabet is shifted by 13 letters. We can write this permutation as:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
NOPQRSTUVWXYZABCDEFGHIJKLM
by which we mean we change all the A's to N's, B's to O's, and so forth. Using this code the message:
WE SHOULD MAKE A BIG WOODEN HORSE
gets translated to:
JR FUBHYQ ZNXR N OVT JBBQRA UBEFR

Given a long enough message encrypted by a substitution cypher it is not hard to break them and work out what the original message is. The usual approach is to look at the frequency with which the letters in the encrypted message occur: if the message is long enough, then the frequency of an encrypted letter should approach the frequency of the unencrypted letter to which it corresponds. In fact, using a little educated guesswork and knowledge of English, it is often possible to work out what the unencrypted version must be for a message with as few as 50 letters.

Needless to say, you do not want to be using a substitution cypher to transmit national secrets.

[ Valid XHTML 1.0! ]