Alphabets, Words, and Languages

Automata are among the most studied models in computer science. They look simple, yet play a key role in both practice and theory. In practice, they show up in text search, compilers, hardware design, and tools that prove software correctness. In theory, their simple structure has made them a basic formalism for the study of fundamental notions such as nondeterminism and the limits of computation.

This lesson is the first in a series about automata and regular languages.
We start with the basic notions needed to describe the simplest machine — the deterministic finite automaton (DFA, for short). Specifically, we will talk about alphabets, words, and languages.

Alphabets and Words

Words are everywhere around us — we express and communicate using them, any input that we feed to a program can be viewed as a word, and even programs themselves are words, made of letters, numbers, spaces, new lines, and other symbols. Words, simple as they are, form one of the basic building blocks in the context of computation and can encode all sorts of objects. This is not surprising: it’s exactly how we encode, send and receive information, and words are what machines basically process when they operate.

To define words we first need to agree on a universal set of symbols, an alphabet. We follow the standard model and consider finite words over finite alphabets. Formally, an alphabet $\Sigma$ is a finite non-empty set, and we refer to its elements as letters. A finite word $w$ over $\Sigma$ is a finite (possibly empty) sequence of letters from $\Sigma$:

First, let $\epsilon$ denote the empty sequence of letters, define $\Sigma^0 = \{ \epsilon\}$, and for every natural $n\geq 1$, we consider a long Cartesian product of $\Sigma$ with itself $$\Sigma^n = \underbrace{\Sigma\times \Sigma\times \Sigma \times \cdots \times \Sigma}_{n \text{ times}} =$$ $$= \{ (a_1, a_2, \ldots, a_n): a_i\in \Sigma \text{ for all $i\in \{ 1, 2, \ldots, n\}$} \}$$

Then, $w$ is a finite word over $\Sigma$ when there is a natural $n\geq 0$ such that $w\in \Sigma^n$.

Let’s look at an example. Consider the binary alphabet $\Sigma = \{ 0, 1\}$. Then, $u = (1, 1, 0)$, $v = (0, 0, 0 , 1, 1, 1)$, and $w = (1,0,1)$ are words over $\Sigma$ as they are finite sequences of letters from $\Sigma$. Indeed, $u, w\in \Sigma^3$ and $v\in \Sigma^6$. Note that although the words $u$ and $w$ are both elements in $\Sigma^3$ and consist of the same letters, they are distinct as order matters — the 3-tuple $(1, 0, 1)$ is distinct from the 3-tuple $(1, 1, 0)$.

For every word $w$ there exists a unique natural $n\geq 0$ with $w\in \Sigma^n$, and we refer to such $n$ as the length of $w$, denoted $|w|$. For convenience, we omit the commas and parenthesis, and write $101$ to denote the word $w = (1, 0, 1)$. Moreover, we use powers to abbreviate repeating letters, and write $1^3$ to denote the word $111$. Thus, we can write $v = 0^31^3 = 000111$. We will define this abbreviation more formally soon.

The word $w = 101$ is of length $3$ as it consists of 3 letters, and it is easy to see that $|1^{17}001^{17}| = 17+2 + 17 = 36$ and $|0^31^3| = 3 + 3 = 6$.

For an alphabet $\Sigma$, we let $\Sigma^*$ denote the set of finite words over $\Sigma$: $$ \Sigma^* = \bigcup_{i\geq 0} \Sigma^i = \Sigma^0 \cup \Sigma^1 \cup \Sigma^2 \cup \cdots $$

taking union over the empty word $\epsilon$ (elements in $\Sigma^0$), words of length 1 (elements in $\Sigma^1 = \Sigma$), words of length 2 (elements in $\Sigma^2$), and so on.

Now that you got the idea, let’s proceed with some simple questions.

Question 1. Consider the alphabet $\Sigma = \{ a, b\}$. List all words of length 3 over $\Sigma$.

Click to show the answer.

The words of length $3$ over $\Sigma$ are the elements in the set $\Sigma^3 = \{aaa, aab, aba, abb, baa, bab, bba, bbb \}$.

Question 2. Consider an alphabet $\Sigma$, and a natural $n\geq 0$. How many words over $\Sigma$ of length $n$ there is?

Click to show the answer.

We show that there are $|\Sigma|^n$ words of length $n$.

We distinguish between two cases.
If $n = 0$, then there is a single word of length 0, $\epsilon$, and in this case $1 = |\Sigma|^n$.

Consider now the case where $n\geq 1$.
Any letter in a word of length $n$ is one of $|\Sigma|$ candidate letters. Therefore, the number of words of length $n$ equals $|\Sigma|^n$.
Alternatively, you can use the fact that the size of a long Cartesian product equals the product of the sizes of its sets. Thus,
$$|\Sigma^n| = |\underbrace{\Sigma \times \Sigma \cdots \times \Sigma}_{n \text{ times}}| = \prod_{i=1}^n |\Sigma| = |\Sigma|^n$$

For a natural $n\geq 0$, let $[n]$ denote the set $\{1, 2, \ldots, n \}$; in particular $[0] = \emptyset$. We can concatenate words by putting them next to each other to form a longer word. Formally, if $w_1 = a_1 a_2\cdots a_n$ and $w_2 = b_1 b_2\dots b_m$ are words over $\Sigma$, where for all $i\in [n], j\in [m]$, $a_i$ and $b_j$ are letters, then we define the concatenation $w_1\cdot w_2$ as the word $a_1a_2\cdots a_nb_1b_2\cdots b_m$ over $\Sigma$.

For example, consider the alphabet $\Sigma = \{ a, b\}$, and the words $w_1 = abb$ and $w_2 = ab$. Then, $w_1 \cdot w_2 = abb \cdot ab = abbab$, and $\epsilon \cdot w_2 = \epsilon\cdot ab = ab = w_2 = ab = ab\cdot \epsilon = w_2\cdot \epsilon$.

The following is a simple exercise and suggests that concatenation is associative.

Question 3 Consider an alphabet $\Sigma$, and words $x, y$ and $z$ in $\Sigma^*$. Show that $(x\cdot y) \cdot z = x\cdot (y\cdot z)$.

Click to show the answer.

We first write the words $x, y$ and $z$ as sequences of letters $x = x_1x_2\cdots x_n, y = y_1y_2\cdots y_m$ and $z = z_1z_2\cdots z_l$. Then, by definition of concatenation, we have that $$ (x\cdot y) \cdot z = (x_1x_2\cdots x_ny_1y_2\cdots y_m) \cdot z = x_1x_2\cdots x_ny_1y_2\cdots y_mz_1z_2\cdots z_l $$

$$= x \cdot (y_1y_2\cdots y_m z_1z_2\cdots z_l) = x \cdot (y \cdot z)$$

For a word $w\in \Sigma^*$, we define $w^0 = \epsilon$, and for all $i\geq 1$, $ w^i$ is defined as the long concatenation $w^i = \underbrace{w\cdot w \cdot w \cdots w}_{i \text{ times}}$. Note that as concatenation is associative, then rearranging the parenthesis in long concatenations does not change the resulting word — this is the reason why we omit parenthesis in long concatenations.

In the next question, you’re asked to define powers of words more formally.

Question 4. Consider an alphabet $\Sigma$, a word $w\in \Sigma^*$, and a natural $i\geq 0$. Define $w^i$ by using induction.

Click to show the answer.

As parenthesis can be arranged arbitrarily in long concatenations, we can define $w^i$ by induction on $i$ as follows. For the base case, $i=0$, we define $w^0 = \epsilon$. Then, for the inductive step, we define $w^{i+1} = w^{i}\cdot w$.

Consider the words $x$ and $w$. We say that $x$ is a prefix of $w$ if there is a (possibly empty) word $y$ with $ w = x\cdot y$, $x$ is a suffix of $w$ if there is a (possibly empty) word $y$ with $w = y\cdot x$, and $x$ is an infix of $w$ if there are (possibly empty) words $y$ and $z$ with $w = y\cdot x\cdot z$.

For example, if $w = aaba$, then $w$ is both a prefix and a suffix of itself, $ab$ is an infix of $w$ but neither a prefix nor a suffix of $w$, and $ba$ is a suffix but not a prefix of $w$.

Languages

Automata are abstract machines that get their inputs as words. After processing an input word, the automaton either accepts or rejects it, deciding whether it belongs to some set under concern. For example, we can model a password checker as an automaton that gets as input a candidate password, and decides whether the password satisfies some conditions, e.g., it is of length at least at least 8, contains at least one english-letter, and contains at least two numbers. Passwords that pass the criteria are accepted, and those that do not are rejected.

This raises a natural definition — a language: a set of words over some alphabet. We define it as a set because each word is either accepted or rejected (binary membership), and the order or repetition of words does not matter. This abstraction lets us reason about what a machine decides independently of how it is implemented.

Formally, let $\Sigma$ be an alphabet. We say that $L$ is a language over $\Sigma$ when it is a set of words over $\Sigma$. Thus, $L\subseteq \Sigma^*$.

Example

Consider the alphabet $\Sigma = \{ a, b\}$. The following are languages over $\Sigma$:

$L = \emptyset$ — the empty langauge.
$L = \Sigma^*$ — the universal language (the set of all words).
$L = \{ w\in \Sigma^*: 10 \leq |w| \leq 100 \}$ — a finite language (the set of all words of length at least 10 and at most 100).
$L = \{ w: w \text{ contains the infix } bb\}$ — an infinite language.
$L = \{ \epsilon\}$ — a nonempty language containing only the empty word.

Operations on Languages

Languages are sets, and so we can apply on them any operation that can be applied on sets such as intersection, union, complementation, etc. Specifically, consider two languages $L_1, L_2\subseteq \Sigma^*$ over an alphabet $\Sigma$. Then, the following are also languages over $\Sigma$:

Union of languages: $L_1\cup L_2$ is the language $L_1\cup L_2 = \{ x\in \Sigma^*: x\in L_1 \text{ or } x\in L_2\}$ of words that belong to at least one of $L_1$ and $L_2$.
Intersection of languages: $L_1\cap L_2$ is the language $L_1\cap L_2 = \{ x\in \Sigma^*: x\in L_1 \text{ and } x\in L_2\}$ of words that belong to both $L_1$ and $L_2$.
Difference of languages: $L_1\setminus L_2$ is the language $L_1\setminus L_2 = \{ x\in \Sigma^*: x\in L_1 \text{ and } x\notin L_2\}$ of words that belong to $L_1$ yet do not belong to $L_2$.
Complement of a language: $\overline{L_1}$ is the language $\overline{L_1} = \{ x\in \Sigma^*: x\notin L_1\}$ of words that do not belong to $L_1$.

Note that we complement a language $L$ over $\Sigma$ with respect to the universal language $\Sigma^*$. Thus, $\overline{L} = \Sigma^*\setminus L$.

We now extend concatenation of words to concatenation of languages. Let $L_1$ and $L_2$ be languages over $\Sigma$. The concatenation $L_1 \cdot L_2$ is a language over $\Sigma$ defined as: $$ L_1\cdot L_2 = \{ x\cdot y: x\in L_1, y\in L_2\} $$

Thus, $L_1\cdot L_2$ is the language of words $w$ that can be written as a concatenation $x\cdot y$ of two words $x$ and $y$ that belong to $L_1$ and $L_2$, respectively. Note that order matters in the sense that $x$ must be in $L_1$ and $y$ in $L_2$ (not the other way around).

Example

Consider the alphabet $\Sigma = \{ a, b\}$ and the languages $L_1 = \{ w\in \Sigma^*: w \text{ starts with } a \text{ and ends with } b\}$ and $L_2 = \{ w\in \Sigma^*: w \text{ ends with } a\}$. The following hold:

$\overline{L_2} = \{ w\in \Sigma^*: w \text{ ends with } b \text{ or } w=\epsilon\}$.
$L_1\cap L_2 = \emptyset$: there is no word thats ends with $a$ and $b$ simultaneously.
$\emptyset \cdot L_1 = \emptyset$: the empty language has no candidate word that can be concatenated with a word from $L_1$.
$L_2\cdot L_1 = \{ w\in \Sigma^*: w \text{ contains } aa \text{ and ends with }b\}$:
for the first direction, consider $w\in L_2\cdot L_1$. By definition, there are $x\in L_2$ and $y\in L_1$ with $w = x\cdot y$. As $L_1$ and $L_2$ specify, $x $ and $y$ are of the form $x=x’\cdot a$ and $y = a\cdot y’ \cdot b$ for some words $x’, y’\in \Sigma^*$. Then, $w = x\cdot y = (x’ \cdot a) \cdot (a \cdot y’ \cdot b) = x’\cdot aa \cdot y’ \cdot b$. Therefore, $w$ contains $aa$ and ends with $b$.

For the other direction, consider a word $w$ of the form $w = x’ \cdot aa \cdot y’ \cdot b$. Then, we can partition $w$ into two words $w = x\cdot y$, where $x = x’ \cdot a$ and $y = a\cdot y’ \cdot b$, and it is not hard to verfiy that $x$ and $y$ belong to $L_2$ and $L_1$, respectively.

Note that in the last item in the example above, we considered two directions. This is crucial: to prove that two sets $S$ and $T$ are equal, we must show both inclusions $S\subseteq T$ and $T\subseteq S$. One side may be easy, the other might be trickier, so don’t skip either. Keep this in mind.

What’s Next

Today was about the building blocks of automata and machines in general: alphabets, words, and languages. In the next lesson, we will get into DFAs — deterministic finite automata, what they are, and how they operate.

Stay tuned!

Automata and Regular Languages — Lessons