8 minutes read

When you think about data, various images may come to mind – text, sequences of numbers, or even a combination of different types. And you'd be right in all cases! There's a versatile data type that can represent this wide variety of data, and that's the string.

Strings are among the most frequently used data types for a good reason. They play a fundamental role in multiple aspects of our lives: translating texts, working with documents in text editors, following scrolling text on TV programs, and searching the Internet for song lyrics. Without strings, these activities would not be possible!

What is a string?

A string is an ordered array of characters. The character can be anything: a letter of the Greek alphabet, a number, or a strange Unicode symbol ֍. However, here we face the first complexity of strings: a string in programming is not always a meaningful word or sentence, like in natural language. In general, a string is a sequence of very different characters that can be printed on a computer.

For instance, here is a string:

Hello! My name is John Snow and I am from Winterfell!

And this is a string too:

q1#%0)n⍟

You can also notice that a string can consist not only of one word, but also of several. If there is a space character in a string, it is still one string, not two.

As mentioned earlier, strings are an ordered array. This means that every character in a string corresponds to an index. The counting of characters in strings traditionally starts from zero.

Ordered array

Accessing the needed symbol by its index usually looks like s[i]s[i]. Here, ss is the string and ii is the index.

For instance, for string s=s = string basics and i=5i = 5, s[i]s[i] is g. In programming languages, strings are usually indicated by single or double quotes in order to emphasize that what is in quotes is one whole, one string.

The length of a string s s is the total number of characters it contains. We denote it by s |s| . Some examples are the following:

  • 100101 — a string of length 6;

  • GATTACA — a string of length 7;

  • string basics — a string of length 13.

A string can have zero length. In this case, we call it an empty string.

Operations on strings

We can perform numerous operations on strings. Let's take a brief look at a few of them:

  • reverse(s) returns the reversed string, i.e. the string written backward. For example, reverse(LIVE) = EVIL;

  • concat(s,t) concatenates the given strings ss and tt. For instance, concat(STR, INGS) = STRINGS;

  • compare(s, t) compares the given strings ss and tt. For example, compare(JON, JOHN) = false;

  • get_symbol(s, i) returns the character in the given string ss at the index ii. For example, get_symbol(ANIMALS, 1) = N. Note that if you call this function from an index that is bigger than the word length, the function returns with an error;

  • length(s) returns the length of a given string ss. For instance, length(HELLO) = 5.

Pretty often, we might need to compare two given strings. We can do this by scanning and comparing every element of the first string with the corresponding element of the same index in the other string. More precisely, given the strings ss and tt, we check for every ii whether s[i]=t[i]s[i] = t[i]. If they all match, then we conclude that s=ts=t.

Here is a pseudocode for better visualization of comparing:

if length(s) != length(t) then: // if the lengths do not match, we can say that the strings are different
    return False
for i to length(s): // alternatively, we can put here the length of the string t, as it is the same as the length of the string s
    if s[i] != t[i] then: // if characters do not match, we conclude that strings are not the same
        return False
return True

It is not necessary to write your own functions for string operations, because there are a few libraries or built-in methods that allow the usage of string operations without coding them from scratch. For example, in Python language you can perform operations with strings as with numbers, using ++ for concatenation and ==== for comparing; in C there is a library string.h with a lot of useful string functions.

Substrings

A substring is a contiguous subsequence of symbols of an original string. Naturally, a substring is called proper if it doesn't coincide with the whole string.

For instance, ATTA is a substring of the string GATTACA because the string GATTACA contains the sequence ATTA. Note that GATTA, TT, TAC, and CA are also substrings of the given string. Actually, there are many more substrings in our string: try to find by yourself some substrings of GATTACA, other than the ones mentioned above.

In terms of notation, the substring of the string s s starting from the i i -th and ending with the jj-th symbol is denoted by s[i,j]s[i,j]. For ss = GATTACA, s[1,4]=s[1,4] = ATTA.

The substring

Obviously, the only nonproper substring of a string is the string itself. An empty string is a substring of any string.

Now it's time to introduce some peculiar types of substrings. A substring starting from the beginning of a string is a prefix, and a substring ending with the last index is a suffix. For example, s[0,2]=s[0,2] = GAT is a prefix of s s , and s[4,6]=s[4,6] = ACA is a suffix of s s .

Prefiz and suffix

The prefix can end anywhere in the main word, as well as the suffix can begin anywhere. Hence, there are more prefixes of GATTACA: G, GA, GATT, GATTA, GATTAC, GATTACA, and its suffixes are, for example, GATTACA, ATTACA, TTACA, TACA, CA, and A.

The problem of searching for prefixes and suffixes is common because it can solve the issue of finding the origin of the word by dropping prefixes and suffixes or counting the number of occurrences of words starting with a particular beginning.

The whole word is a suffix and a prefix for itself as well. Also, empty prefixes and suffixes can exist.

What if a certain prefix and suffix coincide? We end up with a new term: a border. Formally, the border of a string is a non-empty substring that is both a proper prefix and a proper suffix of the string. Proper, as we remember, means that the whole string does not count as a substring. The longest border of ATTA is A.

Prefix, border and suffix

As confusing as it might be, a prefix and a suffix can overlap. Hence, the longest border of ABABAB is ABAB, as shown below:

The longest border

Besides, there is another border in ABABAB, such as AB. In general, there is more than one border in a given string. However, we are interested in considering the longest one, because we can use it in calculating the prefix function, a magical tool to work with strings, as we will discuss later.

Of course, a string may not have a border, because not always there is a proper prefix that is equal to a proper suffix. A string is called unbordered if it's impossible to find any border in it.

Strings applications

The fact that strings can represent any textual information makes them widely used in a variety of fields:

  • String-searching algorithms. If you are interested in developing a text editor and want to add a function for searching for a pattern in a text, then you need to implement an algorithm that finds the positions of all occurrences of a particular string in the text. It is classically done with help of the prefix function, which depends on finding the longest borders.

  • DNA similarity measure. Calculating how strings differ from each other is used to measure the variation between two strands of DNA. There are a lot of ways of quantifying how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other.

  • Natural language translation. Determining the language of the text based on the probabilities of characters and syllables is a complex task that requires a lot of string functions, such as searching for the context of words or searching for a substring by prefix while looking through a set of words in the dictionary.

  • Checking for plagiarism. Have you ever wondered how modern plagiarism detection programs work? Nowadays they perform online verification using substring search algorithms among a large number of documents stored in their own database. Who knows, maybe you will be the one who will improve the plagiarism verification system so that no student will be able to cheat while preparing an essay anymore. But this will only happen if you study string basics well!

  • Analysis of people's requests. Almost everything we look for on the Internet is saved and analyzed by special programs, which then offer us personalized advertising. Our requests are saved as strings and then transmitted to the advertising selection programs. Programs that generate a relevant logo based on the invented name of the company work the same way.

Conclusion

A string is a handy data type that is represented as an ordered sequence of different characters. Strings get indexes starting from zero, and usually are framed with quotes. Besides, there are empty strings those that have zero length.

There are a lot of string functions, such as reversing a string, getting its length, or returning a character by index. There are also some methods for comparing and concatenating two strings.

A huge section of strings' area of implementation is processing substrings. With the help of substrings, plagiarism verification systems work, and words are searched in dictionaries. Sometimes when we talk about substrings, we mean prefixes or suffixes that begin or end strings, respectively. If a prefix is equal to a suffix, the substring is called the border.

Since strings are widely used in many areas, most programming languages have tools to work with them. With the help of these methods and manual functions, you can implement translation functions, measure the distance between strings, and determine the language of a text!

For these reasons, it's essential to be familiar with some standard and well-known string processing algorithms and their applications. We will examine some of them in the following topics.

125 learners liked this piece of theory. 11 didn't like it. What about you?
Report a typo