Computer scienceProgramming languagesGolangPackages and modulesStandard library

Unicode package

7 minutes read

Each character or symbol of this text has a number representation. This is because every computer system handles numbers. A computer finds an associated number in the encoding table to display a symbol. Older PCs contain their own character encoding table. However, the growth in the number of computers and the increase in global connectivity created the need for a universal standard.

So, if the encoding table exists, it needs tools to work. Let's get to know how Golang works with Unicode characters.

The basics of the Unicode

In Go, the base data type for representing a Unicode character is called a rune. You may recall that a rune is an alias for the int32 data type. To create a rune value representing a single Unicode character, you need to enclose the character or symbol within single quotes '':

fmt.Println('a') // 97

The number output of fmt.Println('a') is an effect of the alias for the int32 type.

The int32 type defines a memory limit for a single Unicode symbol (32 bits or 4 bytes). Let's try to output a single Unicode symbol from a string type:

package main

import "fmt"

func main() {
    symbol := "µ"

    fmt.Println(symbol[0])  // 194
    fmt.Println(symbol[:2]) // µ
}

You might remember that strings in Golang are represented like a slice of bytes ([]byte). If you try to access a single character in a string using an index such as symbol[0], you will get a single byte value, not a single Unicode character.

To access individual Unicode characters, you need to take a slice of the underlying byte slice that represents the string. In the second Println statement, we take a slice of the first two bytes of the string symbol[:2], which represents a single Unicode character, and print the µ character correctly.

Handle Unicode symbols

You may have noticed that using a slice is not an efficient way to access individual characters or symbols in a string. You can use the range iterator to iterate over the symbols in a string. The example below showcases the difference between a numeric iterator and using the range iterator on a string:

package main

import "fmt"

func main() {
    unicodeString := "µ¶"

    fmt.Println("Regular iterator:")
    for i := 0; i < len(unicodeString); i++ {
        fmt.Printf("%d: %q\n", i, unicodeString[i])
    }

    fmt.Println("Range iterator:")
    for i, char := range unicodeString {
        fmt.Printf("%d: %q\n", i, char)
    }
}

// Output:
// Regular iterator:
// 0: 'Â'
// 1: 'µ'
// 2: 'Â'
// 3: '¶'
// Range iterator:
// 0: 'µ'
// 2: '¶'

Pay attention to the indexes. The range iterator reads bytes until the character code appears, but the index is the same as a regular iterator.

The symbol definition

Unicode characters or symbols can be separated by attributes. Each attribute combines characters into ranges. It can be digits, whitespaces, control symbols, alphabetical symbols, etc. Ranges are not enclosed. One symbol can be attached to a lot of ranges. Most known Unicode ranges are defined in the unicode package variables.

The unicode package has various helper functions to check if a specific character is in a given range:

package main

import (
    "fmt"
    "unicode"
)

func main() {
    fmt.Println(unicode.IsDigit('1'))    // true
    fmt.Println(unicode.IsUpper('a'))    // false
    fmt.Println(unicode.IsLower('a'))    // true
    fmt.Println(unicode.IsControl('\n')) // true
    fmt.Println(unicode.IsSpace('\n'))   // true
}

Some basic ranges have functions like:

IsDigit to check if the character is a digit (0-9);
IsUpper and IsLower to check the case of a character (uppercase: A-Z or lowercase:a-z);
IsControl to check if the character is a control symbol (\0\e\r);
IsSpace to check if the character is a whitespace symbol ( \n\t);
And many other functions that you can find in the unicode package documentation.

The signature on every function is the same; they all take a rune value — a single character or symbol enclosed within single quotes (''), and return a bool value true if the symbol is in the range or false if the symbol is out of the range.

You will notice there are more ranges than functions if you read the documentation. Only the most commonly used ranges have their own functions. To check whether a rune is in a specific range or ranges, you can use the In helper:

package main

import (
    "fmt"
    "unicode"
)

func main() {
    fmt.Println(unicode.In('ǈ', unicode.Latin)) // true
}

The first argument is a rune variable, and the second variable is a Unicode range. If the symbol is included in the range, the function returns true, otherwise false.

The symbol conversion

The unicode package also contains helper functions to accurately convert the case of a symbol or character:

package main

import (
    "fmt"
    "unicode"
)

func main() {
    fmt.Println(string(unicode.ToLower('A'))) // a
    fmt.Println(string(unicode.ToUpper('a'))) // A
    fmt.Println(string(unicode.ToTitle('a'))) // A
}

All functions take the same type of argument (a rune variable or a single character enclosed within single quotes '') and have the same output (a rune variable in the chosen case). ToLower and ToUpper are intuitive functions: one converts a character to lowercase and the other to uppercase.

But the ToTitle function causes the question: if we can only pass a single character to these functions, what is the difference between ToTitle and ToUpper? Let's take a look at the output of the example below:

package main

import (
    "fmt"
    "unicode"
)

func main() {
    fmt.Println(string(unicode.ToLower('ǈ'))) // ǉ
    fmt.Println(string(unicode.ToUpper('ǈ'))) // Ǉ
    fmt.Println(string(unicode.ToTitle('ǈ'))) // ǈ
}

Here is a single symbol of the Latin range: ǈ. And it consists of two symbols (it's not a joke, it's a standard).

Conclusion

Golang, like any other modern language, has tools and helper functions for working with Unicode characters. A unified coded table helps people and computers interact with each other. Let's recap what tools Golang uses to process Unicode:

A Unicode character is represented by a variable of the rune type;
The info about the range of Unicode characters is stored within the unicode package variables;
The unicode package contains helper functions to check if a specific character is in a given range;
The unicode package contains helper functions to accurately convert the case of a character.

17 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo