Our computers do not have the inherent capability to understand unmediated human language. They need to incorporate special encoding systems to convert it to the machine language first. Unicode is one of these character encoding standards that help our computers comprehend the text we input and process it across different environments without a hitch. Just like the majority of today's programming languages, JavaScript supports the Unicode character set and provides functions or libraries for manipulating Unicode characters.
Unicode
In machine language, characters are represented using sequences of 0s and 1s. The ASCII (American Standard Code for Information Interchange), which is one of the earliest and most widely used encoding scheme, uses a combination of 7 digits, either 0 or 1, to represent a character. This binary sequence allows for a total of 128 different characters to be represented. Due to its limited coverage in terms of sequence combinations, ASCII primarily includes a limited set of characters, such as basic Latin letters, numbers, and a few symbols.
As a more refined alternative to ASCII, Unicode is a universal character encoding standard also used by most of the modern programming languages today. It's a character set that includes universally representation of far more diverse characters and symbols, including those from scripts like Latin, Cyrillic, Chinese, Arabic, and even emojis! 🤩
Code points
UTF-8 (8-Bit Unicode Transformation Format) and UTF-16 (16-Bit Unicode Transformation Format) are encoding schemes that define rules for representing Unicode characters in binary form. These encoding schemes assign unique addresses to the characters. Those address values, known as code points, enable precise identification of the characters within the Unicode character set and can be represented either as hexadecimal values (which is a number system using 16 symbols: 0 to 9 and A to F) preceded by the U+ prefix or simply as decimal numbers. For example, the letter "A" has the code point value of U+0041 or 65, and the heart symbol "❤️" is assigned to the code point U+2764 or 10084 in the Unicode chart.
The key difference between UTF-8 and UTF-16 is that UTF-8 uses a minimum of 8 bits (1 byte) for a character, and UTF-16 uses a minimum of 16 bits (2 bytes). However, there are also some characters in Unicode that can't fit within a single 16-bit unit. These characters have code points beyond the BMP (Basic Multilingual Plane), which cover commonly used characters within Unicode, and they are located in a special range known as the Supplementary Multilingual Plane (SMP). The SMP includes code points ranging from U+10000 to U+1FFFF. To represent code points, the characters within this range, UTF-16 encoding utilize a special method referred to as surrogate pairs.
Surrogate pairs
Surrogate pairs are basically dual combinations of special 16-bit values. By using surrogate pairs, UTF-16 can represent characters that require more than 16 bits. The high and low surrogate values work together to encode and represent these extended characters accurately. To exemplify, the emoji "🐇" has the surrogate value of U+D83D U+DC07 in UTF-16, or 55356 56455 in decimal representation.
If you look at the code points mentioned earlier for the letter "A" and the heart symbol, you will see that they don't need surrogate pairs because their code point is smaller than U+FFFF. On the other hand, the heart symbol has a code point greater than U+FFFF, so it does require surrogate pairs.
Unicode functions in JavaScript
JavaScript includes multiple functions and methods specifically designed to work with Unicode characters, and they can be applied to various data types, including strings, arrays, and objects. These functions enable interaction with code points, converting them to characters and facilitating operations on Unicode-encoded strings. Let's dive in to understand their inner workings!
- codePointAt()
codePointAt() is a method that allows us to work with Unicode. To use codePointAt(), we need to have a string and specify the index of the character. It will return Unicode code point of the character at the given index.
In the following example, we have a string called text. To identify the character at index 11 in this string, which is the lowercase letter "w" in this case, we first use console.log(text[11]). Then, to find the Unicode code point of the character "w", we use '11' as a parameter inside the codePointAt() method and print the value to display the result on the console.
let text = "Follow the white rabbit.";
let code = text.codePointAt(11);
console.log(text[11]); // "w"
console.log(code); // 119
- charCodeAt():
Similar to the codePointAt() method, charCodeAt() also takes the index of the character as a parameter within the string. However, the key distinction that sets it apart from charCodeAt() is that it operates with UTF-16 encoding along with Unicode. So, what does that suggest?
Here's the answer: the UTF-16 code unit is a 16-bit value within the range of 0 to 65535, in decimals. So the applicability of the thecharCodeAt() method is limited to characters within the Basic Multilingual Plane (BMP). On the other hand, Unicode is a character set that encompasses a much larger range of characters, including BMP. Although both codePointAt() and charCodeAt() return integers representing character codes, charCodeAt() is constrained to the UTF-16 range of 0 to 65535 whereas codePointAt() is capable of handling the entire range of Unicode values.
let text = "Follow the white rabbit.";
let code = text.charCodeAt(11);
console.log(text[11]); // "w"
console.log(code); // 119
When we look at the example, we can see that the only updated part is the function we use, but the output remains the same. The reason behind this is that the lowercase "w" character is part of BMP, which means it is commonly used. So, both the Unicode code point and UTF-16 code units for this character will be the same.
Let's consider the situation where our character lies outside the Basic Multilingual Plane (BMP). What would happen then?
let text = "Follow the 🐇.";
let code1 = text.codePointAt(11);
let code2 = text.charCodeAt(11);
console.log(code1); // 128007
console.log(code2); // 55357
At this point, surrogate pairs come into play! In this example, the character "🐇" is represented by a Unicode code point U+1F407. As you remember, this character does have a surrogate value in UTF-16, "🐇" is U+D83D U+DC07 in hexadecimal representation, and it is beyond the BMP, it requires more than one UTF-16 code unit to represent it. That's why, when you use codePointAt(11) method, it correctly returns the full Unicode code point of "🐇", which is 128007 in decimal. On the other hand, when you use charCodeAt(11) method, it only returns the UTF-16 code unit of the first surrogate of "🐇", which is 55357 in decimal. This is because charCodeAt() operates with UTF-16 code units and doesn't handle surrogate pairs as a single entity.
- String.fromCodePoint() and String.fromCharCode()
Lastly, we use String.fromCharCode() and String.fromCodePoint methods in JavaScript to convert Unicode code points into strings. Again, the main difference between String.fromCodePoint() and String.fromCharCode() in JavaScript lies in the range of characters they can handle and the way they accept input: the former works with Unicode, while the latter operates with UTF-16. These methods are useful when we need to work with characters that are represented by their numerical codes.
const text1 = String.fromCodePoint(84, 72, 69, 32, 77, 65, 84, 82, 73, 88);
console.log(text1); // "THE MATRIX"
const text2 = String.fromCharCode(84, 72, 69, 32, 77, 65, 84, 82, 73, 88);
console.log(text2); // "THE MATRIX"
In this example, these code unit values correspond to characters within the BMP, the resulting string will be the same, regardless of whether you use String.fromCharCode() or String.fromCodePoint().
Conclusion
Our computers rely on encoding systems to comprehend human language. Unicode is a universal standard for encoding and representation of wide-ranging characters and symbols. UTF-8 and UTF-16 are encoding systems within this standard, and they define rules for representing characters in binary form. Surrogate pairs, on the other hand, are pairs of Unicode values within the UTF-16 system, and they allow for the representation of characters exceeding 16 bits. JavaScript programming language offers functions like codePointAt() and charCodeAt() to handle Unicode and UTF-16 characters. Finally, String.fromCodePoint() and String.fromCharCode() convert Unicode code points into strings. These methods help computers work with and convert characters in different forms within the JavaScript context.