Why length of emoji (๐Ÿ˜œ) is 2 in JavaScript

Why length of emoji (๐Ÿ˜œ) is 2 in JavaScript

ยท

3 min read

Table of contents

No heading

No headings in the article.

When we work with strings in JavaScript, it's important to understand how they are stored in memory and how their length is calculated.

JavaScript uses UTF-16 (16-bit Unicode Transformation Format) to store strings. This means that before a string is stored in memory, it is converted into 16-bit long binary numbers. Each character in the string is represented by one or more 16-bit code units, depending on the character's Unicode value.

For example, the Unicode value of the "๐Ÿ˜œ" emoji is U+1F61C. To store this emoji in memory, JavaScript converts it into two 16-bit code units: '\uD83D' and '\uDE1C'. The first code unit represents the high surrogate and the second code unit represents the low surrogate.

When we access the .length property of a string in JavaScript, the engine looks up and returns the number of code units occupied by the string. This means that the length of a string is determined by the number of 16-bit code units it contains.

To demonstrate this, let's take a look at some code:

arduinoCopy codeconst s = "๐Ÿ˜œ"
console.log(s.length) // Output: 2

In this code, we define a string s that contains the "๐Ÿ˜œ" emoji. We then log the length of s to the console using the .length property. As we saw earlier, the "๐Ÿ˜œ" emoji is represented by two 16-bit code units, so the length of the string is 2.

To further illustrate this, we can use the TextEncoder API to encode the string "๐Ÿ˜œ" and see how many 16-bit code units it contains:

javascriptCopy codeconst encoder = new TextEncoder()
const view = encoder.encode('๐Ÿ˜œ')
const ar8 = new Uint8Array(view);
const buf = new Buffer(ar8);
const ar16 = new Uint16Array(buf.buffer, buf.byteOffset, buf.byteLength / Uint16Array.BYTES_PER_ELEMENT);

console.log("Array Length",ar16.length);
console.log("Emoji Length","๐Ÿ˜œ".length)

In this code, we create a new instance of the TextEncoder class and use it to encode the "๐Ÿ˜œ" emoji. We then convert the resulting Uint8Array to a Uint16Array and log its length to the console. As expected, the length of the Uint16Array is 2, which matches the length of the original string.

It's worth noting that not all characters in a string will be represented by a single 16-bit code unit. Characters with a Unicode value greater than U+FFFF (known as "supplementary characters") are represented by two 16-bit code units. For example, the "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ" family emoji consists of four characters (man, woman, girl, boy) and is represented by 11 16-bit code units.

In conclusion, JavaScript stores strings using UTF-16 string formatting, which means that each character is represented by one or more 16-bit code units. When we access the .length property of a string, JavaScript returns the number of code units it contains. This is important to keep in mind when working with strings that contain emojis or other non-ASCII characters.

ย