Viki

Viki 写东西的地方

努力上进且优秀
x
github
email
bilibili

Reconsidering emojis due to the issue of emoji character segmentation

The article contains a lot of content. If you are only looking for a solution, please scroll to the end of the article.

When it comes to emoji, we are all familiar with it. It is a type of emoticon widely used on web pages and in chats, such as 😂, 😄, etc.

Although emoji is a valid string content, due to its counterintuitive length and diverse types, unexpected results may occur during splitting. For example, in the following example:

'😃⛔'.split('') // ['\uD83D', '\uDE03', '⛔']

What? How did two symbols become three after splitting? And why are there garbled characters?

Don't panic, let's first take a look at their lengths.

'⛔'.length // 1
'😃'.length // 2
'👦🏾'.length // 4
'🏳️‍🌈'.length // 6
'👨‍👨‍👧‍👧'.length // 11

Oh no, the more we look at it, the more absurd it becomes. What exactly is going on? Why do some emoji become garbled after splitting while others don't? Why is the length of emoji not 1?

Let's continue reading with these questions in mind.

Rediscovering Emoji

Emoji, also known as "emoticons" in this article, are visual emotional symbols used in wireless communication in Japan. In China, emoji is usually called "little yellow faces" or simply emoji. Since the introduction of emoji in the iOS 5 input method released by Apple, this type of emoticon has swept the world. Emoji has been adopted by most modern computer systems compatible with Unicode encoding and is widely used in various mobile phone messages and social networks.

The Unicode 6.0 version released in October 2010 first included emoji encoding, and different emoji are divided into Unicode blocks.

In addition to these regular emoji expressions, Unicode 8.0 also added 5 modifiers: 🏻 🏼 🏽 🏾 🏿, which are added after some people's emoji expressions to adjust the skin color of the humanoid expressions. These are called Fitzpatrick modifiers for emoji expressions and correspond to Fitzpatrick's classification of human skin colors.

For example: 👦 👦🏻 👦🏼 👦🏽 👦🏾 👦🏿 and 👧 👧🏻 👧🏼 👧🏽 👧🏾 👧🏿.

In addition, there are emoji generated by combining two emoji using the U+200D Zero Width Joiner (ZWJ) to make them appear as a single emoji (e.g., 👨‍👩‍👧). If the system supports it, it will be displayed as a family emoji composed of a man, a woman, and a girl, while unsupported systems will display these three emoji sequentially (👨👩👧). There are also combinations of male and female emoji, such as a female emoji combined with a zero width joiner and ♂ to become the male version.

For the specification standards of emoji in Unicode, please refer to here, which defines the structure of Unicode emoji characters and sequences and provides data to support this structure.

It is precisely because of the diversity of these emoji mentioned above that when they are split as regular strings, the splitting results do not match our intuition.

So what can we do to solve this problem?

Initial Solutions (Summary of Final Solutions at the End of the Article)

We can try to match the Unicode blocks assigned to emoji based on the definition of emoji in Unicode using regular expressions, and then split them and filter out empty or undefined blocks.

function emojiStringToArray(str) {
  const reg = /([\uD800-\uDBFF][\uDC00-\uDFFF])/
  return str.split(reg).filter(Boolean)
}

Let's test how this function performs in actual use:

emojiStringToArray('😴⛔🎠🚓🚇') // ['😴', '⛔', '🎠', '🚓', '🚇']

It seems that the effect of regular emoji is still acceptable, but it is a bit inadequate for the aforementioned skin tone emoji or combined emoji, such as the following examples.

emojiStringToArray('👨‍👨‍👧‍👧') // ['👨', '‍', '👨', '‍', '👧', '‍', '👧']
emojiStringToArray('👦🏾') // ['👦', '🏾'] If you see a box question mark here, it is actually a skin tone emoji that may not be displayed.

Wow, the first one 👨‍👨‍👧‍👧 just broke up a whole family, you really did it.

Wait a minute, didn't we do filter(Boolean) to filter out empty strings? Why are there "empty strings" in the resulting array?

Oh no... (Could it be)

Let's test this "empty string" that was output:

'‍' === '' // false

if ('‍') {
  console.log('This is true!') // Successfully prints This is true!
}

Wow, it turns out that this is not an empty string at all.

If you are observant, you may have noticed in the introduction to combined emoji in the Unicode that this "empty string" is actually the U+200D Zero Width Joiner (ZWJ) mentioned earlier. Intuitively, it looks no different from an empty string, but it is a completely different character. This character is specifically used to connect specific emoji to form combined emoji.

We can also try the spread operator (spread operator) in ES6.

;[...'😴⛔🎠🚓🚇'] // ['😴', '⛔', '🎠', '🚓', '🚇']
[...'👨‍👨‍👧‍👧'] // ['👨', '‍', '👨', '‍', '👧', '‍', '👧']
[...'👦🏾'] // ['👦', '🏾']

There is also Array.from(), which, after trying it out, turns out to be the same situation.

Array.from('😴⛔🎠🚓🚇') // ['😴', '⛔', '🎠', '🚓', '🚇']
Array.from('👨‍👨‍👧‍👧') // ['👨', '‍', '👨', '‍', '👧', '‍', '👧']
Array.from('👦🏾') // ['👦', '🏾']

These methods are essentially the same and are not problematic in themselves. The problem lies in the fact that some emoji are not "individually present" and may have some additional features, such as skin tone or combined emoji. To accurately determine emoji, these two special cases must be taken into account.

Optimized Solution

Use Intl.Segmenter for splitting.

Many people may not be familiar with Intl and may even see it for the first time. I admit that I have hardly seen it and have not really used it. Here is a quote from MDN about Intl: The Intl object is the namespace for the ECMAScript Internationalization API, which provides language-sensitive string comparison, number formatting, and date and time formatting.

The Intl.Segmenter object supports language-sensitive text segmentation, allowing you to split a string into meaningful segments (characters, words, sentences).

Let's try Intl.Segmenter.

const splitEmoji = string => {
  const segment = new Intl.Segmenter().segment(string)
  return [...segment].map(e => e.segment)
}

splitEmoji('😴😄😃⛔🎠🚓🚇') // ['😴', '😄', '😃', '⛔', '🎠', '🚓', '🚇']

Great! But for splitting these basic emoji, the previous method can achieve the same result. Let's take a look at the complex cases of skin tone emoji and combined emoji.

splitEmoji('👨‍👨‍👧‍👧👦🏾') // ['👨‍👨‍👧‍👧', '👦🏾']

Nice! Isn't this the result we want? This method perfectly solves our needs, which is great.

However, since we haven't heard much about this thing, how compatible is it and can it be used in a production environment? By searching and consulting on Can I Use, we can find that 89.7% of browsers worldwide (including mobile and PC) are compatible. So, except for devices that require greater coverage, we can basically use Intl.Segmenter with confidence.

Open Source Community Solutions

In fact, emoji has been used for such a long time, and the open source community must have encountered similar problems long ago. Here is a quote from the more mature solution in the community: graphemer.

Install the dependency

npm i graphemer

The basic usage is as follows:

// CommonJS
const Graphemer = require('graphemer').default
const splitter = new Graphemer()
const graphemes = splitter.splitGraphemes('😃⛔👨‍👨‍👧‍👧👦🏾')
console.log(graphemes) // ['😃', '⛔', '👨‍👨‍👧‍👧', '👦🏾']

// Or ESM
import Graphemer from 'graphemer'
const splitter = new Graphemer()
const graphemes = splitter.splitGraphemes('😃⛔👨‍👨‍👧‍👧👦🏾')
console.log(graphemes) // ['😃', '⛔', '👨‍👨‍👧‍👧', '👦🏾']

graphemer is positioned as a Unicode character splitter, which means that not only emoji, but also other Unicode codes with similar situations can be correctly split. It is a great solution.

Summary of Solutions

  1. Split based on the characteristics of Intl.Segmenter (details refer to the previous text).
  2. Use a more mature solution from the open source community: graphemer (recommended, details refer to the previous text).

Related Knowledge

July 17th of each year is World Emoji Day. This is an unofficial commemorative day that has been held since 2014 to celebrate the widespread use of emoji. Usually, emoji events are held on this day, and new emoji are released.

The Queensland Department of Transport and Main Roads in Australia introduced new regulations: starting from March 1, 2019, vehicle owners are allowed to add an emoji to their license plates.

"😂" (Chinese: laughing with tears emoji, English: Face with Tears of Joy) was selected as the 2015 Word of the Year by the Oxford Dictionary.

Related Websites

References

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.