{"id":2268,"date":"2017-03-11T11:23:43","date_gmt":"2017-03-11T10:23:43","guid":{"rendered":"http:\/\/monodes.com\/predaelli\/?p=2268"},"modified":"2017-03-11T11:23:43","modified_gmt":"2017-03-11T10:23:43","slug":"a-programmers-introduction-to-unicode","status":"publish","type":"post","link":"https:\/\/monodes.com\/predaelli\/2017\/03\/11\/a-programmers-introduction-to-unicode\/","title":{"rendered":"A Programmer\u2019s Introduction to Unicode"},"content":{"rendered":"<p><em><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/\">\u00abA Programmer\u2019s Introduction to Unicode\u00bb from Nathan Reed\u2019s coding blog<\/a><\/em><\/p>\n<p><!--more--><\/p>\n<p><!--nextpage--><\/p>\n<blockquote>\n<header>\n<h1><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/\">A Programmer\u2019s Introduction to Unicode<\/a><\/h1>\n<p>March 3, 2017 \u00b7 <a href=\"http:\/\/reedbeta.com\/blog\/category\/coding\/\">Coding<\/a> \u00b7 <a class=\"disqus-comment-count\" href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#comments\" data-disqus-identifier=\"\/blog\/programmers-intro-to-unicode\/\">7 Comments<\/a><\/p>\n<\/header>\n<p>\uff35\uff4e\uff49\uff43\uff4f\uff44\uff45! \ud83c\udd64\ud83c\udd5d\ud83c\udd58\ud83c\udd52\ud83c\udd5e\ud83c\udd53\ud83c\udd54\u203d \ud83c\uddfa\u200c\ud83c\uddf3\u200c\ud83c\uddee\u200c\ud83c\udde8\u200c\ud83c\uddf4\u200c\ud83c\udde9\u200c\ud83c\uddea! \ud83d\ude04 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to \u201csupport Unicode\u201d in our software (whatever that means\u2014like using <code class=\"\" data-line=\"\">wchar_t<\/code> for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page <a href=\"http:\/\/www.unicode.org\/versions\/latest\/\">Unicode Standard<\/a> plus its dozens of supplementary <a href=\"http:\/\/www.unicode.org\/reports\/\">annexes, reports<\/a>, and <a href=\"http:\/\/www.unicode.org\/notes\/\">notes<\/a> can be more than a little intimidating. I don\u2019t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode\u2019s inception.<\/p>\n<p>A few months ago, I got interested in Unicode and decided to spend some time learning more about it in detail. In this article, I\u2019ll give an introduction to it from a programmer\u2019s point of view.<\/p>\n<p>I\u2019m going to focus on the character set and what\u2019s involved in working with strings and files of Unicode text. However, in this article I\u2019m not going to talk about fonts, text layout\/shaping\/rendering, or localization in detail\u2014those are separate issues, beyond my scope (and knowledge) here.<\/p>\n<div class=\"toc\">\n<ul>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#diversity-and-inherent-complexity\">Diversity and Inherent Complexity<\/a><\/li>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#the-unicode-codespace\">The Unicode Codespace<\/a>\n<ul>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#codespace-allocation\">Codespace Allocation<\/a><\/li>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#scripts\">Scripts<\/a><\/li>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#usage-frequency\">Usage Frequency<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#encodings\">Encodings<\/a>\n<ul>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#utf-8\">UTF-8<\/a><\/li>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#utf-16\">UTF-16<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#combining-marks\">Combining Marks<\/a>\n<ul>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#canonical-equivalence\">Canonical Equivalence<\/a><\/li>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#normalization-forms\">Normalization Forms<\/a><\/li>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#grapheme-clusters\">Grapheme Clusters<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/#and-more\">And More\u2026<\/a><\/li>\n<\/ul>\n<\/div>\n<h2 id=\"diversity-and-inherent-complexity\">Diversity and Inherent Complexity<\/h2>\n<p>As soon as you start to study Unicode, it becomes clear that it represents a large jump in complexity over character sets like ASCII that you may be more familiar with. It\u2019s not just that Unicode contains a much larger number of characters, although that\u2019s part of it. Unicode also has a great deal of internal structure, features, and special cases, making it much more than what one might expect a mere \u201ccharacter set\u201d to be. We\u2019ll see some of that later in this article.<\/p>\n<p>When confronting all this complexity, especially as an engineer, it\u2019s hard not to find oneself asking, \u201cWhy do we need all this? Is this really necessary? Couldn\u2019t it be simplified?\u201d<\/p>\n<p>However, Unicode aims to faithfully represent the <em>entire world\u2019s<\/em> writing systems. The Unicode Consortium\u2019s stated goal is \u201cenabling people around the world to use computers in any language\u201d. And as you might imagine, the diversity of written languages is immense! To date, Unicode supports 135 different scripts, covering some 1100 languages, and there\u2019s still a long tail of <a href=\"http:\/\/linguistics.berkeley.edu\/sei\/\">over 100 unsupported scripts<\/a>, both modern and historical, which people are still working to add.<\/p>\n<p>Given this enormous diversity, it\u2019s inevitable that representing it is a complicated project. Unicode embraces that diversity, and accepts the complexity inherent in its mission to include all human writing systems. It doesn\u2019t make a lot of trade-offs in the name of simplification, and it makes exceptions to its own rules where necessary to further its mission.<\/p>\n<p>Moreover, Unicode is committed not just to supporting texts in any <em>single<\/em> language, but also to letting multiple languages coexist within one text\u2014which introduces even more complexity.<\/p>\n<p>Most programming languages have libaries available to handle the gory low-level details of text manipulation, but as a programmer, you\u2019ll still need to know about certain Unicode features in order to know when and how to apply them. It may take some time to wrap your head around it all, but don\u2019t be discouraged\u2014think about the billions of people for whom your software will be more accessible through supporting text in their language. Embrace the complexity!<\/p>\n<h2 id=\"the-unicode-codespace\">The Unicode Codespace<\/h2>\n<p>Let\u2019s start with some general orientation. The basic elements of Unicode\u2014its \u201ccharacters\u201d, although that term isn\u2019t quite right\u2014are called <em>code points<\/em>. Code points are identified by number, customarily written in hexadecimal with the prefix \u201cU+\u201d, such as <a href=\"http:\/\/unicode.org\/cldr\/utility\/character.jsp?a=A\">U+0041 \u201cA\u201d <span class=\"smallcaps\">latin capital letter a<\/span><\/a> or <a href=\"http:\/\/unicode.org\/cldr\/utility\/character.jsp?a=%CE%B8\">U+03B8 \u201c\u03b8\u201d <span class=\"smallcaps\">greek small letter theta<\/span><\/a>. Each code point also has a short name, and quite a few other properties, specified in the <a href=\"http:\/\/www.unicode.org\/reports\/tr44\/\">Unicode Character Database<\/a>.<\/p>\n<p>The set of all possible code points is called the <em>codespace<\/em>. The Unicode codespace consists of 1,114,112 code points. However, only 128,237 of them\u2014about 12% of the codespace\u2014are actually assigned, to date. There\u2019s plenty of room for growth! Unicode also reserves an additional 137,468 code points as \u201cprivate use\u201d areas, which have no standardized meaning and are available for individual applications to define for their own purposes.<\/p>\n<h3 id=\"codespace-allocation\">Codespace Allocation<\/h3>\n<p>To get a feel for how the codespace is laid out, it\u2019s helpful to visualize it. Below is a map of the entire codespace, with one pixel per code point. It\u2019s arranged in tiles for visual coherence; each small square is 16\u00d716 = 256 code points, and each large square is a \u201cplane\u201d of 65,536 code points. There are 17 planes altogether.<\/p>\n<p><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2017\/03\/codespace-map.png?w=910\" alt=\"\" \/><\/a><\/p>\n<p>White represents unassigned space. Blue is assigned code points, green is private-use areas, and the small red area is surrogates (more about those later). As you can see, the assigned code points are distributed somewhat sparsely, but concentrated in the first three planes.<\/p>\n<p>Plane 0 is also known as the \u201cBasic Multilingual Plane\u201d, or BMP. The BMP contains essentially all the characters needed for modern text in any script, including Latin, Cyrillic, Greek, Han (Chinese), Japanese, Korean, Arabic, Hebrew, Devanagari (Indian), and many more.<\/p>\n<p>(In the past, the codespace was just the BMP and no more\u2014Unicode was originally conceived as a straightforward 16-bit encoding, with only 65,536 code points. It was expanded to its current size in 1996. However, the vast majority of code points in modern text belong to the BMP.)<\/p>\n<p>Plane 1 contains historical scripts, such as Sumerian cuneiform and Egyptian hieroglyphs, as well as emoji and various other symbols. Plane 2 contains a large block of less-common and historical Han characters. The remaining planes are empty, except for a small number of rarely-used formatting characters in Plane 14; planes 15\u201316 are reserved entirely for private use.<\/p>\n<h3 id=\"scripts\">Scripts<\/h3>\n<p>Let\u2019s zoom in on the first three planes, since that\u2019s where the action is:<\/p>\n<p><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2017\/03\/script-map.png?w=910\" alt=\"\" \/><\/a><\/p>\n<p>This map color-codes the 135 different scripts in Unicode. You can see how Han () and Korean () take up most of the range of the BMP (the left large square). By contrast, all of the European, Middle Eastern, and South Asian scripts fit into the first row of the BMP in this diagram.<\/p>\n<p>Many areas of the codespace are adapted or copied from earlier encodings. For example, the first 128 code points of Unicode are just a copy of ASCII. This has clear benefits for compatibility\u2014it\u2019s easy to losslessly convert texts from smaller encodings into Unicode (and the other direction too, as long as no characters outside the smaller encoding are used).<\/p>\n<h3 id=\"usage-frequency\">Usage Frequency<\/h3>\n<p>One more interesting way to visualize the codespace is to look at the distribution of usage\u2014in other words, how often each code point is actually used in real-world texts. Below is a heat map of planes 0\u20132 based on a large sample of text from Wikipedia and Twitter (all languages). Frequency increases from black (never seen) through red and yellow to white.<\/p>\n<p><a href=\"http:\/\/reedbeta.com\/blog\/programmers-intro-to-unicode\/\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2017\/03\/heatmap-wikitweets.png?w=910\" alt=\"\" \/><\/a><\/p>\n<p>You can see that the vast majority of this text sample lies in the BMP, with only scattered usage of code points from planes 1\u20132. The biggest exception is emoji, which show up here as the several bright squares in the bottom row of plane 1.<\/p>\n<h2 id=\"encodings\">Encodings<\/h2>\n<p>We\u2019ve seen that Unicode code points are abstractly identified by their index in the codespace, ranging from U+0000 to U+10FFFF. But how do code points get represented as bytes, in memory or in a file?<\/p>\n<p>The most convenient, computer-friendliest (and programmer-friendliest) thing to do would be to just store the code point index as a 32-bit integer. This works, but it consumes 4 bytes per code point, which is sort of a lot. Using 32-bit ints for Unicode will cost you a bunch of extra storage, memory, and performance in bandwidth-bound scenarios, if you work with a lot of text.<\/p>\n<p>Consequently, there are several more-compact encodings for Unicode. The 32-bit integer encoding is officially called UTF-32 (UTF = \u201cUnicode Transformation Format\u201d), but it\u2019s rarely used for storage. At most, it comes up sometimes as a temporary internal representation, for examining or operating on the code points in a string.<\/p>\n<p>Much more commonly, you\u2019ll see Unicode text encoded as either UTF-8 or UTF-16. These are both <em>variable-length<\/em> encodings, made up of 8-bit or 16-bit units, respectively. In these schemes, code points with smaller index values take up fewer bytes, which saves a lot of memory for typical texts. The trade-off is that processing UTF-8\/16 texts is more programmatically involved, and likely slower.<\/p>\n<h3 id=\"utf-8\">UTF-8<\/h3>\n<p>In UTF-8, each code point is stored using 1 to 4 bytes, based on its index value.<\/p>\n<p>UTF-8 uses a system of binary prefixes, in which the high bits of each byte mark whether it\u2019s a single byte, the beginning of a multi-byte sequence, or a continuation byte; the remaining bits, concatenated, give the code point index. This table shows how it works:<\/p>\n<table>\n<thead>\n<tr>\n<th>UTF-8 (binary)<\/th>\n<th>Code point (binary)<\/th>\n<th>Range<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"mono\">0xxxxxxx<\/td>\n<td class=\"mono\">xxxxxxx<\/td>\n<td>U+0000\u2013U+007F<\/td>\n<\/tr>\n<tr>\n<td class=\"mono\">110xxxxx 10yyyyyy<\/td>\n<td class=\"mono\">xxxxxyyyyyy<\/td>\n<td>U+0080\u2013U+07FF<\/td>\n<\/tr>\n<tr>\n<td class=\"mono\">1110xxxx 10yyyyyy 10zzzzzz<\/td>\n<td class=\"mono\">xxxxyyyyyyzzzzzz<\/td>\n<td>U+0800\u2013U+FFFF<\/td>\n<\/tr>\n<tr>\n<td class=\"mono\">11110xxx 10yyyyyy 10zzzzzz 10wwwwww<\/td>\n<td class=\"mono\">xxxyyyyyyzzzzzzwwwwww<\/td>\n<td>U+10000\u2013U+10FFFF<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A handy property of UTF-8 is that code points below 128 (ASCII characters) are encoded as single bytes, and all non-ASCII code points are encoded using sequences of bytes 128\u2013255. This has a couple of nice consequences. First, any strings or files out there that are already in ASCII can also be interpreted as UTF-8 without any conversion. Second, lots of widely-used string programming idioms\u2014such as null termination, or delimiters (newlines, tabs, commas, slashes, etc.)\u2014will just work on UTF-8 strings. ASCII bytes never occur inside the encoding of non-ASCII code points, so searching byte-wise for a null terminator or a delimiter will do the right thing.<\/p>\n<p>Thanks to this convenience, it\u2019s relatively simple to extend legacy ASCII programs and APIs to handle UTF-8 strings. UTF-8 is very widely used in the Unix\/Linux and Web worlds, and many programmers argue <a href=\"http:\/\/utf8everywhere.org\/\">UTF-8 should be the default encoding everywhere<\/a>.<\/p>\n<p>However, UTF-8 isn\u2019t a drop-in replacement for ASCII strings in all respects. For instance, code that iterates over the \u201ccharacters\u201d in a string will need to decode UTF-8 and iterate over code points (or maybe grapheme clusters\u2014more about those later), not bytes. When you measure the \u201clength\u201d of a string, you\u2019ll need to think about whether you want the length in bytes, the length in code points, the width of the text when rendered, or something else.<\/p>\n<h3 id=\"utf-16\">UTF-16<\/h3>\n<p>The other encoding that you\u2019re likely to encounter is UTF-16. It uses 16-bit words, with each code point stored as either 1 or 2 words.<\/p>\n<p>Like UTF-8, we can express the UTF-16 encoding rules in the form of binary prefixes:<\/p>\n<table>\n<thead>\n<tr>\n<th>UTF-16 (binary)<\/th>\n<th>Code point (binary)<\/th>\n<th>Range<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"mono\">xxxxxxxxxxxxxxxx<\/td>\n<td class=\"mono\">xxxxxxxxxxxxxxxx<\/td>\n<td>U+0000\u2013U+FFFF<\/td>\n<\/tr>\n<tr>\n<td class=\"mono\">110110xxxxxxxxxx 110111yyyyyyyyyy<\/td>\n<td class=\"mono\">xxxxxxxxxxyyyyyyyyyy + 0x10000<\/td>\n<td>U+10000\u2013U+10FFFF<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A more common way that people talk about UTF-16 encoding, though, is in terms of code points called \u201csurrogates\u201d. All the code points in the range U+D800\u2013U+DFFF\u2014or in other words, the code points that match the binary prefixes <code class=\"\" data-line=\"\">110110<\/code> and <code class=\"\" data-line=\"\">110111<\/code> in the table above\u2014are reserved specifically for UTF-16 encoding, and don\u2019t represent any valid characters on their own. They\u2019re only meant to occur in the 2-word encoding pattern above, which is called a \u201csurrogate pair\u201d. Surrogate code points are illegal in any other context! They\u2019re not allowed in UTF-8 or UTF-32 at all.<\/p>\n<p>Historically, UTF-16 is a descendant of the original, pre-1996 versions of Unicode, in which there were only 65,536 code points. The original intention was that there would be no different \u201cencodings\u201d; Unicode was supposed to be a straightforward 16-bit character set. Later, the codespace was expanded to make room for a long tail of less-common (but still important) Han characters, which the Unicode designers didn\u2019t originally plan for. Surrogates were then introduced, as\u2014to put it bluntly\u2014a kludge, allowing 16-bit encodings to access the new code points.<\/p>\n<p>Today, Javascript uses UTF-16 as its standard string representation: if you ask for the length of a string, or iterate over it, etc., the result will be in UTF-16 words, with any code points outside the BMP expressed as surrogate pairs. UTF-16 is also used by the Microsoft Win32 APIs; though Win32 supports either 8-bit or 16-bit strings, the 8-bit version unaccountably still doesn\u2019t support UTF-8\u2014only legacy code-page encodings, like ANSI. This leaves UTF-16 as the only way to get proper Unicode support in Windows.<\/p>\n<p>By the way, UTF-16\u2019s words can be stored either little-endian or big-endian. Unicode has no opinion on that issue, though it does encourage the convention of putting <a href=\"http:\/\/unicode.org\/cldr\/utility\/character.jsp?a=FEFF\">U+FEFF <span class=\"smallcaps\">zero width no-break space<\/span><\/a> at the top of a UTF-16 file as a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Byte_order_mark\">byte-order mark<\/a>, to disambiguate the endianness. (If the file doesn\u2019t match the system\u2019s endianness, the BOM will be decoded as U+FFFE, which isn\u2019t a valid code point.)<\/p>\n<h2 id=\"combining-marks\">Combining Marks<\/h2>\n<p>In the story so far, we\u2019ve been focusing on code points. But in Unicode, a \u201ccharacter\u201d can be more complicated than just an individual code point!<\/p>\n<p>Unicode includes a system for <em>dynamically composing<\/em> characters, by combining multiple code points together. This is used in various ways to gain flexibility without causing a huge combinatorial explosion in the number of code points.<\/p>\n<p>In European languages, for example, this shows up in the application of diacritics to letters. Unicode supports a wide range of diacritics, including acute and grave accents, umlauts, cedillas, and many more. All these diacritics can be applied to any letter of any alphabet\u2014and in fact, <em>multiple<\/em> diacritics can be used on a single letter.<\/p>\n<p>If Unicode tried to assign a distinct code point to every possible combination of letter and diacritics, things would rapidly get out of hand. Instead, the dynamic composition system enables you to construct the character you want, by starting with a base code point (the letter) and appending additional code points, called \u201ccombining marks\u201d, to specify the diacritics. When a text renderer sees a sequence like this in a string, it automatically stacks the diacritics over or under the base letter to create a composed character.<\/p>\n<p>For example, the accented character \u201c\u00c1\u201d can be expressed as a string of two code points: <a href=\"http:\/\/unicode.org\/cldr\/utility\/character.jsp?a=A\">U+0041 \u201cA\u201d <span class=\"smallcaps\">latin capital letter a<\/span><\/a> plus <a href=\"http:\/\/unicode.org\/cldr\/utility\/character.jsp?a=0301\">U+0301 \u201c\u25cc\u0301\u201d <span class=\"smallcaps\">combining acute accent<\/span><\/a>. This string automatically gets rendered as a single character: \u201cA\u0301\u201d.<\/p>\n<p>Now, Unicode does also include many \u201cprecomposed\u201d code points, each representing a letter with some combination of diacritics already applied, such as <a href=\"http:\/\/unicode.org\/cldr\/utility\/character.jsp?a=%C3%81\">U+00C1 \u201c\u00c1\u201d <span class=\"smallcaps\">latin capital letter a with acute<\/span><\/a> or <a href=\"http:\/\/unicode.org\/cldr\/utility\/character.jsp?a=%E1%BB%87\">U+1EC7 \u201c\u1ec7\u201d <span class=\"smallcaps\">latin small letter e with circumflex and dot below<\/span><\/a>. I suspect these are mostly inherited from older encodings that were assimilated into Unicode, and kept around for compatibility. In practice, there are precomposed code points for most of the common letter-with-diacritic combinations in European-script languages, so they don\u2019t use dynamic composition that much in typical text.<\/p>\n<p>Still, the system of combining marks does allow for an <em>arbitrary number<\/em> of diacritics to be stacked on any base character. The reductio-ad-absurdum of this is <a href=\"https:\/\/eeemo.net\/\">Zalgo text<\/a>, which works by \u035f\u0345\u0356r\u035ean\u032d\u032b\u0320\u0316\u0348\u0317d\u0356\u033b\u0339o\u0341m\u032a\u0359\u0355\u0317\u031dl\u0327\u0347\u0330\u0353\u0333\u032by\u0341\u0353\u0325\u031f\u034d \u0315s\u032bt\u035c\u032b\u0331\u0355\u0317\u0330\u033c\u0318a\u035d\u033c\u0329\u0356\u0347\u0320\u0348\u0323c\u0319\u034dk\u0358\u0316\u0331\u0339\u034di\u0322n\u0328\u033a\u031d\u0347\u0347\u031f\u0359g\u0327\u032b\u032e\u034e\u0345\u033b\u031f \u0315n\u035e\u033c\u033a\u0348u\u032e\u0359m\u035e\u033a\u032d\u031f\u0317e\u031e\u0353\u0330\u0324\u0353\u032br\u0335o\u0316u\u032ds\u0489\u032a\u034d\u032d\u032c\u031d\u0324 \u0360\u032e\u0349\u031d\u031e\u0317\u031fd\u0334\u031f\u031c\u0331\u0355\u035ai\u0361\u0347\u032b\u033c\u032f\u032d\u031ca\u0325\u0359\u033b\u033cc\u0332\u0332\u0339r\u0328\u0320\u0339\u0323\u0330\u0326i\u0331t\u0315\u0324\u033b\u0324\u034d\u0359\u0318i\u0335\u031c\u032d\u0324\u0331\u034ec\u0335s \u0358o\u0362\u0331\u0332\u0348\u0319\u0356\u0347\u0332n\u0358 \u031c\u0348e\u032c\u0332\u0320\u0329ac\u0355\u033a\u0320\u0349h\u0337\u032a \u033a\u0323\u0356\u0331l\u0331\u032b\u032c\u031d\u0339e\u032d\u0319\u033a\u0359\u032d\u0353\u0332t\u031e\u031e\u0347\u0332\u0349\u034dt\u0337\u0354\u032a\u0349\u0332\u033b\u0320\u0359e\u0326\u033b\u0348\u0349\u0347r\u0347\u032d\u032d\u032c\u0356,\u0341\u0316 \u031c\u0359\u0353\u0323\u032ds\u0318\u0318\u0348o\u0331\u0330\u0345\u0324\u0332 \u031b\u032c\u031c\u0319t\u033c\u0326\u0355\u0331\u0339\u0355\u0325h\u035d\u0333\u0332\u0348\u0345a\u0326t\u033b\u0332 \u033b\u031f\u032d\u0326\u0316t\u031b\u0330\u0329h\u0320\u0355\u0333\u031d\u032b\u0355e\u0358\u0348\u0324\u0318\u0356\u031ey\u0489\u031d\u0359 \u0337\u0349\u0354\u0330\u0320o\u031e\u0330v\u035c\u0348\u0348\u0333\u0318er\u0336f\u0330\u0348\u0354l\u0331\u0355\u0318\u032b\u033a\u0332o\u0360\u0332\u0345\u032d\u0359w\u0331\u0333\u033a \u035ct\u0338h\u0347\u032d\u0355\u0333\u034de\u0316\u032f\u031f\u0320 \u035c\u034d\u031e\u031c\u0354\u0329\u032al\u0327\u034e\u032a\u0332\u035ai\u031d\u0332\u0339\u0319\u0329\u0339n\u0328\u0326\u0329\u0316e\u0362\u032d\u033c\u0345\u0332\u033c \u035d\u032cs\u035d\u033c\u035a\u0318\u031ep\u0359\u0318\u033ba\u0319c\u0489\u0349\u031c\u0324\u0348\u032f\u0316i\u0361\u0325n\u035f\u0326\u0320\u0331g\u0338\u0345\u0317\u033b\u0326\u032d\u032e\u031f \u0315\u0333\u032a\u0320\u0356\u0333\u032fa\u035c\u032bn\u035dd\u0361 \u0323\u0326\u0345\u0319c\u032a\u0317r\u0334\u0359\u032e\u0326\u0339\u0333e\u035f\u0347\u035a\u031e\u0354\u0339\u032ba\u0319\u033a\u0319t\u0326\u0354\u034e\u0345\u0318\u0339e\u0325\u0329\u034d a\u0356\u032a\u031c\u032e\u0359\u0339n\u0322\u0349\u031d \u0341\u0347\u0349\u0353\u0326\u033ca\u0333\u0356\u032a\u0324\u0331p\u0360\u0316\u0354\u0354\u031f\u0347\u034ep\u0331\u034d\u033ae\u0328\u0332\u034e\u0348\u0330\u0332\u0324\u032ba\u035c\u032fr\u0328\u032e\u032b\u0323\u0318a\u0329\u032f\u0356n\u0339\u0326\u0330\u034e\u0323\u031e\u031ec\u0328\u0326\u0331\u0354\u034e\u034d\u0356e\u0358\u032c\u0353 \u0324\u0330\u0329\u0359\u0324\u032c\u0359o\u0335\u033c\u033b\u032c\u033b\u0347\u032e\u032af\u0334 \u0321\u0319\u032d\u0353\u0356\u032a\u0324\u201c\u0338\u0359\u0320\u033cc\u035c\u0333\u0317o\u034f\u033c\u0359\u0354\u032er\u031e\u032b\u033a\u031e\u0325\u032cru\u033a\u033b\u032f\u0349\u032d\u033b\u032fp\u0362\u0330\u0325\u0353\u0323\u032b\u0319\u0324t\u0345\u0333\u034d\u0333\u0316i\u0336\u0348\u031d\u0359\u033c\u0319\u0339o\u0321\u0354n\u035d\u0345\u0319\u033a\u0339\u0316\u0329\u201d\u0328\u0317\u0356\u035a\u0329.\u032f\u0353<\/p>\n<p>A few other places where dynamic character composition shows up in Unicode:<\/p>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Vowel_pointing\">Vowel-pointing notation<\/a> in Arabic and Hebrew. In these languages, words are normally spelled with some of their vowels left out. They then have diacritic notation to indicate the vowels (used in dictionaries, language-teaching materials, children\u2019s books, and such). These diacritics are expressed with combining marks.<br \/>\n<table class=\"borderless\">\n<tbody>\n<tr>\n<td>A Hebrew example, with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Niqqud\">niqqud<\/a>:<\/td>\n<td>\u05d0\u05b6\u05ea \u05d3\u05b7\u05dc\u05b0\u05ea\u05b4\u05bc\u05d9 \u05d4\u05b5\u05d6\u05b4\u05d9\u05d6 \u05d4\u05b5\u05e0\u05b4\u05d9\u05e2\u05b7, \u05e7\u05b6\u05d8\u05b6\u05d1 \u05dc\u05b4\u05e9\u05b0\u05c1\u05db\u05b7\u05bc\u05ea\u05b4\u05bc\u05d9 \u05d9\u05b8\u05e9\u05c1\u05d5\u05b9\u05d3<\/td>\n<\/tr>\n<tr>\n<td>Normal writing (no niqqud):<\/td>\n<td>\u05d0\u05ea \u05d3\u05dc\u05ea\u05d9 \u05d4\u05d6\u05d9\u05d6 \u05d4\u05e0\u05d9\u05e2, \u05e7\u05d8\u05d1 \u05dc\u05e9\u05db\u05ea\u05d9 \u05d9\u05e9\u05d5\u05d3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Devanagari\">Devanagari<\/a>, the script used to write Hindi, Sanskrit, and many other South Asian languages, expresses certain vowels as combining marks attached to consonant letters. For example, \u201c\u0939\u201d + \u201c\u200b\u093f\u201d = \u201c\u0939\u093f\u201d (\u201ch\u201d + \u201ci\u201d = \u201chi\u201d).<\/li>\n<li>Korean characters stand for syllables, but they are composed of letters called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Hangul#Letters\">jamo<\/a> that stand for the vowels and consonants in the syllable. While there are code points for precomposed Korean syllables, it\u2019s also possible to dynamically compose them by concatenating their jamo. For example, \u201c\u1112\u201d + \u201c\u1161\u201d + \u201c\u11ab\u201d = \u201c\u1112\u1161\u11ab\u201d (\u201ch\u201d + \u201ca\u201d + \u201cn\u201d = \u201chan\u201d).<\/li>\n<\/ul>\n<h3 id=\"canonical-equivalence\">Canonical Equivalence<\/h3>\n<p>In Unicode, precomposed characters exist alongside the dynamic composition system. A consequence of this is that there are multiple ways to express \u201cthe same\u201d string\u2014different sequences of code points that result in the same user-perceived characters. For example, as we saw earlier, we can express the character \u201c\u00c1\u201d either as the single code point U+00C1, <em>or<\/em> as the string of two code points U+0041 U+0301.<\/p>\n<p>Another source of ambiguity is the ordering of multiple diacritics in a single character. Diacritic order matters visually when two diacritics apply to the same side of the base character, e.g. both above: \u201c\u01e1\u201d (dot, then macron) is different from \u201c\u0101\u0307\u201d (macron, then dot). However, when diacritics apply to different sides of the character, e.g. one above and one below, then the order doesn\u2019t affect rendering. Moreover, a character with multiple diacritics might have one of the diacritics precomposed and others expressed as combining marks.<\/p>\n<p>For example, the Vietnamese letter \u201c\u1ec7\u201d can be expressed in <em>five<\/em> different ways:<\/p>\n<ul>\n<li>Fully precomposed: U+1EC7 \u201c\u1ec7\u201d<\/li>\n<li>Partially precomposed: U+1EB9 \u201c\u1eb9\u201d + U+0302 \u201c\u25cc\u0302\u201d<\/li>\n<li>Partially precomposed: U+00EA \u201c\u00ea\u201d + U+0323 \u201c\u25cc\u0323\u201d<\/li>\n<li>Fully decomposed: U+0065 \u201ce\u201d + U+0323 \u201c\u25cc\u0323\u201d + U+0302 \u201c\u25cc\u0302\u201d<\/li>\n<li>Fully decomposed: U+0065 \u201ce\u201d + U+0302 \u201c\u25cc\u0302\u201d + U+0323 \u201c\u25cc\u0323\u201d<\/li>\n<\/ul>\n<p>Unicode refers to set of strings like this as \u201ccanonically equivalent\u201d. Canonically equivalent strings are supposed to be treated as identical for purposes of searching, sorting, rendering, text selection, and so on. This has implications for how you implement operations on text. For example, if an app has a \u201cfind in file\u201d operation and the user searches for \u201c\u1ec7\u201d, it should, by default, find occurrences of <em>any<\/em> of the five versions of \u201c\u1ec7\u201d above!<\/p>\n<h3 id=\"normalization-forms\">Normalization Forms<\/h3>\n<p>To address the problem of \u201chow to handle canonically equivalent strings\u201d, Unicode defines several <em>normalization forms<\/em>: ways of converting strings into a canonical form so that they can be compared code-point-by-code-point (or byte-by-byte).<\/p>\n<p>The \u201cNFD\u201d normalization form fully <em>decomposes<\/em> every character down to its component base and combining marks, taking apart any precomposed code points in the string. It also sorts the combining marks in each character according to their rendered position, so e.g. diacritics that go below the character come before the ones that go above the character. (It doesn\u2019t reorder diacritics in the same rendered position, since their order matters visually, as previously mentioned.)<\/p>\n<p>The \u201cNFC\u201d form, conversely, puts things back together into precomposed code points as much as possible. If an unusual combination of diacritics is called for, there may not be any precomposed code point for it, in which case NFC still precomposes what it can and leaves any remaining combining marks in place (again ordered by rendered position, as in NFD).<\/p>\n<p>There are also forms called NFKD and NFKC. The \u201cK\u201d here refers to <em>compatibility<\/em> decompositions, which cover characters that are \u201csimilar\u201d in some sense but not visually identical. However, I\u2019m not going to cover that here.<\/p>\n<h3 id=\"grapheme-clusters\">Grapheme Clusters<\/h3>\n<p>As we\u2019ve seen, Unicode contains various cases where a thing that a user thinks of as a single \u201ccharacter\u201d might actually be made up of multiple code points under the hood. Unicode formalizes this using the notion of a <em>grapheme cluster<\/em>: a string of one or more code points that constitute a single \u201cuser-perceived character\u201d.<\/p>\n<p><a href=\"http:\/\/www.unicode.org\/reports\/tr29\/\">UAX #29<\/a> defines the rules for what, precisely, qualifies as a grapheme cluster. It\u2019s approximately \u201ca base code point followed by any number of combining marks\u201d, but the actual definition is a bit more complicated; it accounts for things like Korean jamo, and <a href=\"http:\/\/blog.emojipedia.org\/emoji-zwj-sequences-three-letters-many-possibilities\/\">emoji ZWJ sequences<\/a>.<\/p>\n<p>The main thing grapheme clusters are used for is text <em>editing<\/em>: they\u2019re often the most sensible unit for cursor placement and text selection boundaries. Using grapheme clusters for these purposes ensures that you can\u2019t accidentally chop off some diacritics when you copy-and-paste text, that left\/right arrow keys always move the cursor by one visible character, and so on.<\/p>\n<p>Another place where grapheme clusters are useful is in enforcing a string length limit\u2014say, on a database field. While the true, underlying limit might be something like the byte length of the string in UTF-8, you wouldn\u2019t want to enforce that by just truncating bytes. At a minimum, you\u2019d want to \u201cround down\u201d to the nearest code point boundary; but even better, round down to the nearest <em>grapheme cluster boundary<\/em>. Otherwise, you might be corrupting the last character by cutting off a diacritic, or interrupting a jamo sequence or ZWJ sequence.<\/p>\n<h2 id=\"and-more\">And More\u2026<\/h2>\n<p>There\u2019s much more that could be said about Unicode from a programmer\u2019s perspective! I haven\u2019t gotten into such fun topics as case mapping, collation, compatibility decompositions and confusables, Unicode-aware regexes, or bidirectional text. Nor have I said anything yet about implementation issues\u2014how to efficiently store and look-up data about the sparsely-assigned code points, or how to optimize UTF-8 decoding, string comparison, or NFC normalization. Perhaps I\u2019ll return to some of those things in future posts.<\/p>\n<p>Unicode is a fascinating and complex system. It has a many-to-one mapping between bytes and code points, and on top of that a many-to-one (or, under some circumstances, many-to-many) mapping between code points and \u201ccharacters\u201d. It has oddball special cases in every corner. But no one ever claimed that representing <em>all written languages<\/em> was going to be <em>easy<\/em>, and it\u2019s clear that we\u2019re never going back to the bad old days of a patchwork of incompatible encodings.<\/p>\n<p>Further reading:<\/p>\n<ul>\n<li><a href=\"http:\/\/www.unicode.org\/versions\/latest\/\">The Unicode Standard<\/a><\/li>\n<li><a href=\"http:\/\/utf8everywhere.org\/\">UTF-8 Everywhere Manifesto<\/a><\/li>\n<li><a href=\"https:\/\/eev.ee\/blog\/2015\/09\/12\/dark-corners-of-unicode\/\">Dark corners of Unicode<\/a> by Eevee<\/li>\n<li><a href=\"http:\/\/site.icu-project.org\/\">ICU (International Components for Unicode)<\/a>\u2014C\/C++\/Java libraries implementing many Unicode algorithms and related things<\/li>\n<li><a href=\"https:\/\/docs.python.org\/3\/howto\/unicode.html\">Python 3 Unicode Howto<\/a><\/li>\n<li><a href=\"https:\/\/www.google.com\/get\/noto\/\">Google Noto Fonts<\/a>\u2014set of fonts intended to cover all assigned code points<\/li>\n<\/ul>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p class=\"excerpt\">\u00abA Programmer\u2019s Introduction to Unicode\u00bb from Nathan Reed\u2019s coding blog<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"https:\/\/monodes.com\/predaelli\/2017\/03\/11\/a-programmers-introduction-to-unicode\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"link","meta":{"inline_featured_image":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":4,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1],"tags":[],"class_list":["post-2268","post","type-post","status-publish","format-link","hentry","category-senza-categoria","post_format-post-format-link"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p6daft-AA","jetpack-related-posts":[{"id":11195,"url":"https:\/\/monodes.com\/predaelli\/2024\/01\/14\/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses\/","url_meta":{"origin":2268,"position":0},"title":"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)","author":"Paolo Redaelli","date":"2024-01-14","format":"link","excerpt":"Ever wonder about that mysterious Content-Type tag? You know, the one you\u2019re supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in\u2026 The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character\u2026","rel":"","context":"In &quot;Documentations&quot;","block_context":{"text":"Documentations","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/"},"img":{"alt_text":"\u05d2","src":"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2024\/01\/gimel.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":9813,"url":"https:\/\/monodes.com\/predaelli\/2022\/11\/09\/ascii-oh-ascii-wherefore-art-thou-ascii\/","url_meta":{"origin":2268,"position":1},"title":"ASCII, oh ASCII! Wherefore art thou, ASCII?","author":"Paolo Redaelli","date":"2022-11-09","format":false,"excerpt":"ASCII, oh ASCII! Wherefore art thou, ASCII?The original line copied by an infamous English poet Puns aside, in the XXI century there are still need to stick to plain, old 7 bit ASCII character table. Many industrial applications stick to it for its simplicity. Unicode is often an overkill in\u2026","rel":"","context":"In &quot;Tricks&quot;","block_context":{"text":"Tricks","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/tricks\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":763,"url":"https:\/\/monodes.com\/predaelli\/2015\/10\/31\/how-to-differentiate-between-an-average-and-a-good-programmer\/","url_meta":{"origin":2268,"position":2},"title":"How to differentiate between an Average and a Good Programmer?","author":"Paolo Redaelli","date":"2015-10-31","format":false,"excerpt":"\u00a0How to differentiate between an Average and a Good Programmer? \u00a0 Oh my! It's so true! javarevisited looks like a really good programming blog as I found gems like\u00ab10 Articles Every Programmer Must Read \u00bb among What Every Programmer Should Know about Memory What Every Computer Scientist Should Know About\u2026","rel":"","context":"In &quot;Senza categoria&quot;","block_context":{"text":"Senza categoria","link":"https:\/\/monodes.com\/predaelli\/category\/senza-categoria\/"},"img":{"alt_text":"Being a good programmer is 3% talent, 97% not being distracted by the Internet","src":"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2015\/10\/Being%2Ba%2BProgrammer.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":15394,"url":"https:\/\/monodes.com\/predaelli\/2026\/03\/29\/nobody-gets-fired-for-picking-json-but-maybe-they-should\/","url_meta":{"origin":2268,"position":3},"title":"Nobody Gets Fired for Picking JSON, but Maybe They Should?","author":"Paolo Redaelli","date":"2026-03-29","format":"link","excerpt":"Nobody Gets Fired for Picking JSON, but Maybe They Should? By Miguel Young de la Sota Nobody Gets Fired for Picking JSON, but Maybe They Should? JSON is extremely popular but deeply flawed. This article discusses the details of JSON\u2019s design, how it\u2019s used (and misused), and how seemingly helpful\u2026","rel":"","context":"In &quot;Javascript&quot;","block_context":{"text":"Javascript","link":"https:\/\/monodes.com\/predaelli\/category\/javascript\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3743,"url":"https:\/\/monodes.com\/predaelli\/2018\/02\/03\/postgres-hidden-gems-craig-kerstiens\/","url_meta":{"origin":2268,"position":4},"title":"Postgres hidden gems &#8211; Craig Kerstiens","author":"Paolo Redaelli","date":"2018-02-03","format":false,"excerpt":"citext I've always been fond of PostgreSQL, now this Postgres hidden gems - Craig Kerstiens shows its smart features even more! There are many interesting features of Postgresql that I didn't knew, as I haven't actually used it for a while Postgres has a rich set of features, even when\u2026","rel":"","context":"In &quot;Documentations&quot;","block_context":{"text":"Documentations","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":9191,"url":"https:\/\/monodes.com\/predaelli\/2022\/03\/27\/9191\/","url_meta":{"origin":2268,"position":5},"title":"From As a Software Engineer,\u2026","author":"Paolo Redaelli","date":"2022-03-27","format":false,"excerpt":"From As a Software Engineer, Here Are 7 Books You Should Always Have at Your Desk. The subtitle says \"Ditch (or pause at least) all those courses and start reading books\". The Pragmatic ProgrammerHead First Design Pattern Code Simplicity: The Fundamentals of Software The Self-Taught Programmer: The Definitive Guide to\u2026","rel":"","context":"In &quot;Legenda&quot;","block_context":{"text":"Legenda","link":"https:\/\/monodes.com\/predaelli\/category\/legenda\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/2268","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/comments?post=2268"}],"version-history":[{"count":0,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/2268\/revisions"}],"wp:attachment":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/media?parent=2268"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/categories?post=2268"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/tags?post=2268"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}