{"id":11195,"date":"2024-01-14T18:37:33","date_gmt":"2024-01-14T17:37:33","guid":{"rendered":"https:\/\/monodes.com\/predaelli\/?p=11195"},"modified":"2024-01-14T20:03:43","modified_gmt":"2024-01-14T19:03:43","slug":"the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses","status":"publish","type":"post","link":"https:\/\/monodes.com\/predaelli\/2024\/01\/14\/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses\/","title":{"rendered":"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Ever wonder about that mysterious Content-Type tag? You know, the one you\u2019re supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in\u2026<\/p>\n<cite><em><a href=\"https:\/\/www.joelonsoftware.com\/2003\/10\/08\/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses\/\">The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)<\/a><\/em><\/cite><\/blockquote>\n\n\n\n<p>Oldies but goldies. This article is 20 years old, but still very relevant<\/p>\n\n\n\n<!--more-->\n\n\n\n<!--nextpage-->\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<header class=\"entry-header\">\n<h1 class=\"entry-title\">The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)<\/h1>\n<\/header>\n\n\n\n<div class=\"entry-meta\">\n<ul class=\"meta-list\">\n<li class=\"meta-cat\"><a href=\"https:\/\/www.joelonsoftware.com\/category\/reading-lists\/top-10\/\" rel=\"category tag\">Top 10<\/a>, <a href=\"https:\/\/www.joelonsoftware.com\/category\/reading-lists\/new-developer\/\" rel=\"category tag\">New developer<\/a>, <a href=\"https:\/\/www.joelonsoftware.com\/category\/news\/\" rel=\"category tag\">News<\/a><\/li>\n<\/ul>\n<\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignright\"><a href=\"https:\/\/www.joelonsoftware.com\/2003\/10\/08\/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses\/\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2024\/01\/ascii.png?w=910&#038;ssl=1\" alt=\"\"\/><\/a><\/figure><\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"alignright\"><a href=\"https:\/\/www.joelonsoftware.com\/2003\/10\/08\/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses\/\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2024\/01\/oem.png?w=910&#038;ssl=1\" alt=\"\"\/><\/a><\/figure><\/div>\n\n\n<div class=\"entry-content\">\n<p>Ever wonder about that mysterious Content-Type tag? You know, the one you\u2019re supposed to put in HTML and you never quite know what it should be?<\/p>\n<p>Did you ever get an email from your friends in Bulgaria with the subject line \u201c???? ?????? ??? ????\u201d?<\/p>\n<p>I\u2019ve been dismayed to discover just how many software developers aren\u2019t really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for <a href=\"http:\/\/www.fogcreek.com\/FogBUGZ\">FogBUGZ<\/a> was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they \u201ccouldn\u2019t do anything about it.\u201d Like many programmers, he just wished it would all blow over somehow.<\/p>\n<p>But it won\u2019t. When I discovered that the popular web development tool PHP has almost <a href=\"http:\/\/ca3.php.net\/manual\/en\/language.types.string.php\">complete ignorance of character encoding issues<\/a>, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, <em>enough is enough<\/em>.<\/p>\n<p>So I have an announcement to make: if you are a programmer working in 2003 and you don\u2019t know the basics of characters, character sets, encodings, and Unicode, and I <em>catch<\/em> you, I\u2019m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.<\/p>\n<p>And one more thing:<\/p>\n<p align=\"center\"><strong>IT\u2019S NOT THAT HARD.<\/strong><\/p>\n<p align=\"left\">In this article I\u2019ll fill you in on exactly what <em>every working programmer<\/em> should know. All that stuff about \u201cplain text = ascii = characters are 8 bits\u201d is not only wrong, it\u2019s hopelessly wrong, and if you\u2019re still programming that way, you\u2019re not much better than a medical doctor who doesn\u2019t believe in germs. Please do not write another line of code until you finish reading this article.<\/p>\n<p align=\"left\">Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. I\u2019m really just trying to set a minimum bar here so that everyone can understand what\u2019s going on and can write code that has a <em>hope<\/em> of working with text in any language other than the subset of English that doesn\u2019t include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, but I can only write about one thing at a time so today it\u2019s character sets.<\/p>\n<p><strong>A Historical Perspective<\/strong><\/p>\n<p>The easiest way to understand this stuff is to go chronologically.<\/p>\n<p>You probably think I\u2019m going to talk about very old character sets like EBCDIC here. Well, I won\u2019t. EBCDIC is not relevant to your life. We don\u2019t have to go that far back in time.<\/p>\n<figure><\/figure><p>Back in the semi-olden days, when Unix was being invented and K&amp;R were writing <a href=\"http:\/\/cm.bell-labs.com\/cm\/cs\/cbook\/\">The C Programming Language<\/a>, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called <a href=\"http:\/\/www.robelle.com\/library\/smugbook\/ascii.html\">ASCII<\/a> which was able to represent every character using a number between 32 and 127. Space was 32, the letter \u201cA\u201d was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called <em>unprintable<\/em> and were used for cussing. Just kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.<\/p>\n<p>And all was good, assuming you were an English speaker.<\/p>\n<figure><\/figure><p>Because bytes have room for up to eight bits, lots of people got to thinking, \u201cgosh, we can use the codes 128-255 for our own purposes.\u201d The trouble was, <em>lots<\/em> of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and <a href=\"http:\/\/www.jimprice.com\/ascii-dos.gif\">a bunch of line drawing characters<\/a>\u2026 horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners\u2019. In fact&nbsp; as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as \u00e9, but on computers sold in Israel it was the Hebrew letter Gimel (<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2024\/01\/gimel.png?resize=5%2C9&#038;ssl=1\" alt=\"\u05d2\" width=\"5\" height=\"9\" border=\"0\" data-recalc-dims=\"1\"\/>), so when Americans would send their r\u00e9sum\u00e9s to Israel they would arrive as r<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2024\/01\/gimel.png?resize=5%2C9&#038;ssl=1\" alt=\"\u05d2\" width=\"5\" height=\"9\" border=\"0\" data-recalc-dims=\"1\"\/>sum<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2024\/01\/gimel.png?resize=5%2C9&#038;ssl=1\" alt=\"\u05d2\" width=\"5\" height=\"9\" border=\"0\" data-recalc-dims=\"1\"\/>s. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn\u2019t even reliably interchange Russian documents.<\/p>\n<p>Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called <em><a href=\"http:\/\/www.i18nguy.com\/unicode\/codepages.html#msftdos\">code pages<\/a><\/em>. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few \u201cmultilingual\u201d code pages that could do Esperanto and Galician <em>on the same computer! Wow!<\/em> But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.<\/p>\n<p>Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the \u201cdouble byte character set\u201d in which <em>some<\/em> letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s\u2013 to move backwards and forwards, but instead to call functions such as Windows\u2019 AnsiNext and AnsiPrev which knew how to deal with the whole mess.<\/p>\n<p>But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.<\/p>\n<p><strong>Unicode<\/strong><\/p>\n<p>Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. <strong>This is not, actually, correct.<\/strong> It is the single most common myth about Unicode, so if you thought that, don\u2019t feel bad.<\/p>\n<p>In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.<\/p>\n<p>Until now, we\u2019ve assumed that a letter maps to some bits which you can store on disk or in memory:<\/p>\n<p>A -&gt; 0100 0001<\/p>\n<p>In Unicode, a letter maps to something called a <em>code point<\/em> which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.<\/p>\n<p>In Unicode, the letter A is a platonic ideal. It\u2019s just floating in heaven:<\/p>\n<p align=\"center\">A<\/p>\n<p>This platonic A is different than B, and different from a, but the same as A and <i><b>A<\/b><\/i> and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but <em>different<\/em> from \u201ca\u201d in lower case, does not seem very controversial, but in some languages just figuring out what a letter <em>is<\/em> can cause controversy. Is the German letter \u00df a real letter or just a fancy way of writing ss? If a letter\u2019s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don\u2019t have to worry about it. They\u2019ve figured it all out already.<\/p>\n<p>Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: <strong>U+0639<\/strong>.&nbsp; This magic number is called a <em>code point<\/em>. The U+ means \u201cUnicode\u201d and the numbers are hexadecimal. <strong>U+0639<\/strong> is the Arabic letter Ain. The English letter A would be <strong>U+0041<\/strong>. You can find them all using the <strong>charmap<\/strong> utility on Windows 2000\/XP or visiting <a href=\"http:\/\/www.unicode.org\/\">the Unicode web site<\/a>.<\/p>\n<p>There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.<\/p>\n<p>OK, so say we have a string:<\/p>\n<p align=\"center\"><strong>Hello<\/strong><\/p>\n<p>which, in Unicode, corresponds to these five code points:<\/p>\n<p align=\"center\">U+0048 U+0065 U+006C U+006C U+006F.<\/p>\n<p>Just a bunch of code points. Numbers, really. We haven\u2019t yet said anything about how to store this in memory or represent it in an email message.<\/p>\n<p><strong>Encodings<\/strong><\/p>\n<p>That\u2019s where <em>encodings<\/em> come in.<\/p>\n<p>The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let\u2019s just store those numbers in two bytes each. So Hello becomes<\/p>\n<p align=\"center\">00 48 00 65 00 6C 00 6C 00 6F<\/p>\n<p>Right? Not so fast! Couldn\u2019t it also be:<\/p>\n<p align=\"center\">48 00 65 00 6C 00 6C 00 6F 00 ?<\/p>\n<p>Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already <em>two<\/em> ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a <a href=\"http:\/\/msdn.microsoft.com\/library\/default.asp?url=\/library\/en-us\/intl\/unicode_42jv.asp\">Unicode Byte Order Mark<\/a> and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.<\/p>\n<p>For a while it seemed like that might be good enough, but programmers were complaining. \u201cLook at all those zeros!\u201d they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to <em>conserve (sneer)<\/em>. If they were Texans they wouldn\u2019t have minded guzzling twice the number of bytes. But those Californian wimps couldn\u2019t bear the idea of <em>doubling<\/em> the amount of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS character sets and who\u2019s going to convert them all? <em>Moi?<\/em> For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.<\/p>\n<p>Thus was <a href=\"http:\/\/www.cl.cam.ac.uk\/~mgk25\/ucs\/utf-8-history.txt\">invented<\/a>&nbsp;the brilliant concept of <a href=\"http:\/\/www.utf-8.com\/\">UTF-8<\/a>. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored <em>in a single byte<\/em>. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.<\/p>\n<p align=\"center\">&nbsp;<\/p>\n<p>This has the neat side effect that English text looks <em>exactly the same in UTF-8 as it did in ASCII,<\/em> so Americans don\u2019t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, <strong>Hello<\/strong>, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you\u2019ll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).<\/p>\n<p>So far I\u2019ve told you <em>three<\/em> ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it\u2019s high-endian UCS-2 or low-endian UCS-2. And there\u2019s the popular new UTF-8 <a href=\"http:\/\/www.zvon.org\/tmRFC\/RFC2279\/Output\/chapter2.html\">standard<\/a> which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.<\/p>\n<p>There are actually a bunch of other ways of encoding Unicode. There\u2019s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are <em>quite enough, thank you<\/em> it can still squeeze through unscathed. There\u2019s UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn\u2019t be so bold as to waste <em>that<\/em> much memory.<\/p>\n<p>And in fact now that you\u2019re thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, <em>with one catch:<\/em> some of the letters might not show up! If there\u2019s no equivalent for the Unicode code point you\u2019re trying to represent in the encoding you\u2019re trying to represent it in, you usually get a little question mark: ? or, if you\u2019re <em>really<\/em> good, a box. Which did you get? -&gt; \ufffd<\/p>\n<p>There are hundreds of traditional encodings which can only store <em>some<\/em> code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and&nbsp;<a href=\"http:\/\/www.htmlhelp.com\/reference\/charset\/\">ISO-8859-1<\/a>, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store <em>any<\/em> code point correctly.<\/p>\n<p><strong>The Single Most Important Fact About Encodings<\/strong><\/p>\n<p>If you completely forget everything I just explained, please remember one extremely important fact. <strong>It does not make sense to have a string without knowing what encoding it uses<\/strong>. You can no longer stick your head in the sand and pretend that \u201cplain\u201d text is ASCII.<\/p>\n<p align=\"center\"><strong><u>There Ain\u2019t No Such Thing As Plain Text.<\/u><\/strong><\/p>\n<p>If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.<\/p>\n<p>Almost every stupid \u201cmy website looks like gibberish\u201d or \u201cshe can\u2019t read my emails when I use accents\u201d problem comes down to one naive programmer who didn\u2019t understand the simple fact that if you don\u2019t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.<\/p>\n<p>How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form<\/p>\n<blockquote dir=\"ltr\"><p><strong>Content-Type: text\/plain;&nbsp;charset=&#8221;UTF-8&#8243;<\/strong><\/p><\/blockquote>\n<p>For a web page, the original idea was that the web server would return a similar <strong>Content-Type<\/strong> http header along with the web page itself \u2014 not in the HTML itself, but as one of the response headers that are sent before the HTML page.<\/p>\n<p>This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn\u2019t really <em>know<\/em> what encoding each file was written in, so it couldn\u2019t send the Content-Type header.<\/p>\n<p>It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy\u2026 how can you <em>read<\/em> the HTML file until you know what encoding it\u2019s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:<\/p>\n<blockquote dir=\"ltr\"><p><strong>&lt;html&gt;<br \/>&lt;head&gt;<br \/><\/strong><strong>&lt;<span class=\"start-tag\">meta<\/span> <span class=\"attribute-name\">http-equiv<\/span>=<span class=\"attribute-value\">&#8220;Content-Type&#8221;<\/span> <span class=\"attribute-name\">content<\/span>=<span class=\"attribute-value\">&#8220;text\/html; charset=utf-8&#8221;<\/span>&gt;<\/strong><\/p><\/blockquote>\n<p>But that meta tag really has to be the very first thing in the &lt;head&gt; section because as soon as the web browser sees this tag it\u2019s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.<\/p>\n<p>What do web browsers do if they don\u2019t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It\u2019s truly weird, but it does seem to work often enough that na\u00efve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it <em>looks ok<\/em>, until one day, they write something that doesn\u2019t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it\u2019s Korean and displays it thusly, proving, I think, the point that Postel\u2019s Law about being \u201cconservative in what you emit and liberal in what you accept\u201d is quite frankly not a good engineering principle. Anyway, what does the poor reader of this website, which was written in Bulgarian but appears to be Korean (and not even cohesive Korean), do? He uses the View | Encoding menu and tries a bunch of different encodings (there are at least a dozen for Eastern European languages) until the picture comes in clearer. If he knew to do that, which most people don\u2019t.<\/p>\n<p>For the latest version of <a href=\"http:\/\/www.fogcreek.com\/CityDesk\">CityDesk<\/a>, the web site management software published by <a href=\"http:\/\/www.fogcreek.com\/\">my company<\/a>, we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT\/2000\/XP use as their native string type. In C++ code we just declare strings as <strong>wchar_t<\/strong> (\u201cwide char\u201d) instead of <strong>char<\/strong> and use the <strong>wcs<\/strong> functions instead of the <strong>str<\/strong> functions (for example <strong>wcscat<\/strong> and <strong>wcslen<\/strong> instead of <strong>strcat<\/strong> and <strong>strlen<\/strong>). To create a literal UCS-2 string in C code you just put an L before it as so: <strong>L&#8221;Hello&#8221;<\/strong>.<\/p>\n<p>When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That\u2019s the way all <a href=\"https:\/\/www.joelonsoftware.com\/navLinks\/OtherLanguages.html\">29 language versions<\/a> of <em>Joel on Software<\/em> are encoded and I have not yet heard a single person who has had any trouble viewing them.<\/p>\n<p>This article is getting rather long, and I can\u2019t possibly cover everything there is to know about character encodings and Unicode, but I hope that if you\u2019ve read this far, you know enough to go back to programming, using antibiotics instead of leeches and spells, a task to which I will leave you now.<\/p>\n<\/div>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p class=\"excerpt\">Ever wonder about that mysterious Content-Type tag? You know, the one you\u2019re supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in\u2026 The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Oldies but&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"https:\/\/monodes.com\/predaelli\/2024\/01\/14\/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"link","meta":{"inline_featured_image":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":4,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[72],"tags":[],"class_list":["post-11195","post","type-post","status-publish","format-link","hentry","category-documentations","post_format-post-format-link"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p6daft-2Uz","jetpack-related-posts":[{"id":2268,"url":"https:\/\/monodes.com\/predaelli\/2017\/03\/11\/a-programmers-introduction-to-unicode\/","url_meta":{"origin":11195,"position":0},"title":"A Programmer\u2019s Introduction to Unicode","author":"Paolo Redaelli","date":"2017-03-11","format":"link","excerpt":"\u00abA Programmer\u2019s Introduction to Unicode\u00bb from Nathan Reed\u2019s coding blog A Programmer\u2019s Introduction to Unicode March 3, 2017 \u00b7 Coding \u00b7 7 Comments \uff35\uff4e\uff49\uff43\uff4f\uff44\uff45! \ud83c\udd64\ud83c\udd5d\ud83c\udd58\ud83c\udd52\ud83c\udd5e\ud83c\udd53\ud83c\udd54\u203d \ud83c\uddfa\u200c\ud83c\uddf3\u200c\ud83c\uddee\u200c\ud83c\udde8\u200c\ud83c\uddf4\u200c\ud83c\udde9\u200c\ud83c\uddea! \ud83d\ude04 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to \u201csupport Unicode\u201d in our\u2026","rel":"","context":"In &quot;Senza categoria&quot;","block_context":{"text":"Senza categoria","link":"https:\/\/monodes.com\/predaelli\/category\/senza-categoria\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":9813,"url":"https:\/\/monodes.com\/predaelli\/2022\/11\/09\/ascii-oh-ascii-wherefore-art-thou-ascii\/","url_meta":{"origin":11195,"position":1},"title":"ASCII, oh ASCII! Wherefore art thou, ASCII?","author":"Paolo Redaelli","date":"2022-11-09","format":false,"excerpt":"ASCII, oh ASCII! Wherefore art thou, ASCII?The original line copied by an infamous English poet Puns aside, in the XXI century there are still need to stick to plain, old 7 bit ASCII character table. Many industrial applications stick to it for its simplicity. Unicode is often an overkill in\u2026","rel":"","context":"In &quot;Tricks&quot;","block_context":{"text":"Tricks","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/tricks\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":763,"url":"https:\/\/monodes.com\/predaelli\/2015\/10\/31\/how-to-differentiate-between-an-average-and-a-good-programmer\/","url_meta":{"origin":11195,"position":2},"title":"How to differentiate between an Average and a Good Programmer?","author":"Paolo Redaelli","date":"2015-10-31","format":false,"excerpt":"\u00a0How to differentiate between an Average and a Good Programmer? \u00a0 Oh my! It's so true! javarevisited looks like a really good programming blog as I found gems like\u00ab10 Articles Every Programmer Must Read \u00bb among What Every Programmer Should Know about Memory What Every Computer Scientist Should Know About\u2026","rel":"","context":"In &quot;Senza categoria&quot;","block_context":{"text":"Senza categoria","link":"https:\/\/monodes.com\/predaelli\/category\/senza-categoria\/"},"img":{"alt_text":"Being a good programmer is 3% talent, 97% not being distracted by the Internet","src":"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2015\/10\/Being%2Ba%2BProgrammer.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":9538,"url":"https:\/\/monodes.com\/predaelli\/2022\/08\/14\/a-suggestion\/","url_meta":{"origin":11195,"position":3},"title":"A suggestion","author":"Paolo Redaelli","date":"2022-08-14","format":false,"excerpt":"A little suggestion to #Librem and #PureOs developers, especially for #Librem5: an advanced energy saving mode that turns off everything not absolutely needed. No fancy animations, black'n'white, no gradients, no sub-pixel anti-aliasing, no background services (as far as possible)","rel":"","context":"In &quot;Gnome&quot;","block_context":{"text":"Gnome","link":"https:\/\/monodes.com\/predaelli\/category\/gnome\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2022\/08\/risparmio-energetico-avanzato1134265965843513934.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2022\/08\/risparmio-energetico-avanzato1134265965843513934.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2022\/08\/risparmio-energetico-avanzato1134265965843513934.jpg?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":3522,"url":"https:\/\/monodes.com\/predaelli\/2017\/11\/22\/sql-server-why-should-an-application-not-use-the-sa-account-database-administrators-stack-exchange\/","url_meta":{"origin":11195,"position":4},"title":"sql server &#8211; Why should an application not use the sa account &#8211; Database Administrators Stack Exchange","author":"Paolo Redaelli","date":"2017-11-22","format":"status","excerpt":"sql server - Why should an application not use the sa account - Database Administrators Stack Exchange I understand that the sa account enables complete control over a SQL Server and all the databases, users, permissions etc. I have an absolute belief that applications should not use the sa password\u2026","rel":"","context":"In &quot;Proprietary software&quot;","block_context":{"text":"Proprietary software","link":"https:\/\/monodes.com\/predaelli\/category\/software\/proprietary-software\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":15394,"url":"https:\/\/monodes.com\/predaelli\/2026\/03\/29\/nobody-gets-fired-for-picking-json-but-maybe-they-should\/","url_meta":{"origin":11195,"position":5},"title":"Nobody Gets Fired for Picking JSON, but Maybe They Should?","author":"Paolo Redaelli","date":"2026-03-29","format":"link","excerpt":"Nobody Gets Fired for Picking JSON, but Maybe They Should? By Miguel Young de la Sota Nobody Gets Fired for Picking JSON, but Maybe They Should? JSON is extremely popular but deeply flawed. This article discusses the details of JSON\u2019s design, how it\u2019s used (and misused), and how seemingly helpful\u2026","rel":"","context":"In &quot;Javascript&quot;","block_context":{"text":"Javascript","link":"https:\/\/monodes.com\/predaelli\/category\/javascript\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/11195","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/comments?post=11195"}],"version-history":[{"count":0,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/11195\/revisions"}],"wp:attachment":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/media?parent=11195"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/categories?post=11195"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/tags?post=11195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}