ASCII, oh ASCII! Wherefore art thou, ASCII?

ASCII, oh ASCII! Wherefore art thou, ASCII?

The original line copied by an infamous English poet

Puns aside, in the XXI century there are still need to stick to plain, old 7 bit ASCII character table. Many industrial applications stick to it for its simplicity. Unicode is often an overkill in that situations.

So how to find if a stream of text is Unicode? POSIX systems (GNU/Linux, *BSD, MacOs and many, many others) have the file utility, which is as simple as “file filename” that will likely answer like:

DEST_CLI.CSV.20220901-105314.backup: UTF-8 Unicode text, with CRLF line terminators
foo: UTF-8 Unicode text

In case of debug we will need to find where Unicode characters are actually used. Ifound inspiration in How to Find Non-ASCII Characters in Text Files in Linux that has been graciously updated less than three days ago (lucky me!), slightly modifying the command; it is as simple as:

grep --color='auto' -P -n "[\x80-\xFF]" foo

Sometimes we just need to convert a UTF-8 file to ASCII (best-effort). In many cases the iconv does the job, piping your data to

iconv -f utf-8 -t ascii//TRANSLIT

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.