Thursday, August 4, 2016

Character Encoding : What & How

Character Encoding : What & How

You might have seen web pages with a symbol � instead of apostrophe ('). We too faced this on our website, not only with apostrophe but with a bunch of other characters. Now what exactly is the problem and how do you solve this? In short, the problem is that either the page does not specifies character encoding or has mentioned an incorrect leaving the browser guessing.

What is character encoding?

So, before we move on let’s see what character encoding actually is. As per the definition on wikipedia

“A character encoding is used to represent a repertoire of characters by some kind of an encoding system”

To put this across in simple terms

“Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding”

The most widely used encoding system that you might have even heard of is CP-1252, also know as “Windows-1252” as it is used on windows operating system. The other that I’d be talking about here would be “Unicode”

Unicode

As the name suggests Unicode is a unified coding system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number. Each number represents a unique character used in at least one of the world’s languages. (Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn’t be sufficient.) Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to.

Although, this seems a brilliant idea however it comes with it’s own drawbacks as well. Four bytes for each character? Doesn’t this seem wasteful for languages e.g. English which is never going to use more than 256 characters.

There are unicode systems which store 4 bytes per character (i.e. UTF-32), then there are unicode systems that store 2 bytes per character (i.e. UTF-16) which are twice as space efficient 4 bytes per character systems. However, the one that we are specially talking about is UTF-8 which is a variable length unicode system i.e. that each character takes up different number of bytes. Now as mentioned earlier as well that each system does bring along its own drawbacks

Drawback:
  1. Finding the Nth character is an O(N) operation which is, the longer the string, the longer it takes to find a specific character.
  2. Bit manipulation or Bit twiddling involved to encode characters into bytes and decode bytes into characters.

However, comparing the advantages over other encoding systems (like UTF-16 & UTF-32), we have some interesting points.

  1. Super-efficient handling of ASCII Characters
  2. It is comparable to UTF-16 and definitely better than UTF-32
  3. And, due to exact nature of bit twiddling there is no byte-ordering issues.

Now since UTF-8 is our system of choice let’s move ahead on how do we ensure that our system is using the same.

Character Encoding :  In your code

HTML 4 Content

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

HTML 5 Content

<meta charset="UTF-8">

XML Content

<?xml version="1.0" encoding="UTF-8"?>

PHP

header('Content-Type: text/html; charset=utf-8');

MYSQL

$link = mysql_connect('localhost', 'user', 'password');
mysql_set_charset('utf8', $link);


Character Encoding : Into configuration files

php.ini (PHP)

default_charset = "utf-8";

my.cnf (MySQL)

 [client]
 default-character-set=UTF-8
   
 [mysql]
 default-character-set=UTF-8
   
 [mysqld]
 character-set-client-handshake = false #force encoding to uft8
 character-set-server=UTF-8
 collation-server=UTF-8_general_ci
   
 [mysqld_safe]
 default-character-set=UTF-8


While the above steps ensure that all the data coming henceforth in the system would be UTF-8 encoded and remains that way, the other challenge that remains is the existing data in the system which would still be in latin1.

We would also have to migrate this data from latin1 to utf-8
  1. Make sure all the above mentioned steps are done
  2. We begin with altering the schema of the database

mysql> ALTER SCHEMA `<database_name>` DEFAULT CHARACTER SET UTF-8;

  1. Create a dump file for the table / database with latin1 encoding that you want to convert

> mysqldump -u$1 -p$2 -c -e --default-character-set=utf8 --single-transaction --skip-set-charset --add-drop-database -B $3 > <database-table-name>.sql

> sed 's/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci/' < <database-table-name>.sql | sed 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/' > <database-table-name>-fixed.sql

> mysql -u$1 -p$2 < <database-table-name>-fixed.sql

Problems & solutions
  1. Double Encoding
  1. There might be certain characters which might not have converted correctly yet. Since non-ASCII characters are multi-byte, we can find them by comparing byte length to the character length (this issue can also occur as a result of double-encoding into UTF-8)

mysql> SELECT COUNT(*) FROM <tablename> WHERE LENGTH(<fieldname>) != CHAR_LENGTH(<fieldname>);

  1. Now once we know that we have problem with our data let’s look how to fix the same.

mysql> CREATE TABLE TEMPTABLE (
     SELECT * FROM <tablename> WHERE
     LENGTH(<fieldname>) != CHAR_LENGTH(<fieldname>));

  1. Now this is tricky and we need to be very careful while fixing the same. Downside, we might end up corrupting the data to an extent where we are at the point of no return. So, basically what we would be doing here is tricking mysql so as to avoid converting this data.
    1. So to begin with we set the encoding to latin1, thereby removing double encoding from the data

mysql> ALTER TABLE TEMPTABLE MODIFY TEMPTABLE.<fieldname> varchar(128) character set latin1;

  1. To avoid the scenario of double encoding we now convert the data type of the field to blob. This exploits the fact the MySQL will attempt to encode a blob field.

mysql> ALTER TABLE TEMPTABLE MODIFY TEMPTABLE.<fieldname> blob;

mysql> alter table temptable modify temptable.<fieldname> varchar(128) character set UTF-8;

  1. Once the above activity is complete, remove all the erroneous data from the original table & replace the same with the data from the temp table with the fixed data.
Finally, repeat the activity to ensure that all the data has been correctly encoded.

  1. InnoDB tables with foreign keys giving mysql error 1025

Although, this a long and tedious process, but it saves you with the task of continuously patching the system or an elevated QPS as a result of selective character encoding prior to execution of every query.

No comments:

Post a Comment