|
There are no translations available.
After migrating a joomla installation from version 1.0 to 1.5, I came across a particularly annoying issue: the following characters showed up in error in the browser: - left single quote: ‘
- right single quote: ’
- left double quote: “
- right double quote: ”
- emdash: —
I first thought that we had forgotten to convert the encoding from iso-8859-1 to utf-8, but this was not the case. The diacritical letters such as ë, é, à, and so on, all came through correctly. Therefore, the encoding conversion had been done correctly. Still, there were problems with at least these five particular letters. The Joomla migration script had converted the characters unto the following utf-8 (hexadecimal) character codes: - left single quote: C291:
- right single quote: C292:
- left double quote: C293:
- right double quote: C294:
- emdash: C296/C297:
Note: There is actually a reason for this mess. At the time, the ISO/IEC 8859 standardization committee did not wish to standardize character codes primarily used in typography, such as curly quotes, nor characters that users could ordinarily simulate by composing them from multiple characters. Consequently, every operating system defines its own codes for these characters.This also explains why iso-8859-1 to utf-8 convertors may get things wrong on these characters. But then again, what are the proper utf-8 character codes for these letters? I decided to triangulate over html entities. I used the easysql database management tool to issue SQL statements from my joomla installation. You can discover these bad character codes by issuing the SQL command: select introtext, hex(introtext) from jos_content
This allows you to fish for what the wrong character codes are. You can find what the correct character codes should be by writing the letters in error in html entity format. A left single quote becomes & lsquo;. The command select ''& lsquo ;'' automatically becomes select '‘' after executing it in easysql, and select hex(ord('‘')) yields E28098, which is the correct character code. This is, of course, not the same character code as the one converted to by the migration script. Therefore, the complete list of correctly displaying character codes is: - ‘ —> & lsquo ; —> E28098
- ’ —> & rsquo ; —> E28099
- “ —> & ldquo ; —> E2809C
- ” —> & rdquo ; —> E2809D
- — —> & mdash ; —> E28094
In order to test the character codes, issue, for example, the following SQL command: select 0xE28098. The result will be a left single quote. From there, we can correct the content of the jos_content table in Joomla: UPDATE jos_content SET introtext = REPLACE(introtext , unhex('C291'), unhex('E28098')) ; UPDATE jos_content SET introtext = REPLACE(introtext , unhex('C292'), unhex('E28099')) ; UPDATE jos_content SET introtext = REPLACE(introtext , unhex('C293'), unhex('E2809C')) ; UPDATE jos_content SET introtext = REPLACE(introtext , unhex('C294'), unhex('E2809D')) ; UPDATE jos_content SET introtext = REPLACE(introtext , unhex('C296'), unhex('E28094'))
This effectively solves the problem and makes the characters previously in error, display correctly. In case you cannot triangulate over html entities, there are always other ways to discover what the correct character codes should be. If you can enter the characters through the keyboard, enter <myletter> and see what mysql says after executing: select hex(ord('<myletter>')) . It should show you the correct encoding in hexadecimal notation.
|