MySQL and Character Sets

Wednesday Oct 18th 2006 by Ian Gilfillan
Share:

In spite of the dominance of English for business purposes, we still live in a multilingual world. At some point, most serious MySQL developers and DBA's will come across the need to store data from other languages and these may require other character sets or collations.

In spite of the dominance of English for business purposes, we still live in a multilingual world. At some point, most serious MySQL developers and DBA's will come across the need to store data from other languages and these may require other character sets or collations.

What are character sets and collations?

A character set is a set of characters, and the specific encoding for each character. A collation is a set of rules applied to a particular character set for comparison purposes. A collation can only apply to one character set (and its name will always begin with the character sets name), and each character set will always have a default collation.

For example, the letter Z may be encoded as the hex value A23 in one character set, and 15 in another. As an example of differing collation rule, just as in English, a falls before b, in Swedish, the character z falls before the character ö, while in German the reverse is true. More obscurely, in a German dictionary, the character öf falls before the character of, while in German telephone directories the reverse is true.

There can even be differences about what constitutes a character. For example, in German, ä is usually treated as an accented form of a, and therefore it follows after a. In Swedish however, it's seen as an individual letter, and falls after z.

MySQL support for character sets and collations

Before MySQL 4.1 (it seems so long ago now), MySQL could support only one character set at a time. So, MySQL could support Hebrew, or Greek, or any other character set, but if you wanted to store Greek and Hebrew data at the same time, this was not possible.

Now, you can specify character sets at the server, database, table, column, or string constant level. The MyISAM, MEMORY, NDBCluster, and InnoDB storage engines all support character sets. Here is a list of all character sets supported by my version of MySQL. You can run the SHOW CHARACTER SET statement to see what is available in your environment.

mysql> SHOW CHARACTER SET\G
*************************** 1. row ***************************
          Charset: big5
      Description: Big5 Traditional Chinese
Default collation: big5_chinese_ci
           Maxlen: 2
*************************** 2. row ***************************
          Charset: dec8
      Description: DEC West European
Default collation: dec8_swedish_ci
           Maxlen: 1
*************************** 3. row ***************************
          Charset: cp850
      Description: DOS West European
Default collation: cp850_general_ci
           Maxlen: 1
*************************** 4. row ***************************
          Charset: hp8
      Description: HP West European
Default collation: hp8_english_ci
           Maxlen: 1
*************************** 5. row ***************************
          Charset: koi8r
      Description: KOI8-R Relcom Russian
Default collation: koi8r_general_ci
           Maxlen: 1
*************************** 6. row ***************************
          Charset: latin1
      Description: cp1252 West European
Default collation: latin1_swedish_ci
           Maxlen: 1
*************************** 7. row ***************************
          Charset: latin2
      Description: ISO 8859-2 Central European
Default collation: latin2_general_ci
           Maxlen: 1
*************************** 8. row ***************************
          Charset: swe7
      Description: 7bit Swedish
Default collation: swe7_swedish_ci
           Maxlen: 1
*************************** 9. row ***************************
          Charset: ascii
      Description: US ASCII
Default collation: ascii_general_ci
           Maxlen: 1
*************************** 10. row ***************************
          Charset: ujis
      Description: EUC-JP Japanese
Default collation: ujis_japanese_ci
           Maxlen: 3
*************************** 11. row ***************************
          Charset: sjis
      Description: Shift-JIS Japanese
Default collation: sjis_japanese_ci
           Maxlen: 2
*************************** 12. row ***************************
          Charset: hebrew
      Description: ISO 8859-8 Hebrew
Default collation: hebrew_general_ci
           Maxlen: 1
*************************** 13. row ***************************
          Charset: tis620
      Description: TIS620 Thai
Default collation: tis620_thai_ci
           Maxlen: 1
*************************** 14. row ***************************
          Charset: euckr
      Description: EUC-KR Korean
Default collation: euckr_korean_ci
           Maxlen: 2
*************************** 15. row ***************************
          Charset: koi8u
      Description: KOI8-U Ukrainian
Default collation: koi8u_general_ci
           Maxlen: 1
*************************** 16. row ***************************
          Charset: gb2312
      Description: GB2312 Simplified Chinese
Default collation: gb2312_chinese_ci
           Maxlen: 2
*************************** 17. row ***************************
          Charset: greek
      Description: ISO 8859-7 Greek
Default collation: greek_general_ci
           Maxlen: 1
*************************** 18. row ***************************
          Charset: cp1250
      Description: Windows Central European
Default collation: cp1250_general_ci
           Maxlen: 1
*************************** 19. row ***************************
          Charset: gbk
      Description: GBK Simplified Chinese
Default collation: gbk_chinese_ci
           Maxlen: 2
*************************** 20. row ***************************
          Charset: latin5
      Description: ISO 8859-9 Turkish
Default collation: latin5_turkish_ci
           Maxlen: 1
*************************** 21. row ***************************
          Charset: armscii8
      Description: ARMSCII-8 Armenian
Default collation: armscii8_general_ci
           Maxlen: 1
*************************** 22. row ***************************
          Charset: utf8
      Description: UTF-8 Unicode
Default collation: utf8_general_ci
           Maxlen: 3
*************************** 23. row ***************************
          Charset: ucs2
      Description: UCS-2 Unicode
Default collation: ucs2_general_ci
           Maxlen: 2
*************************** 24. row ***************************
          Charset: cp866
      Description: DOS Russian
Default collation: cp866_general_ci
           Maxlen: 1
*************************** 25. row ***************************
          Charset: keybcs2
      Description: DOS Kamenicky Czech-Slovak
Default collation: keybcs2_general_ci
           Maxlen: 1
*************************** 26. row ***************************
          Charset: macce
      Description: Mac Central European
Default collation: macce_general_ci
           Maxlen: 1
*************************** 27. row ***************************
          Charset: macroman
      Description: Mac West European
Default collation: macroman_general_ci
           Maxlen: 1
*************************** 28. row ***************************
          Charset: cp852
      Description: DOS Central European
Default collation: cp852_general_ci
           Maxlen: 1
*************************** 29. row ***************************
          Charset: latin7
      Description: ISO 8859-13 Baltic
Default collation: latin7_general_ci
           Maxlen: 1
*************************** 30. row ***************************
          Charset: cp1251
      Description: Windows Cyrillic
Default collation: cp1251_general_ci
           Maxlen: 1
*************************** 31. row ***************************
          Charset: cp1256
      Description: Windows Arabic
Default collation: cp1256_general_ci
           Maxlen: 1
*************************** 32. row ***************************
          Charset: cp1257
      Description: Windows Baltic
Default collation: cp1257_general_ci
           Maxlen: 1
*************************** 33. row ***************************
          Charset: binary
      Description: Binary pseudo charset
Default collation: binary
           Maxlen: 1
*************************** 34. row ***************************
          Charset: geostd8
      Description: GEOSTD8 Georgian
Default collation: geostd8_general_ci
           Maxlen: 1
*************************** 35. row ***************************
          Charset: cp932
      Description: SJIS for Windows Japanese
Default collation: cp932_japanese_ci
           Maxlen: 2
*************************** 36. row ***************************
          Charset: eucjpms
      Description: UJIS for Windows Japanese
Default collation: eucjpms_japanese_ci
           Maxlen: 3
36 rows in set (0.00 sec)

Thirty-six character sets may seem a lot, but there are even more collations (as each character set can have one or more collations). The SHOW COLLATION statement will allow us to take a look at the available collations - there're quite a few of them!

mysql> SHOW COLLATION\G
*************************** 1. row ***************************
Collation: big5_chinese_ci
  Charset: big5
       Id: 1
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 2. row ***************************
Collation: big5_bin
  Charset: big5
       Id: 84
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 3. row ***************************
Collation: dec8_swedish_ci
  Charset: dec8
       Id: 3
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 4. row ***************************
Collation: dec8_bin
  Charset: dec8
       Id: 69
  Default:
 Compiled:
  Sortlen: 0
*************************** 5. row ***************************
Collation: cp850_general_ci
  Charset: cp850
       Id: 4
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 6. row ***************************
Collation: cp850_bin
  Charset: cp850
       Id: 80
  Default:
 Compiled:
  Sortlen: 0
*************************** 7. row ***************************
Collation: hp8_english_ci
  Charset: hp8
       Id: 6
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 8. row ***************************
Collation: hp8_bin
  Charset: hp8
       Id: 72
  Default:
 Compiled:
  Sortlen: 0
*************************** 9. row ***************************
Collation: koi8r_general_ci
  Charset: koi8r
       Id: 7
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 10. row ***************************
Collation: koi8r_bin
  Charset: koi8r
       Id: 74
  Default:
 Compiled:
  Sortlen: 0
*************************** 11. row ***************************
Collation: latin1_german1_ci
  Charset: latin1
       Id: 5
  Default:
 Compiled:
  Sortlen: 0
*************************** 12. row ***************************
Collation: latin1_swedish_ci
  Charset: latin1
       Id: 8
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 13. row ***************************
Collation: latin1_danish_ci
  Charset: latin1
       Id: 15
  Default:
 Compiled:
  Sortlen: 0
*************************** 14. row ***************************
Collation: latin1_german2_ci
  Charset: latin1
       Id: 31
  Default:
 Compiled: Yes
  Sortlen: 2
*************************** 15. row ***************************
Collation: latin1_bin
  Charset: latin1
       Id: 47
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 16. row ***************************
Collation: latin1_general_ci
  Charset: latin1
       Id: 48
  Default:
 Compiled:
  Sortlen: 0
*************************** 17. row ***************************
Collation: latin1_general_cs
  Charset: latin1
       Id: 49
  Default:
 Compiled:
  Sortlen: 0
*************************** 18. row ***************************
Collation: latin1_spanish_ci
  Charset: latin1
       Id: 94
  Default:
 Compiled:
  Sortlen: 0
*************************** 19. row ***************************
Collation: latin2_czech_cs
  Charset: latin2
       Id: 2
  Default:
 Compiled: Yes
  Sortlen: 4
*************************** 20. row ***************************
Collation: latin2_general_ci
  Charset: latin2
       Id: 9
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 21. row ***************************
Collation: latin2_hungarian_ci
  Charset: latin2
       Id: 21
  Default:
 Compiled:
  Sortlen: 0
*************************** 22. row ***************************
Collation: latin2_croatian_ci
  Charset: latin2
       Id: 27
  Default:
 Compiled:
  Sortlen: 0
*************************** 23. row ***************************
Collation: latin2_bin
  Charset: latin2
       Id: 77
  Default:
 Compiled:
  Sortlen: 0
*************************** 24. row ***************************
Collation: swe7_swedish_ci
  Charset: swe7
       Id: 10
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 25. row ***************************
Collation: swe7_bin
  Charset: swe7
       Id: 82
  Default:
 Compiled:
  Sortlen: 0
*************************** 26. row ***************************
Collation: ascii_general_ci
  Charset: ascii
       Id: 11
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 27. row ***************************
Collation: ascii_bin
  Charset: ascii
       Id: 65
  Default:
 Compiled:
  Sortlen: 0
*************************** 28. row ***************************
Collation: ujis_japanese_ci
  Charset: ujis
       Id: 12
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 29. row ***************************
Collation: ujis_bin
  Charset: ujis
       Id: 91
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 30. row ***************************
Collation: sjis_japanese_ci
  Charset: sjis
       Id: 13
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 31. row ***************************
Collation: sjis_bin
  Charset: sjis
       Id: 88
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 32. row ***************************
Collation: hebrew_general_ci
  Charset: hebrew
       Id: 16
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 33. row ***************************
Collation: hebrew_bin
  Charset: hebrew
       Id: 71
  Default:
 Compiled:
  Sortlen: 0
*************************** 34. row ***************************
Collation: tis620_thai_ci
  Charset: tis620
       Id: 18
  Default: Yes
 Compiled: Yes
  Sortlen: 4
*************************** 35. row ***************************
Collation: tis620_bin
  Charset: tis620
       Id: 89
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 36. row ***************************
Collation: euckr_korean_ci
  Charset: euckr
       Id: 19
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 37. row ***************************
Collation: euckr_bin
  Charset: euckr
       Id: 85
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 38. row ***************************
Collation: koi8u_general_ci
  Charset: koi8u
       Id: 22
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 39. row ***************************
Collation: koi8u_bin
  Charset: koi8u
       Id: 75
  Default:
 Compiled:
  Sortlen: 0
*************************** 40. row ***************************
Collation: gb2312_chinese_ci
  Charset: gb2312
       Id: 24
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 41. row ***************************
Collation: gb2312_bin
  Charset: gb2312
       Id: 86
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 42. row ***************************
Collation: greek_general_ci
  Charset: greek
       Id: 25
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 43. row ***************************
Collation: greek_bin
  Charset: greek
       Id: 70
  Default:
 Compiled:
  Sortlen: 0
*************************** 44. row ***************************
Collation: cp1250_general_ci
  Charset: cp1250
       Id: 26
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 45. row ***************************
Collation: cp1250_czech_cs
  Charset: cp1250
       Id: 34
  Default:
 Compiled: Yes
  Sortlen: 2
*************************** 46. row ***************************
Collation: cp1250_croatian_ci
  Charset: cp1250
       Id: 44
  Default:
 Compiled:
  Sortlen: 0
*************************** 47. row ***************************
Collation: cp1250_bin
  Charset: cp1250
       Id: 66
  Default:
 Compiled:
  Sortlen: 0
*************************** 48. row ***************************
Collation: gbk_chinese_ci
  Charset: gbk
       Id: 28
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 49. row ***************************
Collation: gbk_bin
  Charset: gbk
       Id: 87
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 50. row ***************************
Collation: latin5_turkish_ci
  Charset: latin5
       Id: 30
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 51. row ***************************
Collation: latin5_bin
  Charset: latin5
       Id: 78
  Default:
 Compiled:
  Sortlen: 0
*************************** 52. row ***************************
Collation: armscii8_general_ci
  Charset: armscii8
       Id: 32
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 53. row ***************************
Collation: armscii8_bin
  Charset: armscii8
       Id: 64
  Default:
 Compiled:
  Sortlen: 0
*************************** 54. row ***************************
Collation: utf8_general_ci
  Charset: utf8
       Id: 33
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 55. row ***************************
Collation: utf8_bin
  Charset: utf8
       Id: 83
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 56. row ***************************
Collation: utf8_unicode_ci
  Charset: utf8
       Id: 192
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 57. row ***************************
Collation: utf8_icelandic_ci
  Charset: utf8
       Id: 193
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 58. row ***************************
Collation: utf8_latvian_ci
  Charset: utf8
       Id: 194
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 59. row ***************************
Collation: utf8_romanian_ci
  Charset: utf8
       Id: 195
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 60. row ***************************
Collation: utf8_slovenian_ci
  Charset: utf8
       Id: 196
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 61. row ***************************
Collation: utf8_polish_ci
  Charset: utf8
       Id: 197
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 62. row ***************************
Collation: utf8_estonian_ci
  Charset: utf8
       Id: 198
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 63. row ***************************
Collation: utf8_spanish_ci
  Charset: utf8
       Id: 199
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 64. row ***************************
Collation: utf8_swedish_ci
  Charset: utf8
       Id: 200
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 65. row ***************************
Collation: utf8_turkish_ci
  Charset: utf8
       Id: 201
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 66. row ***************************
Collation: utf8_czech_ci
  Charset: utf8
       Id: 202
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 67. row ***************************
Collation: utf8_danish_ci
  Charset: utf8
       Id: 203
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 68. row ***************************
Collation: utf8_lithuanian_ci
  Charset: utf8
       Id: 204
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 69. row ***************************
Collation: utf8_slovak_ci
  Charset: utf8
       Id: 205
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 70. row ***************************
Collation: utf8_spanish2_ci
  Charset: utf8
       Id: 206
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 71. row ***************************
Collation: utf8_roman_ci
  Charset: utf8
       Id: 207
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 72. row ***************************
Collation: utf8_persian_ci
  Charset: utf8
       Id: 208
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 73. row ***************************
Collation: utf8_esperanto_ci
  Charset: utf8
       Id: 209
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 74. row ***************************
Collation: ucs2_general_ci
  Charset: ucs2
       Id: 35
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 75. row ***************************
Collation: ucs2_bin
  Charset: ucs2
       Id: 90
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 76. row ***************************
Collation: ucs2_unicode_ci
  Charset: ucs2
       Id: 128
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 77. row ***************************
Collation: ucs2_icelandic_ci
  Charset: ucs2
       Id: 129
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 78. row ***************************
Collation: ucs2_latvian_ci
  Charset: ucs2
       Id: 130
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 79. row ***************************
Collation: ucs2_romanian_ci
  Charset: ucs2
       Id: 131
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 80. row ***************************
Collation: ucs2_slovenian_ci
  Charset: ucs2
       Id: 132
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 81. row ***************************
Collation: ucs2_polish_ci
  Charset: ucs2
       Id: 133
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 82. row ***************************
Collation: ucs2_estonian_ci
  Charset: ucs2
       Id: 134
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 83. row ***************************
Collation: ucs2_spanish_ci
  Charset: ucs2
       Id: 135
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 84. row ***************************
Collation: ucs2_swedish_ci
  Charset: ucs2
       Id: 136
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 85. row ***************************
Collation: ucs2_turkish_ci
  Charset: ucs2
       Id: 137
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 86. row ***************************
Collation: ucs2_czech_ci
  Charset: ucs2
       Id: 138
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 87. row ***************************
Collation: ucs2_danish_ci
  Charset: ucs2
       Id: 139
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 88. row ***************************
Collation: ucs2_lithuanian_ci
  Charset: ucs2
       Id: 140
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 89. row ***************************
Collation: ucs2_slovak_ci
  Charset: ucs2
       Id: 141
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 90. row ***************************
Collation: ucs2_spanish2_ci
  Charset: ucs2
       Id: 142
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 91. row ***************************
Collation: ucs2_roman_ci
  Charset: ucs2
       Id: 143
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 92. row ***************************
Collation: ucs2_persian_ci
  Charset: ucs2
       Id: 144
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 93. row ***************************
Collation: ucs2_esperanto_ci
  Charset: ucs2
       Id: 145
  Default:
 Compiled: Yes
  Sortlen: 8
*************************** 94. row ***************************
Collation: cp866_general_ci
  Charset: cp866
       Id: 36
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 95. row ***************************
Collation: cp866_bin
  Charset: cp866
       Id: 68
  Default:
 Compiled:
  Sortlen: 0
*************************** 96. row ***************************
Collation: keybcs2_general_ci
  Charset: keybcs2
       Id: 37
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 97. row ***************************
Collation: keybcs2_bin
  Charset: keybcs2
       Id: 73
  Default:
 Compiled:
  Sortlen: 0
*************************** 98. row ***************************
Collation: macce_general_ci
  Charset: macce
       Id: 38
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 99. row ***************************
Collation: macce_bin
  Charset: macce
       Id: 43
  Default:
 Compiled:
  Sortlen: 0
*************************** 100. row ***************************
Collation: macroman_general_ci
  Charset: macroman
       Id: 39
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 101. row ***************************
Collation: macroman_bin
  Charset: macroman
       Id: 53
  Default:
 Compiled:
  Sortlen: 0
*************************** 102. row ***************************
Collation: cp852_general_ci
  Charset: cp852
       Id: 40
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 103. row ***************************
Collation: cp852_bin
  Charset: cp852
       Id: 81
  Default:
 Compiled:
  Sortlen: 0
*************************** 104. row ***************************
Collation: latin7_estonian_cs
  Charset: latin7
       Id: 20
  Default:
 Compiled:
  Sortlen: 0
*************************** 105. row ***************************
Collation: latin7_general_ci
  Charset: latin7
       Id: 41
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 106. row ***************************
Collation: latin7_general_cs
  Charset: latin7
       Id: 42
  Default:
 Compiled:
  Sortlen: 0
*************************** 107. row ***************************
Collation: latin7_bin
  Charset: latin7
       Id: 79
  Default:
 Compiled:
  Sortlen: 0
*************************** 108. row ***************************
Collation: cp1251_bulgarian_ci
  Charset: cp1251
       Id: 14
  Default:
 Compiled:
  Sortlen: 0
*************************** 109. row ***************************
Collation: cp1251_ukrainian_ci
  Charset: cp1251
       Id: 23
  Default:
 Compiled:
  Sortlen: 0
*************************** 110. row ***************************
Collation: cp1251_bin
  Charset: cp1251
       Id: 50
  Default:
 Compiled:
  Sortlen: 0
*************************** 111. row ***************************
Collation: cp1251_general_ci
  Charset: cp1251
       Id: 51
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 112. row ***************************
Collation: cp1251_general_cs
  Charset: cp1251
       Id: 52
  Default:
 Compiled:
  Sortlen: 0
*************************** 113. row ***************************
Collation: cp1256_general_ci
  Charset: cp1256
       Id: 57
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 114. row ***************************
Collation: cp1256_bin
  Charset: cp1256
       Id: 67
  Default:
 Compiled:
  Sortlen: 0
*************************** 115. row ***************************
Collation: cp1257_lithuanian_ci
  Charset: cp1257
       Id: 29
  Default:
 Compiled:
  Sortlen: 0
*************************** 116. row ***************************
Collation: cp1257_bin
  Charset: cp1257
       Id: 58
  Default:
 Compiled:
  Sortlen: 0
*************************** 117. row ***************************
Collation: cp1257_general_ci
  Charset: cp1257
       Id: 59
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 118. row ***************************
Collation: binary
  Charset: binary
       Id: 63
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 119. row ***************************
Collation: geostd8_general_ci
  Charset: geostd8
       Id: 92
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 120. row ***************************
Collation: geostd8_bin
  Charset: geostd8
       Id: 93
  Default:
 Compiled:
  Sortlen: 0
*************************** 121. row ***************************
Collation: cp932_japanese_ci
  Charset: cp932
       Id: 95
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 122. row ***************************
Collation: cp932_bin
  Charset: cp932
       Id: 96
  Default:
 Compiled: Yes
  Sortlen: 1
*************************** 123. row ***************************
Collation: eucjpms_japanese_ci
  Charset: eucjpms
       Id: 97
  Default: Yes
 Compiled: Yes
  Sortlen: 1
*************************** 124. row ***************************
Collation: eucjpms_bin
  Charset: eucjpms
       Id: 98
  Default:
 Compiled: Yes
  Sortlen: 1
124 rows in set (0.00 sec)

A collation is only really useful in the context of a character set. Since collation naming convention specifies that the supported character set will be part of the name, you can use LIKE to return collations that are related to a particular character set. For example, to see which collations can be applied to 7bit Swedish, use:

mysql> SHOW COLLATION LIKE '%swe7%'\G
*************************** 1. row ***************************
Collation: swe7_swedish_ci
  Charset: swe7
       Id: 10
  Default: Yes
 Compiled:
  Sortlen: 0
*************************** 2. row ***************************
Collation: swe7_bin
  Charset: swe7
       Id: 82
  Default:
 Compiled:
  Sortlen: 0
2 rows in set (0.07 sec)

Default character sets and collations

There are five levels to which character sets and collations can be applied - server, database, table, column and string constant, each of which can have differing defaults. They also apply according to a certain hierarchy. The character set and collation specific to a particular string constant and column apply first. If those don't exist, the table default is used, and if that doesn't exist the database default, and finally the server default. By default, MySQL uses the latin1 character set, and the latin1_swedish_ci collation for each. You can see which defaults apply to your installation by running the SHOW VARIABLES LIKE 'character_set%' and SHOW VARIABLES LIKE 'collation%'statements:

mysql> SHOW VARIABLES LIKE 'character_set%'\G
*************************** 1. row ***************************
Variable_name: character_set_client
        Value: latin1
*************************** 2. row ***************************
Variable_name: character_set_connection
        Value: latin1
*************************** 3. row ***************************
Variable_name: character_set_database
        Value: latin1
*************************** 4. row ***************************
Variable_name: character_set_results
        Value: latin1
*************************** 5. row ***************************
Variable_name: character_set_server
        Value: latin1
*************************** 6. row ***************************
Variable_name: character_set_system
        Value: utf8
*************************** 7. row ***************************
Variable_name: character_sets_dir
        Value: /usr/local/mysql/share/mysql/charsets/
7 rows in set (0.00 sec)
mysql> SHOW VARIABLES LIKE 'collation%'\G
*************************** 1. row ***************************
Variable_name: collation_connection
        Value: latin1_swedish_ci
*************************** 2. row ***************************
Variable_name: collation_database
        Value: latin1_swedish_ci
*************************** 3. row ***************************
Variable_name: collation_server
        Value: latin1_swedish_ci
3 rows in set (0.00 sec)

The server defaults are set up in your configuration file (my.cnf). For example:

[mysqld]
...
default-character-set=latin2
default-collation=latin2_general_ci
...

If no other character sets or collations are specified at any of the other three levels, all operations will be performed using this character set and collation. For example, create a new database, and see which character set is used:

mysql> CREATE DATABASE latin1db;
Query OK, 1 row affected (0.00 sec)
mysql> SHOW CREATE DATABASE latin1db\G
*************************** 1. row ***************************
       Database: latin1db
Create Database: CREATE DATABASE `latin1db` /*!40100 DEFAULT CHARACTER SET latin1 */
1 row in set (0.00 sec)

To create a database that makes use of a different character set and collation, simply use the CHARACTER SET (CHARSET is a synonym) and COLLATION clauses:

mysql> CREATE DATABASE latin2db CHARACTER SET latin2 COLLATE latin2_general_ci;
Query OK, 1 row affected (0.05 sec)
mysql> SHOW CREATE DATABASE latin2db\G
*************************** 1. row ***************************
       Database: latin2db
Create Database: CREATE DATABASE `latin2db` /*!40100 DEFAULT CHARACTER SET latin2 */
1 row in set (0.00 sec)


Let's look at changing the character set and collation at the table and column levels as well.

mysql> USE latin1db;
Database changed
mysql> CREATE TABLE latin2table (v VARCHAR(50), t TEXT) CHARACTER SET latin2 COLLATE latin2_general_ci;
Query OK, 0 rows affected (0.12 sec)
mysql> CREATE TABLE latin1table (c CHAR(20), v VARCHAR(50) CHARACTER SET utf8 COLLATE utf8_general_ci, t TEXT CHARACTER SET latin1 COLLATE latin1_german1_ci);
Query OK, 0 rows affected (0.07 sec)

Now, let's insert the 'ý' character into each column, and see the different ways it's stored, using the HEX() function.

mysql> INSERT INTO latin1table(c,v,t) VALUES('ý','ý','ý');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT c,v,t FROM latin1table;
+------+------+------+
| c    | v    | t    |
+------+------+------+
| ý    | ý    | ý    |
+------+------+------+
1 row in set (0.00 sec)
mysql> SELECT HEX(c), HEX(v), HEX(t) FROM latin1table;
+--------+--------+--------+
| HEX(c) | HEX(v) | HEX(t) |
+--------+--------+--------+
| FD     | C3BD   | FD     |
+--------+--------+--------+
1 row in set (0.00 sec)

Since utf8 is an encoding for many more characters than latin1, it therefore requires more bytes of storage, as is indicated in the hex value used to store the character.

Client and connection CHARACTER SET and COLLATION issues

As anyone developing for a wide variety of languages and character sets will know, it's not just how MySQL handles character sets that's important. All the other tools used for interacting are as important. The client being used to interact with MySQL is a key element as well. Take for example the character. This character can be encoded in utf8, but not in latin1. Working with it is further complicated by the other tools I'm using at the same time. I'm using a KDE desktop environment to write this, and can easily cut and paste the character. In order to save the character in my editor, KWrite, I had to manually set the encoding. Some editors won't be able to make sense of the character at all. However, if I paste the character into my command line MySQL client, the character is replaced with ?, as the client cannot understand it. MySQL too will replace characters it cannot understand with a question mark. For example, let's convert our utf8 varchar column (containing the ý character) into ascii, which cannot store it:

mysql> ALTER TABLE latin1table CHANGE v v VARCHAR(50) CHARACTER SET ascii;
Query OK, 1 row affected, 1 warning (0.20 sec)
Records: 1  Duplicates: 0  Warnings: 1
mysql> SELECT c,v,t FROM latin1table;
+------+------+------+
| c    | v    | t    |
+------+------+------+
| ý    | ?    | ý    |
+------+------+------+
1 row in set (0.00 sec)
mysql> SELECT HEX(c), HEX(v), HEX(t) FROM latin1table;
+--------+--------+--------+
| HEX(c) | HEX(v) | HEX(t) |
+--------+--------+--------+
| FD     | 3F     | FD     |
+--------+--------+--------+
1 row in set (0.00 sec)

You can see one major change between ascii and utf8, in the length of the returned hex result. utf-8 stores characters with between one and four bytes (although the subset used by MySQL only uses up to three bytes a character), while ascii only uses one. Because of the performance drop when using more bytes to store strings (and indexes), it doesn't make sense to use a broad character set requiring more than one byte per character (utf16 for example) just in case you ever need it. The default latin1 is sufficient for most western European languages. If you do need to use a multi-byte character set, rather use VARCHARs than CHARs, as the benefits are even greater than normal.

Earlier, we saw the character_set_client, character_set_results and character_set_connection variables. You can modify each of these individually (for example, to latin2) by running:

SET character_set_client = latin2;
SET character_set_results = latin2;
SET character_set_connection = latin2;

but it's quite likely you'll want to set all three at the same time. You can do this in one statement, as follows:

SET NAMES = latin2;

This will set the collation to whatever the default is for that character set, but to apply a collation specifically, use the COLLATE clause, for example:

SET NAMES latin2 COLLATE latin2_czech_cs

Conclusion

MySQL's mechanism for handling multiple languages is simple and powerful, especially compared to many of its DBMS rivals. In practice, you are likely to come across more issues with the other elements in a multilingual project, such as your desktop environment, programming language or editor. Good luck!

» See All Articles by Columnist Ian Gilfillan

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved