Upgrading an ASP.Net/SQL2000 application to support Chinese (GB18030) characters



  • Hello All,

    I've got an ASP.Net application with an SQL Server 2000 back-end that I'd like to upgrade so it can handle chinese language (input & display, with text searching & sorting).

    I know I'll have to change all my varchar fields to nvarchar and prefix all strings with N'. I've got as far as reading the GB18030 standard and know that this is not supported in SQL2000. I'm a bit confused with the various forms of unicode support and how this then flows through to GB18030 compliance.

    Can anyone offer some practical advice or experience on these matters?

    Cheers.



  • You might want to ask this in the SQL Team forums (http://www.sqlteam.com).

    Ken



  • I don't think so.

    nvarchar uses 16-bit encodings; SQL Server 2000 I believe supports UTF-16.  As such, GB18030 support isn't necessary in the database; by using nvarchar columns, you've made the decision to use UTF-16 as your encoding.

    As such, your application needs to convert from GB18030 to UTF-16 (prior to insertion to the database) and convert from UTF-16 to GB18030 prior to presentation to the user.  This is in fact a generalized behaviour you should use; any data you get should be converted from whatever character set it's in to UTF-16 on entry, and back to the user's character set on presentation.  Otherwise you'll get screwy results if someone has their system set up to use some non-GB18030 encoding.

    If you do this searching and so on should work properly.

    The remaining issue is collation order; you can only pick one.  If you're only ever going to have users from one particular locale, you could pick their sort order (there should be one or more chinese collation orders in SQL Server).  If you have multiple locales I think you're probably best off using case-insensitive, accent-insensitive.  It's not ideal, but it's the best you can do.

     

     



  • Thanks Dr. Pizza that's very useful info. 

    My understanding is that the GB18030 maps directly to all UTF-16 code points, and then extends over & above that to cater for a greater number of tradional chinese characters. On that basis is it fair to say that SQL2000 could never fully support the full GB18030 character set?

    I'd be interested in hearing about any experiences you (or anyone else) may have had with this sort of localisation task? Anything else to watch out for?

    Cheers.



  • UTF-16 can encode every single Unicode character; so can GB18030 (and thus you can map between them).  The difference is that GB18030 can encode certain characters in two bytes whilst UTF-16 needs four bytes (i.e. GB18030 has two-byte encodings for certain non-BMP characters).  SQL Server 2000 doesn't directly support GB18030, but it supports UTF-16 fine.  As such, it's good enough.

    The main thing you can do to make things easy for yourself is to pick a common internal representation (and you're best choosing one of UTF-8, UTF-16, UTF-32, as they're the most standard/widely supported) and at the "edges" of your application, convert everything from the foreign encoding (GB18030, various ISO 8-bit encodings, other unicode encodings) into your internal format, and vice versa.

     


Log in to reply