SQLServer Encoding WTF

boomzilla

I integrate with a COTS product that uses SQL Server. Their database uses the collation that uses CP-1252.

They store XML documents (complete with the beginning xml tag, etc) in a TEXT column, which I've come to learn is made to store non-unicode text (and seems to be deprecated to boot). Users sometimes type an apostrophe and it gets stored as that special right single quote garbage, which works fine in the table where it's stored as nvarchar, but when that gets put into one of these XML documents, it ends up as â€™.

Trying to find a way to convert that back into something sane, no luck so far. Is there any way to re-encode or whatever?

LaoC

@boomzilla said in SQLServer Encoding WTF:

I integrate with a COTS product that uses SQL Server. Their database uses the collation that uses CP-1252.

They store XML documents (complete with the beginning xml tag, etc) in a TEXT column, which I've come to learn is made to store non-unicode text (and seems to be deprecated to boot). Users sometimes type an apostrophe and it gets stored as that special right single quote garbage, which works fine in the table where it's stored as nvarchar, but when that gets put into one of these XML documents, it ends up as â€™.

Trying to find a way to convert that back into something sane, no luck so far. Is there any way to re-encode or whatever?

Once it's in an 8-bit encoding, the only way back is via nasty heuristics. Probably the least ugly one would be to make a list of all occurring CP-1252 sequences that would be valid UTF-8 and hope there are not too many to manually go over them and check for plausibility before converting the column to UTF-8 and writing a big-ass substitution to fix those cases that end up double-encoded.

PleegWat

@boomzilla said in SQLServer Encoding WTF:

I integrate with a COTS product that uses SQL Server. Their database uses the collation that uses CP-1252.

They store XML documents (complete with the beginning xml tag, etc) in a TEXT column, which I've come to learn is made to store non-unicode text (and seems to be deprecated to boot). Users sometimes type an apostrophe and it gets stored as that special right single quote garbage, which works fine in the table where it's stored as nvarchar, but when that gets put into one of these XML documents, it ends up as â€™.

Trying to find a way to convert that back into something sane, no luck so far. Is there any way to re-encode or whatever?

Since it turns into a sequence of characters rather than a fallback character, have you tried simply interpreting the bytestream as UTF-8?

boomzilla

@PleegWat said in SQLServer Encoding WTF:

@boomzilla said in SQLServer Encoding WTF:

I integrate with a COTS product that uses SQL Server. Their database uses the collation that uses CP-1252.

They store XML documents (complete with the beginning xml tag, etc) in a TEXT column, which I've come to learn is made to store non-unicode text (and seems to be deprecated to boot). Users sometimes type an apostrophe and it gets stored as that special right single quote garbage, which works fine in the table where it's stored as nvarchar, but when that gets put into one of these XML documents, it ends up as â€™.

Trying to find a way to convert that back into something sane, no luck so far. Is there any way to re-encode or whatever?

Since it turns into a sequence of characters rather than a fallback character, have you tried simply interpreting the bytestream as UTF-8?

How? For reference, this goes from SQL Server, through Oracle (where all of our data is) and through our reporting application.

I'm pretty much limited to what I can do in SQL Server or Oracle, but I'd prefer to do whatever I'm doing on SQL Server to keep the incompetence of the vendor contained there.

aitap

@boomzilla said in SQLServer Encoding WTF:

Users sometimes type an apostrophe and it gets stored as that special right single quote garbage, which works fine in the table where it's stored as nvarchar, but when that gets put into one of these XML documents, it ends up as â€™.

That seems to be a UTF-8 apostrophe interpreted as CP-1252. You need to encode the text back into CP-1252 and interpret it as UTF-8 in order to make it work:

$ echo â€™ | iconv -t cp1252
’

Or is it the question of "how to do it using $toolset", not "what should be done with it"?

boomzilla

@aitap I know how to do it from, e.g., the linux CLI as you demonstrated. Need a way to do it inside the DB.

Meanwhile in the interest of pleasing the customer I just added a REPLACE() call in a function.

PleegWat

@boomzilla Googledy-goo

CONVERT

No obvious hits for mssql.

aitap

@boomzilla Okay, sorry for the noise.

One more Hail Mary shot. The question title makes it sound as if it's nothing like your problem, but the answers are all about decoding UTF-8 bytes stored in a TEXT column. In plain SQL, on MS SQL Server.

At least it's consistently botched. right? It's not a mixture of encodings?

boomzilla

@PleegWat said in SQLServer Encoding WTF:

@boomzilla Googledy-goo

CONVERT

No obvious hits for mssql.

Eh...the problem here is that it's the UTF8 bytes just interpreted as CP-1252. At this point it's not really CP-1252.

boomzilla

@aitap said in SQLServer Encoding WTF:

@boomzilla Okay, sorry for the noise.

One more Hail Mary shot. The question title makes it sound as if it's nothing like your problem, but the answers are all about decoding UTF-8 bytes stored in a TEXT column. In plain SQL, on MS SQL Server.

At least it's consistently botched. right? It's not a mixture of encodings?

Yeah, that looks like what I would probably need to convert anything that gets in there. Will have to play with that.

Thanks!

PleegWat

@boomzilla Depending on how it's currently interpreting the incoming bytestream, the correct cast may be from UTF-8 to CP1252. Possibly followed by utf-8 to whatever the local charset is.

loopback0

I assume it's the same when you get the data from SQL Server directly as it is when you get it via Oracle?

boomzilla

@loopback0 yes

boomzilla

@boomzilla said in SQLServer Encoding WTF:

@aitap said in SQLServer Encoding WTF:

@boomzilla Okay, sorry for the noise.

One more Hail Mary shot. The question title makes it sound as if it's nothing like your problem, but the answers are all about decoding UTF-8 bytes stored in a TEXT column. In plain SQL, on MS SQL Server.

At least it's consistently botched. right? It's not a mixture of encodings?

Yeah, that looks like what I would probably need to convert anything that gets in there. Will have to play with that.

Thanks!

Had a few minutes before end of day happened and soooo many errors. Didn't like some semi-colons. Very unhappy with the CTEs. Apparently the statement before needs a semi-colon. But then get weird errors about needing to declare the function parameter.

Probably because I'm not using an official MS tool to connect to the DB. Who knows?

Karla

@boomzilla said in SQLServer Encoding WTF:

I integrate with a COTS product that uses SQL Server. Their database uses the collation that uses CP-1252.

They store XML documents (complete with the beginning xml tag, etc) in a TEXT column, which I've come to learn is made to store non-unicode text (and seems to be deprecated to boot). Users sometimes type an apostrophe and it gets stored as that special right single quote garbage, which works fine in the table where it's stored as nvarchar, but when that gets put into one of these XML documents, it ends up as â€™.

Trying to find a way to convert that back into something sane, no luck so far. Is there any way to re-encode or whatever?

This was always a problem for us. Solution...do not use rich text editors.

I'm pretty sure I have not seen a RTE in CRM, so that might be their one redeeming quality.

Karla

@boomzilla said in SQLServer Encoding WTF:

@boomzilla said in SQLServer Encoding WTF:

@aitap said in SQLServer Encoding WTF:

@boomzilla Okay, sorry for the noise.

One more Hail Mary shot. The question title makes it sound as if it's nothing like your problem, but the answers are all about decoding UTF-8 bytes stored in a TEXT column. In plain SQL, on MS SQL Server.

At least it's consistently botched. right? It's not a mixture of encodings?

Yeah, that looks like what I would probably need to convert anything that gets in there. Will have to play with that.

Thanks!

Had a few minutes before end of day happened and soooo many errors. Didn't like some semi-colons. Very unhappy with the CTEs. Apparently the statement before needs a semi-colon. But then get weird errors about needing to declare the function parameter.

Probably because I'm not using an official MS tool to connect to the DB. Who knows?

We've (I've) had the problem that our databases on the same SQL Server as CRM have different encodings.

The workaround was to use COLLATE when selecting data.

sockpuppet7

I have one wtf that kind of match this title. borland c++ builder 6 running in a windows container with hiper v isolation interprets data from sql server with the wrong encoding, because these kind of containers run in the utf-8 codepage (I tried all sorts of parameters in the connection string and nothing worked)