Foreign Characters stored on Latin collation vs BODS

system · May 23, 2012, 8:14am

Designer: 12.2.3.0
SQL Server 2008 R2
Collation: Latin1_General_CI_AS, code page: 1252

We have been experiencing a problem with the ecszet character, ß (alt 225) where sql server “sees” this as a ‘ss’, due to the collation we are using. Due to the scale of our warehouse, it is not desirable to change the collation to binary to fix this, instead we want to configure data services to handle this character the same way as sql server.

Using data services locale selector I have adjusted the code page to 1252 and have run a test DF. If I make sure the data is brought into the job server (by shoving a map operation in) and then performing a subsequent select distinct in next query, data services does not filter any rows out that are duplicate.

Are there any other settings I need to consider? I had a play around with the datastore code page, server page etc. but this had no effect (except some characters coming through as arrows or blocks on the select distinct test but still fails on table comparison test).

John_F_Sheppey (BOB member since 2009-10-16)

system · May 23, 2012, 12:01pm

But cp1252 does have this character!?! Why would there be a conversion from one character into two?

There might be an upper() function or a client side effect but I don’t think that 0xDF character code gets changed to two bytes of 0x53 0x53 (SS) by a plain read with any tool including SQL Server Management Studio.

http://www.gymel.com/charsets/CP1252.html#U00DF

Werner Daehn (BOB member since 2004-12-17)

system · May 23, 2012, 1:44pm

This may explain it further.

I have a similar problem where the codepage reads ‘æ’ (alt 145) as ‘ae’.

If we used binary, it would be fine but the latin collation reads these characters this way. Data services sees them as separate characters, why would it behave in this manner when code page is set to 1252 on the data services local settings and datastore code page & server code page to cp1252, shouldn’t it work the same way as SQL server 2008 R2?

It’s quite simple to test, if you have data collated with what I have defined and you do a

Select *
from a
where a.name like ‘%ß%’

you will return any rows with ‘ss’ and ‘ß’, this behaviour is similar to æ and ae.

John_F_Sheppey (BOB member since 2009-10-16)

system · May 23, 2012, 1:57pm

Because the codepage has those two as separate values. It’s the collation setting of your database which tells them to be treated as equal, BODS does not honor that settings when comparing in memory (actually there is no way it can)

Actually this would rather be a language setting then a codepage if it can be changed within BODS but I’m no expert in that field.

If you use Latin1_General_CP1_CI_AS you would get ss != ß on your database as well…

Actually if you search the Forum you will find COLLATE Error on Table Comparison
So it seems that BODS is doing a COLLATE with a binary sort when using sorted input. Maybe there is an option to set which COLLATE is used there? Werner should know

HerdplattenToni (BOB member since 2011-05-13)

system · May 23, 2012, 2:28pm

Thank you all for the participation, it has been a great help.

I did also configure the language to eng and the location to gb on data services locale settings, and on the datastore to eng- but this had no effect.

I feel like I am missing something else that needs configuring…the sorted input is fixed for this version of BODS (11.7.2 was when it was resolved).

Ironically it warns me of duplicate keys if I push the same rows through with a ‘ß’ version and an ‘SS’ version (for cached this warning does not appear). Doing a select distinct before does not remove this warning, unless pushed down to database.

John_F_Sheppey (BOB member since 2009-10-16)

system · May 23, 2012, 4:27pm

I am still confused about the expectation.

a) Is it that when you do a
select * from table where name like ‘%ß%’
you get different amount of rows depending on where the filter is executed, DS or database? Yes, that’s a fact and cannot be changed.

b) You read an ß character from the source database and expect an ss char in the target database.

My understanding is, the collation controls the codepage and the comparison logic. In the codepage there is an ß it will always remain there. In the sort order you can do different things, e.g. is DAEHN=DÄHN, yes or no? These kind of things. Here DS works binary.
Actually, this is not entirely true. We use the database default collation for sorting except when the database and DS does sorts/comparison on the data, then we force the database to binary sorts. Else we don’t.
Example of such a dataflow would be one with TC in sorted mode. Would be kind of interesting if the database does sort in lexicographical order with A<B<C<…<O<Ö (O-Umlaut)<P<… and the TC transform then does a binary comparison like Is the hex code of the previous row value < the hex code of the current row? (Order there is binary, meaning A<B<…O<P<…Z<Ö)

Werner Daehn (BOB member since 2004-12-17)

system · May 24, 2012, 7:10am

If I understood correctly the problem is that Table Comparison will not flag a row as update if the comparison is done by DS (sorted input option and doing the sort not on the DB) and the difference is only between ß and ss. And since you get an insert the dataflow will fail with a unique constraint error since the DB says those two are identical.

The expectation is that if you do a table comparison a unique constraint violation on the input primary keys can never happen.

The same thing applies for doing a select distinct and writing in an empty table. The expectation is that distinct avoids duplicate keys.

edit: Don’t get me wrong, I personally don’t agree with the expectation. The unique key is in my opinion not on the column but on a function (collate(column)) and distinct on the column doesn’t have to have any relation with that.

HerdplattenToni (BOB member since 2011-05-13)

system · May 24, 2012, 8:25am

Apologies, I probably didn’t make this as clear as it needs to be. HerdplattenToni hit the nail on the head with his first paragraph.

I don’t expect any conversion of the character, i.e. to change ß to ss.

What I expect is data service to handle this the same way as the database. If I do a select distinct on “straße” and “strasse”, only one row will return on db (or if I push the sql down via BODS). If the data is forced onto the jobserver, a select distinct will return 2 rows.

Similary if I do a table comparison and a column is part of the natural keys (primary keys), then if a row arrives that is ‘straße’, but in the target we already have a value of ‘strasse’, then I expect DS to flag this as an update on the compare columns.

Why do we force it to use binary sorts?, it’s odd because like I said if I push both rows through at same time, it does warn me of duplicate keys. If I push them through on separate runs, the second one will still result in an insert. The behaviour isn’t consistent and does not match similar operations at DB level. Is it not possible to configure BODS to use code page 1252 for various operations I may want to perform?

John_F_Sheppey (BOB member since 2009-10-16)