Watch, Follow, &
Connect with Us

For forums, blogs and more please visit our
Developer Tools Community.


Welcome, Guest
Guest Settings
Help

Thread: Silent string truncation with UTF8?



Permlink Replies: 13 - Last Post: May 13, 2015 3:06 PM Last Post By: Anthony Gautier
Anthony Gautier

Posts: 17
Registered: 10/10/09
Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 7, 2015 10:05 AM
Summary: I've noticed that with UTF8 enabled, even though I have a VARCHAR(N), the DB can store up to N * 4 bytes worth of characters without errors. It then hides the characters that go beyond N when I perform later queries. Is this a bug, or is it possible I have something set up incorrectly?

Original message below


Hello EDN forums!

I've been looking into creating future databases with UTF8 as the default character set using InterBase XE7. From what I can gather, this is the recommended method of supporting Unicode. For the most part it's working swimmingly, but I ran into a couple of puzzling issues that I'm hoping the community can help me resolve. These issues happen for me using UNICODE_FSS as well.

First of all, my configuration:
-InterBase XE7 Developer Edition V12.0.1.261
-Windows 7
-Database with charset UTF8 for all VARCHARs (this is the primary string type we use)
-IBConsole for testing queries. I've tried setting no char set when creating db or performing a query, and setting to UTF8.

Now the behaviors: it looks like silent string "truncation" is occurring when the data I'm placing into a VARCHAR(N) has more than N characters, but the characters don't exceed N*charsize (4 bytes in UTF8). When I perform a "select *" query and look at the field, it only shows N chars (the additional chars are not there). When I perform a query with the field contents as a condition, the condition only evaluates to true if I include the original, untruncated string as the condition.

This seems to indicate that even though I have a VARCHAR(N), the DB can store more than N characters and is hiding the rest. Is this a bug, or is it possible I have something set up incorrectly?

Here is an example of the behavior I'm seeing (can be reproduced through IBConsole):

1) Try to place a string with >N chars into a VARCHAR(N) column (in this case a VARCHAR(1)), which also fits within N*4 bytes
Execute query: "update a_table set a_id = 'abcd' where a_category = '1'"
Result: Query successful. No errors!

2) Show values as stored in DB for the last commit
Execute query: "select * from a_table where a_category = '1'"
Result: Query successful. a_value = "a"

3) Attempt to show the value stored in the DB for the last commit via its truncated field data
Execute query: "select * from a_table where a_value = 'a'"
Result: Query successful. Query returns NO results!

4) Attempt to show the value stored in the db for the last commit via its original untruncated field data
Execute query: "select * from a_table where a_value = 'abcd'"
Result: Query successful. Query returns results!

Edited by: Anthony Gautier on Apr 13, 2015 9:28 AM
Anthony Gautier

Posts: 17
Registered: 10/10/09
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 14, 2015 3:55 PM   in response to: Anthony Gautier in response to: Anthony Gautier
Is there anyone out there who has experienced this issue or can do a quick test to see if you can reproduce? Any insight is much appreciated.
Anthony Gautier

Posts: 17
Registered: 10/10/09
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 20, 2015 9:31 AM   in response to: Anthony Gautier in response to: Anthony Gautier
This issue is still causing me trouble. We have a large existing application that expects the DB to return an error when a string is too large to store, but I can't rely on the DB to report errors to me when it's behaving this way. I may write this as a bug and see where this will get me.
Anthony Gautier

Posts: 17
Registered: 10/10/09
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 24, 2015 10:17 AM   in response to: Anthony Gautier in response to: Anthony Gautier
This problem looks to have been written up in QualityCentral already (#102982), but hasn't gained much attention so far. This is pretty surprising to me, considering the severity of the issue.

[http://qc.embarcadero.com/wc/qcmain.aspx?d=102982]

Edited by: Anthony Gautier on Apr 24, 2015 10:17 AM
Tamotsu Maki

Posts: 3
Registered: 10/3/99
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 24, 2015 4:07 PM   in response to: Anthony Gautier in response to: Anthony Gautier
Hi Anthony,

This is our long-time design issue. When defined as CHAR/VARCHAR(n), InterBase allows to store up to n*4 bytes. Not enforcing "number of characters" limitation.

When you defined CHAR(1) field with UTF8, you can store 'ABCD' (4 bytes) in this field. When you try to insert 5 bytes of UTF8 characters (like 'ABCDE'), server will return SQL code -802. ("Arithmetic exception, numeric overflow, or string truncation")

My recommendation is using UNICODE_BE on the server side, then set UTF8 to client side. Hence UNICODE_BE is fixed length character set, CHAR/VARCHAR(n) only allows up to n characters.

- Tamotsu
InterBase QA
Tamotsu Maki

Posts: 3
Registered: 10/3/99
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 24, 2015 4:15 PM   in response to: Anthony Gautier in response to: Anthony Gautier
Also, I confirmed IBConsole truncate strings with this situation.

I say this is a bug of IBConsole. Client application should not truncate the result.

- Tamotsu
InterBase QA
Anthony Gautier

Posts: 17
Registered: 10/10/09
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 27, 2015 11:48 AM   in response to: Tamotsu Maki in response to: Tamotsu Maki
Thanks for the info Tamotsu! It's good to hear from someone. In my experience, applications using IBX are also truncating results, e.g. if I insert "abcd" into a VARCHAR(1), I only see "a" when I query via IBX for the data.

A couple of questions:
1) Is there a reason UNICODE_BE is recommended instead of UNICODE_LE? Would UNICODE_LE work in the same manner?
2) Will conversion between UNICODE_BE and UTF8 cause performance loss compared to a UTF8/UTF8 setup?
3) Does using UNICODE_BE mean my database will get larger, since min size per character is 2 bytes?
Tamotsu Maki

Posts: 3
Registered: 10/3/99
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 27, 2015 12:25 PM   in response to: Anthony Gautier in response to: Anthony Gautier
Anthony Gautier wrote:
Thanks for the info Tamotsu! It's good to hear from someone. In my experience, applications using IBX are also truncating results, e.g. if I insert "abcd" into a VARCHAR(1), I only see "a" when I query via IBX for the data.

A couple of questions:
1) Is there a reason UNICODE_BE is recommended instead of UNICODE_LE? Would UNICODE_LE work in the same manner?
2) Will conversion between UNICODE_BE and UTF8 cause performance loss compared to a UTF8/UTF8 setup?
3) Does using UNICODE_BE mean my database will get larger, since min size per character is 2 bytes?

Hi,

1) The reason is, when converting between UNICODEs, InterBase server always convert to UNICODE_BE first.

2) Yes. Generally, server does not do anything when SERVER CS and CLIENT CS is matched.
UNICODE_BE/UTF8 setup adds conversion needs on the server side. However, unlike traditional ANSI MBCSs, UNICODE_BE to/from UTF8 conversion is very simple. Performance impact should be minimal.

3) Depends on data type.
VARCHAR is actually storing <Actual String> + 2 bytes (length marker). This case, storage size goes double if most of the stored characters are ASCII.
CHAR is fixed length. Server always reserve possible maximum bytes. This case, Storage size goes half. (UTF8 CHAR storage: n*4 bytes, UTF16 CHAR storage: n*2 bytes)

And thanks. I will check IBX behavior.
- Tamotsu
Anthony Gautier

Posts: 17
Registered: 10/10/09
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 27, 2015 2:48 PM   in response to: Tamotsu Maki in response to: Tamotsu Maki
Thanks again. I'll look into size and performance differences when making my changes.
Anthony Gautier

Posts: 17
Registered: 10/10/09
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 30, 2015 2:17 PM   in response to: Tamotsu Maki in response to: Tamotsu Maki
Hello again Tamotsu. The DB size and performance look pretty comparable so far when using your method -- I like what I'm seeing.

It looks like I have one strange issue after creating my DB as UNICODE_BE and connecting using UTF8 though. What I'm seeing now is that I get string truncation errors when trying to view some data I've placed in the DB. This seems to happen only with characters that are 3 bytes in UTF8 (Chinese characters for example).

For example, execute this query on a VARCHAR(1) column (for simplicity) in a UNICODE_BE database that is connected via UTF8:
Query: update your_table set your_column = <3 byte UTF8 character>
Result: no errors -- query executes okay.

Then do this:
Query: select * from your_table
Result: string truncation error -- query doesn't complete

I've tried this in IBConsole and using IBX and get the same result in both cases. Is there something I'm missing? I appreciate your help.
Jeff Overcash (...

Posts: 1,529
Registered: 9/23/99
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 27, 2015 12:54 PM   in response to: Anthony Gautier in response to: Anthony Gautier
Anthony Gautier wrote:
Thanks for the info Tamotsu! It's good to hear from someone. In my experience, applications using IBX are also truncating results, e.g. if I insert "abcd" into a VARCHAR(1), I only see "a" when I query via IBX for the data.

That almost always means you did not set the lc_ctype on the IBDatabase's
parameters.

the data is coming across as unicode, but the lc_ctype missing tells IBX that
the data will be ANSI. The 2 byte pairs for Unicode will have the second byte a
0 for ansi data and therefor when converted from "ANSI" to UTF-16 gets truncated
after the very first letter.

To work with Unicode data you MUST set the lc_ctype for IBX. This includes
working with such data in IBConsole.

A couple of questions:
1) Is there a reason UNICODE_BE is recommended instead of UNICODE_LE? Would UNICODE_LE work in the same manner?
2) Will conversion between UNICODE_BE and UTF8 cause performance loss compared to a UTF8/UTF8 setup?
3) Does using UNICODE_BE mean my database will get larger, since min size per character is 2 bytes?

--
Jeff Overcash (TeamB)
(Please do not email me directly unless asked. Thank You)
And so I patrol in the valley of the shadow of the tricolor
I must fear evil. For I am but mortal and mortals can only die.
Asking questions, pleading answers from the nameless
faceless watchers that stalk the carpeted corridors of Whitehall.
(Fish)

Anthony Gautier

Posts: 17
Registered: 10/10/09
Re: Silent string truncation with UTF8?
Click to report abuse...   Click to reply to this thread Reply
  Posted: Apr 27, 2015 3:03 PM   in response to: Jeff Overcash (... in response to: Jeff Overcash (...
Indeed I have set lc_ctype to UTF8 in my testing (both within IBConsole and within IBX), but haven't seen the characters beyond the N of the VARCHAR(N) columns come back as output. That makes sense to me though and is what I expect, because there should only be up to N characters in the field (even though IB is erroneously storing more). I'm not sure why DBWorkbench shows the full strings, as I've heard it does.

I'm confident the ctype is being set because if I use international characters, I get different answers from having a ctype and having none. My columns are defined as CHARSET UTF8. Beyond that I'm not sure what part of my config could be causing different results.
Jeff Overcash (...

Posts: 1,529
Registered: 9/23/99
Re: Silent string truncation with UTF8? [Edit]
Click to report abuse...   Click to reply to this thread Reply
  Posted: May 12, 2015 1:49 PM   in response to: Anthony Gautier in response to: Anthony Gautier
InterBase has a loophole/exploit/bug with variable width character sets like
sjis (Japanese) and UTF8. All the server checks is that the number of bytes
passed does not exceed the maximum number of bytes that charset can store.

I'll use SJIS as an example. The ASCII chars are all 1 byte wide. The Japanese
chars are 2 bytes wide. So a column that is a VarChar(10) has a max storage
capacity of 20 bytes.

The exploit is that you can store into that '12345678901234567890', 20
characters since that only requires 20 bytes. This violates though the SQL
standard that a VarChar(10) holds at most 10 characters.

One reason people used this exploit was to get around the # of characters in an
index when you are only storing ASCII in a multi byte charset column. This
really is no longer a factor since Xe when the index size was greatly increased.

What looks like truncating is IBX. For performance purposes, when I updated IBX
for Unicode all string data stored in the internal buffer is stored in UTF-16,
native Delphi string type. All conversions are done coming and going from the
API to and from UTF-16. This IMO is the right way to Unicode an app, you
perform the conversions at hte boundary.

IBX's internal memory buffer is built based on the maximum size the record can
be then large chunk of that size are allocated (default 1000 at a time). IBX
though only has the information of the charset size (not that it is varible for
SJIS and UTF-8) and the max bytes that column can be. So I basically allocate
based on the meta data size, not the max exploitable size. If it is defined as
a VarChar(10) I allocate 20 bytes to hold the UTF-16 representation of that data
(10 * 2 bytes per char). So the maximum size of a single record is calculated
then blocks of 1000 of that size gets allocated and I track offsets into hte
memory based on what record you are looking at.

With the conversion from storing it as is (ANSI days) and Unicode enabling
everything, string storage already doubles in ANSI cases. In the case of
variable multi-byte charsets (which IIRC is only JSIS and UTF-8 in InterBase)
this would have been a 4x's memory increase for SJIS and 8x's increase for UTF-8
if I put in special code to allow for the exploit.

The decision was made back then to not allow for the exploit. Sriram and I had
a long phone conversation on this and we agreed that IB was wrong in not
restricting by characters as per the standard instead of bytes. Now part 2 of
that decision was around how to handle the occasions where people did use the
exploit. It was either raise tons of exceptions of overflows or to silently
truncate. I chose to silently truncate.

--
Jeff Overcash (TeamB)
(Please do not email me directly unless asked. Thank You)
And so I patrol in the valley of the shadow of the tricolor
I must fear evil. For I am but mortal and mortals can only die.
Asking questions, pleading answers from the nameless
faceless watchers that stalk the carpeted corridors of Whitehall.
(Fish)
Anthony Gautier

Posts: 17
Registered: 10/10/09
Re: Silent string truncation with UTF8? [Edit]
Click to report abuse...   Click to reply to this thread Reply
  Posted: May 13, 2015 3:06 PM   in response to: Jeff Overcash (... in response to: Jeff Overcash (...
Thank you Jeff, great explanation. I do think IBX is correct in expecting a VARCHAR(N) to be truly N characters.
Legend
Helpful Answer (5 pts)
Correct Answer (10 pts)

Server Response from: ETNAJIVE02