Jump to content
aehimself

PC-specific memory corruption...?

Recommended Posts

All,

 

I have a really strange situation here and I'm running out of ideas on what differences to check. Long story short: I have an application to access data in a database (Connection -> DataSet -> DataSource -> DBGrid, using Zeos and VCL DBGrid descendant for some custom features). In one specific codepage setting it seemingly starts to corrupt memory on ONE PC only (that I know of, that is). Details can be read here.

Now, given the complexity the issue might be Oracle-related, PC environment-related, component-related, moon phase related or simply I messed something up in my code. My issue is that I'm running out of ideas to pinpoint the culprit.

I'm not asking how to fix the issue (I know it's not possible without seeing the code), I'm asking for diagnostic steps; basically how would you start to debug this?

 

What I know:

- OS, Oracle client locale doesn't seem to affect the behavior

- It's not thread-related, confirmed by moving all logic to VCL

- Oracle client and server version seems to be irrelevant, the issue appeared with 12 and 19, client and server respectively

- It is not a display issue, datasets actually contain garbage

 

What I think:

- I doubt the issue is in the component as the same binary acts differently on one PC only

- Because of the same reason I would say that my code is good, too

 

What I will do next:

- Simple test app. Only a connection, query and a DBGrid

- On a PC where it seems to work I'll use said codepage to see if just by some miracle corruption only happens at a later stage

 

Thank you all for your input!

Share this post


Link to post

How do you actually detect that something has now gone wrong? Some component (like a DBGrid) displays a TDataSet that contains gobbledegook? Or is there more to detect it?

 

How do you know it's memory corruption (like writing out of bounds), and not generally using a stale pointer? Have you tried compiling the application with a memory manager like full FastMM and enabled runtime checks?

Edited by Der schöne Günther

Share this post


Link to post
21 minutes ago, aehimself said:

- It is not a display issue, datasets actually contain garbage

Why is it relevant what's in the dataset after an error in the client lib? Clear or fill the memory in zeos prior to query the db, you should see that instead of the garbage now.

Also, dump the raw result to disk on both pc's and compare.

One side, you were mentioning VARCHAR somewhere and the other side you are setting the clientcodepage to UTF16. I don't know oracle but this is interesting.

Maybe zeos calculates for varchar and server delivers in nvarchar?

Could you do some selects where the actual data is the half of the maximum data etc.. to see the threshold for the error?

Btw. are the actual RDBMS's also the same?  (You and your collegue)

 

  • Like 1

Share this post


Link to post
4 minutes ago, Der schöne Günther said:

How do you actually detect that something has now gone wrong? Some component (like a DBGrid) displays a TDataSet that contains gobbledegook? Or is there more to detect it?

Messed up text in components. This is why I explicitly mentioned that it's not a display issue, data in the dataset is garbage too. I guess this never worked on my side, for a long time I simply got a "Fetched value truncated" error from Oracle until I found out changing a property gets rid of that (see the prev, thread mentioned in the linked one). Data is still corrupted, though.

7 minutes ago, Der schöne Günther said:

How do you know it's memory corruption (like writing out of bounds), and not generally using a stale pointer? Have you tried compiling the application with a memory manager like full FastMM and enabled runtime checks?

I don't know, but based on the initial Oracle error I guess it is going to be a memory corruption. I already checked MadExcept's buffer under / overrun protection, did not fire off.

6 minutes ago, Attila Kovacs said:

Why is it relevant what's in the dataset after an error in the client lib?

After increasing the initial_buffer_size, Oracle doesn't throw an error anymore.

7 minutes ago, Attila Kovacs said:

Could you do some selects where the actual data is the half of the maximum data etc.. to see the threshold for the error?

Yes, I did that, but it's not consistent. I have a feeling that it is going to be a sweetspot, when a specific amount of character data was returned the error appeared. Simply removing 3 random text columns already returned all data, formatted correctly afaik.

9 minutes ago, Attila Kovacs said:

Btw. are the actual RDBMS's also the same?  (You and your collegue)

Failed to mention this, thanks! Yes, we are connecting to the exact same DB.

12 minutes ago, Attila Kovacs said:

One side, you were mentioning VARCHAR somewhere and the other side you are setting the clientcodepage to UTF16. I don't know oracle but this is interesting.

Maybe zeos calculates for varchar and server delivers in nvarchar?

This is something I can not tell. I don't have access to the DB create scripts, PL/SQL says all the textual fields are VARCHARs only.

Unfortunately my knowledge in private life of RDBMSs is very limited, I consider myself a user only, so I don't know what is (should be) the difference between normal varchar and nvarchar.

Share this post


Link to post
13 minutes ago, aehimself said:

Yes, I did that, but it's not consistent. I have a feeling that it is going to be a sweetspot, when a specific amount of character data was returned the error appeared. Simply removing 3 random text columns already returned all data, formatted correctly afaik.

29 minutes ago, Attila Kovacs said:

I mean on field base. Like field varchar(10), clientcodepage utf16, can you fetch 'AAAAA' (5) but not 'AAAAAA' (6)?

Share this post


Link to post
Guest

.

Edited by Guest
Nevermind... saw the link, cannot read all there.

Share this post


Link to post

Since it is machine specific, does the application define the connection or use an existing ODBC connection? This could produce different results on different machines.

 

Since you know the 3 columns that cause the query to fail, profile them to check for bad data.  Then you can understand what data in the column is getting truncated.

 

SELECT  LENGTH( first_name ) len,   COUNT( * )
FROM  employees
GROUP BY   LENGTH( first_name ) ORDER BY len;

 

 

 

Share this post


Link to post

...so a little update, the issue is in the component 🙂

The behavior difference between the 4 PCs are caused by a setting in my application which is changing a property on the connection, which is causing the corruption. This setting by default is off and wasn't changed on 3 PCs.

 

Took me a while to figure this out, as I had to import half of my main program bit-by-bit to the simple test app for the issue to finally occur.

At least it also made it clear what I thought is one issue is actually two; and fortunately both have a workaround.

Share this post


Link to post
1 hour ago, Fr0sT.Brutal said:

and the issue is...?

...that the journey didn't end just yet 🙂

 

What I found is that the two things I thought are connected aren't. The original "column value is truncated" Oracle issue is a simple case of not enough buffer allocated and my workaround (forcing to use a larger buffer) is the solution.

However, this has nothing to do with the corruption issue; which I just found out is NOT memory corruption but wrongly transformed character array:

 

Here, the method's return value is PAnsiChar, fRawTemp is a RawByteString:

image.thumb.png.d96aee6aa1c4dcd769ca123af779183b.png

RawByteString is correct, but in the moment it's being casted to a PAnsiChar it is immediately messed up:

image.png.64f2b3f1ee60017e96211b69e2c28360.png

This then is stored in the resultset, so the dataset and the components display this mess.

Share this post


Link to post

looks like it's a UTF8 encoded string, í (0xED) is for example 0xC3 (Ã) 0xAD

probably 2 times encoded that you can see the code points or because of the cast?

Edited by Attila Kovacs
  • Like 1

Share this post


Link to post
Guest

You mentioned different behaviour on different client PCs?

If so, this is really very speculative, sorry, check the old setting for "language for non-unicode programs".

I have seen strange side-effects when that setting is off.

HTH

Share this post


Link to post

@Attila Kovacs you are right. I extracted the data behind RawByteString, which was 195,129,108,108,97,110,100,195,179,32,108,97,107,99,195,173,109. This is indeed UTF8, and it just gave me an idea on where to look...

 

@Dany Marmur The setting I'm talking about controls whether downloaded LOB data is cached or not. Inside the component this causes a different type of resultset created. On 3 PCs this setting was off (default), on one I turned it on. This is why one machine acted differently, as it was using a different type of resultset (the one which is faulty).

Share this post


Link to post

It seems I found and fixed the problem. For a quick recap: Using UTF16 character encoding I often experienced ORA-01406 fetched column value was truncated errors and when this went away by manually increasing the buffer size text data got malformed.

As it turns out these two were independent errors and for the first my workaround is actually the solution.

 

Text corruption was a bit trickier. There is a setting which controls how LOB data is cached and internally this chooses what kind of resultset is being created. The difference between the two of them was how the columns were defined: the one working properly checked if the codepage is set to Unicode and if yes it changed String types to UnicodeString.

 

Moving the same block to the other resutset fixed the corruption.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×