Jump to content

mael

Members
  • Content Count

    13
  • Joined

  • Last visited

  • Days Won

    1

mael last won the day on February 13

mael had the most liked content!

Community Reputation

6 Neutral
  1. I finally wrote a class to iterate over a range of parameters for hash functions/tables chained one after another. First I thought about using some optimization algorithm, like gradient descent, but it proved hard to write a reasonable error function which has gradients that are smooth and wide enough for optimization to work well. So I simply iterated over a reasonable range of the parameters until a satisfyingly small size was found. I could reduce the size of 1088 KiB (1 Byte for each of the 1114112 code points), to 16 KiB while keeping random access (three chained hash table lookups and three hash functions to compute the index/key for each hash table), which I think is good enough.
  2. mael

    New features in GExperts

    I replied there. Couldn't find my old account information, so I had to make a new one. As can be seen there this is not related to MMX, and I did not make a wrong bug report @timfrost.
  3. mael

    Grep search and DFM files

    I would not try to interpret strings encoded in a special way (such as using # and a number, or string lists which generate a list of lines which begin and end with ' and have a special escaping for '). You would expect the text search to work with the verbatim text file, and only consider things such as text encoding (UTF-8, Windows-1252) that applies to every text file, but otherwise nothing smart that does syntax interpretation. If such a mode would be introduced, I think it should be an explicit option, then people will also be less surprised by behavioral changes (only if that option is checked).
  4. mael

    New features in GExperts

    Thanks for maintaining GExperts! In the latest release (but also in older releases), some action seems to take over the shortcut Ctrl+Alt+C which is usually reserved for showing the CPU window in debug mode. I only found the editor expert "Copy Raw String" with this shortcut, and disabled it, but the shortcut is still "caught" by some GExperts code. If I disable GExperts, everything is fine again.
  5. Reading the thread it seems to confirm my theory in the other thread you posted @dummzeuch, that the issue is related to reference counting. Some code is probably treating WideStrings (which are COM strings) as UnicodeStrings (which are reference counted) or somehow misinterpreting/casting data types along the way. @Sue King: It would be helpful if you could provide links (or simply attach) both versions of dxgettext, the old one that worked, and the new one that doesn't. If there is a minimal demo, which works with nexus db trial DCUs, that would be even better.
  6. 2009 was the first to introduce Unicode and UnicodeString, so it's very likely UTF8ToUnicodeString did not exist before that. But you could use IFDEFs to define UnicodeString as WideString for pre-Unicode Delphi versions, and make a stub UTF8ToUnicodeString that calls UTF8ToWideString. That's how I used to do it, and it worked well. WideString will still not be reference counted of course. A reason for the original issue could be reference-counting. I remember that Andreas Hausladen implemented reference counting for WideStrings, with a hack. I am not sure anymore how it was implemented, and how deep the hack went (a quick search didn't turn up anything). But if people have this patch installed, it may have unintended consequences, which might have caused the issue.
  7. Thanks a lot for your input, Mahdi. I wanted to post working code once it's finished and polished, but that will take a while, as I have to solve other parts of the software first. Your solution is somewhat similar to what I tried, however this is limited to values <= UInt16 (as you noted), whereas Unicode code points range from 0 to $10FFFF. The other issue is that while TUnicodeCategory was in the question, I wanted to implement a general solution, where ranges aren't necessarily as regular (lookup tables for other properties of codepoints where they don't build such nice ranges) or working with the specifically chosen "hash keys" that will not work when your ranges are not as good. Ideally I was looking for an algorithm that automatically searches for the proper hash functions that would yield a reasonably small hash table. Or at least a principle algorithm for it. I'll have a look at your solution again, though, if I need to optimize my approach further.
  8. After more analysis I found out the tables implement a 3 level hashmap, or actually three hashmaps that are used consecutively to implement the mapping from codepoints to categories. I have been able to reverse engineer part of the hash functions, but besides the first table, I don't get identical results for the table values or table sizes. The overall mapping from codepoints to categories works however. The second table CatIndexSecondary increases its values in steps of 16 every time a bucket with a collision is found. If there is a bucket that has a single value (i.e., no collisions), and that single value appears again in another bucket, they both get assigned the same index. Sometimes though it gets strange and suddenly values get large, apparently to provide more room for collision avoidance, but it's not obvious how they are computed. It is also strange that a value between 0..15 is added to the result of a mapping with CatIndexSecondary. It would cause collisions if the values in CatIndexSecondary are not carefully chosen to avoid that. The increment in steps of 16 would ensure this happens, but not all values are computed in such a straight forward way. Maybe the increments in steps of 16 are a first attempt, then a check for collisions occurs, and remaining gaps in the index numbers are filled to reduce the size of the hashmap incrementally. I know this remains vague, but it's at least a quick progress update.
  9. Thanks. There is definitely a structure and ranges that are assigned the same value. But there is no special documentation in the Unicode standard that would go beyond what you can directly deduce from the mapping available in Delphi. I actually started with the standard, then looked for efficient encodings. The standard vaguely suggests using a data structure like a trie. The Unicode documentation itself only lists every character and gives it a matching category. The Delphi implementation apparently uses some kind of Hashmap. But I haven't been able to figure out the "inverse function" yet, to create the table. Edit: I have looked into writing my own hashing function, assuming the division of the original key into three parts (one 13 bit key, and two 4 bit keys) as the original RTL code does. I could reproduce the values after a while, eventhough it seems the RTL wastes a bit of value range. I will update this post when I found out the final solution.
  10. I briefly browsed through the list of posts, but didn't find an obviously similar one.
  11. Hi, When you make a code block it defaults to HTML syntax highlighting. For a Delphi forum it would be better if Pascal was selected by default.
  12. Hi, In the unit System.Character there is a function InternalGetUnicodeCategory(). It uses a complex indexing scheme to get the category of a Unicode Codepoint (determining if it is a control character, a letter, a number, etc.) from a table. Result := CategoryTable[CatIndexSecondary[CatIndexPrimary[C shr 8] + ((C shr 4) and $F)] + C and $F]; The indexing is used to save memory, probably using a sort of trie that is implemented using arrays, or a kind of hashmap principle. What I don't get is how the range of Codepoints that goes from 0..$10FFFF is exactly reduced so that the table is only about 36664 bytes (in Delphi XE3) in size. Can somebody shed some light on how this indexing scheme was determined from the initial array that had a form similar to this: CodepointProperty: array[0..$10FFFF] of Byte; System.Character_const.5.2.0.inc gives a little more detail, because there the arrays are more explicit. Seems to be some kind of bit compression, but I am still looking for more insight into how it works.
  13. mael

    Property editor - on the finest art

    That's very useful! I always wanted to have an option that would keep certain properties "pinned", no matter what control is selected.
×