Jump to content
pyscripter

Linking to a C obj file that makes Windows calls

Recommended Posts

Just now, pyscripter said:

I would prefer to not mess around with the c files and I think that pcre_jit_compile needs the stack checking anyway.

I agree and you don't, but /GS is not essential here but /Gs, i don't see #pragma check_stack(on) in the code, so no need to mess with the code.

also there is a configuration directive for pcre that might worth testing for exception and stack problem

Quote

$PACKAGE-$VERSION configuration summary:

    Install prefix .................. : ${prefix}
    C preprocessor .................. : ${CPP}
    C compiler ...................... : ${CC}
    C++ preprocessor ................ : ${CXXCPP}
    C++ compiler .................... : ${CXX}
    Linker .......................... : ${LD}
    C preprocessor flags ............ : ${CPPFLAGS}
    C compiler flags ................ : ${CFLAGS} ${VISIBILITY_CFLAGS}
    C++ compiler flags .............. : ${CXXFLAGS} ${VISIBILITY_CXXFLAGS}
    Linker flags .................... : ${LDFLAGS}
    Extra libraries ................. : ${LIBS}

    Build 8 bit pcre library ........ : ${enable_pcre8}
    Build 16 bit pcre library ....... : ${enable_pcre16}
    Build 32 bit pcre library ....... : ${enable_pcre32}
    Build C++ library ............... : ${enable_cpp}
    Enable JIT compiling support .... : ${enable_jit}
    Enable UTF-8/16/32 support ...... : ${enable_utf}
    Unicode properties .............. : ${enable_unicode_properties}
    Newline char/sequence ........... : ${enable_newline}
    \R matches only ANYCRLF ......... : ${enable_bsr_anycrlf}
    EBCDIC coding ................... : ${enable_ebcdic}
    EBCDIC code for NL .............. : ${ebcdic_nl_code}
    Rebuild char tables ............. : ${enable_rebuild_chartables}
    Use stack recursion ............. : ${enable_stack_for_recursion}
    POSIX mem threshold ............. : ${with_posix_malloc_threshold}
    Internal link size .............. : ${with_link_size}
    Nested parentheses limit ........ : ${with_parens_nest_limit}
    Match limit ..................... : ${with_match_limit}
    Match limit recursion ........... : ${with_match_limit_recursion}
    Build shared libs ............... : ${enable_shared}
    Build static libs ............... : ${enable_static}
    Use JIT in pcregrep ............. : ${enable_pcregrep_jit}
    Buffer size for pcregrep ........ : ${with_pcregrep_bufsize}
    Link pcregrep with libz ......... : ${enable_pcregrep_libz}
    Link pcregrep with libbz2 ....... : ${enable_pcregrep_libbz2}
    Link pcretest with libedit ...... : ${enable_pcretest_libedit}
    Link pcretest with libreadline .. : ${enable_pcretest_libreadline}
    Valgrind support ................ : ${enable_valgrind}
    Code coverage ................... : ${enable_coverage}

EOF

Quote

  --disable-stack-for-recursion                   don't use stack recursion when matching

 

  • Thanks 1

Share this post


Link to post
7 minutes ago, pyscripter said:

Could anyone help translate the LLMV code to Delphi PLEASE?  The prize is that everyone gets a much faster System.RegularExpressions and we may even convince Embarcadero to implement the changes. 

I'd help ... I'm just checking other options.

Share this post


Link to post
Posted (edited)

@pyscripter 

 

Anyway I compiled (using CLANG) a simple c code that I tweaked to use chkstk. and then I dumped the function and edited it to be compatible with Delphi :

procedure chkstk();
asm
  .NOFRAME
  sub rsp, $10
  mov [rsp], r10
  mov [rsp+8], r11
  xor r11,r11
  lea r10, [rsp+$18]
  sub r10,rax
  cmovb r10,r11
  mov r11, qword ptr gs:[$10]
  cmp r10,r11
  db $f2
  jae @@L1
  and r10w,$F000
@@L2:
  lea r11, [r11-$1000]
  mov byte [r11],0
  cmp r10,r11
  db $f2
  jne @@L2
@@L1:
  mov r10, [rsp]
  mov r11, [rsp+8]
  add rsp, $10
  db $f2
  ret
end;

 

EDIT: I didn't pay attention to the GAS syntax for branch and interpreted it as hex ... My bad.

Edited by Mahdi Safsafi
  • Like 1
  • Thanks 1

Share this post


Link to post

Found it and you were almost right 

/Gs100000000 

did it !

 

The problem is in one call jit_machine_stack_exec

Share this post


Link to post
Posted (edited)
27 minutes ago, Mahdi Safsafi said:

@pyscripter 

 

The ___chkstk_ms function that you post is incomplete !!!

ja 0x2b is jumping to some location that is not with function body ! Perhaps DEFINE_COMPILERRT_FUNCTION is doing something here. 


DEFINE_COMPILERRT_FUNCTION(___chkstk_ms)
        push   %rcx
        push   %rax
        cmp    $0x1000,%rax
        lea    24(%rsp),%rcx
        jb     1f
2:
        sub    $0x1000,%rcx
        test   %rcx,(%rcx)
        sub    $0x1000,%rax
        cmp    $0x1000,%rax
        ja     2b
1:
        sub    %rax,%rcx
        test   %rcx,(%rcx)
        pop    %rax
        pop    %rcx
        ret
END_COMPILERRT_FUNCTION(___chkstk_ms)

Anyway I compiled (using CLANG) a simple c code that I tweaked to use chkstk. and then I dumped the function and edited it to be compatible with Delphi :


procedure chkstk();
asm
  .NOFRAME
  sub rsp, $10
  mov [rsp], r10
  mov [rsp+8], r11
  xor r11,r11
  lea r10, [rsp+$18]
  sub r10,rax
  cmovb r10,r11
  mov r11, qword ptr gs:[$10]
  cmp r10,r11
  db $f2
  jae @@L1
  and r10w,$F000
@@L2:
  lea r11, [r11-$1000]
  mov byte [r11],0
  cmp r10,r11
  db $f2
  jne @@L2
@@L1:
  mov r10, [rsp]
  mov r11, [rsp+8]
  add rsp, $10
  db $f2
  ret
end;

That works and produces the correct results!! Thanks.  So win64 is complete and I will now try to get win32 right. 

 

I would love to know what this function is doing though...  What I understand is that it just "touches" the stack and if an _XCPT_GUARD_PAGE_VIOLATION error occurs the OS traps that and grows the stack.   In the event of failure, the OS raises the  _XCPT_UNABLE_TO_GROW_STACK exception.
Sounds like magic. 

 

Edited by pyscripter

Share this post


Link to post

@pyscripter Out of curiosity, why are you suing 16bit version of pcre ?

Are 8bit and 32bit slower than 16bit ? 

 

The sum size of 32bit obj's is 380kb        -DPCRE_BUILD_PCRE32=ON -DPCRE_BUILD_PCRE8=OFF

while the sum size of 16bit obj's 409kb   -DPCRE_BUILD_PCRE16=ON -DPCRE_BUILD_PCRE8=OFF

Share this post


Link to post
Just now, Kas Ob. said:

@pyscripter Out of curiosity, why are you suing 16bit version of pcre ?

Are 8bit and 32bit slower than 16bit ? 

 

The sum size of 32bit obj's is 380kb        -DPCRE_BUILD_PCRE32=ON -DPCRE_BUILD_PCRE8=OFF

while the sum size of 16bit obj's 409kb   -DPCRE_BUILD_PCRE16=ON -DPCRE_BUILD_PCRE8=OFF

PCRE16.  This is what Delphi uses and for a reason.  It avoids conversions between delphi strings and PCRE strings.

  • Thanks 1

Share this post


Link to post
Quote

That works and produces the correct results!! Thanks.  So win64 is complete and I will now try to get win32 right. 

 

From what I understood from your comment, you're going to compile using MSVC and then use some tools to convert the output ? why not give bcc32c a chance ? if it does not work ... you can simply use plan B. 

Share this post


Link to post
Posted (edited)
19 minutes ago, Mahdi Safsafi said:

From what I understood from your comment, you're going to compile using MSVC and then use some tools to convert the output ? why not give bcc32c a chance ? if it does not work ... you can simply use plan B. 

There are no conversions needed for Win64.

For Win 32 I have the issue of windows calls (such as __imp__VirtualFree@12.  Fortunately there is just one obj file that makes such calls and I will find a way to work around it (see earlier posts).

 

Why not use bcc32c...

  • I have only access to the free tools that only compile x86 code
  • pcre has a number of configuration steps and options and cmake makes it very easy and convenient to automate everything.  It would be very hard to do that manually correctly.
  • cmake has a generator for bcc32c but is rather new and I am not sure how well tested that is.
  • Saddly, I have more trust in the Visual Studio compiler, especially for code that requires stack checking.  Does bcc32c do that?

When I finish I will release the stuff at github, so people can take a look and make adjustments.   Embarcadero can use their own compiler if they wish.

Edited by pyscripter

Share this post


Link to post
4 minutes ago, pyscripter said:

There are no conversions needed for Win64.

Yeah I know that, I early proposed to you to use MSVC for x64 and bcc32c for x86.

Quote
  • Saddly, I have more trust in the Visual Studio compiler, especially for code that requires stack checking.  Does bcc32c do that?

Sadly bcc32c (at least the free version) is build on an old LLVM compiler. 

Quote

 When I finish I will release the stuff at github, so people can take a look and make adjustments.   Embarcadero can use their own compiler if they wish.

Good luck ! if you need some help let me know. 

Share this post


Link to post
2 minutes ago, Mahdi Safsafi said:

Good luck ! if you need some help let me know. 

Well yes.  I need a 32-bit __chkstk

 

Here is the Visual Studio masm code

 

        page    ,132
        title   chkstk - C stack checking routine
;***
;chkstk.asm - C stack checking routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       Provides support for automatic stack checking in C procedures
;       when stack checking is enabled.
;
;*******************************************************************************

.xlist
        include vcruntime.inc
.list

; size of a page of memory

_PAGESIZE_      equ     1000h


        CODESEG

page
;***
;_chkstk - check stack upon procedure entry
;
;Purpose:
;       Provide stack checking on procedure entry. Method is to simply probe
;       each page of memory required for the stack in descending order. This
;       causes the necessary pages of memory to be allocated via the guard
;       page scheme, if possible. In the event of failure, the OS raises the
;       _XCPT_UNABLE_TO_GROW_STACK exception.
;
;       NOTE:  Currently, the (EAX < _PAGESIZE_) code path falls through
;       to the "lastpage" label of the (EAX >= _PAGESIZE_) code path.  This
;       is small; a minor speed optimization would be to special case
;       this up top.  This would avoid the painful save/restore of
;       ecx and would shorten the code path by 4-6 instructions.
;
;Entry:
;       EAX = size of local frame
;
;Exit:
;       ESP = new stackframe, if successful
;
;Uses:
;       EAX
;
;Exceptions:
;       _XCPT_GUARD_PAGE_VIOLATION - May be raised on a page probe. NEVER TRAP
;                                    THIS!!!! It is used by the OS to grow the
;                                    stack on demand.
;       _XCPT_UNABLE_TO_GROW_STACK - The stack cannot be grown. More precisely,
;                                    the attempt by the OS memory manager to
;                                    allocate another guard page in response
;                                    to a _XCPT_GUARD_PAGE_VIOLATION has
;                                    failed.
;
;*******************************************************************************

public  _alloca_probe

_chkstk proc

_alloca_probe    =  _chkstk

        push    ecx

; Calculate new TOS.

        lea     ecx, [esp] + 8 - 4      ; TOS before entering function + size for ret value
        sub     ecx, eax                ; new TOS

; Handle allocation size that results in wraparound.
; Wraparound will result in StackOverflow exception.

        sbb     eax, eax                ; 0 if CF==0, ~0 if CF==1
        not     eax                     ; ~0 if TOS did not wrapped around, 0 otherwise
        and     ecx, eax                ; set to 0 if wraparound

        mov     eax, esp                ; current TOS
        and     eax, not ( _PAGESIZE_ - 1) ; Round down to current page boundary

cs10:
        cmp     ecx, eax                ; Is new TOS
    bnd jb      short cs20              ; in probed page?
        mov     eax, ecx                ; yes.
        pop     ecx
        xchg    esp, eax                ; update esp
        mov     eax, dword ptr [eax]    ; get return address
        mov     dword ptr [esp], eax    ; and put it at new TOS
    bnd ret

; Find next lower page and probe
cs20:
        sub     eax, _PAGESIZE_         ; decrease by PAGESIZE
        test    dword ptr [eax],eax     ; probe page.
        jmp     short cs10

_chkstk endp

        end

 

Share this post


Link to post

I always turn off stack checking for this reason.

 

Wouldn't it just be better to put it in a DLL? 

Share this post


Link to post

@pyscripter /Gs100000000 did remove the need for chkstk for both 32 and 64 bit, as i see on my VS 2017, now i am not sure why the protection should be needed here as the parameters passed in stack are only to 2 and to hit the stack limits, it should be recursively called 128000 times on 32bit 1mb stack, so may be you should consider remove it.

 

Now on other hand you can simply disable recursion to begin with

SET(PCRE_NO_RECURSE OFF CACHE BOOL
    "If ON, then don't use stack recursion when matching. See NO_RECURSE in config.h.in for details.")

then use that 

/Gs100000000 

to remove chkstk call

Share this post


Link to post
2 minutes ago, Kas Ob. said:

@pyscripter /Gs100000000 did remove the need for chkstk for both 32 and 64 bit, as i see on my VS 2017, now i am not sure why the protection should be needed here as the parameters passed in stack are only to 2 and to hit the stack limits, it should be recursively called 128000 times on 32bit 1mb stack, so may be you should consider remove it.

 

Now on other hand you can simply disable recursion to begin with

SET(PCRE_NO_RECURSE OFF CACHE BOOL
    "If ON, then don't use stack recursion when matching. See NO_RECURSE in config.h.in for details.")

then use that 

/Gs100000000 

to remove chkstk call

Wow ... that's a 100mb ! 

Share this post


Link to post
Posted (edited)
3 hours ago, pyscripter said:

I understand that a DLL is a safer and easier option, but this is meant to be an enhancement of System.RegularExpressions which uses static linking of obj files.

@David Heffernan I explained why I am linking C code earlier...

@Kas Ob.I would rather use the default options if I can.  For example I am not sure of the performance implications of PCRE_NO_RECURSE.  Now, if nothing else worked, I will try.

Edited by pyscripter

Share this post


Link to post

Before make your mind, please hear me out, and forgive me to repeat explaining this

There two settings for this stack protection

Quote

SET(PCRE_MATCH_LIMIT_RECURSION "MATCH_LIMIT" CACHE STRING
    "Default limit on internal recursion. See MATCH_LIMIT_RECURSION in config.h.in for details.")

 

SET(PCRE_NO_RECURSE OFF CACHE BOOL
    "If ON, then don't use stack recursion when matching. See NO_RECURSE in config.h.in for details.")

The default for limit comes with

Quote

#define MATCH_LIMIT        10000000
#define MATCH_LIMIT_RECURSION    MATCH_LIMIT

Those belongs to pcre itself, and they are already applied and in action

while chkstk comes from the JIT engine where no limit, and enforced by VS, so i really don't see the reason to be afraid of stack overflow with their own default settings.

  • Thanks 1

Share this post


Link to post

@pyscripter Here is the x86 function.

But its really weird, the original version shipped with Delphi declares it as empty and it works. I don't understand why it crashed with you when you defined it as empty.

@Kas Ob's point is very interesting. I didn't read the configuration's documentation. But he definitely did ... Perhaps he is right. I suggest that you check with him that point. And who knows ... you may not need to use chkstk.

chkstk.txt

  • Thanks 1

Share this post


Link to post
Posted (edited)

@Mahdi SafsafiThanks!! Now it works in 32-bits as well.

 

32-bit benchmark

 

                                                        Time     | Match count
==============================================================================
Delphi's own TRegEx:
                                        /Twain/ :        6.00 ms |         811
                                    /(?i)Twain/ :       41.00 ms |         965
                                   /[a-z]shing/ :      448.00 ms |        1540
                   /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :      528.00 ms |         262
                                    /\b\w+nn\b/ :      634.00 ms |         262
                             /[a-q][^u-z]{13}x/ :      543.00 ms |        4094
                  /Tom|Sawyer|Huckleberry|Finn/ :      896.00 ms |        2598
              /(?i)Tom|Sawyer|Huckleberry|Finn/ :     1063.00 ms |        4152
          /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :     2980.00 ms |        2598
          /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :     3088.00 ms |        1976
            /Tom.{10,25}river|river.{10,25}Tom/ :      528.00 ms |           2
                                 /[a-zA-Z]+ing/ :      918.00 ms |       78423
                        /\s[a-zA-Z]{0,12}ing\s/ :      548.00 ms |       49659
                /([A-Za-z]awyer|[A-Za-z]inn)\s/ :      775.00 ms |         209
                    /["'][^"']{0,30}[?!\.]["']/ :      312.00 ms |        8885
Total Time:    13321.00 ms
==============================================================================
Delphi's own TRegEx with Study:
                                        /Twain/ :        6.00 ms |         811
                                    /(?i)Twain/ :       40.00 ms |         965
                                   /[a-z]shing/ :      318.00 ms |        1540
                   /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :       24.00 ms |         262
                                    /\b\w+nn\b/ :      619.00 ms |         262
                             /[a-q][^u-z]{13}x/ :      450.00 ms |        4094
                  /Tom|Sawyer|Huckleberry|Finn/ :       31.00 ms |        2598
              /(?i)Tom|Sawyer|Huckleberry|Finn/ :      256.00 ms |        4152
          /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :     2875.00 ms |        2598
          /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :     2905.00 ms |        1976
            /Tom.{10,25}river|river.{10,25}Tom/ :       63.00 ms |           2
                                 /[a-zA-Z]+ing/ :      829.00 ms |       78423
                        /\s[a-zA-Z]{0,12}ing\s/ :      569.00 ms |       49659
                /([A-Za-z]awyer|[A-Za-z]inn)\s/ :      685.00 ms |         209
                    /["'][^"']{0,30}[?!\.]["']/ :       56.00 ms |        8885
Total Time:     9746.00 ms
==============================================================================
Delphi's own TRegEx with JIT:
                                        /Twain/ :       12.00 ms |         811
                                    /(?i)Twain/ :       13.00 ms |         965
                                   /[a-z]shing/ :       13.00 ms |        1540
                   /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :        3.00 ms |         262
                                    /\b\w+nn\b/ :      189.00 ms |         262
                             /[a-q][^u-z]{13}x/ :      154.00 ms |        4094
                  /Tom|Sawyer|Huckleberry|Finn/ :       22.00 ms |        2598
              /(?i)Tom|Sawyer|Huckleberry|Finn/ :       64.00 ms |        4152
          /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :      371.00 ms |        2598
          /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :      497.00 ms |        1976
            /Tom.{10,25}river|river.{10,25}Tom/ :       12.00 ms |           2
                                 /[a-zA-Z]+ing/ :      105.00 ms |       78423
                        /\s[a-zA-Z]{0,12}ing\s/ :      197.00 ms |       49659
                /([A-Za-z]awyer|[A-Za-z]inn)\s/ :       39.00 ms |         209
                    /["'][^"']{0,30}[?!\.]["']/ :       19.00 ms |        8885
Total Time:     1730.00 ms

I will release it at Github (watch for announcement).  Then @Kas Ob.you can try to get rid of __chkstk and see what happens.  

Edited by pyscripter
  • Like 2

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×