tomye 1 Posted May 3, 2022 i met a troubled problem , that is thread issue i did a neural network program and it can be run without problem but the CPU execution efficiency is very low as the screenshot , it spent 3.4 seconds on detection, but the CPU use ratio is only 40%+ i found the problem comes from Python GIL lock , it limited the CPU execution time is there any solution? Thanks Share this post Link to post
shineworld 73 Posted May 3, 2022 (edited) I'm not a Win guru but Windows has owner rules on how much time is left to a process and related task. Usually, it does not permit to use of more than 40/50% of the time to a process/thread depending on many factors. GIL Lock system however limits a lot of threading performances in Python, is not permit a true Threading programming as like as in other languages . When possibile I move python math code in delphi extension module native threads or I use Cython to improve moudule performances. To improve a little the performances of time.Sleep() and timers and so of GIL I've tried to change the precision timer of Windows with: from os_utils import os_improve_timings ... .. . def main(): """Main entry point.""" ... .. . # MAIN entry point when module is called if __name__ == '__main__': # improve os timings and run main with os_improve_timings(): main() os_utils.py Edited May 3, 2022 by shineworld Share this post Link to post
tomye 1 Posted May 3, 2022 4 hours ago, shineworld said: I'm not a Win guru but Windows has owner rules on how much time is left to a process and related task. Usually, it does not permit to use of more than 40/50% of the time to a process/thread depending on many factors. GIL Lock system however limits a lot of threading performances in Python, is not permit a true Threading programming as like as in other languages . When possibile I move python math code in delphi extension module native threads or I use Cython to improve moudule performances. To improve a little the performances of time.Sleep() and timers and so of GIL I've tried to change the precision timer of Windows with: from os_utils import os_improve_timings ... .. . def main(): """Main entry point.""" ... .. . # MAIN entry point when module is called if __name__ == '__main__': # improve os timings and run main with os_improve_timings(): main() os_utils.py thanks for your help is there any way to improve CPU's usage rate in Python? for example , now i run the script in Python, the CPU is only used 40% can i add some codes or other ways let the CPU's usage rate up to 90%+ ? using multi-threading maybe is a solution, but it is too complicated because one second has 25 frames, i can detect these 25 frames at same time if using multi-threading but after processed there are too many things need to do , difficult to control to be steady Share this post Link to post
David Heffernan 2345 Posted May 3, 2022 Multithreading in Python is inevitably blocked by the GIL. Nothing you can do about it. No magic solution. See countless discussions of this topic on the broader internet. In other words the issue is not related to Delphi or p4d but is a fundamental design choice in Python. Share this post Link to post
shineworld 73 Posted May 3, 2022 (edited) When I can I move thread things to Delphi Thread in an expansion module but this has a limited range of operations, you have to use only Delphi libraries, no Python code there. For example, I'm banging my head on a very odd problem without a solution. 1] In a Delphi extension module I get frames from a proprietary camera that can't be accessed with OpenCV VideoCapture, for a lot of reasons. 2] A Python thread gets the frame data from the Delphi extension module as a property to TByteDynArray data. TByteDynArray is converted to List by PythonWrapper and the inner system copies any byte of the array in a VariantArray which takes a big amount of time. To get back from the Delphi module 48000 image bytes take 0.080 (80ms) which is too much, slowing down a lot the FPS. A classical bottleneck..... I've tried to change the FrameData returned from TByteDynArray to AnsiString and data movement is very very fast (0.001ms) but AnsiString with Frame Data is encoded to Unicode before sending it back to python and str.encode(frame_data) in Python is not applicable.... So my question? What is the best way to get an array of bytes from the Delphi extension module (pyd) back to python? Have I to enter in the HELL of array pointers shared between Delphi to Python ? Delphi extension module is able to capture till 60fps, but the bottleneck of frame retrieve from PYD to Python fall down my FPS to 6FPS... Edited May 3, 2022 by shineworld Share this post Link to post
tomye 1 Posted May 3, 2022 I have a tip maybe can help , i don't know if it does work for you i found Complex types such as objects , numpy array etc. will spend more time to transfer i just use the simplest type. STRING, there are no other P4D components in my project, only one PyEngine i don't know whether it is right or not, but it does work and the speed is OK var v: Variant; v := MainModule.ProcessImage(ImageToPyBytes(BMP)); the BMP is a TBitmap object, it's a video frame, the V returns a object type which i defined in Python, but here i access it as STRING type, very simple and fast IDs : = v.IDs; Names := v.Names; ... i think maybe you can encode the frame data to STRING in Python and send it back to Delphi i have tested the FPS, almost fall down < 2 fps , it is acceptable for me. but my problem is the Python script execution efficiency is too poor, that's why Python people always said YOU MUST RUN NEURAL NETWORK PROGRAMS ON GPU 😞 Share this post Link to post
shineworld 73 Posted May 3, 2022 (edited) Following your suggestion, I've solved using PyBytes_FromStringAndSize... very very fast!!! 1] I've created a VideoCapture as native Delphi component TCncVisionCapturePxc. This can be used for standard Delphi programs or wrapped in an expansion module pyd to be used in Python. In this class, I can't use python objects, obviously as return values but the dynamic array of bytes for a frame. 2] I wrapped the class to export in pyd adding an extra method to get fastly the frame with PyBytes_FromStringAndSize: uses ... osCncVideoCapturePxc; type TPyCncVisionVideoCapturePxc_Wrapper = class(TPyDelphiObject) private constructor Create(APythonType: TPythonType); override; constructor CreateWith(PythonType: TPythonType; args: PPyObject); override; function Repr: PPyObject; override; class function DelphiObjectClass: TClass; override; class procedure RegisterMethods(PythonType: TPythonType); override; function GrabbedFrameEx(pself, args: PPyObject): PPyObject; cdecl; end; ... { TPyCncVisionVideoCapturePxc_Wrapper } constructor TPyCncVisionVideoCapturePxc_Wrapper.Create(APythonType: TPythonType); begin inherited; DelphiObject := TCncVisionVideoCapturePxc.Create; Owned := False; end; constructor TPyCncVisionVideoCapturePxc_Wrapper.CreateWith(PythonType: TPythonType; args: PPyObject); begin inherited; //### end; class function TPyCncVisionVideoCapturePxc_Wrapper.DelphiObjectClass: TClass; begin Result := TCncVisionVideoCapturePxc; end; function TPyCncVisionVideoCapturePxc_Wrapper.Repr: PPyObject; begin Result := GetPythonEngine.PyUnicodeFromString('none to show at moment'); end; class procedure TPyCncVisionVideoCapturePxc_Wrapper.RegisterMethods(PythonType: TPythonType); begin inherited; with PythonType do begin AddMethod('grabbed_frame_ex', @TPyCncVisionVideoCapturePxc_Wrapper.GrabbedFrameEx, 'get grabbed frame as string' ); end; end; function TPyCncVisionVideoCapturePxc_Wrapper.GrabbedFrameEx(pself, args: PPyObject): PPyObject; cdecl; var S: AnsiString; FrameSize: Integer; Frame: TByteDynArray; VideoCapture: TCncVisionVideoCapturePxc; begin Adjust(@Self); VideoCapture := TCncVisionVideoCapturePxc(DelphiObject); Frame := VideoCapture.grabbed_frame; FrameSize := Length(Frame); if FrameSize = 0 then Exit(GetPythonEngine.ReturnNone); SetLength(S, FrameSize); CopyMemory(@S[1], Frame, FrameSize); Result := GetPythonEngine.PyBytes_FromStringAndSize(@S[1], FrameSize); end; Adjust(@Self) was a nightmare because I don't know about that and without the self access to DelphiObject is always invalid. Now from 80ms to transfer an image from Delphi to Python is moved to 128uS. I'm happy. Edited May 3, 2022 by shineworld Share this post Link to post
tomye 1 Posted May 4, 2022 i am glade to help you, for resolve the CPU execution efficiency issue in Python, i am trying use Delphi threads to run i think call 10 threads to process the detection scripts in Python , is it possible? but i found the PyEngine can not be worked in thread mode? the codes as below: TThreadDetect = class(TThread) public DT: TDetectedObjs; BMP: TBitmap; Img: TImage; protected procedure Execute; override; constructor Create(ABMP: TBitmap; AImg: TImage); destructor Destroy; override; end; procedure TThreadDetect.Execute; var v: Variant; begin inherited; DT.sTime:=GetTickCount; DT.Init(true); v := MainModule.ProcessImage(ImageToPyBytes(BMP)); // dead in here , no any response DT.FillObjsData(v,0,0); DT.DrawObjects(BMP.Canvas,1280); DT.eTime:=GetTickCount; DT.Elapsed:=DT.eTime-DT.sTime; //画帧率和速度信息 DT.DrawSpeedInfo(BMP.Canvas,'NA',1280); FitDrawEx(Img,BMP); Img.Invalidate; end; do you know what i missed or wrong coding? Share this post Link to post
SwiftExpat 65 Posted May 4, 2022 It looks to me like your thread is doing processing and display, can you seperate it? Have you considered running PYEngine on a thread instead of main? Create a work queue on the thread and only have it execute the python steps necessary, then send the instructions to draw the result back to Delphi for display. My thread looks like this and I use DEB for messaging between main and the python thread. type TSERTCPythonEngine = class(TThread) private PE: TPythonEngine; PythonIO: TPythonInputOutput; PythonResultVar: TPythonDelphiVar; PyMod: TPythonModule; QSessionAnalyze: TThreadedQueueCS<IEventSessionAnalyze>; function FilePathPython(AFilePath: string): string; procedure PythonIOReceiveData(Sender: TObject; var Data: AnsiString); procedure PythonIOReceiveUniData(Sender: TObject; var Data: string); procedure PythonIOSendData(Sender: TObject; const Data: AnsiString); procedure PythonIOSendUniData(Sender: TObject; const Data: string); procedure StartEngine; procedure SetInitScript; function InstallVerify: boolean; procedure InstallPython(AUpdateLocation: string); procedure InstallRttkPackage(AUpdateLocation: string); procedure InstallDependDLLs; procedure InstallDLLCopy(ADllName: string); procedure ProcessSessionAnalyze; procedure PySessionAnalyze(ASessionAnalyze: IEventSessionAnalyze); procedure ProcessSshHostReload; procedure PyModExecHostConfigFound(Sender: TObject; PSelf, Args: PPyObject; var Result: PPyObject); procedure PyModExecHostConfigUpdated(Sender: TObject; PSelf, Args: PPyObject; var Result: PPyObject); protected procedure Execute; override; public class function PythonDll: string; class function PythonDir: string; class function PyPkgDir: string; class function PythonExists: boolean; constructor Create; destructor Destroy; override; [Subscribe(TThreadMode.Background)] procedure OnEventSessionAnalyze(AEvent: IEventSessionAnalyze); end; procedure TSERTCPythonEngine.Execute; begin inherited; NameThreadForDebugging('THPythonEngine'); try try CoInitialize(nil); if InstallVerify then if PythonExists then begin StartEngine; while not Terminated do begin ProcessSessionAnalyze; sleep(10); end; end else Logger.Critical('Python does not exist') else Logger.Critical('Python install verifiction failed!'); finally CoUninitialize(); end; except on E: Exception do Logger.Critical('Python Exectue failed with ' + E.Message); end; end; procedure TSERTCPythonEngine.ProcessSessionAnalyze; var lEsa: IEventSessionAnalyze; begin if (not QSessionAnalyze.ShutDown) and ((QSessionAnalyze.TotalItemsPushed - QSessionAnalyze.TotalItemsPopped) > 0) then begin lEsa := QSessionAnalyze.PopItem; if TFile.Exists(lEsa.SessionFileName) then PySessionAnalyze(lEsa); GlobalEventBus.Post(lEsa, 'Processed'); end; end; Share this post Link to post
tomye 1 Posted May 4, 2022 2 hours ago, SwiftExpat said: It looks to me like your thread is doing processing and display, can you seperate it? Have you considered running PYEngine on a thread instead of main? Create a work queue on the thread and only have it execute the python steps necessary, then send the instructions to draw the result back to Delphi for display. My thread looks like this and I use DEB for messaging between main and the python thread. type TSERTCPythonEngine = class(TThread) private PE: TPythonEngine; PythonIO: TPythonInputOutput; PythonResultVar: TPythonDelphiVar; PyMod: TPythonModule; QSessionAnalyze: TThreadedQueueCS<IEventSessionAnalyze>; function FilePathPython(AFilePath: string): string; procedure PythonIOReceiveData(Sender: TObject; var Data: AnsiString); procedure PythonIOReceiveUniData(Sender: TObject; var Data: string); procedure PythonIOSendData(Sender: TObject; const Data: AnsiString); procedure PythonIOSendUniData(Sender: TObject; const Data: string); procedure StartEngine; procedure SetInitScript; function InstallVerify: boolean; procedure InstallPython(AUpdateLocation: string); procedure InstallRttkPackage(AUpdateLocation: string); procedure InstallDependDLLs; procedure InstallDLLCopy(ADllName: string); procedure ProcessSessionAnalyze; procedure PySessionAnalyze(ASessionAnalyze: IEventSessionAnalyze); procedure ProcessSshHostReload; procedure PyModExecHostConfigFound(Sender: TObject; PSelf, Args: PPyObject; var Result: PPyObject); procedure PyModExecHostConfigUpdated(Sender: TObject; PSelf, Args: PPyObject; var Result: PPyObject); protected procedure Execute; override; public class function PythonDll: string; class function PythonDir: string; class function PyPkgDir: string; class function PythonExists: boolean; constructor Create; destructor Destroy; override; [Subscribe(TThreadMode.Background)] procedure OnEventSessionAnalyze(AEvent: IEventSessionAnalyze); end; procedure TSERTCPythonEngine.Execute; begin inherited; NameThreadForDebugging('THPythonEngine'); try try CoInitialize(nil); if InstallVerify then if PythonExists then begin StartEngine; while not Terminated do begin ProcessSessionAnalyze; sleep(10); end; end else Logger.Critical('Python does not exist') else Logger.Critical('Python install verifiction failed!'); finally CoUninitialize(); end; except on E: Exception do Logger.Critical('Python Exectue failed with ' + E.Message); end; end; procedure TSERTCPythonEngine.ProcessSessionAnalyze; var lEsa: IEventSessionAnalyze; begin if (not QSessionAnalyze.ShutDown) and ((QSessionAnalyze.TotalItemsPushed - QSessionAnalyze.TotalItemsPopped) > 0) then begin lEsa := QSessionAnalyze.PopItem; if TFile.Exists(lEsa.SessionFileName) then PySessionAnalyze(lEsa); GlobalEventBus.Post(lEsa, 'Processed'); end; end; Does it work? i found the error, comes from the wrong class inherited , it must inherited from TPythonThread not TThread it can be run now, but it can't improve efficiency , the speed is almost samed as single process i referred some posts , seems must use multiprocessing in Python to create new single process not threads it becomes very complicated because the P4D only supports one Python Engine process instance i have tried write a multi-process codes in Python scripts , but it will launch new Delphi application instance at same time so , your codes can be worked? Share this post Link to post
SwiftExpat 65 Posted May 4, 2022 This might take a lot of code to accomplish, but this would be my goal before going multiprocess. Keep in mind thread construction has a cost and IPC will always have a high cost. The code works fine in my usage, the thread hosts the engine and stays in a loop. No reason to use TPythonThread, you are not creating a thread in Python. In your example you have the thread performing several tasks, split them. The blocking point is Python, so the goal is to keep a steady stream of work going to it. This is the limit you design around. The thread that is executing the python should only be watching a work queue, TThreadedQueueCS. Handle all the synchronization in Delphi. procedure TThreadProcess.Execute; var v: Variant; itm: TItem; begin inherited; do while not terminated begin itm := QSessionAnalyze.PopItem; v := MainModule.ProcessImage(ImageToPyBytes(BMP)); QSessionDraw.PushItem(itm) end; end; Could all of this other could happen in another thread or at the time it is rendered. procedure TThreadDraw.Execute; var v: Variant; begin inherited; DT.Init(true); DT.FillObjsData(v,0,0); DT.DrawObjects(BMP.Canvas,1280); DT.DrawSpeedInfo(BMP.Canvas,'NA',1280); FitDrawEx(Img,BMP); Img.Invalidate; end; 11 hours ago, tomye said: procedure TThreadDetect.Execute; var v: Variant; begin inherited; DT.sTime:=GetTickCount; DT.Init(true); v := MainModule.ProcessImage(ImageToPyBytes(BMP)); // dead in here , no any response DT.FillObjsData(v,0,0); DT.DrawObjects(BMP.Canvas,1280); DT.eTime:=GetTickCount; DT.Elapsed:=DT.eTime-DT.sTime; //画帧率和速度信息 DT.DrawSpeedInfo(BMP.Canvas,'NA',1280); FitDrawEx(Img,BMP); Img.Invalidate; end; Share this post Link to post
SwiftExpat 65 Posted May 4, 2022 (edited) What if the user wants to see the image without the lines drawn? Possibly this will help you see the problem / goal differently. Maybe you can do something like this, I have the image and then draw the boxes at display. Draw Boxes.mp4 Edited May 4, 2022 by SwiftExpat user comment Share this post Link to post
tomye 1 Posted May 17, 2022 On 5/3/2022 at 10:31 PM, shineworld said: Following your suggestion, I've solved using PyBytes_FromStringAndSize... very very fast!!! 1] I've created a VideoCapture as native Delphi component TCncVisionCapturePxc. This can be used for standard Delphi programs or wrapped in an expansion module pyd to be used in Python. In this class, I can't use python objects, obviously as return values but the dynamic array of bytes for a frame. 2] I wrapped the class to export in pyd adding an extra method to get fastly the frame with PyBytes_FromStringAndSize: uses ... osCncVideoCapturePxc; type TPyCncVisionVideoCapturePxc_Wrapper = class(TPyDelphiObject) private constructor Create(APythonType: TPythonType); override; constructor CreateWith(PythonType: TPythonType; args: PPyObject); override; function Repr: PPyObject; override; class function DelphiObjectClass: TClass; override; class procedure RegisterMethods(PythonType: TPythonType); override; function GrabbedFrameEx(pself, args: PPyObject): PPyObject; cdecl; end; ... { TPyCncVisionVideoCapturePxc_Wrapper } constructor TPyCncVisionVideoCapturePxc_Wrapper.Create(APythonType: TPythonType); begin inherited; DelphiObject := TCncVisionVideoCapturePxc.Create; Owned := False; end; constructor TPyCncVisionVideoCapturePxc_Wrapper.CreateWith(PythonType: TPythonType; args: PPyObject); begin inherited; //### end; class function TPyCncVisionVideoCapturePxc_Wrapper.DelphiObjectClass: TClass; begin Result := TCncVisionVideoCapturePxc; end; function TPyCncVisionVideoCapturePxc_Wrapper.Repr: PPyObject; begin Result := GetPythonEngine.PyUnicodeFromString('none to show at moment'); end; class procedure TPyCncVisionVideoCapturePxc_Wrapper.RegisterMethods(PythonType: TPythonType); begin inherited; with PythonType do begin AddMethod('grabbed_frame_ex', @TPyCncVisionVideoCapturePxc_Wrapper.GrabbedFrameEx, 'get grabbed frame as string' ); end; end; function TPyCncVisionVideoCapturePxc_Wrapper.GrabbedFrameEx(pself, args: PPyObject): PPyObject; cdecl; var S: AnsiString; FrameSize: Integer; Frame: TByteDynArray; VideoCapture: TCncVisionVideoCapturePxc; begin Adjust(@Self); VideoCapture := TCncVisionVideoCapturePxc(DelphiObject); Frame := VideoCapture.grabbed_frame; FrameSize := Length(Frame); if FrameSize = 0 then Exit(GetPythonEngine.ReturnNone); SetLength(S, FrameSize); CopyMemory(@S[1], Frame, FrameSize); Result := GetPythonEngine.PyBytes_FromStringAndSize(@S[1], FrameSize); end; Adjust(@Self) was a nightmare because I don't know about that and without the self access to DelphiObject is always invalid. Now from 80ms to transfer an image from Delphi to Python is moved to 128uS. I'm happy. Hi, shineworld would you like to write a simple demo for this transfer? ( transfer an image from delphi to python within 1ms) i have tried a several times but all failed 😞 thank you Share this post Link to post
shineworld 73 Posted May 17, 2022 (edited) Attached to the post there is a very simple Delphi application that should help you. The demo creates a PythonEngine and adds a new Module called delphi_vcl_ext in which wraps two functions: get_loaded_image_as_bytes() # get delphi loaded image as bytes update_image_from_bytes(...) # update delphivcl Image object from bytes array with width, height & channels The program has two panels: - Left Panel is a TImage and shows the loaded image to transfer to the Python script. - Right Panel is a TImage to show the python script evaluated image. By default, the program preloads a test BMP file (640x480 so fits the left image panel). Default script: - Get Delphi loaded image using get_loaded_image_as_byte(). - Decode the image to a NumPy array. - Apply an automatic canny filter. - Send back to Delphi the resulting image to be shown in the right panel. The time to transfer images is shown in the right log panel. With Load Image, you can try other files but must be supported by the TImage component. After all, is only a demo code made for you during rest time. Take care If you call script in a Delphi Thread you can't write python sent image directly in a TImage, or in any VCL component but use rightly Thread.Synchronized method to be done in the main thread. delphi_python_001.7z Edited May 17, 2022 by shineworld Share this post Link to post
tomye 1 Posted May 18, 2022 On 5/17/2022 at 5:00 PM, shineworld said: Attached to the post there is a very simple Delphi application that should help you. The demo creates a PythonEngine and adds a new Module called delphi_vcl_ext in which wraps two functions: get_loaded_image_as_bytes() # get delphi loaded image as bytes update_image_from_bytes(...) # update delphivcl Image object from bytes array with width, height & channels The program has two panels: - Left Panel is a TImage and shows the loaded image to transfer to the Python script. - Right Panel is a TImage to show the python script evaluated image. By default, the program preloads a test BMP file (640x480 so fits the left image panel). Default script: - Get Delphi loaded image using get_loaded_image_as_byte(). - Decode the image to a NumPy array. - Apply an automatic canny filter. - Send back to Delphi the resulting image to be shown in the right panel. The time to transfer images is shown in the right log panel. With Load Image, you can try other files but must be supported by the TImage component. After all, is only a demo code made for you during rest time. Take care If you call script in a Delphi Thread you can't write python sent image directly in a TImage, or in any VCL component but use rightly Thread.Synchronized method to be done in the main thread. delphi_python_001.7z thank you , i have done a test your way is almost 3 times faster than DEMO29, very good! Share this post Link to post
shineworld 73 Posted May 18, 2022 (edited) Actually, I've used the simplest way to send an image to Python, send it completely (header, image structure, data). This required a python cv2.imdecode to get back a NumPy array clean of container (GIF/BMP/JPG/PNG/etc). You can use GetDIB in Delphi to extract only pure image data (RGB or RGBA), packet it, and return it to python then reorganize data in NumPy without the use of cv2.imdecode. TAKE CARE ======= In your time test for ByteIO, you have an overhead of cv2.imdecode.... Try without it. Edited May 18, 2022 by shineworld Share this post Link to post
tomye 1 Posted May 18, 2022 46 minutes ago, shineworld said: Actually, I've used the simplest way to send an image to Python, send it completely (header, image structure, data). This required a python cv2.imdecode to get back a NumPy array clean of container (GIF/BMP/JPG/PNG/etc). You can use GetDIB in Delphi to extract only pure image data (RGB or RGBA), packet it, and return it to python then reorganize data in NumPy without the use of cv2.imdecode. TAKE CARE ======= In your time test for ByteIO, you have an overhead of cv2.imdecode.... Try without it. actually, i have tried this, i convert an image to a byte array and send it to Python but i did not convert the byte array to Numpy array, that's why i can not get a good speed. at first, i thought can i send a POINT type to Python? if can then i can get a very fast speed, but failed 😞 BTW, what is the meaning of an overhead of cv2.imdecode ? Share this post Link to post