dummzeuch 1506 Posted May 29, 2020 Given a large directory tree with many files and sub directories and a depth of about 5 located on a Samba server share which must be fully traversed: What is more efficient: Depth First Search: Recursive FindFirst/Next which immediately handles each sub directory when it is encountered. Breadth First Search: Iterative FindFirst/Next which maintains a list of directories to process and appends sub directories to this list when they are encountered to be processed later. I have so far implemented DFS, simply because it was the easiest. It takes a very long time (about 60 seconds in my test scenario). Before I try to implement BFS: Has anybody ever compared the performance of these approaches on Windows + network shares? I googled, but could not find anything definitive (which really surprised me), maybe my google fu has let me down again. 1 Share this post Link to post
Lars Fosdal 1792 Posted May 29, 2020 Discovery = DFS Processing = BFS on Discovery -> Parallel worker threads processing queue? The efficiency depends on a part of Windows file share architecture that I have never dived deeply into. Coming from a single remote computer, will multiple concurrent file accesses be truly parallel, multiplexed or serialized? 1 Share this post Link to post
Bill Meyer 337 Posted May 29, 2020 29 minutes ago, Lars Fosdal said: Discovery = DFS Processing = BFS on Discovery -> Parallel worker threads processing queue? The efficiency depends on a part of Windows file share architecture that I have never dived deeply into. Coming from a single remote computer, will multiple concurrent file accesses be truly parallel, multiplexed or serialized? There is also the question of disk thrashing defeating the parallel approach. You can get a hint of this in BeyondCompare, if you set up multiple large copy operations, and run them in parallel. Share this post Link to post
aehimself 396 Posted May 29, 2020 4 hours ago, dummzeuch said: I have so far implemented DFS, simply because it was the easiest. It takes a very long time (about 60 seconds in my test scenario). Before I try to implement BFS: Has anybody ever compared the performance of these approaches on Windows + network shares? Give TDirectory.GetFiles (System.IOUtils) a try. It drastically improved the time it took to discover everything compared to my recursive FindFirst / FindNext in my backup application. Share this post Link to post
dummzeuch 1506 Posted May 29, 2020 58 minutes ago, aehimself said: Give TDirectory.GetFiles (System.IOUtils) a try. That might for a change actually be an option, as the program in question is developed with Delphi 10.2 rather than my usual work horse Delphi 2007. I'm not yet used to those "new" utility units. Thanks for the pointer. Share this post Link to post
Vandrovnik 214 Posted May 29, 2020 But TDirectory.GetFiles (System.IOUtils) internally uses FindFirst/FindNext too... 1 Share this post Link to post
timfrost 78 Posted May 29, 2020 There are two simple enhancements that can be made to Delphi FindFirst (which I never call, having adapted it into my own version as a replacement). I do not think that either will help you search a Samba network drive, but together they can significantly enhance file searching on Windows file systems. The secret is to call Windows FindFirstFileEx instead of FindFirstFile, which is used both by Delphi FindFirst and therefore also by TDirectory.GetFiles. On modern OS, the 'Ex' function allows you to opt for a larger buffer, and also to omit the double search for the 8.3 equivalent alternate names in addition to the full file names. Once you have set up the 'find' optimally in this way, you can call the standard FindNext and FindClose functions as usual. MSDN has the details you need for FindFirstFileEx. But to address your question, my guess is that your recursive search would turn out to be faster than multi-threading the subfolder searches, if you were to measure it. And measuring in your own environment is the best way to get the answer. 1 Share this post Link to post
FredS 138 Posted May 29, 2020 1 hour ago, timfrost said: search a Samba network drive This works fine, and I also use a home rolled method using FindFirstFileEx. As for multiple threads, I did that test a while back and it didn't show any better time. Share this post Link to post
Vincent Parrett 754 Posted May 30, 2020 TDirectory.GetFiles seems horribly inefficient to me. In most cases it's probably fast enough.. but looking at the code, I can't help think it's use TPath.DoMatchesPattern is wasteful (creating the mask and freeing it each time). 1 Share this post Link to post
dummzeuch 1506 Posted May 30, 2020 (edited) 18 hours ago, dummzeuch said: 19 hours ago, aehimself said: Give TDirectory.GetFiles (System.IOUtils) a try. It drastically improved the time it took to discover everything compared to my recursive FindFirst / FindNext in my backup application. That might for a change actually be an option, as the program in question is developed with Delphi 10.2 rather than my usual work horse Delphi 2007. I'm not yet used to those "new" utility units. I just had a look at the sources. As @Vandrovnik already said: It internally also uses FindFirst / FindNext in the same way my program did, so I don't expect any improvements. WalkThroughDirectory also recursively does a depth first search. In addition it calls two different callbacks, which won't improve performance either. I'll now try to use FindFirstEx and also implemtent BFS and do some timing on each. Edited May 30, 2020 by dummzeuch Share this post Link to post
dummzeuch 1506 Posted May 30, 2020 (edited) My tests show no performance difference at all between DFS und BFS using FindFirst/FindNext. Both take nearly exactly 60 seconds on my test machine traversing the same directory tree on a Samba server. Trying to use TDirectory did not result in any improvement. It takes about twice as long. DFS code: var DirsWithJpg: TArray<string>; procedure CheckDirectory(const _Dir: string); var DirBs: string; sr: TSearchRec; ContainsFiles: boolean; ContainsSubdirs: boolean; begin ContainsFiles := False; ContainsSubdirs := False; DirBs := IncludeTrailingPathDelimiter(_Dir); if FindFirst(DirBs + '*.*', faAnyFile, sr) = 0 then begin try repeat if (sr.Name = '.') or (sr.Name = '..') then begin // ignore end else if (sr.Attr and faDirectory) <> 0 then begin // directory ContainsSubdirs := true; // recursive Depth First Search CheckDirectory(DirBs + sr.Name); end else if sr.Attr and (faHidden or faSysFile or faSymLink) = 0 then begin // regular file if not ContainsFiles then begin if SameText(ExtractFileExt(sr.Name), '.jpg') then begin ContainsFiles := true; end; end; end else begin // ignore special files end; until FindNext(sr) <> 0; finally FindClose(sr); end; end; if ContainsFiles then begin if ContainsSubdirs then begin m_Result.Lines.Add(Format('Directory %s contains files and subdirectories, ignoring it.', [_Dir])); end else begin DirsWithJpg := DirsWithJpg + [_Dir]; end; end; end; var Stopwatch: TStopwatch; begin Stopwatch.Reset; Stopwatch.Start; CheckDirectory('\\server\share\dir'); Stopwatch.Stop; m_Result.Lines.Add(Format('FindFirst/Next DBF: Found %d dirs in %.3f seconds', [Length(DirsWithJpg), Stopwatch.Elapsed.TotalSeconds])); end; BFS code: var DirsWithJpg: TArray<string>; DirsToSearch: TStringList; procedure CheckDirectory(const _Dir: string); var DirBs: string; sr: TSearchRec; ContainsFiles: boolean; ContainsSubdirs: boolean; begin ContainsFiles := False; ContainsSubdirs := False; DirBs := IncludeTrailingPathDelimiter(_Dir); if FindFirst(DirBs + '*.*', faAnyFile, sr) = 0 then begin try repeat if (sr.Name = '.') or (sr.Name = '..') then begin // ignore end else if (sr.Attr and faDirectory) <> 0 then begin // directory ContainsSubdirs := true; // add to list for later processing DirsToSearch.Add(DirBs + sr.Name); end else if sr.Attr and (faHidden or faSysFile or faSymLink) = 0 then begin // regular file if not ContainsFiles then begin if SameText(ExtractFileExt(sr.Name), '.jpg') then begin ContainsFiles := true; end; end; end else begin // ignore special files end; until FindNext(sr) <> 0; finally FindClose(sr); end; end; if ContainsFiles then begin if ContainsSubdirs then begin m_Result.Lines.Add(Format('Directory %s contains files and subdirectories, ignoring it.', [_Dir])); end else begin DirsWithJpg := DirsWithJpg + [_Dir]; end; end; end; var Stopwatch: TStopwatch; begin Stopwatch.Reset; Stopwatch.Start; DirsToSearch := TStringList.Create; try DirsToSearch.Add('\\server\share\dir'); while DirsToSearch.count > 0 do begin CheckDirectory(DirsToSearch[0]); DirsToSearch.Delete(0); end; finally FreeAndNil(DirsToSearch); end; Stopwatch.Stop; m_Result.Lines.Add(Format('FindFirst/Next BFS: Found %d dirs in %.3f seconds', [Length(DirsWithJpg), Stopwatch.Elapsed.TotalSeconds])); end; The naive approach using TDirectory takes about twice as long because of the calls to .GetFiles and .GetDirectories each call FindFirst/FindNext per directory. (It also finds 2 less directories containing jpg files, so there is probably a bug somewhere, but I didn't investigate this any further.) var DirsWithJpg: TArray<string>; procedure CheckDirectory(const _Dir: string); var Files: TArray<string>; Dirs: TArray<string>; i: Integer; begin Files := TDirectory.GetFiles(_Dir, '*.jpg'); Dirs := TDirectory.GetDirectories(_Dir); if Length(Files) > 0 then begin if Length(Dirs) > 0 then begin m_Result.Lines.Add(Format('Directory %s contains files and subdirectories, ignoring it.', [_Dir])); end else DirsWithJpg := DirsWithJpg + [_Dir]; end else begin for i := Low(Dirs) to High(Dirs) do CheckDirectory(Dirs[i]); end; end; var Stopwatch: TStopwatch; begin Stopwatch.Reset; Stopwatch.Start; CheckDirectory('\\server\share\dir'); Stopwatch.Stop; m_Result.Lines.Add(Format('DirGetFiles: Found %d dirs in %.3f seconds', [Length(DirsWithJpg), Stopwatch.Elapsed.TotalSeconds])); end; I thought about using TDirectory.FileSystemEntries instead but could not think of a simple way to implement this. It's probably possible using a Predicate but that's not really any simpler than directly using FindFirst/FindNext. There is probably still some space for improvements and the code is not really clean, e.g. accessing variables of the outer procedure. So, now I'll have a look at calling Windows.FindFirstFileEx instead of FindFirst, but I don't have much hope that this will help much. Edited May 30, 2020 by dummzeuch 1 1 Share this post Link to post
dummzeuch 1506 Posted May 30, 2020 OK, so here is the DFS implementation using FindFirstFileEx and only getting basic info. In my tests on the same computer and with the same Samba server and share it only takes about 45 seconds (rather than the 60 seconds with SysUtils.FindFirst). The result is the same. For completeness I am going to implement BFS for this too, but I don't expect any change really. function FindFirstFileEx(lpFileName: LPWSTR; fInfoLevelId: TFindexInfoLevels; lpFindFileData: Pointer; fSearchOp: TFindexSearchOps; lpSearchFilter: PWin32FindData; dwAdditionalFlags: DWORD): THandle; stdcall; external kernelbase Name 'FindFirstFileExW'; function FindNextFile(hFindFile: THandle; lpFindFileData: PWin32FindData): BOOL; stdcall; external kernelbase Name 'FindNextFileW'; procedure TForm1.b_FindFirstExNextDFSClick(Sender: TObject); var DirsWithJpg: TArray<string>; procedure CheckDirectory(const _Dir: string); const FIND_FIRST_EX_LARGE_FETCH = 2; var DirBs: string; SearchHandle: THandle; sr: TWin32FindData; ContainsFiles: boolean; ContainsSubdirs: boolean; fn: string; begin ContainsFiles := False; ContainsSubdirs := False; DirBs := IncludeTrailingPathDelimiter(_Dir); SearchHandle := FindFirstFileEx(PChar(DirBs + '*.*'), FindExInfoBasic, @sr, FindExSearchNameMatch, nil, FIND_FIRST_EX_LARGE_FETCH); if SearchHandle <> INVALID_HANDLE_VALUE then begin try repeat fn := sr.cFileName; if (fn = '.') or (fn = '..') then begin // ignore end else if (sr.dwFileAttributes and FILE_ATTRIBUTE_DIRECTORY) <> 0 then begin // directory ContainsSubdirs := true; // recursive Depth First Search CheckDirectory(DirBs + fn); end else if sr.dwFileAttributes and (FILE_ATTRIBUTE_HIDDEN or FILE_ATTRIBUTE_SYSTEM or FILE_ATTRIBUTE_REPARSE_POINT) = 0 then begin // regular file if not ContainsFiles then begin if SameText(ExtractFileExt(fn), '.jpg') then begin ContainsFiles := true; end; end; end else begin // ignore special files end; until not FindNextFile(SearchHandle, @sr); finally Winapi.Windows.FindClose(SearchHandle); end; end; if ContainsFiles then begin if ContainsSubdirs then begin m_Result.Lines.Add(Format('Directory %s contains files and subdirectories, ignoring it.', [_Dir])); end else begin DirsWithJpg := DirsWithJpg + [_Dir]; end; end; end; var Stopwatch: TStopwatch; begin Stopwatch.Reset; Stopwatch.Start; CheckDirectory('\\server\share\dir); Stopwatch.Stop; m_Result.Lines.Add(Format('FindFirstExNext DFS: Found %d dirs in %.3f seconds', [Length(DirsWithJpg), Stopwatch.Elapsed.TotalSeconds])); end; I redeclared FindFirstFileEx and FindNextFile for this because the declarations in WinApi.Windows are inconsistent in the declaration of the lpfFindFileData parameter. Now both declare it as PWin32FindData. 1 Share this post Link to post
dummzeuch 1506 Posted May 30, 2020 As promised, here the BFS version using FindFirstFileEx. And as expected, it also takes about 45 seconds in my test setup. function FindFirstFileEx(lpFileName: LPWSTR; fInfoLevelId: TFindexInfoLevels; lpFindFileData: Pointer; fSearchOp: TFindexSearchOps; lpSearchFilter: PWin32FindData; dwAdditionalFlags: DWORD): THandle; stdcall; external kernelbase Name 'FindFirstFileExW'; function FindNextFile(hFindFile: THandle; lpFindFileData: PWin32FindData): BOOL; stdcall; external kernelbase Name 'FindNextFileW'; procedure TForm1.b_FindFirstExNextBFSClick(Sender: TObject); var DirsWithJpg: TArray<string>; DirsToSearch: TStringList; procedure CheckDirectory(const _Dir: string); const FIND_FIRST_EX_LARGE_FETCH = 2; var DirBs: string; SearchHandle: THandle; sr: TWin32FindData; ContainsFiles: boolean; ContainsSubdirs: boolean; fn: string; begin ContainsFiles := False; ContainsSubdirs := False; DirBs := IncludeTrailingPathDelimiter(_Dir); SearchHandle := FindFirstFileEx(PChar(DirBs + '*.*'), FindExInfoBasic, @sr, FindExSearchNameMatch, nil, FIND_FIRST_EX_LARGE_FETCH); if SearchHandle <> INVALID_HANDLE_VALUE then begin try repeat fn := sr.cFileName; if (fn = '.') or (fn = '..') then begin // ignore end else if (sr.dwFileAttributes and FILE_ATTRIBUTE_DIRECTORY) <> 0 then begin // directory ContainsSubdirs := true; // add to list for later processing DirsToSearch.Add(DirBs + fn); end else if sr.dwFileAttributes and (FILE_ATTRIBUTE_HIDDEN or FILE_ATTRIBUTE_SYSTEM or FILE_ATTRIBUTE_REPARSE_POINT) = 0 then begin // regular file if not ContainsFiles then begin if SameText(ExtractFileExt(fn), '.jpg') then begin ContainsFiles := true; end; end; end else begin // ignore special files end; until not FindNextFile(SearchHandle, @sr); finally Winapi.Windows.FindClose(SearchHandle); end; end; if ContainsFiles then begin if ContainsSubdirs then begin m_Result.Lines.Add(Format('Directory %s contains files and subdirectories, ignoring it.', [_Dir])); end else begin DirsWithJpg := DirsWithJpg + [_Dir]; end; end; end; var Stopwatch: TStopwatch; begin Stopwatch.Reset; Stopwatch.Start; DirsToSearch := TStringList.Create; try DirsToSearch.Add('\\server\share\dir'); while DirsToSearch.count > 0 do begin CheckDirectory(DirsToSearch[0]); DirsToSearch.Delete(0); end; finally FreeAndNil(DirsToSearch); end; Stopwatch.Stop; m_Result.Lines.Add(Format('FindFirstExNext BFS: Found %d dirs in %.3f seconds', [Length(DirsWithJpg), Stopwatch.Elapsed.TotalSeconds])); end; 1 Share this post Link to post
dummzeuch 1506 Posted May 30, 2020 Conclusion: It doesn't matter whether I use Breadth First or Depth First Search, the performance is the same. Switching from SysUtils.FindFirst/FindNext to Windows.FindFirstFileEx/FindNextFile with only basic file information and the flag FIND_FIRST_EX_LARGE_FETCH brought me a performance gain of about 25% which is not too bad. I found no way to benefit from using TDirectory. The only test I made got worse performance than FindFirst/Next. Test environment: Windows 8.1 client accessing a share on a Samba server over 1 GBit Ethernet with minimal traffic (it's a weekend). The computer is a reasonably fast Intel Xeon E3 with 3.4 GHz with 8 cores and 16 GByte RAM. The Server has an Intel Core I9-9940X processor with 16 cores and with 64 GBytes of RAM. The share is stored on a 2 TB Intel M2 NVMe SSD. It's running Ubuntu 18.04.4 LTS and Samba 4.7.6-Ubuntu. The test program was compiled with Delphi 10.4. There was a no noticable difference between building with Debug and Release config. The directory tree is 4 levels deep and contains only a few directories on the first 3 levels but a total of 898 directories on the deepest level. Each of these directories on the deepest level contain from several hundred up to several thousand jpeg files. The purpose of the code is to find all directories on any level containing at least one jpeg file but no subdirectories. All tests were run several times in varying order to prevent any caching effects to distort the result. 1 2 Share this post Link to post
Vincent Parrett 754 Posted May 30, 2020 6 hours ago, dummzeuch said: Trying to use TDirectory did not result in any improvement. It takes about twice as long. Thanks for that info! I have been using TDirectory lately, but this thread spiked my interest in looking at how it's implemented.. creating a freeing a TMask for every file/directory is not very efficient. Share this post Link to post
Vincent Parrett 754 Posted May 30, 2020 A quick look tells me there are other issues with TDirectory.GetFiles - no support for paths longer than MAX_PATH - which windows 10 1607 or later does support. You do have to enable it both on the system (reg key) and per process (manifest option) https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#enable-long-paths-in-windows-10-version-1607-and-later Share this post Link to post
Remy Lebeau 1405 Posted June 1, 2020 (edited) On 5/29/2020 at 6:06 PM, Vincent Parrett said: TDirectory.GetFiles seems horribly inefficient to me. That is how I feel for most things in the IOUtils unit! It is just bad implementations all around. Edited June 1, 2020 by Remy Lebeau 2 Share this post Link to post
dummzeuch 1506 Posted June 14, 2020 (edited) I did some more tests: On Windows 10 1909 on a physical computer and a local drive I found that it makes no difference whether I use FindFirstFileEx or FindFirstFile. On Windows 10 1909 on a Samba share (same as in the first tests) there is still an advantage of using FindFirstFileEx, but it isn't as big as in my tests on Windows 8.1. Since the contents of the drive have changed in the meantime and the Windows 10 installation is on a virtual machine while the Windows 8.1 installation is on a physical computer I can't meaningfully compare them to the original tests. Edited June 14, 2020 by dummzeuch 1 Share this post Link to post
FredS 138 Posted June 15, 2020 10 hours ago, dummzeuch said: I can't meaningfully compare them It's faster especially when you are only enumerating folders.. To do a real test one would have to access remote ADMIN$ shares, something like a 2012 server or newer with a huge directory tree.. Run a few of those on Azure and run them simultaneously (multi-threaded).. but not in 10.4 .. being able to retrieve the data in large chunks makes enough of a difference.. Share this post Link to post
Mahdi Safsafi 225 Posted June 15, 2020 @dummzeuch I used PChar instead of string in b_FindFirstExNextBFSClick and I got up to 20% improvement. I tested on Windows 10 with SDD on 'C:' drive. Share this post Link to post