Jump to content
dummzeuch

Depth First Search vs. Breadth First Search in directories

Recommended Posts

Given a large directory tree with many files and sub directories and a depth of about 5 located on a Samba server share which must be fully traversed:

 

What is more efficient:

  • Depth First Search: Recursive FindFirst/Next which immediately handles each sub directory when it is encountered.
  • Breadth First Search: Iterative FindFirst/Next which maintains a list of directories to process and appends sub directories to this list when they are encountered to be processed later.

I have so far implemented DFS, simply because it was the easiest. It takes a very long time (about 60 seconds in my test scenario).

Before I try to implement BFS: Has anybody ever compared the performance of these approaches on Windows + network shares?

 

I googled, but could not find anything definitive (which really surprised me), maybe my google fu has let me down again.

  • Thanks 1

Share this post


Link to post

Discovery = DFS

Processing = BFS on Discovery -> Parallel worker threads processing queue?

 

The efficiency depends on a part of Windows file share architecture that I have never dived deeply into.

Coming from a single remote computer, will multiple concurrent file accesses be truly parallel, multiplexed or serialized?

  • Like 1

Share this post


Link to post
29 minutes ago, Lars Fosdal said:

Discovery = DFS

Processing = BFS on Discovery -> Parallel worker threads processing queue?

 

The efficiency depends on a part of Windows file share architecture that I have never dived deeply into.

Coming from a single remote computer, will multiple concurrent file accesses be truly parallel, multiplexed or serialized?

There is also the question of disk thrashing defeating the parallel approach. You can get a hint of this in BeyondCompare, if you set up multiple large copy operations, and run them in parallel.

Share this post


Link to post
4 hours ago, dummzeuch said:

I have so far implemented DFS, simply because it was the easiest. It takes a very long time (about 60 seconds in my test scenario).

Before I try to implement BFS: Has anybody ever compared the performance of these approaches on Windows + network shares?

Give TDirectory.GetFiles (System.IOUtils) a try. It drastically improved the time it took to discover everything compared to my recursive FindFirst / FindNext in my backup application.

Share this post


Link to post
58 minutes ago, aehimself said:

Give TDirectory.GetFiles (System.IOUtils) a try.

That might for a change actually be an option, as the program in question is developed with Delphi 10.2 rather than my usual work horse Delphi 2007. I'm not yet used to those "new" utility units.

Thanks for the pointer.

Share this post


Link to post

There are two simple enhancements that can be made to Delphi FindFirst (which I never call, having adapted it into my own version as a replacement). I do not think that either will help you search a Samba network drive, but together they can significantly enhance file searching on Windows file systems. The secret is to call Windows FindFirstFileEx instead of FindFirstFile, which is used both by Delphi FindFirst and therefore also by TDirectory.GetFiles.  On modern OS, the 'Ex' function allows you to opt for a larger buffer, and also to omit the double search for the 8.3 equivalent alternate names in addition to the full file names.  Once you have set up the 'find' optimally in this way, you can call the standard FindNext and FindClose functions as usual.  MSDN has the details you need for FindFirstFileEx. 

 

But to address your question, my guess is that your recursive search would turn out to be faster than multi-threading the subfolder searches, if you were to measure it.  And measuring in your own environment is the best way to get the answer.

  • Like 1

Share this post


Link to post
1 hour ago, timfrost said:

search a Samba network drive

This works fine, and I also use a home rolled method using FindFirstFileEx.

 

As for multiple threads, I did that test a while back and it didn't show any better time.

Share this post


Link to post

TDirectory.GetFiles seems horribly inefficient to me. In most cases it's probably fast enough.. but looking at the code, I can't help think it's use TPath.DoMatchesPattern is wasteful (creating the mask and freeing it each time). 

  • Like 1

Share this post


Link to post
18 hours ago, dummzeuch said:
19 hours ago, aehimself said:

Give TDirectory.GetFiles (System.IOUtils) a try. It drastically improved the time it took to discover everything compared to my recursive FindFirst / FindNext in my backup application.

That might for a change actually be an option, as the program in question is developed with Delphi 10.2 rather than my usual work horse Delphi 2007. I'm not yet used to those "new" utility units.

I just had a look at the sources. As @Vandrovnik already said: It internally also uses FindFirst / FindNext in the same way my program did, so I don't expect any improvements. WalkThroughDirectory also recursively does a depth first search. In addition it calls two different callbacks, which won't improve performance either.

 

I'll now try to use FindFirstEx and also implemtent BFS and do some timing on each.

Edited by dummzeuch

Share this post


Link to post

My tests show no performance difference at all between DFS und BFS using FindFirst/FindNext. Both take nearly exactly 60 seconds on my test machine traversing the same directory tree on a Samba server. Trying to use TDirectory did not result in any improvement. It takes about twice as long.

 

DFS code:

var
  DirsWithJpg: TArray<string>;

  procedure CheckDirectory(const _Dir: string);
  var
    DirBs: string;
    sr: TSearchRec;
    ContainsFiles: boolean;
    ContainsSubdirs: boolean;
  begin
    ContainsFiles := False;
    ContainsSubdirs := False;
    DirBs := IncludeTrailingPathDelimiter(_Dir);

    if FindFirst(DirBs + '*.*', faAnyFile, sr) = 0 then begin
      try
        repeat
          if (sr.Name = '.') or (sr.Name = '..') then begin
            // ignore
          end else if (sr.Attr and faDirectory) <> 0 then begin
            // directory
            ContainsSubdirs := true;
            // recursive Depth First Search
            CheckDirectory(DirBs + sr.Name);
          end else if sr.Attr and (faHidden or faSysFile or faSymLink) = 0 then begin
            // regular file
            if not ContainsFiles then begin
              if SameText(ExtractFileExt(sr.Name), '.jpg') then begin
                ContainsFiles := true;
              end;
            end;
          end else begin
            // ignore special files
          end;
        until FindNext(sr) <> 0;
      finally
        FindClose(sr);
      end;
    end;
    if ContainsFiles then begin
      if ContainsSubdirs then begin
        m_Result.Lines.Add(Format('Directory %s contains files and subdirectories, ignoring it.',
          [_Dir]));
      end else begin
        DirsWithJpg := DirsWithJpg + [_Dir];
      end;
    end;
  end;

var
  Stopwatch: TStopwatch;
begin
  Stopwatch.Reset;
  Stopwatch.Start;
  CheckDirectory('\\server\share\dir');
  Stopwatch.Stop;
  m_Result.Lines.Add(Format('FindFirst/Next DBF: Found %d dirs in %.3f seconds',
    [Length(DirsWithJpg), Stopwatch.Elapsed.TotalSeconds]));
end;

BFS code:

var
  DirsWithJpg: TArray<string>;
  DirsToSearch: TStringList;

  procedure CheckDirectory(const _Dir: string);
  var
    DirBs: string;
    sr: TSearchRec;
    ContainsFiles: boolean;
    ContainsSubdirs: boolean;
  begin
    ContainsFiles := False;
    ContainsSubdirs := False;
    DirBs := IncludeTrailingPathDelimiter(_Dir);

    if FindFirst(DirBs + '*.*', faAnyFile, sr) = 0 then begin
      try
        repeat
          if (sr.Name = '.') or (sr.Name = '..') then begin
            // ignore
          end else if (sr.Attr and faDirectory) <> 0 then begin
            // directory
            ContainsSubdirs := true;
            // add to list for later processing
            DirsToSearch.Add(DirBs + sr.Name);
          end else if sr.Attr and (faHidden or faSysFile or faSymLink) = 0 then begin
            // regular file
            if not ContainsFiles then begin
              if SameText(ExtractFileExt(sr.Name), '.jpg') then begin
                ContainsFiles := true;
              end;
            end;
          end else begin
            // ignore special files
          end;
        until FindNext(sr) <> 0;
      finally
        FindClose(sr);
      end;
    end;
    if ContainsFiles then begin
      if ContainsSubdirs then begin
        m_Result.Lines.Add(Format('Directory %s contains files and subdirectories, ignoring it.',
          [_Dir]));
      end else begin
        DirsWithJpg := DirsWithJpg + [_Dir];
      end;
    end;
  end;

var
  Stopwatch: TStopwatch;
begin
  Stopwatch.Reset;
  Stopwatch.Start;
  DirsToSearch := TStringList.Create;
  try
    DirsToSearch.Add('\\server\share\dir');
    while DirsToSearch.count > 0 do begin
      CheckDirectory(DirsToSearch[0]);
      DirsToSearch.Delete(0);
    end;

  finally
    FreeAndNil(DirsToSearch);
  end;
  Stopwatch.Stop;
  m_Result.Lines.Add(Format('FindFirst/Next BFS: Found %d dirs in %.3f seconds',
    [Length(DirsWithJpg), Stopwatch.Elapsed.TotalSeconds]));
end;

The naive approach using TDirectory takes about twice as long because of the calls to .GetFiles and .GetDirectories each call FindFirst/FindNext per directory. (It also finds 2 less directories containing jpg files, so there is probably a bug somewhere, but I didn't investigate this any further.)

var
  DirsWithJpg: TArray<string>;

  procedure CheckDirectory(const _Dir: string);
  var
    Files: TArray<string>;
    Dirs: TArray<string>;
    i: Integer;
  begin
    Files := TDirectory.GetFiles(_Dir, '*.jpg');
    Dirs := TDirectory.GetDirectories(_Dir);

    if Length(Files) > 0 then begin
      if Length(Dirs) > 0 then begin
        m_Result.Lines.Add(Format('Directory %s contains files and subdirectories, ignoring it.',
          [_Dir]));
      end else
        DirsWithJpg := DirsWithJpg + [_Dir];
    end else begin
      for i := Low(Dirs) to High(Dirs) do
        CheckDirectory(Dirs[i]);
    end;
  end;

var
  Stopwatch: TStopwatch;
begin
  Stopwatch.Reset;
  Stopwatch.Start;
  CheckDirectory('\\server\share\dir');
  Stopwatch.Stop;
  m_Result.Lines.Add(Format('DirGetFiles: Found %d dirs in %.3f seconds',
    [Length(DirsWithJpg), Stopwatch.Elapsed.TotalSeconds]));
end;

I thought about using TDirectory.FileSystemEntries instead but could not think of a simple way to implement this. It's probably possible using a Predicate but that's not really any simpler than directly using FindFirst/FindNext.

 

There is probably still some space for improvements and the code is not really clean, e.g. accessing variables of the outer procedure.

 

So, now I'll have a look at calling Windows.FindFirstFileEx instead of FindFirst, but I don't have much hope that this will help much.

Edited by dummzeuch
  • Like 1
  • Thanks 1

Share this post


Link to post

OK, so here is the DFS implementation using FindFirstFileEx and only getting basic info. In my tests on the same computer and with the same Samba server and share it only takes about 45 seconds (rather than the 60 seconds with SysUtils.FindFirst). The result is the same. For completeness I am going to implement BFS for this too, but I don't expect any change really.

function FindFirstFileEx(lpFileName: LPWSTR; fInfoLevelId: TFindexInfoLevels;
  lpFindFileData: Pointer; fSearchOp: TFindexSearchOps; lpSearchFilter: PWin32FindData;
  dwAdditionalFlags: DWORD): THandle; stdcall;
  external kernelbase Name 'FindFirstFileExW';

function FindNextFile(hFindFile: THandle; lpFindFileData: PWin32FindData): BOOL; stdcall;
  external kernelbase Name 'FindNextFileW';

procedure TForm1.b_FindFirstExNextDFSClick(Sender: TObject);
var
  DirsWithJpg: TArray<string>;

  procedure CheckDirectory(const _Dir: string);
  const
    FIND_FIRST_EX_LARGE_FETCH = 2;
  var
    DirBs: string;
    SearchHandle: THandle;
    sr: TWin32FindData;
    ContainsFiles: boolean;
    ContainsSubdirs: boolean;
    fn: string;
  begin
    ContainsFiles := False;
    ContainsSubdirs := False;
    DirBs := IncludeTrailingPathDelimiter(_Dir);

    SearchHandle := FindFirstFileEx(PChar(DirBs + '*.*'), FindExInfoBasic, @sr, FindExSearchNameMatch,
      nil, FIND_FIRST_EX_LARGE_FETCH);
    if SearchHandle <> INVALID_HANDLE_VALUE then begin
      try
        repeat
          fn := sr.cFileName;
          if (fn = '.') or (fn = '..') then begin
            // ignore
          end else if (sr.dwFileAttributes and FILE_ATTRIBUTE_DIRECTORY) <> 0 then begin
            // directory
            ContainsSubdirs := true;
            // recursive Depth First Search
            CheckDirectory(DirBs + fn);
          end else if sr.dwFileAttributes
            and (FILE_ATTRIBUTE_HIDDEN or FILE_ATTRIBUTE_SYSTEM or FILE_ATTRIBUTE_REPARSE_POINT) = 0 then begin
            // regular file
            if not ContainsFiles then begin
              if SameText(ExtractFileExt(fn), '.jpg') then begin
                ContainsFiles := true;
              end;
            end;
          end else begin
            // ignore special files
          end;
        until not FindNextFile(SearchHandle, @sr);
      finally
        Winapi.Windows.FindClose(SearchHandle);
      end;
    end;
    if ContainsFiles then begin
      if ContainsSubdirs then begin
        m_Result.Lines.Add(Format('Directory %s contains files and subdirectories, ignoring it.',
          [_Dir]));
      end else begin
        DirsWithJpg := DirsWithJpg + [_Dir];
      end;
    end;
  end;

var
  Stopwatch: TStopwatch;
begin
  Stopwatch.Reset;
  Stopwatch.Start;
  CheckDirectory('\\server\share\dir);
  Stopwatch.Stop;
  m_Result.Lines.Add(Format('FindFirstExNext DFS: Found %d dirs in %.3f seconds',
    [Length(DirsWithJpg), Stopwatch.Elapsed.TotalSeconds]));
end;

I redeclared FindFirstFileEx and FindNextFile for this because the declarations in WinApi.Windows are inconsistent in the declaration of the lpfFindFileData parameter. Now both declare it as PWin32FindData.

  • Like 1

Share this post


Link to post

As promised, here the BFS version using FindFirstFileEx. And as expected, it also takes about 45 seconds in my test setup.

function FindFirstFileEx(lpFileName: LPWSTR; fInfoLevelId: TFindexInfoLevels;
  lpFindFileData: Pointer; fSearchOp: TFindexSearchOps; lpSearchFilter: PWin32FindData;
  dwAdditionalFlags: DWORD): THandle; stdcall;
  external kernelbase Name 'FindFirstFileExW';

function FindNextFile(hFindFile: THandle; lpFindFileData: PWin32FindData): BOOL; stdcall;
  external kernelbase Name 'FindNextFileW';

procedure TForm1.b_FindFirstExNextBFSClick(Sender: TObject);
var
  DirsWithJpg: TArray<string>;
  DirsToSearch: TStringList;

  procedure CheckDirectory(const _Dir: string);
  const
    FIND_FIRST_EX_LARGE_FETCH = 2;
  var
    DirBs: string;
    SearchHandle: THandle;
    sr: TWin32FindData;
    ContainsFiles: boolean;
    ContainsSubdirs: boolean;
    fn: string;
  begin
    ContainsFiles := False;
    ContainsSubdirs := False;
    DirBs := IncludeTrailingPathDelimiter(_Dir);

    SearchHandle := FindFirstFileEx(PChar(DirBs + '*.*'), FindExInfoBasic, @sr, FindExSearchNameMatch,
      nil, FIND_FIRST_EX_LARGE_FETCH);
    if SearchHandle <> INVALID_HANDLE_VALUE then begin
      try
        repeat
          fn := sr.cFileName;
          if (fn = '.') or (fn = '..') then begin
            // ignore
          end else if (sr.dwFileAttributes and FILE_ATTRIBUTE_DIRECTORY) <> 0 then begin
            // directory
            ContainsSubdirs := true;
            // add to list for later processing
            DirsToSearch.Add(DirBs + fn);
          end else if sr.dwFileAttributes
            and (FILE_ATTRIBUTE_HIDDEN or FILE_ATTRIBUTE_SYSTEM or FILE_ATTRIBUTE_REPARSE_POINT) = 0 then begin
            // regular file
            if not ContainsFiles then begin
              if SameText(ExtractFileExt(fn), '.jpg') then begin
                ContainsFiles := true;
              end;
            end;
          end else begin
            // ignore special files
          end;
        until not FindNextFile(SearchHandle, @sr);
      finally
        Winapi.Windows.FindClose(SearchHandle);
      end;
    end;
    if ContainsFiles then begin
      if ContainsSubdirs then begin
        m_Result.Lines.Add(Format('Directory %s contains files and subdirectories, ignoring it.',
          [_Dir]));
      end else begin
        DirsWithJpg := DirsWithJpg + [_Dir];
      end;
    end;
  end;

var
  Stopwatch: TStopwatch;
begin
  Stopwatch.Reset;
  Stopwatch.Start;
  DirsToSearch := TStringList.Create;
  try
    DirsToSearch.Add('\\server\share\dir');
    while DirsToSearch.count > 0 do begin
      CheckDirectory(DirsToSearch[0]);
      DirsToSearch.Delete(0);
    end;

  finally
    FreeAndNil(DirsToSearch);
  end;
  Stopwatch.Stop;
  m_Result.Lines.Add(Format('FindFirstExNext BFS: Found %d dirs in %.3f seconds',
    [Length(DirsWithJpg), Stopwatch.Elapsed.TotalSeconds]));
end;

 

  • Like 1

Share this post


Link to post

Conclusion:

  1. It doesn't matter whether I use Breadth First or Depth First Search, the performance is the same.
  2. Switching from SysUtils.FindFirst/FindNext to Windows.FindFirstFileEx/FindNextFile with only basic file information and the flag FIND_FIRST_EX_LARGE_FETCH brought me a performance gain of about 25% which is not too bad.
  3. I found no way to benefit from using TDirectory. The only test I made got worse performance than FindFirst/Next.

Test environment:

  • Windows 8.1 client accessing a share on a Samba server over 1 GBit Ethernet with minimal traffic (it's a weekend). The computer is a reasonably fast Intel Xeon E3 with 3.4 GHz with 8 cores and 16 GByte RAM.
  • The Server has an Intel Core I9-9940X processor with 16 cores and with 64 GBytes of RAM. The share is stored on a 2 TB Intel M2 NVMe SSD. It's running Ubuntu 18.04.4 LTS and Samba 4.7.6-Ubuntu.
  • The test program was compiled with Delphi 10.4. There was a no noticable difference between building with Debug and Release config.
  • The directory tree is 4 levels deep and contains only a few directories on the first 3 levels but a total of 898 directories on the deepest level. Each of these directories on the deepest level contain from several hundred up to several thousand jpeg files.
  • The purpose of the code is to find all directories on any level containing at least one jpeg file but no subdirectories.
  • All tests were run several times in varying order to prevent any caching effects to distort the result.
  • Like 1
  • Thanks 2

Share this post


Link to post
6 hours ago, dummzeuch said:

Trying to use TDirectory did not result in any improvement. It takes about twice as long.

Thanks for that info! I have been using TDirectory lately, but this thread spiked my interest in looking at how it's implemented.. creating a freeing a TMask for every file/directory is not very efficient.  

Share this post


Link to post

A quick look tells me there are other issues with TDirectory.GetFiles - no support for paths longer than MAX_PATH  - which windows 10 1607 or later does support. You do have to enable it both on the system (reg key) and per process (manifest option) 

 

https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#enable-long-paths-in-windows-10-version-1607-and-later

 

Share this post


Link to post
On 5/29/2020 at 6:06 PM, Vincent Parrett said:

TDirectory.GetFiles seems horribly inefficient to me.

That is how I feel for most things in the IOUtils unit!  It is just bad implementations all around.

Edited by Remy Lebeau
  • Like 2

Share this post


Link to post

I did some more tests:

  • On Windows 10 1909 on a physical computer and a local drive I found that it makes no difference whether I use FindFirstFileEx or FindFirstFile.
  • On Windows 10 1909 on a Samba share (same as in the first tests) there is still an advantage of using FindFirstFileEx, but it isn't as big as in my tests on Windows 8.1. Since the contents of the drive have changed in the meantime and the Windows 10 installation is on a virtual machine while the Windows 8.1 installation is on a physical computer I can't meaningfully compare them to the original tests.

 

Edited by dummzeuch
  • Like 1

Share this post


Link to post
10 hours ago, dummzeuch said:

I can't meaningfully compare them

 

It's faster especially when you are only enumerating folders..

 

To do a real test one would have to access remote ADMIN$ shares, something like a 2012 server or newer with a huge directory tree..
Run a few of those on Azure and run them simultaneously (multi-threaded).. but not in 10.4 .. being able to retrieve the data in large chunks makes enough of a difference..

 

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×