Jump to content
Mike Torrettinni

Any advice when to use FileExists?

Recommended Posts

I definitely enjoy refactoring old code into small reusable methods, but refactoring methods dealing with Files always gives me a sense of 'point of failure' is very close 😉

 

Lets assume I'm setting up main method that access files, like this pseudo code:

MainMethod:
  GetMainPath;
  for each File in MainPath do
  begin
    if FileExists(File) then
    begin
      If _IsFileValidForParsing(File) then
      begin
        If _FileIsA(File) then
      	  _ParseAsA(File);
         else
       	  _ParseAsB(File);
      end;          
    end;
  end;

So, here I have these simple (or complex) methods that all use File :

_IsFileValidForParsing

_FileIsA

_FileIsB

_ParseAsA

_ParseAsB

...

 

and, if they are to be reusable, I always put FileExists() also in each of these method, because I don't know who calls this method and if the caller already checked... just to be sure. So, now looking at the code if feels like my OS will eventually say: WTF, the File exists, stop checking for it!!! 🙂

 

Is this good approach? Any advice how to look at this, not to worry? Or I should change my approach?

 

Thanks!

Share this post


Link to post

Let's assume this is multiplatform, multi-everything, multi-devices. It's ok, if not prudent, to assume you cannot know the cost of the call. In that case introduce a caching of the state flag. IMHO.

Or research the cost of the call on all possible platforms and device combinations.

Share this post


Link to post

completely superfluous, then

 

for each File in MainPath do
if the file can disappear between these two lines
if FileExists(File) then

it could also disappear between these two lines
If _IsFileValidForParsing(File) then

 

(then you decide to create the file, it could be there again)

etc..

 

FileExists and co. are a total nonsense, then it's returning a momentary state.

Put a try except around any IO and handle the last os error if apply.

 

  • Like 1

Share this post


Link to post

Well...if you know your other methods do the file check, then don't do it outside of them. But only then.

Share this post


Link to post
38 minutes ago, Dany Marmur said:

assume you cannot know the cost of the call

Even though this project is Windows based only, you never know. Didn't think about this... constant file checking could prove a bit over to the top when accessing network files, potentially slow.

Share this post


Link to post
16 minutes ago, Attila Kovacs said:

completely superfluous, then

 

for each File in MainPath do
if the file can disappear between these two lines
if FileExists(File) then

it could also disappear between these two lines
If _IsFileValidForParsing(File) then

 

(then you decide to create the file, it could be there again)

etc..

 

FileExists and co. are a total nonsense, then it's returning a momentary state.

Put a try except around any IO and handle the last os error if apply.

 

Valid point!

Share this post


Link to post
15 minutes ago, Sherlock said:

Well...if you know your other methods do the file check, then don't do it outside of them. But only then.

Well, all my methods have checks for invalid parameter values, nil pointers or  integer<0 or empty strings... so, I'm kind of used to not relying on callers to pass always proper parameters. In this case to rely that file exists (as caller checked), goes against my experience... but maybe it's time this changes.

Share this post


Link to post
Posted (edited)
19 minutes ago, Mike Torrettinni said:

not relying on callers to pass always proper parameters

Throw them an exception. Then they (your callers) will start checking by themselves and throwing an exception to its callers etc.. at the end you will have a solid app.

Edited by Attila Kovacs

Share this post


Link to post

You shouldn't put checks for parameter validity in every method. Your code will just be full of that and you will drown in clutter. 

  • Like 2

Share this post


Link to post
2 hours ago, Attila Kovacs said:

Throw them an exception. Then they (your callers) will start checking by themselves and throwing an exception to its callers etc.. at the end you will have a solid app.

That's not how you do exception handling. You don't need to catch and re-raise. You can just let them float up to wherever in the code knows how to deal with them with finality. 

  • Like 1

Share this post


Link to post
2 hours ago, David Heffernan said:

You can just let them float up to wherever in the code knows how to deal with them with finality. 

This makes sense, I guess. Not what I'm used to.

Share this post


Link to post
8 hours ago, Mike Torrettinni said:

Lets assume I'm setting up main method that access files, like this pseudo code:


MainMethod:
  GetMainPath;
  for each File in MainPath do
  begin
    if FileExists(File) then
    begin
      If _IsFileValidForParsing(File) then
      begin
        If _FileIsA(File) then
      	  _ParseAsA(File);
         else
       	  _ParseAsB(File);
      end;          
    end;
  end;

The FileExists() call becomes redundant if you make _IsFileValidForParsing() return False for a non-existing file.  But really, I would suggest completely re-writing this loop to actually open the File, decide whether to call _ParseAsA() or _ParseAsB() based on the content of the opened file rather than just its filename, and wrap the loop code in a try/except to handle IO errors.  For instance:

 

procedure MainMethod;
var
  File: string;
begin
  for File in GetMainPath do
  begin
    if _IsFileValidForParsing(File) then // <-- OK to leave in only if it checks just the filename and not the content ...
    begin
      try
        FS := TFileStream.Create(File, fmOpenRead or fmShareDenyWrite);
        try
          if _FileIsA(FS) then
            _ParseAsA(FS)
          else if _FileIsB(FS) then
            _ParseAsB(FS);
        finally
          FS.Free;
        end;
      except
        // handle error as needed ...
      end;
    end;          
  end;
end;

 

Share this post


Link to post
37 minutes ago, Remy Lebeau said:

The FileExists() call becomes redundant if you make _IsFileValidForParsing() return False for a non-existing file.  But really, I would suggest completely re-writing this loop to actually open the File, decide whether to call _ParseAsA() or _ParseAsB() based on the content of the opened file rather than just its filename, and wrap the loop code in a try/except to handle IO errors.  For instance:

 


procedure MainMethod;
var
  File: string;
begin
  for File in GetMainPath do
  begin
    if _IsFileValidForParsing(File) then // <-- OK to leave in only if it checks just the filename and not the content ...
    begin
      try
        FS := TFileStream.Create(File, fmOpenRead or fmShareDenyWrite);
        try
          if _FileIsA(FS) then
            _ParseAsA(FS)
          else if _FileIsB(FS) then
            _ParseAsB(FS);
        finally
          FS.Free;
        end;
      except
        // handle error as needed ...
      end;
    end;          
  end;
end;

 

That is definitely different that what I would... I try to open and close files as fast as possible, but maybe this is not such a bad idea. So, i lock the file and use it until I don't need it anymore. Which could actually work, because files are in 99% local files.

Share this post


Link to post
Posted (edited)
22 hours ago, Mike Torrettinni said:

That is definitely different that what I would... I try to open and close files as fast as possible, but maybe this is not such a bad idea. So, i lock the file and use it until I don't need it anymore. Which could actually work, because files are in 99% local files.

 

If you're worried about the parsing taking a relatively long time, then simply create a TMemoryStream and do LoadFromFile and parse that instead

Edited by David Schwartz
  • Like 1

Share this post


Link to post
23 hours ago, Mike Torrettinni said:

That is definitely different that what I would... I try to open and close files as fast as possible, but maybe this is not such a bad idea. So, i lock the file and use it until I don't need it anymore. Which could actually work, because files are in 99% local files.

Why are you so worried about reading from files. It's perfectly normal to open a file, read it, parse as you go, then close. In fact trying to do other than that is just going to create more complexity for you. To what benefit? 

 

What is the problem here? What do you see going wrong with the most simple way to read files? 

 

It looks like you are inventing a problem where none exists. 

  • Like 1

Share this post


Link to post

Thank you guys, some good advises and comments that made me think! Not sure what I will do, but good to know some other options. Never considered options like load file, close and then parse. or keeping file open for all operations.. checking, identifying, parsing... new stuff 🙂

 

6 hours ago, David Schwartz said:

If you're worried about the parsing taking a relatively long time, then simply create a TMemoryStream and do LoadFromFile and parse that instead

Not sure if usable for my current case, but interesting idea. Will remember.

 

 

5 hours ago, David Heffernan said:

Why are you so worried about reading from files. It's perfectly normal to open a file, read it, parse as you go, then close. In fact trying to do other than that is just going to create more complexity for you. To what benefit? 

 

What is the problem here? What do you see going wrong with the most simple way to read files? 

 

It looks like you are inventing a problem where none exists. 

The simple way - that's how I'm doing now, except I was having all those If FileExists checks at every step. I think it was more because I wasn't sure or thought of the whole process, but refactoring large methods and then ended up with unnecessary checks.

 

I did start noticing some over-refactoring... 🙂

Share this post


Link to post

Load into memory, close, then parse is likely wasteful and risks memory exhaustion. 

 

I would urge you to choose the simplest solution that meets your needs, at every decision point. 

  • Like 1

Share this post


Link to post

The Windows file system segments get cached. The first time you call FileExists, it might take a little bit of time to load the segment from disk if it's not cached already. But after that it would run very fast unless you loaded so much stuff to memory in between that it got flushed out of cache. It's not something I'd even worry about.

 

In fact, unless your files are really huge relative to free memory, they'd be cached as well. I don't know if the underlying Windows logic would load them in segment-sized buffers (4k or whatever) or try to suck the entire file into memory at once if it detects that there's sufficient free memory. Windows has gotten pretty darn good at managing its virtual memory.

 

Way back in DOS days, we worried about this stuff. But when you've got gigabytes of free memory and you're loading up files smaller than 1MB or so, it's mostly irrelevant.

 

I've got an app I built a few months ago that loads up all of the *.pas and *.dfm files it finds in an entire file tree ... several hundreds of them. Some are over a MB in size, but the average size is around 80KB. It takes less than a second to load them all from my SSD into memory streams and attach them to the .Data property on a TTreeview. Then it takes 30 seconds or so to parse them all looking for whatever stuff I want using regular expressions. RE's aren't the fastest things to use for parsing data; they're just a heck of lot easier to use than building a dedicated parser.

 

Needless to say, as far as I can tell, the vast majority of the run-time overhead is taken up in the memory manager allocating new objects, copying data identified by the RE parser into them, and adding them into other containers.

 

And in most cases, the biggest delays are caused by the VCL UI components, usually because I failed to call DisableControls on visual components while I'm loading them up. When I do that, UI progress updates are STILL causing the lion's share of processing delays! I mean, I can cut my parsing time in half if I don't want to tell the user what's going on, but that's poor UX design. Most people would rather be kept updated on long-running processes than having them performed in the dark with no idea how long they should take.

 

I cannot imagine how old and slow of a computer you'd have to be running where calls to FileExists would take a noticeable amount of overhead. Maybe a 400MHz Celeron with 256KB of RAM and a 20GB HDD or so. We're talking something from the late 90's. I know there are small SBCs like that are still being used for process-control apps, but even Raspberry Pi's have more computing power than that!

 

I'd urge you to focus on the efficiency of the value-add you're providing, not wasting time second-guessing the imagined inefficiencies of the OS. Get your program working, then look at where it's spending the most time. Most likely, about 98% of the actual execution time is going to be spent in your code or in the UI (eg, VCL). Not the OS interfaces -- unless your app is doing a HUGE amount of interaction with the file system. Even DBMS systems exhibit amazingly low file system overhead most of the time.

  • Like 1

Share this post


Link to post
56 minutes ago, David Schwartz said:

I cannot imagine how old and slow of a computer you'd have to be running where calls to FileExists would take a noticeable amount of overhead.

Actually, I had a case where FileExists took more time than I wanted it to because I was processing a huge number of files (the record was 70000 files in 8000 folders. It took forever (about 60 seconds) the first time.

 

See my question here:

 

I managed to cut the original time into a quarter by in addition to finding out if the file exists also getting its file age (which FileExists already retrieves and which I needed later on anyway) and caching it myself. Now the time is down to about 10 seconds on the first call which still sometimes feels like eternity.

 

But as you already said: Every additional call just hits the file system cache and is done in no time.

  • Like 1

Share this post


Link to post
2 hours ago, dummzeuch said:

I managed to cut the original time into a quarter by in addition to finding out if the file exists also getting its file age (which FileExists already retrieves and which I needed later on anyway) 

FileExists() has not relied on file ages for many years. The original version of FileExists() did, but it proved to be buggy and gave wrong results at times due to errors while calculating the age, so it was rewritten to rely on file attributes instead (see Why is GetFileAttributes the way old-timers test file existence?), and later to include some permission checks, too (can't access the attributes? If it is because the file is locked or access is denied, the file exists).

  • Like 1

Share this post


Link to post

My fault. When I said "getting the file age", I didn't mean calling GetFileAge, but getting it using GetFileAttributesEx directly, which of course also checks whether the file exists. That's slightly slower than GetFileAttributes but still much faster than calling both. And yes, I forgot that it was the *Ex variant, not the simple one called by FileExists.

Share this post


Link to post
Posted (edited)

Historically speaking, the old DOS FAT file system was notoriously inefficient when it came to looking up filenames. The original FAT directory segments held something like 63 x 8.3 filenames and the last entry was a pointer to the next segment. Each time you ask for a file, the FAT file system would begin a linear search at the first directory segment and scan through each one to the very end of the list before it knew if it existed or not. It was always faster to get a list of files first into a memory buffer, sort it, then search that first before probing the file system. But the real solution laid in splitting things up so you didn't have more than a few directory segments anywhere. That usually led to something where you had two directory levels: one that was simply A to Z, and the next that contained folders or files that begin with that letter. Sometimes it would help a lot to go down and additional level.

 

Also, FAT directory segments were filled from front-to-back. If a file was deleted, the slot for it would be cleared. A new filename added to a folder would take the first available slot. 

 

So putting thousands of files ('n') into a FAT-based folder would take on average O(n/2) tests for filename equality. Going from front to back probing with names, whether it was an ordered or unordered folder, would take the same amount of time.

 

Introduction of long filenames added further delays if you used the long filenames. I'm not sure what FAT32 did to improve this, if anything. Windows introduced caching of directory segments that sped things up considerably.

 

From what I understand, NTFS introduced something closer to what Unix does, and resolved this inefficiency somewhat, although I never did anything that pushed the limits of an NTFS-based file system the way I used to do with old FAT-based file systems. With faster CPUs and more extensive caching in Windows, the problem seemed to fade out.

 

Part of the solution was to use FindFirst/FindNext, even if this meant you built a list in memory to search against before asking the file system for anything.

 

Edited by David Schwartz
  • Like 1

Share this post


Link to post

A more common issue could be if the file is accessible for read or not.  Being written or otherwise held locked from another app, f.x.  and whether a retry is desirable or not.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×