Mike Torrettinni 198 Posted April 17, 2019 I definitely enjoy refactoring old code into small reusable methods, but refactoring methods dealing with Files always gives me a sense of 'point of failure' is very close 😉 Lets assume I'm setting up main method that access files, like this pseudo code: MainMethod: GetMainPath; for each File in MainPath do begin if FileExists(File) then begin If _IsFileValidForParsing(File) then begin If _FileIsA(File) then _ParseAsA(File); else _ParseAsB(File); end; end; end; So, here I have these simple (or complex) methods that all use File : _IsFileValidForParsing _FileIsA _FileIsB _ParseAsA _ParseAsB ... and, if they are to be reusable, I always put FileExists() also in each of these method, because I don't know who calls this method and if the caller already checked... just to be sure. So, now looking at the code if feels like my OS will eventually say: WTF, the File exists, stop checking for it!!! 🙂 Is this good approach? Any advice how to look at this, not to worry? Or I should change my approach? Thanks! Share this post Link to post
Guest Posted April 17, 2019 Let's assume this is multiplatform, multi-everything, multi-devices. It's ok, if not prudent, to assume you cannot know the cost of the call. In that case introduce a caching of the state flag. IMHO. Or research the cost of the call on all possible platforms and device combinations. Share this post Link to post
Attila Kovacs 629 Posted April 17, 2019 completely superfluous, then for each File in MainPath doif the file can disappear between these two linesif FileExists(File) then it could also disappear between these two linesIf _IsFileValidForParsing(File) then (then you decide to create the file, it could be there again) etc.. FileExists and co. are a total nonsense, then it's returning a momentary state. Put a try except around any IO and handle the last os error if apply. 1 Share this post Link to post
Sherlock 663 Posted April 17, 2019 Well...if you know your other methods do the file check, then don't do it outside of them. But only then. 1 Share this post Link to post
Mike Torrettinni 198 Posted April 17, 2019 38 minutes ago, Dany Marmur said: assume you cannot know the cost of the call Even though this project is Windows based only, you never know. Didn't think about this... constant file checking could prove a bit over to the top when accessing network files, potentially slow. Share this post Link to post
Mike Torrettinni 198 Posted April 17, 2019 16 minutes ago, Attila Kovacs said: completely superfluous, then for each File in MainPath doif the file can disappear between these two linesif FileExists(File) then it could also disappear between these two linesIf _IsFileValidForParsing(File) then (then you decide to create the file, it could be there again) etc.. FileExists and co. are a total nonsense, then it's returning a momentary state. Put a try except around any IO and handle the last os error if apply. Valid point! Share this post Link to post
Mike Torrettinni 198 Posted April 17, 2019 15 minutes ago, Sherlock said: Well...if you know your other methods do the file check, then don't do it outside of them. But only then. Well, all my methods have checks for invalid parameter values, nil pointers or integer<0 or empty strings... so, I'm kind of used to not relying on callers to pass always proper parameters. In this case to rely that file exists (as caller checked), goes against my experience... but maybe it's time this changes. Share this post Link to post
Attila Kovacs 629 Posted April 17, 2019 (edited) 19 minutes ago, Mike Torrettinni said: not relying on callers to pass always proper parameters Throw them an exception. Then they (your callers) will start checking by themselves and throwing an exception to its callers etc.. at the end you will have a solid app. Edited April 17, 2019 by Attila Kovacs Share this post Link to post
David Heffernan 2345 Posted April 17, 2019 You shouldn't put checks for parameter validity in every method. Your code will just be full of that and you will drown in clutter. 2 Share this post Link to post
David Heffernan 2345 Posted April 17, 2019 2 hours ago, Attila Kovacs said: Throw them an exception. Then they (your callers) will start checking by themselves and throwing an exception to its callers etc.. at the end you will have a solid app. That's not how you do exception handling. You don't need to catch and re-raise. You can just let them float up to wherever in the code knows how to deal with them with finality. 1 Share this post Link to post
Attila Kovacs 629 Posted April 17, 2019 I've never mention any catch and re-raise. Share this post Link to post
Mike Torrettinni 198 Posted April 17, 2019 2 hours ago, David Heffernan said: You can just let them float up to wherever in the code knows how to deal with them with finality. This makes sense, I guess. Not what I'm used to. Share this post Link to post
Remy Lebeau 1396 Posted April 17, 2019 8 hours ago, Mike Torrettinni said: Lets assume I'm setting up main method that access files, like this pseudo code: MainMethod: GetMainPath; for each File in MainPath do begin if FileExists(File) then begin If _IsFileValidForParsing(File) then begin If _FileIsA(File) then _ParseAsA(File); else _ParseAsB(File); end; end; end; The FileExists() call becomes redundant if you make _IsFileValidForParsing() return False for a non-existing file. But really, I would suggest completely re-writing this loop to actually open the File, decide whether to call _ParseAsA() or _ParseAsB() based on the content of the opened file rather than just its filename, and wrap the loop code in a try/except to handle IO errors. For instance: procedure MainMethod; var File: string; begin for File in GetMainPath do begin if _IsFileValidForParsing(File) then // <-- OK to leave in only if it checks just the filename and not the content ... begin try FS := TFileStream.Create(File, fmOpenRead or fmShareDenyWrite); try if _FileIsA(FS) then _ParseAsA(FS) else if _FileIsB(FS) then _ParseAsB(FS); finally FS.Free; end; except // handle error as needed ... end; end; end; end; 1 Share this post Link to post
Mike Torrettinni 198 Posted April 17, 2019 37 minutes ago, Remy Lebeau said: The FileExists() call becomes redundant if you make _IsFileValidForParsing() return False for a non-existing file. But really, I would suggest completely re-writing this loop to actually open the File, decide whether to call _ParseAsA() or _ParseAsB() based on the content of the opened file rather than just its filename, and wrap the loop code in a try/except to handle IO errors. For instance: procedure MainMethod; var File: string; begin for File in GetMainPath do begin if _IsFileValidForParsing(File) then // <-- OK to leave in only if it checks just the filename and not the content ... begin try FS := TFileStream.Create(File, fmOpenRead or fmShareDenyWrite); try if _FileIsA(FS) then _ParseAsA(FS) else if _FileIsB(FS) then _ParseAsB(FS); finally FS.Free; end; except // handle error as needed ... end; end; end; end; That is definitely different that what I would... I try to open and close files as fast as possible, but maybe this is not such a bad idea. So, i lock the file and use it until I don't need it anymore. Which could actually work, because files are in 99% local files. Share this post Link to post
David Schwartz 426 Posted April 18, 2019 (edited) 22 hours ago, Mike Torrettinni said: That is definitely different that what I would... I try to open and close files as fast as possible, but maybe this is not such a bad idea. So, i lock the file and use it until I don't need it anymore. Which could actually work, because files are in 99% local files. If you're worried about the parsing taking a relatively long time, then simply create a TMemoryStream and do LoadFromFile and parse that instead Edited April 18, 2019 by David Schwartz 1 Share this post Link to post
David Heffernan 2345 Posted April 18, 2019 23 hours ago, Mike Torrettinni said: That is definitely different that what I would... I try to open and close files as fast as possible, but maybe this is not such a bad idea. So, i lock the file and use it until I don't need it anymore. Which could actually work, because files are in 99% local files. Why are you so worried about reading from files. It's perfectly normal to open a file, read it, parse as you go, then close. In fact trying to do other than that is just going to create more complexity for you. To what benefit? What is the problem here? What do you see going wrong with the most simple way to read files? It looks like you are inventing a problem where none exists. 1 Share this post Link to post
Mike Torrettinni 198 Posted April 19, 2019 Thank you guys, some good advises and comments that made me think! Not sure what I will do, but good to know some other options. Never considered options like load file, close and then parse. or keeping file open for all operations.. checking, identifying, parsing... new stuff 🙂 6 hours ago, David Schwartz said: If you're worried about the parsing taking a relatively long time, then simply create a TMemoryStream and do LoadFromFile and parse that instead Not sure if usable for my current case, but interesting idea. Will remember. 5 hours ago, David Heffernan said: Why are you so worried about reading from files. It's perfectly normal to open a file, read it, parse as you go, then close. In fact trying to do other than that is just going to create more complexity for you. To what benefit? What is the problem here? What do you see going wrong with the most simple way to read files? It looks like you are inventing a problem where none exists. The simple way - that's how I'm doing now, except I was having all those If FileExists checks at every step. I think it was more because I wasn't sure or thought of the whole process, but refactoring large methods and then ended up with unnecessary checks. I did start noticing some over-refactoring... 🙂 Share this post Link to post
David Heffernan 2345 Posted April 19, 2019 Load into memory, close, then parse is likely wasteful and risks memory exhaustion. I would urge you to choose the simplest solution that meets your needs, at every decision point. 1 Share this post Link to post
David Schwartz 426 Posted April 19, 2019 The Windows file system segments get cached. The first time you call FileExists, it might take a little bit of time to load the segment from disk if it's not cached already. But after that it would run very fast unless you loaded so much stuff to memory in between that it got flushed out of cache. It's not something I'd even worry about. In fact, unless your files are really huge relative to free memory, they'd be cached as well. I don't know if the underlying Windows logic would load them in segment-sized buffers (4k or whatever) or try to suck the entire file into memory at once if it detects that there's sufficient free memory. Windows has gotten pretty darn good at managing its virtual memory. Way back in DOS days, we worried about this stuff. But when you've got gigabytes of free memory and you're loading up files smaller than 1MB or so, it's mostly irrelevant. I've got an app I built a few months ago that loads up all of the *.pas and *.dfm files it finds in an entire file tree ... several hundreds of them. Some are over a MB in size, but the average size is around 80KB. It takes less than a second to load them all from my SSD into memory streams and attach them to the .Data property on a TTreeview. Then it takes 30 seconds or so to parse them all looking for whatever stuff I want using regular expressions. RE's aren't the fastest things to use for parsing data; they're just a heck of lot easier to use than building a dedicated parser. Needless to say, as far as I can tell, the vast majority of the run-time overhead is taken up in the memory manager allocating new objects, copying data identified by the RE parser into them, and adding them into other containers. And in most cases, the biggest delays are caused by the VCL UI components, usually because I failed to call DisableControls on visual components while I'm loading them up. When I do that, UI progress updates are STILL causing the lion's share of processing delays! I mean, I can cut my parsing time in half if I don't want to tell the user what's going on, but that's poor UX design. Most people would rather be kept updated on long-running processes than having them performed in the dark with no idea how long they should take. I cannot imagine how old and slow of a computer you'd have to be running where calls to FileExists would take a noticeable amount of overhead. Maybe a 400MHz Celeron with 256KB of RAM and a 20GB HDD or so. We're talking something from the late 90's. I know there are small SBCs like that are still being used for process-control apps, but even Raspberry Pi's have more computing power than that! I'd urge you to focus on the efficiency of the value-add you're providing, not wasting time second-guessing the imagined inefficiencies of the OS. Get your program working, then look at where it's spending the most time. Most likely, about 98% of the actual execution time is going to be spent in your code or in the UI (eg, VCL). Not the OS interfaces -- unless your app is doing a HUGE amount of interaction with the file system. Even DBMS systems exhibit amazingly low file system overhead most of the time. 1 Share this post Link to post
dummzeuch 1505 Posted April 19, 2019 56 minutes ago, David Schwartz said: I cannot imagine how old and slow of a computer you'd have to be running where calls to FileExists would take a noticeable amount of overhead. Actually, I had a case where FileExists took more time than I wanted it to because I was processing a huge number of files (the record was 70000 files in 8000 folders. It took forever (about 60 seconds) the first time. See my question here: I managed to cut the original time into a quarter by in addition to finding out if the file exists also getting its file age (which FileExists already retrieves and which I needed later on anyway) and caching it myself. Now the time is down to about 10 seconds on the first call which still sometimes feels like eternity. But as you already said: Every additional call just hits the file system cache and is done in no time. 1 Share this post Link to post
Remy Lebeau 1396 Posted April 19, 2019 2 hours ago, dummzeuch said: I managed to cut the original time into a quarter by in addition to finding out if the file exists also getting its file age (which FileExists already retrieves and which I needed later on anyway) FileExists() has not relied on file ages for many years. The original version of FileExists() did, but it proved to be buggy and gave wrong results at times due to errors while calculating the age, so it was rewritten to rely on file attributes instead (see Why is GetFileAttributes the way old-timers test file existence?), and later to include some permission checks, too (can't access the attributes? If it is because the file is locked or access is denied, the file exists). 1 Share this post Link to post
dummzeuch 1505 Posted April 19, 2019 My fault. When I said "getting the file age", I didn't mean calling GetFileAge, but getting it using GetFileAttributesEx directly, which of course also checks whether the file exists. That's slightly slower than GetFileAttributes but still much faster than calling both. And yes, I forgot that it was the *Ex variant, not the simple one called by FileExists. Share this post Link to post
David Schwartz 426 Posted April 22, 2019 (edited) Historically speaking, the old DOS FAT file system was notoriously inefficient when it came to looking up filenames. The original FAT directory segments held something like 63 x 8.3 filenames and the last entry was a pointer to the next segment. Each time you ask for a file, the FAT file system would begin a linear search at the first directory segment and scan through each one to the very end of the list before it knew if it existed or not. It was always faster to get a list of files first into a memory buffer, sort it, then search that first before probing the file system. But the real solution laid in splitting things up so you didn't have more than a few directory segments anywhere. That usually led to something where you had two directory levels: one that was simply A to Z, and the next that contained folders or files that begin with that letter. Sometimes it would help a lot to go down and additional level. Also, FAT directory segments were filled from front-to-back. If a file was deleted, the slot for it would be cleared. A new filename added to a folder would take the first available slot. So putting thousands of files ('n') into a FAT-based folder would take on average O(n/2) tests for filename equality. Going from front to back probing with names, whether it was an ordered or unordered folder, would take the same amount of time. Introduction of long filenames added further delays if you used the long filenames. I'm not sure what FAT32 did to improve this, if anything. Windows introduced caching of directory segments that sped things up considerably. From what I understand, NTFS introduced something closer to what Unix does, and resolved this inefficiency somewhat, although I never did anything that pushed the limits of an NTFS-based file system the way I used to do with old FAT-based file systems. With faster CPUs and more extensive caching in Windows, the problem seemed to fade out. Part of the solution was to use FindFirst/FindNext, even if this meant you built a list in memory to search against before asking the file system for anything. Edited April 22, 2019 by David Schwartz 1 Share this post Link to post
Lars Fosdal 1792 Posted April 22, 2019 A more common issue could be if the file is accessible for read or not. Being written or otherwise held locked from another app, f.x. and whether a retry is desirable or not. Share this post Link to post
Sherlock 663 Posted November 10, 2022 (edited) Just for completeness sake it should be noted, that FileExists will raise an exception, if the current user has insufficient rights to read the file. Turns out, that is completely false. Sorry about that. Edited November 11, 2022 by Sherlock My own stupudity Share this post Link to post