Jump to content
Mark Williams

Best way to refresh static data

Recommended Posts

I'm trying to work out the best way to refresh data loaded from a local file with data from the server.

 

Using FireDAC with PostgreSQL. However, solution needs to be a general one as it will eventually need to work for SQL, Oracle etc.

 

I have a table (DOCS_TABLE) which can vary greatly in size. DOCS_TABLE has a timestamp field (LAST_UPDATED). As the name suggests this records the date on which data in the record last changed.

 

When user's open an app if they haven't queried DOCS_TABLE previously it is loaded via the server using a fairly complicated WHERE statement(COMPLICATED_WHERE_STATEMENT) which involves a number of joins to establish which records from the table the user is permitted to access. When the user closes the app the data from DOCS_TABLE is stored locally along with a timestamp to record the date and time the data was last refreshed (STORED_TMESTAMP).

 

Next time the app opens it loads the data from the locally stored file. It then needs to ensure the user is working with up-to-date data. 

 

At the moment I am running a refresh query SELECT [fields] FROM DOCS_TABLE WHERELAST_UPDATED >[STORED_TIMESTAMP] AND [COMPLICATED_WHERE_STATEMENT].

 

I use the resulting data from the refresh query to update the in memory dataset holding DOCS_TABLE.

 

This works, although it doesn't deal with records that were available at time of last saving locally and have now been deleted or access denied.

 

As such,within the app, I run a check to make sure the user still has access to any record  before trying to do anything with it, but it's not a terribly elegant solution. It would be better if such items were removed soon after loading the locally saved data.

 

I have some thoughts on how to deal with this, which are below. However, I am concerned I may be overcomplicating things and that there may be much simpler solutions to this problem. 

 

  1. Load the data from the local file.
  2. Run a thread for the following:
  3. Run a query (ID_QUERY) to ascertain which rows are now available to the user:
    SELECT id FROM DOCS_TABLE WHERE [COMPLICATED_WHERE_STATEMENT]
     
  4. Check the locally saved data against the result of this query to see what rows are no longer available to the user and remove them.
  5. Build a list of ids from the locally saved data (EXISTING_ID_ARRAY).
  6. Check the locally saved data against the results from ID_QUERY to see whether there are any new records to be added and build a list of the ids (NEW_ID_ARRAY).
  7. Run the refresh query using the arrays: SELECT [fields] FROMDOCS_TABLE WHERE (id in ([NEW_ID_ARRAY])) OR (id in [EXISTING_ID_ARRAY] ANDLAST_UPDATED >[STORED_TIMESTAMP]). Subject to my whole theory being cock-eyed I am pretty sure NEW_ID_ARRAY is the way to go. The part that concerns me is  EXISTING_ID_ARRAY? Whilst it will cut out the use of the COMPLICATED_WHERE_STATEMENT and enable the query to focus explicitly on a group of records clearly identified,  I would think the size of the array, could become a problem. Is there a law of diminishing returns with an IN clause? For example, if there were 1M records in the table and 20 items in the array, I suppose it must be the case using EXISTING_ID_ARRAY will be quicker than using COMPLICATED_WHERE_STATEMENT. But what if the array contained 800K of ids? I guess it has to be significantly less efficient to use EXISTING_ID_ARRAY and more efficient to use COMPLICATED_WHERE_STATEMENT.

 

I appreciate without providing full details of the structure of DOCS_TABLE and the various joined tables, the data being retrieved from it and the full nature of the COMPLICATED_WHERE_STATEMENT, I may be asking for a comparison between apples and pears. What I am really interested in is whether my logic set out above is sound or idiotic and any suggestions on how best to achieve what I am trying to achieve.

Share this post


Link to post

Why so complicated? Dump DOCS_TABLE and rebuild it from scratch - assuming you actually have access to the server. Otherwise you have to work with the data available in DOCS_TABLE anyway.

  • Like 1

Share this post


Link to post

It can be a pretty massive table, which can take a long while to load. It obviously loads much quicker from local file (even though encrypted). I am keen to avoid passing unnecessary traffic to the server, hence suggestion of local file and update with only the data necessary. There could be just a handful of records to update. I really don't want to download the lot again just for that.

 

I think this is a sensible way to approach it. I'm just not sure that the way I propose to handle it is the most sensible. But I am keen to avoid full reload of whole table.

Share this post


Link to post

Whatever you implement, you should always measure the performance against the brute force update.

 

If the server were a recent InterBase one, you could make it simple and fast with Change Views, but these are not available with other databases.

Share this post


Link to post

In a good database, the cost of complicated where statements usually is limited once you have tweaked your indexes.

Not sure you can get away from those where statements if the client will be receiving NEW rows once every now and then.

 

Not sure how good Postgre is with temp tables., but in MS SQL you could write your existing id array to a temp table, and do a join query.

Probably more efficient than trying to match thousands of IDs in an IN statement.

 

  • Like 1

Share this post


Link to post
25 minutes ago, Uwe Raabe said:

Whatever you implement, you should always measure the performance against the brute force update.

 

If the server were a recent InterBase one, you could make it simple and fast with Change Views, but these are not available with other databases. 

Unfortunately, I need to be able to offer a range of databases.

Share this post


Link to post
26 minutes ago, Lars Fosdal said:

In a good database, the cost of complicated where statements usually is limited once you have tweaked your indexes.

Not sure you can get away from those where statements if the client will be receiving NEW rows once every now and then.

It's a system for handling documents in litigation matters. So the number of documents can range from a few hundred to hundreds of thousands. Whilst a case is in its early stages the documents will (depending on the nature of the case) be loaded in large tranches. When it gets closer to its conclusion, you will get much fewer new documents being added to the case. I guess a count of how much new data there is and then decide on whether to use where statement or pass in an array of ids. 

 

30 minutes ago, Lars Fosdal said:

Not sure how good Postgre is with temp tables., but in MS SQL you could write your existing id array to a temp table, and do a join query.

Probably more efficient than trying to match thousands of IDs in an IN statement.

Not sure, but I will look into it. However, I need a solution that will work for Postgre, Oracle and MS SQL at the very least, so it has to be compatible across the board, although I suppose I could write a different routine for each server if needs be.

 

As for the IN clause, if it is particularly large in relation to the total number of records I could just load the lot via a thread and then dump the static data when I have the new data. If it's not too large relative to the total number of records, but still relatively large for an IN clause I guess I could submit in batches.

Share this post


Link to post

Hi,

 

Have you measure what is taking a long time?

    Is it the retrieval time from the SQL Server: Make sure the SQL optimizer is using the expected index.

    Is it downloading data to local store: Try compress , or is it the local loading time?


using IN ( long list ) is not a good option in ANY database. As other suggested, use a proper table to insert the IDs and join with that table. if some SQL server don't support session table you can simulate one by creating physically the table with two columns, one with some user identification ( user name, user login, machine name or machine IP ) and another an ID. But do use a join to retrieve those rows.

 

 

  • Like 1

Share this post


Link to post
20 hours ago, Clément said:

Is it the retrieval time from the SQL Server: Make sure the SQL optimizer is using the expected index.

    Is it downloading data to local store: Try compress , or is it the local loading time?

I'm not sure how to measure how much time  it takes to retrieve from the server and how much to locally load. However, I know that loading from a static file is significantly faster than loading from the server so I'm pretty sure the bottleneck lies with the retrieval time.

 

I'll have a look at compression. Haven't used it so far and didn't know FireDac supported it. Although my principal goal is to minimize resources server side. 

 

20 hours ago, Clément said:

using IN ( long list ) is not a good option in ANY database. As other suggested, use a proper table to insert the IDs and join with that table. if some SQL server don't support session table you can simulate one by creating physically the table with two columns, one with some user identification ( user name, user login, machine name or machine IP ) and another an ID. But do use a join to retrieve those rows.

Noted. I will try and do it via temp tables.

Share this post


Link to post

Given that various databases have significantly different ways of storing files and optmize queries, I would hide the access behind a REST service. I'd go so far as to have entirely separate services for the various databases I needed to support.  That would make it possible to do compression in a generic way, not having to rely on the database capabilities.  It also would allow the client to be "ignorant" of the database specifics - and allow you to do changes and optimizations server-side without having to change the client.

 

Share this post


Link to post
2 hours ago, Lars Fosdal said:

I would hide the access behind a REST service

I've not really looked into possibility of using REST. If I'm not mistaken I would need the enterprise edition of Delphi rather than the mere professional to implement REST services. Not a deal breaker in itself, but I am not convinced that I need to go down that route (time is a major constraint for me at the moment). FireDAC seems to make it fairly simple to change horses between different databases. Isn't it then a question of ensuring your queries etc are compliant SQL and, if not, adapted as necessary for each database you support? 

Share this post


Link to post

I have fallen into the trap of underestimating the complexity of cross-database code before. 

It is no walk in the park. Each database has numerous functions / best practices that go beyond "standard" SQL.

 

Consider MS SQL's FileStream vs Oracle SecureFile / LOBs. The first is documented for FireDAC, the second is not.

 

Do you need to encrypt the traffic between the client and the server? 

FireDAC SQL Server driver supports encryption. No mention of it in the Oracle FireDAC doc, or the PostgreSQL FireDAC doc.

 

Off-topic:

Down the line, you may need to support Azure, AWS, Oracle Cloud, etc.

Just curious: Why must the documents be downloaded? Due to usage requirements in locations without network access?

 

Share this post


Link to post
6 hours ago, Lars Fosdal said:

It is no walk in the park. Each database has numerous functions / best practices that go beyond "standard" SQL.

Noted. Just working in Postgres at the moment and trying to finish an app. Offered in Postgres initially and intending to offer connectors for other dbs when requested. If at the time of doing so I experience any issues I may then opt for  REST server.

6 hours ago, Lars Fosdal said:

Do you need to encrypt the traffic between the client and the server? 

FireDAC SQL Server driver supports encryption. No mention of it in the Oracle FireDAC doc, or the PostgreSQL FireDAC doc.

Yes. It's there for PostGres also. You use the PGAdvanced parameter of the connection which gives you access to Postgre's Database Connection Control Functions.

6 hours ago, Lars Fosdal said:

Why must the documents be downloaded?

They don't have to be. But with large amounts of data it I find it is much faster to load the data from a local file and refresh it from the server in the background. 

Share this post


Link to post
3 hours ago, Mark Williams said:

They don't have to be. But with large amounts of data it I find it is much faster to load the data from a local file and refresh it from the server in the background

But you would get rid of all the trouble of sync‘ing the data, if you kept the data on the server.

With a REST service you will probably be able to load the data much faster, and save a lot of bandwith - at least if you are on a slow network.

  • Like 1

Share this post


Link to post
42 minutes ago, Hans J. Ellingsgaard said:

ut you would get rid of all the trouble of sync‘ing the data, if you kept the data on the server.

I understand that, but I am keen to speed up user experience at client end and reduce traffic at server end and doing it by way of a local file does that.

43 minutes ago, Hans J. Ellingsgaard said:

With a REST service you will probably be able to load the data much faster, and save a lot of bandwith

I will now demonstrated my ignorance of REST services. I don't understand how it will result in faster loading of the same data and less bandwidth.

Share this post


Link to post

If you have thousands of documents related to a case - how many of those would be opened in the same sitting?

If you are able to reach the server at any time, I would synchronize the catalog only and pull the documents on demand to an MRU cache

You could even keep MFU statistics to pre-load the most frequently used docs at first connect.

 

Personally, I would consider having it all server-side with a web UI.

 

Then again - this is the kind of problem that Office 365 and SharePoint really excels at - including tried and tested security models from all angles.

 

Share this post


Link to post
17 minutes ago, Lars Fosdal said:

If you have thousands of documents related to a case - how many of those would be opened in the same sitting?

I think there may be some confusion. DOCS_TABLE does not contain the actual documents, rather it contains only data relating to the documents (such as data, author, file type, when uploaded etc. I don't download all the documents in one hit, just the data relating to them. The documents are stored on the server as files and downloaded only when they are needed.

 

I could (and did until recently) just load the data from the database on startup. However, this obviously gets progressively slower as the number of records increases. It also struck me as pointless downloading the data time after time where it had not changed or was little changed. So I thought it would be better to store the data locally. For a large dataset (including heavy duty encryption of the local file) I get around a 20% time saving and a lot less traffic on the server. 

 

The actual documents when downloaded are cached (if that's what the user specifies) in a temp folder.

 

28 minutes ago, Lars Fosdal said:

 

Personally, I would consider having it all server-side with a web UI.

It is all server side save for some caching. Whilst there isn't a web UI it is not a path I want to go down. Have used them extensively in the past and I don't think it is appropriate for the current usage for various reasons. Quite happy with a desktop app and using internet components and/or FireDAC. It works well and I am long way down the road. 

 

32 minutes ago, Lars Fosdal said:



Then again - this is the kind of problem that Office 365 and SharePoint really excels at - including tried and tested security models from all angles.

That's as may be. However, I am 5 years down the road, the software works as  I want it. I am thinking of changing the way I load and refresh data not thinking of throwing out baby with the bathwater!:classic_smile:

Share this post


Link to post

You know me. I am always in a state of bewilderment 🙂

 

Based on your updated sync requirements description, I'll stop envisioning a new generation of your system, and instead tell you about how I solved a similar need for synchronization.

 

I did distributed stock, energy and weather market data databases for well over a decade. The solution was database agnostic and the central server generated change instructions - which technically were binary scripts for adding, changing or removing data.  The change scripts were sequenced in chronological order by changes done at the central server, so that to do a refresh, the client requested an update from the last synch point, and then convert the list of change instructions to operations relevant to the local database.  The change instruction format remained unchanged even though the underlying databases changed from flat files, to client side Paradox, to server side sql over a tcp connection.  I still shudder at the memory of having terabytes of Paradox tables that I couldn't touch except through the change instructions 😛 

 

Today, these data would have been retrieved through a REST interface.

Share this post


Link to post
5 hours ago, Lars Fosdal said:

Based on your updated sync requirements description, I'll stop envisioning a new generation of your system, and instead tell you about how I solved a similar need for synchronization.

I'm not convinced I have such a major issue here, but only time will tell and I'll have to make a judgement call at that time.

 

5 hours ago, Lars Fosdal said:

Today, these data would have been retrieved through a REST interface.

Give it a REST! Joking aside, I get the message. REST is the way to go. Although by the time I get round to looking at it, it will probably be out to pasture with the brontosauri.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×