Correct approach in making incremental update dataflows

Hi again,

today I was thinking about creating dataflows, that have the most correct approach for incremental data update.

So our source database tables contain field called “date_added” and “date_updated”.

When our workflow starts, it takes the current time into variable called “$newMaxDateUpdated” and it finds a last time when the specfic table was updated into
variable called “$maxDateUpdated”. When the dataflow starts, it makes a query on the source data with the condition like this “where t.date_updated > {$maxDateUpdated} or t.date_added > {$maxDateUpdated}”.
This dataflow makes table compare and some of them also use history preserve.

After the completion of specific dataflow work, the value of variable $newMaxDateUpdated will be recorded to our table called “table_update_log”.
And next time the variable called $maxDateUpdated will get the value from that table again and so on…

I was thinking about the time comparison, because when the dataflow starts and the query is being made, the operation may be quite fast and there is a possibility that on the same time somebody updates data on these tables. So there may be double rows afterwards in the future runs…

What do you think about this solution? Is somebody facing same issues and what solutions are you using then?


bobmeediake :estonia: (BOB member since 2008-12-01)

Your approach sounds fine. The table compare should take care of the duplicates problem. (That is, you could set your $MaxDateUpdated back an extra day or whatever and you’d re-read the same records again, but they’d just update, rather than insert, into the target.)


dnewton :us: (BOB member since 2004-01-30)