I have already searched and found a reply on the topic of creating MD5 hashs … but it wasn’t quite what I wanted.
I’m working on a DW build where there is a requirement to create MD5 hash of keys, and separately staging data in order to speed up the Table Compare process in the load. The TC is very poorly performing without hashs.
I have originally solved the problem using the python functionality exposed in Data Services DQ components (User-Defined). It worked fine, and very fast, but I was then told that the licence key in use was only a temp one; when the real one turned up, it disabled all my custom md5 functions
I wondered if anyone:
Knows how I can continue to use the python code … or
Has written a DI custom function to generate an MD5 hash … or
Knows how I can implement a new piece of Java/whatever code as a DI transform
I have also written a small generator in VB.NET, and it works but the performance is rubbish.
We have an hash function but do not expose it to the outside. But before we check options here, may I ask why the Table Comparison transform is slow? Depending on its mode I see throughput of up to 80’000rows per second and thread.
TC is (currently) using Sorted Input. Given the size of the target table, I’ve tried all the options!
Input table has 4 key fields and about 65 compare fields (about 67 million rows)
Comparison table is about 700mil rows, and is properly indexed. I can generate an ATL tomorrow and post it to this thread.
We did a major improvement on sorted input in 12.1. Are you on that version by any chance?
The comparison is not the issue it is finding the matching row. So if you try the MD5 function route, I would state you change anything except making it even slower. Would be interestng to see what you did in python though.
A completely different approach could be to build a very simple dataflow like
source -> query -> target
and set the target to autocorrect load. Goal should be to get a full pushdown into an Oracle Merge statement.
btw, what database are we talking about? ATL would certainly help.
Re Python, I’ve attached an ATL for you to look at; this is not the problem table for me since it’s reasonably small, and is at the import step where I generate the MD5 (prior to ODS load)
I will give the simpler flow a try and let you know the results …
FYI, I had a recent conversation with the Director of EIM Product Management and this Adapt was closed with no action taken. As far as he was aware there is no current initiative to provide MD5 capability in Data Services. If this is something that has community support I would recommend putting it into Idea Place so it can get some exposure and feedback.