Hi,
Need some help with “how to in Data Services”. I tried doing a search, but didn’t find anything.
We are migrating from an old ETL tool that uses SQL to define the source TO SAP Data Services 4.2. In the SQL, we have to use subselects on the RRN (Relative Record Number, aka Row ID) of our DB2 database for various sources, since the source tables insert a new record for every change but we only want the latest record for each customer, vendor, etc. We don’t care what the address was last month, we only care what it is today.
I’m told they use the RRN instead of Date/Time because time in JDE is HHMMSS, no fractions of a second, so it’s possible to have multiple max date/times, if somebody fat-fingered the Enter key.
So, is there ANY way to find the Latest Record/Max Row ID in this kind of scenario in Data Services.
e.g.
CUSTNBR ADDRESS DATE RRN
124 8080 FWR 01/01/2014 737
124 80801 FWR 01/01/2014 745
we only want to extract row id/rrn 745. The RRN is a FUNCTION in DB2, not a field in the table.
In your example, my preferred approach would be would be to read-in all the records, ordered by 1) the key (customer number, say), and 2) the RRN in descending order (in a Query transform, of course). Add a column X_ROWNUM_CUSTNBR with mapping gen_row_num_by_group(CUSTNBR) to generate integers that reset to 1 on every new CUSTNBR. Then, immediately downstream of that, in another Query, filter the old records out with X_ROWNUM_CUSTNBR = 1.
You could also push that subselect down using the PUSHDOWN() function, which just pushes down a string, verbatim, and adds it to the end of the WHERE clause generated in the “optimized SQL”. I prefer the slightly more transparent route above, which has the side-benefit of guaranteeing a single record-per-CUSTNBR even if you should find duplicates on CUSTNBR + RRN, but you may (or may not) get better performance pushing-down the correlated subquery.
Another way (come to think of it) would be to extract using a self join of the table in question:
SELECT
C1.CUSTNBR
,C1.[ATTR1]
,C1.[ATTR2]
,C1.[…]
,C1.RRN
CUST C1 INNER JOIN CUST C2 ON C1.CUSTNBR = C2.CUSTNBR
AND C1.RRN >= C2.RRN
GROUP BY
C1.CUSTNBR
,C1.[ATTR1]
,C1.[ATTR2]
,C1.[…]
,C1.RRN
HAVING
COUNT(*) = 1
I think that whole thing could push down. Would need to try the various approaches to see what would be fastest.