Oracle over (partition by)

system · May 1, 2008, 6:34pm

Has anyone been able to use the Oracle over (partition by) functional in DI? Is there an alternative to get the same results?

Eileen King (BOB member since 2002-07-10)

system · May 3, 2008, 5:13pm

Well, the first option would be to use the SQL Transform and type in the SQL statement of your choice. Obviously that renders the entire purpose of having an ETL tool useless. Similar you could create an Oracle database view that acts as your source-“table”. Same argument.

To answer your question if there is a similar thing in DI, we would need to know what the exact requirement is. For example a rank per partition can be created using the gen_row_num_per_group() function on a dataset that is sorted by the partition clause.

Werner Daehn (BOB member since 2004-12-17)

system · May 5, 2008, 4:49pm

Here is the part of the code that I’m struggling with…


select distinct--/*+ INDEX(v QCTRL_MTR_READING_EFM_LG_IDX2) */
v.mtr_no,
g.segment_id,
trunc(v.reading_dt, 'hh24') as reading_dt,
g.segment_name,
trunc(v.contract_dt,'WW') as week_start,
v.volume_as_del as raw_volume,
(v.volume_as_del-avg(v.volume_as_del) over (partition by v.mtr_no, trunc(v.contract_dt,'WW')) )
/nullif((2*stddev(v.volume_as_del) over (partition by v.mtr_no, trunc(v.contract_dt,'WW'))),0) as norm_vol,
sysdate as update_Date,
user as update_user
from ...

Thanks for the help!

Eileen King (BOB member since 2002-07-10)

system · May 6, 2008, 7:07am

Well, the avg() part is no big deal, you simply split the source data into two streams, one of that performs the avg() per partition and then you join this information back to each original row. Something similar to this: https://boc.sdn.sap.com/node/5025

But the stddev() I have to give up on. You would need to calculate that yourself: http://en.wikipedia.org/wiki/Standard_deviation

Is a database view an option to hide the complexity?

Werner Daehn (BOB member since 2004-12-17)

system · May 6, 2008, 1:38pm

The database view is our last option as far as choices go…

We were hoping to be able to do this within DI. It just doesn’t look like it’s going to work.

Since I can do this type of thing in Designer, I’m kind of surprised and disappointed that DI can’t handle standard database functions…maybe the next version???

Eileen King (BOB member since 2002-07-10)

system · May 6, 2008, 2:49pm

It is more than just standard database functions. The entire partition-over syntax and all other analytic functions would need to be supported. And that for Oracle, SQL Server, DB2, …all other databases and if the function cannot be pushed down and has to be executed inside the engine, it has to be supported too.

Frankly, we do not get a lot of requests of that kind. Analytic function and stddev like statistic functions are used to query a data warehouse, not when populating it.

Maybe we should look at your requirement from a bigger picture. Where do you read the data from and why do you perform aggregation on it when loading the data warehouse?

Werner Daehn (BOB member since 2004-12-17)

system · May 7, 2008, 3:24pm

This is just one small table in the entire DW…

The rest of the tables are very straightforward and are taking relational database tables and putting them into a datawarehouse.

I believe we are going to handle it as a view as that is how it is currently being constructed.

I just expected that DI would offer the robust functions that are available through each of the databases…same as Designer does.

Eileen King (BOB member since 2002-07-10)

system · May 8, 2008, 8:02am

Werner Daehn (BOB member since 2004-12-17)

system · May 8, 2008, 12:08pm

rofl.

No matter how much you have, people always want more.

Ernie

eepjr24 (BOB member since 2005-09-16)

system · May 8, 2008, 2:35pm

Hey…if Designer can do it…

I just figured that it was one of the options in DI.

I always want more…that’s what keeps me challenged (and challenging)!!!

Eileen King (BOB member since 2002-07-10)

system · May 12, 2008, 2:32pm

First, you don’t need to support all the Analytic functions. I would divide this issue into commonly used Oracle extended SQL for ETL vs. SQL for Analysis (ntiles, complex rank and statistical operations) and basic Partition Query Functionality. I don’t need the stats functions per se and can always decompose STD DEV into more simple terms (e.g. x^0.5 of avg of mean).

I use this functionality extensively in Oracle just for the ability to perform calculations within sets of rows, for example Schedule Lines data from SAP S&D. I need to consider all rows for a Sales Order/Item, calculate several interdependent values for that set up Schedule Lines. I’m not aggregating the data per se, although I do generate columns which are “parition” totals based on all Schedule Lines for an Order/Item so I can ensure the individual line calculated columns add up.

I had tried several approaches in DI and lacking a read-ahead and read-behind functionality equivalent to Oracle’s Lead/Lag, I finally just put my code into a SQL transform. For example

select sales_order_number, sales_line_item_id, schedule_line_id,
sum(shipped_qty_sl) over (partition by sales_order_num_item order by sales_order_number, sales_line_item_id, schedule_line_id) CUME_SHIP_SL,
sum(backlog_qty_sl) over (partition by sales_order_num_item order by sales_order_number, sales_line_item_id, schedule_line_id) CUME_BACKLOG_SL
from sched_lines_fact
where
SCHED_LINES_FACT.SL_FULLFILL_STATUS_CODE<>‘C’

Basically I needed a SUM to then calculate a running total on each line to come up with Allocted Not Shipped allocated per schedule line in a FIFO manner, since SAP considers everything “delivered” even w/o an Actual Goods Issue Date, i.e. shows the item as “Delivered” and includes both shipped and ANS in “shipped qty” - not good enough. This is typical where I need to know the sums over all rows then add column values not stored in SAP. It’s not a complex query, but it shows line item level and “partition level” numbers -

SO         LI     SL         CONF      SHIPI    SHIP_SL         BK     ANS_SL      ANS_I      PA_SL
---------- ------ ---- ---------- ---------- ---------- ---------- ---------- ---------- ----------
0000003337 000040 0001          0          1          0          0          0          0          0
0000003337 000040 0002          1          1          1          0          0          0          0
0000003337 000050 0001          3          3          3          0          0          0          0
0000003337 000060 0001          0          1          0          0          0          0          0
0000003337 000060 0002          1          1          1          0          0          0          0
0000003337 000070 0001          0          3          0          0          0          0          0
0000003337 000070 0002          3          3          3          0          0          0          0
0000003337 000080 0001          1          1          1          0          0          0          0
0000003337 000090 0001          0          1          0          0          0          0          0
0000003337 000090 0002          1          1          1          0          0          0          0

BOBJ Designer did not suddenly show up with full support for Oracle Windowing/Partition functionality; the most frequently used were added incrementally, starting with simple things like substitution of Oracle’s RANK. Later, subsequent versions added more support, with increasing complexity - more than you probably need for ETL work.

I you perform a lot a complex ETL where Oracle is the source or in my case, the DW source post-extract from SAP, you tend to run with what works the best. I had set up approaches using traditional DI functionality, which resulted in reading large amounts of data into the JS mem, vs. pushing the entire operation down to the DBMS and having it run in seconds to a few minutes to process a million Schedule Lines, calculate and update several backlog and related columns.

I think if DI could support the basic concept or framework of dynamic partitioning and allow calculations, that’s doable. You just need the framework to generate SQL to partition rows by keys in the database. I also had concerns in that within a typical SAP S&D implemenation, I might need to reset or recalc the entire table and run up against process limits for memory in the JS (11.7 was not an option then).

BTW, it would be really handy for DI to have the ability to scan SQL transforms for Table Names in a “view-where-used function”. While I expect this uses an object id based on the repository, it shouldn’t be too hard to use semantic checking on text from SQL transforms, either.

If I could have done the same thing remotely close in efficiency in DI alone, I would have done so, but it wasn’t even close and the complexity for inter-row calcs when you don’t even know the number of rows in a set in DI was too much.

Stracy (BOB member since 2006-07-14)

system · May 12, 2008, 2:37pm

Well, other than completely separate development paths, you make a good point

The difference is Designer needed all or most of the functionality out of the gate, DI does not. You could skip many of the more complex SQL for Analysis functions as a first pass…

Or, we can keep using DI to write SQL transformations for things it cannot do, but that’s not the path to hugs and puppies

Stracy (BOB member since 2006-07-14)

system · May 12, 2008, 2:59pm

This might be worth a read for you:

https://boc.sdn.sap.com/node/5521

Ernie

eepjr24 (BOB member since 2005-09-16)

system · May 12, 2008, 3:41pm

Thanks, looking forward to that. Caught my eye right away when 11.7.x was announced.

Unfortunately it still isn’t an option, not yet. Upgrading a Solaris installation in full production mode in place is no small task, currently my client is risk-adverse to the upgrade. Too bad, considering we run on a 64-bit UNIX OS with a V490 for the JS - talk about your lost opportunity! And we do occasionally run into 2 GB mem limits, so we have plenty of other reasons.

Add to that the complexity that the JS/Database servers are hosted by one vendor, the SAP system by another, it gets murky testing w/o a complete parallel environment, not an option at present, too bad

It still fails to address operations across all rows of a logical parition ( where >3 anyway), but it’s darn useful. Wish I had it in 11.5.2…

Stracy (BOB member since 2006-07-14)

system · May 12, 2008, 3:46pm

In the same boat here, but not a risk aversion problem. It is a matter of hours in the day, as there are never enough of them, especially on our Prod Linux box. We have a full replication of servers for Dev, but that does not solve the problem since you have a repo upgrade to deal with and Central Repo promotion gets hairy with dev to cert and cert to prod happening with repos in two different versions.

E

eepjr24 (BOB member since 2005-09-16)

system · May 12, 2008, 3:53pm

Now here’s one I missed -

And gen_rom_num_per_group(expression) is a function that does generate line number 1,2,3,4,… per group.

That is truly useful and a PITA with SAP data, you don’t know how many detail lines per header/document record you have just from the data. To do so post extraction requires a cumbersome sort or in my case, using a Bad SQL Transform. Being able to append that upon initial extract would be very helpful (and replace a typical partition query I use today).

For example, I need to know how many schedule lines there are in a group for a given Sales Order/Sales Line Item, a very common thing - if I understand the function it would replace the following, where I have to get the total number of lines, then go back and label each as the “nth” line in a subsequent step (not shown)-

SELECT sales_order_number, sales_line_item_id, schedule_line_id,
count(schedule_line_id) OVER (PARTITION BY SALES_ORDER_NUM_ITEM ORDER BY sales_order_number, sales_line_item_id) NUM_SCHED_LINES
FROM SCHED_LINES_FACT
WHERE
SCHED_LINES_FACT.LI_FULLFILL_STATUS_CODE<>'C' and
(SCHED_LINES_FACT.SALES_DOC_CATEGORY='C' or SCHED_LINES_FACT.SALES_DOC_CATEGORY='I')

Must have been a common SQL Transform or custom ABAP thing, another fine example of features added for things you have to do outside the toolset!

Stracy (BOB member since 2006-07-14)

system · May 12, 2008, 3:59pm

Also useful in cases where you only have loose relationships in the source (the lines are with that order), but no actual keyed reference. It lets you create one for surrogate and other purposes without having to change the source system.

E

eepjr24 (BOB member since 2005-09-16)

system · May 12, 2008, 4:17pm

Also useful in SAP when for example Schedule Lines are subjected to a rescheduling process that collapses the lines, so the identifiers 10, 20, 30 can become 10, 30, or even 10, 40 where the last line has a collapsed view of the former lines.

That sort of process falls in the same category, the value of the key no longer tells you anything about “how many” in a group, and makes a mess of your surrogate key assignment since formerly assigned surrogate keys just disappear, and you have gaps in your sequence. Max functions on natural key values also tell you nothing about “how many”.

Stracy (BOB member since 2006-07-14)

system · May 13, 2008, 11:48am

The previous_row() functions would not help on the cumulative sum problem as the function works on the input dataset, but you need the add the current value to the previous row’s output value.

Example

Value   prev_row   cum_sum
1       NULL      1
3       1         4
5       3         9

For the second row saying sum=value+previous_row_value(value)=3+1=4 would work. But for the third row all you know is the current 5 and the previous 3. No way to get access to the output value of 4 of the previous row.

So you would have to build a cartesian product that then has to be group_by’ed to the original primary key. And even that only works well if all your “partition over” clauses are the same.

https://boc.sdn.sap.com/node/5025

(Watchout, above example is not the exact same use case, just similar. In above case each row of a group should have the same value, in the cumulative sum case each row would get all the previous rows less than the current within one group to get different values for each row)

Werner Daehn (BOB member since 2004-12-17)