XML file generation extremely slow:

system · May 3, 2016, 1:29pm

I am having issues generating a fairly simple schema processing 176,000 records.

The job has thus far been processing 18 hours and by my estimate… will need 29 total hours to complete.

I have been round and round on this trying to get this job to run faster. Can someone look at it and tell me if I am missing something? I have tried the xml_map and that didnt help.

Thanks
Josh
ds_export.zip (14.0 KB)

jlynn73 (BOB member since 2009-10-27)

system · May 5, 2016, 1:01am

check the monitor log to identify which transform is taking long time
if nesting is causing problem then first try dumping the data to a flat file without building the nested structure and see how much time that takes

how many rows you have in the 3 source tables ?

if you look at the optimized SQL 3 queries are getting pushed down, see if you can join all these 3 tables and get the data in a single query and process it

can you attach the monitor log

manoj_d (BOB member since 2009-01-02)

system · May 5, 2016, 4:01pm

Dumping the data takes apx 10 seconds. The monitor doesnt show anything out of the ordinary except the 36 hours it took to finish. (counts are all 1:1)

table counts for this file (there are 20 which get generated)… are 176,450 X 2 and 177,870.

I’m not sure if I can do the join first and get what I need. If I can cleanup all the babies in the nesting, then maybe it will work.

Playing around outside of DS, I created the same xml file in 62 seconds using a Python script. Dumped all 3 datasets and did the join in the script.

jlynn73 (BOB member since 2009-10-27)

system · May 6, 2016, 7:46pm

I attempted to do the join before the NRDM nesting, and of course ended up with multiple copies of the same Notice.(multiple orders/payments children)

you end up with a mess that requires a join to fix.

jlynn73 (BOB member since 2009-10-27)

system · May 7, 2016, 8:01pm

I tried something not sure if it will help, I don’t have a way to test this in my env, don’t have the data for these tables

to achieve pushdown of joins I used a DTT, you can avoid DTT by adding the same source table multiple times and creating multiple parallel stream to read level1, payments and order line and then finally merge and nest them

I have suffixed the name of the DF with _UPDATED, take necessary backup before importing this DF in your repo

if you are on SP6, then import only the DF, and temp table that is used by DTT, and you can exclude other objects from the import plan
gt1_notif_mkxml_df_updated.zip (15.0 KB)

manoj_d (BOB member since 2009-01-02)

system · May 9, 2016, 3:08pm

I pulled in your updated dataflow and tested without the DTT (no DI license).

I tested with a small file of 8 records, and the monitor shows it processing 8^4 or 4096 records. Maybe if the Notice level is created first, then the order/payments are joined in the next query.

Files for this project have been relatively small up until recently, requiring me to revisit the problem. I wish I could Nth the files down, but the print shop doesn’t want to deal with the mess of files it would create.

Thank you for taking the time to look at this.

jlynn73 (BOB member since 2009-10-27)

system · May 17, 2016, 1:47pm

Hi Jynn,

How did you add Notice type, Notice objects in the Schema out of the Query?

Thanks,
Naveen

Navi_79 (BOB member since 2015-04-21)

system · May 17, 2016, 2:26pm

I let the target xml file define the output of the query.

The notice Type only occurs once. This node isnt used by the print program so it is populated with some meaningless constants. There is an array of Notices which have child arrays of payments/orders.

I have given up on using DS to generate these files. It took DS roughly 36 hours to create the output, which I can generate in a script in 60 seconds.

jlynn73 (BOB member since 2009-10-27)

system · May 18, 2016, 3:24am

yes, scripting actually gives you a flexibility of deciding when to stop scanning especially of the data is sorted

in case of DS it is treating it like a table and when trying to build the child array of payments and orders for each sequence its scanning the complete source payment table and order tables to find the children, one way to prevent this is to stage all the data that you need to temp table in a previous DF

create a second DF with these 3 tables as source, and then build the XML output, check the optimized SQL, you should see 3 sql statements, one for the parent nodes and 2 for the child nodes with a where clause filtering records from payments and orders using parent key

in the updated DF that I provided, I was trying to build the child nodes first but looks like you have large amount of data for parent node itself, which again might cause full scan (possibly in memory)

you can try this approach with your DF, you are not familiar with Python or other scripting languages

manoj_d (BOB member since 2009-01-02)