What's the max sized XML source file DI Xi 11 can handle?

system · March 8, 2006, 8:14pm

I currently have a 500mb XML file on a brand new Dual processer Xeon server with 2GB of ram (Win XP). I am working on an XML splitter tool to break the main 500mb XML file down into smaller pieces, but I was wondering if anyone knows of a max file size that ACTA*…err Data Integrator can handle in a batch job?

I have never used XML before as a source, so if anyone can point me to any good info about using them in DI Xi as well, I would be much appreciative (and yes, I already searched the forum for XML related posts…didn’t come up with much).

-Matt

*Hello everyone, don’t mind my “Acta” humor…Wow, I just upgraded from ActaWorks 5.2 to DI 11.0.1.1 and sure am happy! And then I found this forum, and got even happier!!! Where have you been all my life!!!

hoppers69 (BOB member since 2006-03-08)

system · March 9, 2006, 10:58am

We have been waiting here all the time and wondered where you got lost!

Try the XML Pipeline transform in DI 11.5. This is meant to cut down one huge XML into pieces like

ALL_CUSTOMERS
+--ONE_CUSTOMER
    +---all its structure
+--ONE_CUSTOMER
    +---all its structure
+--ONE_CUSTOMER
    +---all its structure
...
...
...

will be shredded into ONE_CUSTOMER structure after the other. So it would look like as if you are reading one XML after the other and all just have the ONE_CUSTOMER structure. This way, instead of processing one ALL_CUSTOMERS file with 500 ONE_CUSTOMERS it looks like 500 ONE_CUSTOMERS…

Werner Daehn (BOB member since 2004-12-17)

system · August 22, 2007, 1:05pm

Will this transform work when sourcing from a SQL Server 2005 table? We are having lots of problems (I’m assuming memory issues) when trying to extract large xml.

traider (BOB member since 2007-03-01)

system · August 22, 2007, 3:37pm

No, files only. Large XMLs are a problem.

Werner Daehn (BOB member since 2004-12-17)

system · August 27, 2007, 8:54pm

We are using XML parsing from DB2 and SQL server 2000. Its working fine. 100000 rows, its taking 3 minutes. I guess it is not memory issue. If you try with different strategies you can figure out. I havent tried with file.

writefm (BOB member since 2006-08-30)

system · August 28, 2007, 6:41am

caching 100’000 rows is no big deal. Try with 10 millions and you will get in troubles.

Werner Daehn (BOB member since 2004-12-17)

system · December 1, 2009, 9:05pm

Hi Werner,
We are finding it difficult to use XML pipeline as we need all fields in the XML file; so using query transform to flatten it. I works well with 0.5 million records or so but fails with memory issues if XML file has few millions rows in it.
We need your suggestions/advise to overcome this problem.
Thanks
Kumar

vkumar_b (BOB member since 2009-11-19)

system · December 2, 2009, 8:37am

Can you show us a hint of the XML structure, just the top levels, and what you want to accomplish? And the row counts.

Like:

I have an XML with the structure

xyz ...main attributes and sub attributes... ...main attributes and sub attributes... ...main attributes and sub attributes... ...main attributes and sub attributes... ... ...

ROOT: 1 times
ORDER_DETAILS: billion times
attributes within one ORDER_DETAILS node: 100 KB data in average

and want to load all fields of ORDER_DETAILS into one flat structure including the ATTRIBUTE1.

We have limitations in that area, things we have to optimize once. It is a matter of memory vs. speed vs. more clever logic and these three areas need to be balanced. We can make sure you never run into memory issues but if the performance then sucks, not much help.
But usually you can workaround the problem, I just need to be certain your case is a usual one. If my case pretty much is an good example, I would need to know what ATTRIBUTE1 is required for, cause that would be the problem. Without that ATTRIBUTE1 you would build a dataflow with

XML -> XML_Pipeline -> Query_Unnest -> …

In the XML_Pipeline you say you are interested in ORDER_DETAILS only and all its sub-elements you leave nested. So XML_Pipeline outputs one row per ORDER_DETAILS with just 100KB data each, whereas without we would read 1 row of ROOT data will GB of size and the dataflow explodes.
The subsequent Query does the unnesting of the ORDER_DETAILS as you do it today. You just brought down the XML one level…

Werner Daehn (BOB member since 2004-12-17)

system · December 2, 2009, 3:36pm

Root - 1 times
(1 times)
--nested (1 or a few of them)
--nested (1 or a few of them)
--nested (1 or a few of them)
(millions of transactions/records)
--nested (few of them)
PD --element of nested structure
abc --element of nested structure
--nested
0

1000
PD --element of nested structure

S --element of nested structure
100

I want to flatten the file with all the fields of the XML file.
Please let me know if you need any other details to help me in this regard.

vkumar_b (BOB member since 2009-11-19)

system · December 2, 2009, 7:43pm

That’s what I feared.
Not sure.

What would happen if we use the XML Pipeline for the element and its child? We would lose the extra attributes of the lower levels.

1000 PD --element of nested structure S --element of nested structure 100

But these are just a few, so we could have another dataflow where we process just these and not the child schemas. Hence that’s small as well, does not include the millions of transactions.

The problem now is finding these attributes together, which STerr element had RTransactions row 2? Could be the first, could be the second.

Damn, no idea. On the other hand, given your task is to simply convert XML to flat file, you might be better off with calling an external program via exec() that does just that.

Werner Daehn (BOB member since 2004-12-17)