DI/DS Improvements & Suggestions

system · September 16, 2008, 11:57pm

Post your ideas here, and see what others might have to say (would others use the suggested functionality too? If so it might give more “weight” to the idea within BO.)

===================

Here is a collection of ideas from a previous thread, updated and duplicates removed, grouped into themes.

Usability, Developer Productivity & Built-in transforms

My ‘enhancement’ would be for BusObj to concentrate on improving the UI of Designer - I’ve found it particularly unintuitive, clumsy, slow and most significantly, highly unstable. The last point is very frustrating and will prevent me from using DI again or indeed recommending it to anyone else until the product improves. It crashes/hangs on me at least 4 or 5 times a day. As far as i can tell it seems to be whenever DI needs to read/write to local or central repositories.
Reusable column-level mappings (or at least LOOKUPs()). Lookups are a pain to code, and the UI wizard isn’t much faster. So often you’re doing the same lookup in many dataflows, say, to the Customer table. It would be nice to create a library of lookups which can be used which simply “paste” into your mapping with a click. Related: The second idea of getting a library of look_up is very interesting, and maybe, the user could build his own library and to access it, he could simply click on a dedicated tab in the object library window.
How about a Lookup transform. It’d be similar a Query transform, except the point in every column mapping would be a checkbox (lookup or not), and if lookup, expedite the process of doing the lookups – make it really easy for the designer.
Different icons on the canvas if the output has the delete or drop & create option set
Copy & paste of canvas objects, on the same and between canvas. Related: -More “copy/paste” of transforms and scripts; and, drag-and-drop dataflows into workflows. (see Why IF and WHILE conditional cannot be reuse? )
Convert embedded data flow objects to canvas objects.
Sort by owner within a datastore. They are ordered when importing metadata, but not on object display
Ability to Reimport a table right within the dataflow.
Check in objects from the Local repository, vs. having to find it in the central repo.
In a Query step, you can click on a column on the right pane and the “used” source columns will get highlighted on the left pane. How about doing the reverse, too? Click on a source column and quickly see which target columns are using it.
If columns are used only in the join/WHERE clause of a Query (and not in a mapping), there’s no indication that they are used or mapped. A separate kind of highlight would be helpful on those source columns.
Improvements to version control team development. EG:

When looking at objects in the Local Object Library be able to identify if they are checked in or out, who by and date of check in or out.
Use usernames rather than database schema names

ADAPT01130292 - In the central repo, doing a right-click to view History of the object – show details of check-outs (ones that weren’t Undone), not just check-ins, especially if the object is currently checked out. Prompt for Comments which get recorded at check-out time, not just check-in.
The ability to change a large number of mappings in a Query Transform to a constant value (say, null) in one step.
It would be convenient if from within DI Designer, you could right-click on a datastore table and do not just a “view where used”, but a “View Lineage/Impact”. This menu option would launch the DI web client and bring you right to the lineage/impact information for the object in question.
Agregator Transform to enable aggregation of columns based on grouping of select columns.
More keyboard shortcuts – some things like “validate all” don’t have keyboard commands, which is bad from a 508 (accessibility) perspective as well as productivity.

Connectivity, Database support, SQL generation

Identity Insert I would like to able to use SQLSvr IDENTITY columns as designed. This would allow me to parallel load tables, and do INSERT into … SELECT … statements with full pushdown. Related: Key Generation transform to use database max value, not in-memory calculation, to increment the key for every row. Related: This would also work for ORACLE with a default ‘sequence’.NEXTVAL
Support for a RANK function.
Seperation of JOIN conditions and WHERE conditions, to allow you to perform an outer join and a filter within a single query. (Granted, there are ways around it, but it would be nicer to be able to make one query do the job. See Outer join and data filtering and Using sub-queries)

Better management of joins/outer joins. We spend a lot of time creating views on the source DB simply because we can’t get DI to create an efficient query.

Ability to turn off implicit transactions, both for targets and for SQL() script calls.
Fixing DECODE to translate to one CASE statement instead of a series of nested CASE statments in SQL Server
Ability to push down a HAVING clause.
ADAPT01129419 - support for reading Excel 2007 native files
An equivalent of pushdown_sql for column mappings - i.e. just pass the text thru to the database and map the the result as a column.

Job migration, execution, & scheduling

Disabling workflows/dataflows - Well I would be interested in disabling objects (DF/WF) within a job with a simple right-click! This function could be very usefull especially when you have complex WF/DF and you want to test part of it. Related: If I am not mistaken, COgnos Decision Stream has this option and shows a big red ‘X’ when any component is disabled. Informatica too has a disable session task option.
Migration - Another general point would to improve the migration of Jobs/DF/WF within differents environements (the actual method is not intuitive to me… but I’m still working with the 6.5 versions, I don’t know about the XI!)
We should able to “PAUSE” a job, I don’t know why we need this, but it can become handy, in some situations.
A job DI scheduler that actually takes in consideration the job dependencies if you had more than one job. We have many jobs that have dependencies, and had to develop a database solution to our dependency issue. Each DI job calls a workflow in which each job checks to see if all jobs it is dependent upon have run. If so, it writes its start time to the database. If not, the database function sleeps until all dependencies are met, then allows the job to continue. After the DI execution is complete, the job calls another workflow which writes its ending time to the database, thus allowing any dependent jobs currently sleeping to run. Yes, we have had situations were our cluster was quite full of sleeping jobs because a couple jobs ran longer than normal; it has indeed brought our servers to a grinding halt before. Having job dependencies built into DI could save us some serious headaches…
A schedule migration process that would make migrating ETL process from one environment to another painless.
Scheduling of jobs from within DI Designer interface

Other general functionality

Lookup with formulae - being able to define a formula on both sides of the lookup fields
In-Memory Datastore for Batch jobs - Lookup to flow - several times I’ve created my own result set after a query or two and really need to lookup the result of that from somewhere else further down the stream. I have to write this out to a table and then start a subsequent dataflow to be able to lookup from it. If this is a set in memory already I would like to be able to “fix” a position in a flow as a named set and use this as a lookup table without having to write it out (if you write it you only need to read it later, if it’s already in memory and it’s a transient set why bother). Related: Essentially, an in-memory datastore that works with Batch jobs (not just realtime). Write out the results from your first step to a “table” that lives in a datastore which is defined as being pinned in memory. Right now you can only write out to a persistent cache datastore, which is on-disk; we haven’t found the local persistent on-disk cache datastores to be any faster than re-retrieving the data from the database.
A New Variable Output column in the Query transform, which will hold, temporary results. Most of the time, we are calculating some values, for doing this, we need to add one more QUERY transform.
I think a concepts very similar to VARIABLE PORT in Informatica.

Ability to use a query transform to write results to variables. If you need to set a few variables to various columns from a query you have to do this in a sql command in a script. This means you lose the visibility of the table having been used for starters, and secondly you need to run the same sql multiple times to get a different field back into a different variable.

In order to set a variable to a value from a query you need to use a script task to set the variable = sql( datastore, query) It would be nicer to be able to use a query transform (or similar) where the output is set to a variable instead of a field in the output stream.

How about better support for IN / NOT IN in the WHERE clause of your query? One way might be to make a new function called “is_in_query” that would take parameters:

Datastore and table to look in (including an in-memory datastore from #11 above)
IN or NOT IN
Input column to be compared
column – or formula – of the lookup table to compare to
Filtering clauses of the lookup table (to get a subset or to match)

At run-time, DI would evaluate the datastore of the Input colum and the datastore of the IN column. If the same, it would push this down to a regular IN in the SQL. If they were different… it would get much more complicated. I suppose at that point DI could turn it into a regular lookup() and then add in an in-memory filtering step. The idea here, overall, is to not have to do Pushdown_SQL so that you have better table impact analysis etc.

Flat Files: - be able to define a description and other metdata for columns in a flat file definition the same way you can for columns in database tables.
Flat Files: writing multiple files. Input file is 10 million lines long and it would be good like to write multiple output files of X lines long. This is so that we can restart the job at point of failure when it fails halfway through. This is similar to the “commit every 10000” rows when writing to a DB. Our workaround is to split input file and then use some complex looping workflow to go thru all files in the folder - works but headache to maintain and debug.
Validation transform: i’d like to be able to add a column to output that contains my error code so i can write this to a table for analysis or use a lookup table to obtain a description of error. Also useful to add a severity code
Enable Dynamic Lookup
In the SQL Transform have a larger field for the Datastore name (we have some problems we long names).
Suppress some data type conversion warnings: a mapping of 0 in a decimal(18,0) generate a warning like Conversion of to ?

We just put Zero in a field, I’m sure that work.
Why a substr(xxx, 2, 1) generate a warning when we put the result in a varchar(1). BODI just look the input column and the output column. I’m sure that my substr() will return 1 character or null.

A months_between function [added in DI 12.0 / DS] and a trunc function like the Oracle one to strip off the time from a date time.
ADAPT01130309 - CMS Collector (metadata reporting) - does not pull underlying tables if Aliased in a universe, and, does not consider table joins as a dependency. Enhancement would be for DI to record the underlying table (vs. the alias name) for full dependency checking, and, for DI to consider objects used in Joins (but not as a universe object itself) as a dependency.
ADAPT01117847 - Ability to query repository for post-load SQL on target tables. (Probably means overhaul of how ATL is stored in repository – or maybe an export to XML format that is more human-readable.)
Rather than DI abruptly crashing as it hits the 32-bit 2Gb barrier… it would be nice if you exceed 1.9Gb of memory usage, DI logged a warning in the job console. And if you exceed 1.99Gb of usage, DI shuts down the job gracefully (ie. with an error code rather than a crash dump). This could avoid a lot of support issues with tech support and on BOB!

dnewton (BOB member since 2004-01-30)

system · September 17, 2008, 4:44pm

Nice list…thanks for pulling it all together.

https://bobj-board.org/t/111054

Also, I’d REALLY like an “undo” feature for those times when I want to back out a series of changes I’ve made to a flow.

nitrobrother (BOB member since 2004-09-08)

system · September 17, 2008, 4:46pm

Seconded! Or at least an Undo of the “last” step/change, if a multi-layered Undo is too difficult.

dnewton (BOB member since 2004-01-30)

system · September 18, 2008, 7:47am

Werner Daehn (BOB member since 2004-12-17)

system · September 19, 2008, 5:16pm

To expand on this, I’d like to see the entire lineage when I do a “view where used” for an object. For example, if a data source table is used in a few data flows, and each data flow is used in a couple of work flows, and those are used in a couple of jobs, I’d like to be able to see the entire “tree”, navigate up and down the branches, and not have to go all the way back to the beginning (data source) to explore another path.

nitrobrother (BOB member since 2004-09-08)

system · October 8, 2008, 2:18pm

New idea: DI can, in some situations, push down into the target an INSERT INTO…SELECT FROM and bypass the ETL engine altogether.

In a slight variation of this, if there’s a Map_Operation before the target, with all of the options set to “Delete”, then it would be nice if DI could push this down to a DELETE FROM TARGET … WHERE ID IN (SELECT ID FROM SOURCE).

I understand this would be a rather rare use-case, but of course I just had that use case today.

dnewton (BOB member since 2004-01-30)

system · October 15, 2008, 3:04pm

I wish Designer handled database connection interruptions more gracefully. I’m in and out of VPN connections all day long (I’m a roving consultant), on and off of networks intermittently and sometimes I just kick my Ethernet patch cable out of the wall. Any disconnection from the network and you are totally screwed.

eganjp (BOB member since 2007-09-12)

system · October 15, 2008, 4:51pm

I agree!!!

We recently went through an intermittant network glitch that would drop everyone’s network connection. The Oracle and SQL Server DBA’s that I sit around would just hit their Reconnect buttons, while I got to watch Designer crash and burn. I was saving work every few seconds during this time period or cursing myself for not doing it when was in “deep development mode” and neglected to.

nitrobrother (BOB member since 2004-09-08)

system · October 15, 2008, 7:36pm

On large projects the number of objects in the Job, Workflow and Dataflow lists becomes unmanageable. It would be helpful to be able to put these objects into categories or groupings. Kind of like how you can create groups in MS Messenger. The object should only be allowed in one group/category at a time and it should not be necessary to reference the group/category when the object is in use. For example, job_name() should only return the job name and not add the group/category.

Since developers could set up different groups/categories in their local repositories I can see where check in/out would have to retain the group/category assignment.

eganjp (BOB member since 2007-09-12)

system · October 15, 2008, 11:44pm

Kind of like how Projects are just arbitrary groupings of Jobs (in fact I’ve wondered why they aren’t in the same pane as Jobs)… why not have other folder/category structures.

Although building in different preferences per developer – while simultaneously preserving them in the central repo – sounds a little messy.

Related idea: How about a master-hierarchy tab that would let you start at the job level, drill down (expand/collapse) to workflows, dataflows, and source/target datastores. Sort of like “View where used” but presented hierarchically.

Oh, here’s another one: Have each DI transform considered an object inside a Dataflow, so that if you wanted to do a “view where used” of a particular DI transform, you could find that too.

dnewton (BOB member since 2004-01-30)

system · October 16, 2008, 2:41am

Agreed, it would be messy but it beats the heck out of looking through hundreds of Workflows/Dataflows.

Um, you mean like you have in the Project Area? This only goes so far and we make heavy re-use of Workflows so I would see the same Workflow in many jobs. I have started to drill down from a job to find the specific Dataflow I want because hunting for it in the Local Object Library list takes too long.

eganjp (BOB member since 2007-09-12)

system · October 16, 2008, 11:23am

Really? I never use projects in my local repo so I didn’t even know that feature existed.

dnewton (BOB member since 2004-01-30)

system · October 17, 2008, 11:15am

on the list…

Werner Daehn (BOB member since 2004-12-17)

system · October 17, 2008, 11:15am

on the list as well…

Werner Daehn (BOB member since 2004-12-17)

system · October 28, 2008, 3:23pm

One of my pet peeves is how Designer jumps around in the project and Central Object library lists when doing check in/out. If I’m at the end of the list checking an object out (without replacement) then the central object library list and the project list both jump all the way up to the top! If I have a number of objects that I’m working through and checking out one at a time this is frustrating and a waste of time. (I know I can highlight multiple items and check them out together.

eganjp (BOB member since 2007-09-12)

system · October 29, 2008, 5:50pm

Something that just popped into my head… the ability to drag a dataflow from the Central repo window onto a Job or Workflow workspace, and have DI prompt you with the Checkout options at that time.

Rather than making you go the central repo, check out there, then back to your local repo window, find the object, and drag that into your workflow.

dnewton (BOB member since 2004-01-30)

system · October 29, 2008, 10:34pm

What happened to the idea of moving Join Rank to the query instead of (or in addition to) the tables as mentioned here? How about at the EDF level?

nitrobrother (BOB member since 2004-09-08)

system · October 30, 2008, 12:24pm

Still planned. Like so man yother things.

Werner Daehn (BOB member since 2004-12-17)

system · November 8, 2008, 6:05pm

If you take a look at Pentaho/Kettle, the number of built-in transforms/widgets is eye-popping. Stuff like the above, as well as a ton of objects for things like FTP steps, moving/renaming/copy/delete of files, automating population of warehouse “junk” dimensions, and more.

dnewton (BOB member since 2004-01-30)

system · December 16, 2008, 3:16pm

I’m in the middle of moving three projects for two different clients to production so I’m experiencing all the joys of coordinating the different install/configuration points.

It sure would be nice to be able to use the Management Console for everything. Once the job server and Management Console have been installed and are up and running I would like to be able to use the Management Console for:

Creating the repository or even upgrading
Configuring the job server to talk to the repository
Importing an ATL file to the repository - no more designer required for production.
Selecting System Configuration settings.

The biggest issue with going to production is that the operations support people often don’t know anything about Data Integrator. They have to use so many DI applications to get a production instance working that it is painful.

Install the job server and management console (Unix)
Install Designer (Windows)
Create database (Oracle)
Configure job server (Unix)
Import ATL file using Designer (Windows)
Change System Configuration using Designer (Windows) or change Datastore connections using Designer or the Management Console.
Export the job execution command for external scheduling (Unix or Windows)

eganjp (BOB member since 2007-09-12)