(Here’s a reorganized list, as of January 2010. Wdaehn, if any of these have been already done, let us know and we can remove them from this post.)
Built-in transforms and Functions
-
How about a Lookup transform. It’d be similar a Query transform, except the point in every column mapping would be a checkbox (lookup or not), and if lookup, expedite the process of doing the lookups – make it really easy for the designer.
-
Aggregator Transform to enable aggregation of columns based on grouping of select columns.
-
Some of the Table Compare options should be set-able through global variables, to allow dataflows that handle both Incremental loading as well as full re-loads. In particular, if you could set the “Detect Deleted Rows” and “Comparison Method” options to also read a variable, you could have a single dataflow that does incremental (no deletion detection; comparison method is row-by-row), and a full re-load (deletion detection; cached comparison table). Without this, you need two separate dataflows.
-
A new transform, something different than History Preserve, more like “Existing Column Preserve”, that does the following:
- If the data in the target table has a value, leave it alone
- If the data in the target table is null, then fill in the incoming value
In this transform, you could click on each column and the tab/dialog area below would have options for:
- Normal (pass-through)
- Preserve if not not null
- User-specified condition (and you could put in your own WHERE-like criteria)
This is useful for some columns where you always want to preserve a value like “initial load date”, even if the row is later Updated (rather than inserted). But you want to have the same dataflow, and not have to resort to bringing in the target table as a source. In fact, now that I think of it, this could be added into the Table Compare transform. Since you already have to read the entire target table’s row for that transform to work, you could have another tab in the table compare that lets you specify this column-by-column behavior.
- A “Top” transform: I had a requirement to compute an aggregated fact table, and only show the Top customers or Top products in the result. This was a
minor pain to do, as it required several Query steps in succession:
- Aggregate the measures
- Sort by the customer/product and then by the Revenue, in DESC order
- Add a “gen_row_num_by_group” column
- Then filter out just the top 10.
If there was a “Top” transform, it could do all of the steps above. Just specify which columns are the dimensions, which are the measures, and whether you want the highest or lowest values, and how big the TOP should be. It could do all of these in one step.
-
The Merge transform should have a radio button for “Merge All” versus “Merge Distinct”. Similar to the MERGE ALL vs MERGE in SQL itself. Yes, you can add a Query transform after the Merge with the Distinct checkbox checked, but a radio button choice would make for a simpler dataflow design (fewer steps) and would make it clearer to the new DI user what was going on.
-
Lookup with formulae - being able to define a formula on both sides of the lookup fields
-
How about better support for IN / NOT IN in the WHERE clause of your query? One way might be to make a new function called “is_in_query” that would take parameters:
- Datastore and table to look in (including an in-memory datastore from #11 above)
- IN or NOT IN
- Input column to be compared
- column – or formula – of the lookup table to compare to
- Filtering clauses of the lookup table (to get a subset or to match)
At run-time, DI would evaluate the datastore of the Input colum and the datastore of the IN column. If the same, it would push this down to a regular IN in the SQL. If they were different… it would get much more complicated. I suppose at that point DI could turn it into a regular lookup() and then add in an in-memory filtering step. The idea here, overall, is to not have to do Pushdown_SQL so that you have better table impact analysis etc.
- Validation transform: i’d like to be able to add a column to output that contains my error code so i can write this to a table for analysis or use a lookup table to obtain a description of error. Also useful to add a severity code
DI Designer Usability and Developer Productivity
-
Reusable column-level mappings (or at least LOOKUPs()). Lookups are a pain to code, and the UI wizard isn’t much faster. So often you’re doing the same lookup in many dataflows, say, to the Customer table. It would be nice to create a library of lookups which can be used which simply “paste” into your mapping with a click. Related: The second idea of getting a library of look_up is very interesting, and maybe, the user could build his own library and to access it, he could simply click on a dedicated tab in the object library window.
-
Different icons on the canvas if the output has the delete or drop & create option set
-
Convert embedded data flow objects to canvas objects.
-
Sort by owner within a datastore. They are ordered when importing metadata, but not on object display
-
Ability to Reimport a table right within the dataflow. Right-click on a table and choose Reimport.
-
In a Query step, you can click on a column on the right pane and the “used” source columns will get highlighted on the left pane. How about doing the reverse, too? Click on a source column and quickly see which target columns are using it.
-
If columns are used only in the join/WHERE clause of a Query (and not in a mapping), there’s no indication that they are used or mapped. A separate kind of highlight would be helpful on those source columns.
-
The ability to change a large number of mappings in a Query Transform to a constant value (say, null) in one step.
-
More keyboard shortcuts – some things like “validate all” don’t have keyboard commands, which is bad from a 508 (accessibility) perspective as well as productivity.
-
I’d REALLY like an “undo” feature for those times when I want to back out a series of changes I’ve made to a flow. Or at least an Undo of the “last” step/change, if a multi-layered Undo is too difficult.
-
I wish Designer handled database connection interruptions more gracefully. I’m in and out of VPN connections all day long (I’m a roving consultant), on and off of networks intermittently and sometimes I just kick my Ethernet patch cable out of the wall. Any disconnection from the network and you are totally screwed. We recently went through an intermittant network glitch that would drop everyone’s network connection. The Oracle and SQL Server DBA’s that I sit around would just hit their Reconnect buttons, while I got to watch Designer crash and burn. I was saving work every few seconds during this time period or cursing myself for not doing it when was in “deep development mode” and neglected to.
-
How about the ability to have multiple instances of the Designer open at the same time, with each one pointing to a different repository? For example, I’m in the development environment and want to go look at something in the production repository. Currently I have to exit Designer and restart it, point it to the prod system, look at what I need to, then lather-rinse-repeat to go back to my development environment.
-
I think the graphic ‘readability’ of Data Flows would be enhanced if the lines from the Inner Source(s) to a Query containing an Outer Join could be displayed as dashes ‘-------’ or dots ‘…’, while the lines from the Outer Source remained (as now) solid ‘_______’. This would avoid having to click into the Query’s Outer Join tab to discover the nature of the join. Ideally, this facility would be configurable in Tools > Options.
Connectivity, Database support, SQL generation
-
Identity Insert I would like to able to use SQLSvr IDENTITY columns as designed. This would allow me to parallel load tables, and do INSERT into … SELECT … statements with full pushdown. Related: Key Generation transform to use database max value, not in-memory calculation, to increment the key for every row. Related: This would also work for ORACLE with a default ‘sequence’.NEXTVAL
-
Support for a RANK function with pushdown.
-
Seperation of JOIN conditions and WHERE conditions, to allow you to perform an outer join and a filter within a single query. (Granted, there are ways around it, but it would be nicer to be able to make one query do the job. See Outer join and data filtering and
-
Better management of joins/outer joins. We spend a lot of time creating views on the source DB simply because we can’t get DI to create an efficient query.
-
Move Join Rank to the query instead of (or in addition to) the source table setting.
-
Easier ability to turn off implicit transactions, both for targets and for SQL() script calls.
-
Fixing DECODE to translate to one CASE statement instead of a series of nested CASE statments in SQL Server
-
Ability to push down a HAVING clause.
-
ADAPT01129419 - support for reading Excel 2007 native files
-
An equivalent of pushdown_sql for column mappings - i.e. just pass the text thru to the database and map the the result as a column.
-
The ability to specify an Oracle Home to use for a DataSource would be useful. If the server is Oracle 10g, but has the need to connect to an older version of Oracle Server if I install an Oracle 8.1.7 client it would be nice to have the ability to stipulate the oracle Home to pass the SQL Net statement down.
-
I would really like to be able to use Active Directory as a data source/target! I have several ETLs that need to extract data from Active Directory, modify it, then reload it into AD. As it currently stands, I had to write a VB script to read AD into a CSV, then read that. In order to load AD, I had to write another DF to create an LDIFDE upload input file then execute LDIFDE within the WF.
Lineage and Impact Analysis
-
It would be convenient if from within DI Designer, you could right-click on a datastore table and do not just a “view where used”, but a “View Lineage/Impact”. This menu option would launch the DI web client and bring you right to the lineage/impact information for the object in question. I’d like to see the entire lineage when I do a “view where used” for an object. For example, if a data source table is used in a few data flows, and each data flow is used in a couple of work flows, and those are used in a couple of jobs, I’d like to be able to see the entire “tree”, navigate up and down the branches, and not have to go all the way back to the beginning (data source) to explore another path.
-
Have each DI transform considered an object inside a Dataflow, so that if you wanted to do a “view where used” of a particular DI transform, you could find that too.
-
How about “View Where Used” ability for variables!
-
Ability to do a “show where used” on an object in the central repo, and have it list other places in the central repo where used (not the local repo)
Source code management
-
Check in objects from the Local repository, vs. having to find it in the (much longer) central repo. You can do an “Add” from the local repo, but not a check-in.
-
When looking at objects in the Local Object Library be able to identify if they are checked in or out, who by and date of check in or out.
*Use usernames rather than database schema names when showing who has checked something out
-
ADAPT01130292 - In the central repo, doing a right-click to view History of the object – show details of check-outs (ones that weren’t Undone), not just check-ins, especially if the object is currently checked out. Prompt for Comments which get recorded at check-out time, not just check-in.
-
The ability to drag a dataflow from the Central repo window onto a Job or Workflow workspace, and have DI prompt you with the Checkout options at that time. Rather than making you go the central repo, check out there, then back to your local repo window, find the object, and drag that into your workflow.
-
Give a super-admin the ability to undo a checkout in the central repo… At the moment, you have to hack the database with SQL scripts, which is a time-waster and is dangerous.
-
Ability to tag objects in the central repo as “deprecated” (unused/retired), which would show them as greyed out. (And if you did a Show Where Used and that deprecated object was still being referenced, it would note it as such in the dialog box.)
-
The ability to display the attributes date created / data modified for an object in the Local Object Library, or Central repo, (in the list of column
headers) and sort by them so I can find out what has most recently been worked on or new additions without going through each items properties! -
Add additional attributes to objects. Such as category, and display/sort/filter by these in the Local Object Library (which also covers grouping mentioned by previous posters).
Job migration, execution, & scheduling
-
Disabling workflows/dataflows - Well I would be interested in disabling objects (DF/WF) within a job with a simple right-click! This function could be very usefull especially when you have complex WF/DF and you want to test part of it. Related: If I am not mistaken, COgnos Decision Stream has this option and shows a big red ‘X’ when any component is disabled. Informatica too has a disable session task option.
-
A schedule migration process that would make migrating ETL process from one environment to another painless.
-
Scheduling of jobs from within DI Designer interface
-
The facility to lock a datastore so that it won’t be overwiten by an ATL import - maybe an “Are you sure?” prompt.
-
A job DI scheduler that actually takes in consideration the job dependencies if you had more than one job. We have many jobs that have dependencies, and had to develop a database solution to our dependency issue. Each DI job calls a workflow in which each job checks to see if all jobs it is dependent upon have run. If so, it writes its start time to the database. If not, the database function sleeps until all dependencies are met, then allows the job to continue. After the DI execution is complete, the job calls another workflow which writes its ending time to the database, thus allowing any dependent jobs currently sleeping to run. Yes, we have had situations were our cluster was quite full of sleeping jobs because a couple jobs ran longer than normal; it has indeed brought our servers to a grinding halt before. Having job dependencies built into DI could save us some serious headaches…
-
Related - Job chaining: The ability to call jobs on the same and different jobservers from other jobs - with success, fail, waiting, etc status and associated error alert/reporting options. This allows you to have master/child jobs as for some of the EDW’s I’ve build it just isn’t possible to stuff everything in a single monster job. You can already do this, using commandline execution and all, but it’s quite some work to develop a framework around this and it would be very useful if Data Services would offer this “natively”.
Other general functionality
-
In-Memory Datastore for Batch jobs - Lookup to flow - several times I’ve created my own result set after a query or two and really need to lookup the result of that from somewhere else further down the stream. I have to write this out to a table and then start a subsequent dataflow to be able to lookup from it. If this is a set in memory already I would like to be able to “fix” a position in a flow as a named set and use this as a lookup table without having to write it out (if you write it you only need to read it later, if it’s already in memory and it’s a transient set why bother). Related: Essentially, an in-memory datastore that works with Batch jobs (not just realtime). Write out the results from your first step to a “table” that lives in a datastore which is defined as being pinned in memory. Right now you can only write out to a persistent cache datastore, which is on-disk; we haven’t found the local persistent on-disk cache datastores to be any faster than re-retrieving the data from the database.
-
A New Variable Output column in the Query transform, which will hold, temporary results. Most of the time, we are calculating some values, for doing this, we need to add one more QUERY transform. I think a concepts very similar to VARIABLE PORT in Informatica.
-
Ability to use a query transform to write results to variables. If you need to set a few variables to various columns from a query you have to do this in a sql command in a script. This means you lose the visibility of the table having been used for starters, and secondly you need to run the same sql multiple times to get a different field back into a different variable. In order to set a variable to a value from a query you need to use a script task to set the variable = sql( datastore, query) It would be nicer to be able to use a query transform (or similar) where the output is set to a variable instead of a field in the output stream.
-
Flat Files: - be able to define a description and other metdata for columns in a flat file definition the same way you can for columns in database tables.
-
Flat Files: writing multiple files. Input file is 10 million lines long and it would be good like to write multiple output files of X lines long. This is so that we can restart the job at point of failure when it fails halfway through. This is similar to the “commit every 10000” rows when writing to a DB. Our workaround is to split input file and then use some complex looping workflow to go thru all files in the folder - works but headache to maintain and debug.
-
Flat files: a template flat file like a template table or xml
-
Flat files: ability to automatically ignore columns to the right of those defined in the flat file format
-
In the SQL Transform have a larger field for the Datastore name (we have some problems we long names).
-
Suppress some data type conversion warnings: a mapping of 0 in a decimal(18,0) generate a warning like Conversion of to ?
-
A trunc function like the Oracle one to strip off the time from a date time.
-
ADAPT01130309 - CMS Collector (metadata reporting) - does not pull underlying tables if Aliased in a universe, and, does not consider table joins as a dependency. Enhancement would be for DI to record the underlying table (vs. the alias name) for full dependency checking, and, for DI to consider objects used in Joins (but not as a universe object itself) as a dependency.
-
ADAPT01117847 - Ability to query repository for post-load SQL on target tables. (Probably means overhaul of how ATL is stored in repository – or maybe an export to XML format that is more human-readable.)
-
Rather than DI abruptly crashing as it hits the 32-bit 2Gb barrier… it would be nice if you exceed 1.9Gb of memory usage, DI logged a warning in the job console. And if you exceed 1.99Gb of usage, DI shuts down the job gracefully (ie. with an error code rather than a crash dump). This could avoid a lot of support issues with tech support and on BOB!
-
Create and include in the DI installation, a BO universe to report on the DI repository. This would seem like a slam-dunk integration point between these two BO products…
Management Console (Web Admin)
- It sure would be nice to be able to use the Management Console for everything. Once the job server and Management Console have been installed and are up and running I would like to be able to use the Management Console for:
- Creating the repository or even upgrading
- Configuring the job server to talk to the repository
- Importing an ATL file to the repository - no more designer required for production.
- Selecting System Configuration settings.
I would like to see all DI configuration and administration available within the Web Admin. This would include:
- DSCONFIG
- Job Server setup
- Repo maintenance/setup
- Calculating dependencies/mappings
- Ability to stop/start DI services
-
Since we’re asking for the moon, let’s also have this rolled into the BO Enterprise CMC entirely, so that there’s a single security model across all BO products… That way we can use AD authentication and not have to maintain separate account schemes. And then you can let the BOE installation process worry about web/app servers, Java dependencies, etc.
-
While using the Web-based Management Console, looking at the Batch Job status… I’d like to have a hover-over menu on the Job Name, that had options for:
- Show History - this would link you over to the Operational Job Execution History page (/odreports/odJobList.do) with that particular job already filtering the results
- Auto Documentation - this would link you over to the Auto Doc section (/smartdoc.jsp) with that job already chosen
dnewton (BOB member since 2004-01-30)