BusinessObjects Board

Improvements/Suggestions to DI

Post your ideas here, and see what others might have to say (would others use the suggested functionality too? If so it might give more “weight” to the idea within BO.)

===================

Here is a collection of ideas from a previous thread.

  1. My ‘enhancement’ would be for BusObj to concentrate on improving the UI of Designer - I’ve found it particularly unintuitive, clumsy, slow and most significantly, highly unstable. The last point is very frustrating and will prevent me from using DI again or indeed recommending it to anyone else until the product improves. It crashes/hangs on me at least 4 or 5 times a day. As far as i can tell it seems to be whenever DI needs to read/write to local or central repositories.

  2. Identity Insert I would like to able to use SQLSvr IDENTITY columns as designed. This would allow me to parallel load tables, and do INSERT into … SELECT … statements with full pushdown. Related: Key Generation transform to use database max value, not in-memory calculation, to increment the key for every row. Related: This would also work for ORACLE with a default ‘sequence’.NEXTVAL

  3. Disabling workflows/dataflows - Well I would be interested in disabling objects (DF/WF) within a job with a simple right-click! This function could be very usefull especially when you have complex WF/DF and you want to test part of it. Related: If I am not mistaken, COgnos Decision Stream has this option and shows a big red ‘X’ when any component is disabled. Informatica too has a disable session task option.

  4. Reusable column-level mappings (or at least LOOKUPs()). Lookups are a pain to code, and the UI wizard isn’t much faster. So often you’re doing the same lookup in many dataflows, say, to the Customer table. It would be nice to create a library of lookups which can be used which simply “paste” into your mapping with a click. Related: The second idea of getting a library of look_up is very interesting, and maybe, the user could build his own library and to access it, he could simply click on a dedicated tab in the object library window.

  5. MigrationAn onther general point would to improve the migration of Jobs/DF/WF within differents environements (the actual method is not intuitive to me… but I’m still working with the 6.5 versions, I don’t know about the XI!)

  6. Different icons on the canvas if the output has the delete or drop & create option set

  7. Copy & paste of canvas objects, on the same and between canvas. Related: -More “copy/paste” of transforms and scripts; and, drag-and-drop dataflows into workflows. (see Why IF and WHILE conditional cannot be reuse? )

  8. Convert embedded data flow objects to canvas objects.

  9. Sort by owner within a datastore. They are ordered when importing metadata, but not on object display

  10. Lookup with formulae - being able to define a formula on both sides of the lookup fields

11 - [b]In-Memory Datastore for Batch jobs[b] Lookup to flow - several times I’ve created my own result set after a query or two and really need to lookup the result of that from somewhere else further down the stream. I have to write this out to a table and then start a subsequent dataflow to be able to lookup from it. If this is a set in memory already I would like to be able to “fix” a position in a flow as a named set and use this as a lookup table without having to write it out (if you write it you only need to read it later, if it’s already in memory and it’s a transient set why bother). Related: Essentially, an in-memory datastore that works with Batch jobs (not just realtime). Write out the results from your first step to a “table” that lives in a datastore which is defined as being pinned in memory. Right now you can only write out to a persistent cache datastore, which is on-disk; we haven’t found the local persistent on-disk cache datastores to be any faster than re-retrieving the data from the database.

  1. Bug- decode - amendment to the decode so that when generating MSSQL it does not produce nested statements, but correctly generate a single CASE statement in the SQL.

  2. Inbuild RANK function.

  3. A New Variable Output column in the Query transform, which will hold, temporary results. Most of the time, we are calculating some values, for doing this, we need to add one more QUERY transform.
    I think a concepts very similar to VARIABLE PORT in Informatica.

  4. We should able to “PAUSE” a job, I don’t know why we need this, but it can become handy, in some situations.

  5. How about a Lookup transform. It’d be similar a Query transform, except the point in every column mapping would be a checkbox (lookup or not), and if lookup, expedite the process of doing the lookups – make it really easy for the designer.

  6. Ability to Reimport a table right within the dataflow.

  7. Check in objects from the Local repository, vs. having to find it in the central repo.

  8. In a Query step, you can click on a column on the right pane and the “used” source columns will get highlighted on the left pane. How about doing the reverse, too? Click on a source column and quickly see which target columns are using it.

  9. If columns are used only in the join/WHERE clause of a Query (and not in a mapping), there’s no indication that they are used or mapped. A separate kind of highlight would be helpful on those source columns.

  10. Extend mapping to include CASE statement


dnewton :us: (BOB member since 2004-01-30)

Seperation of JOIN conditions and WHERE conditions, to allow you to perform an outer join and a filter within a single query.

(Granted, there are ways around it, but it would be nicer to be able to make one query do the job. See https://bobj-board.org/t/99872 and https://bobj-board.org/t/100493)


HuwD :uk: (BOB member since 2007-04-19)

How about better support for IN / NOT IN in the WHERE clause of your query? One way might be to make a new function called “is_in_query” that would take parameters:

  • Datastore and table to look in
  • IN or NOT IN
  • Input column to be compared
  • column of the lookup table to compare to
  • Filtering clauses of the lookup table (to get a subset or to match)

At run-time, DI would evaluate the datastore of the Input colum and the datastore of the IN column. If the same, it would push this down to a regular IN in the SQL.

If they were different… it would get much more complicated. I suppose at that point DI could turn it into a regular lookup() and then add in an in-memory filtering step.

The idea here, overall, is to not have to do Pushdown_SQL so that you have better table impact analysis etc.


dnewton :us: (BOB member since 2004-01-30)

I agree with all of that, but would also like to add an additional option, to be able to use one of the in-memory temporary sets (11 in the list) as the comparison table of the IN / NOT IN clause

also to modify slightly the column to compare to instead be formula to compare, so that this might be a single column, or could be a concatenation or other formula based on this table

:slight_smile:


HuwD :uk: (BOB member since 2007-04-19)

I just wanted to emphasis the need for some of these:

enable/disable workflow, dataflow
copy & paste.

It’s shocking to me that these features were discussed as early as 2005 and have not been incorporated into the product. These would boost productivity greatly.


dbeardsley (BOB member since 2006-08-30)

Ability to use a query transform to write results to variables.

If you need to set a few variables to various columns from a query you have to do this in a sql command in a script. This means you lose the visibility of the table having been used for starters, and secondly you need to run the same sql multiple times to get a different field back into a different variable.


HuwD :uk: (BOB member since 2007-04-19)

sql( [datastore], [command]) function to NOT set implicit transactions on.

This is non-intuitive behaviour (and I did not realise this was happening until DNewton mentioned it last week). When designing jobs we should have control over when/if transactions are used. DI deciding for us we need a transaction here does not make sense. Not all queries are wrapped in a transaction, so why should the sql function be any different.

(DI does not put anything else into a transaction elsewhere unless you choose to - the sql function seems out of sorts with the rest of DI here)


HuwD :uk: (BOB member since 2007-04-19)

Fixing DECODE to translate to one CASE statement instead of a series of nested CASE statments in SQL 2000 would be an significant help to our team.


maclark :us: (BOB member since 2007-08-03)

some more suggestions

I think quite a lot of improvements can be made to version control team development. EG:

  • When looking at objects in the Local Object Library be able to identify if they are checked in or out, who by and date of check in or out.
  • Use usernames rather than database schema names

I’ve been working with flat file to flat file recently and although this isn’t DI’s main use some useful enhancements could be made,

  • be able to define a description and other metdata for columns in a flat file definition the same way you can for columns in database tables.

  • writing multiple files. Input file is 10 million lines long and it would be good like to write multiple output files of X lines long. This is so that we can restart the job at point of failure when it fails halfway through. This is similar to the “commit every 10000” rows when writing to a DB. Our workaround is to split input file and then use some complex looping workflow to go thru all files in the folder - works but headache to maintain and debug.

  • for validation transform i’d like to be able to add a column to output that contains my error code so i can write this to a table for analysis or use a lookup table to obtain a description of error. Also useful to add a severity code


agulland :uk: (BOB member since 2004-03-17)

I would like to add to the list:

A good DI scheduler that actually takes in consideration the job dependencies if you had more than one job.

A schedule migration process that would make migrating ETL process from one environment to another painless.

Thanks,
cfscg.


cfscg (BOB member since 2007-06-19)

I second the job scheduler request; we have many jobs that have dependencies, and had to develop a database solution to our dependency issue. Each DI job calls a workflow in which each job checks to see if all jobs it is dependent upon have run. If so, it writes its start time to the database. If not, the database function sleeps until all dependencies are met, then allows the job to continue. After the DI execution is complete, the job calls another workflow which writes its ending time to the database, thus allowing any dependent jobs currently sleeping to run. Yes, we have had situations were our cluster was quite full of sleeping jobs because a couple jobs ran longer than normal; it has indeed brought our servers to a grinding halt before. Having job dependencies built into DI could save us some serious headaches…

Also, as I mentioned in another forum, the option to have sequential or queued execution rather than strict time-based execution would be nice. The optimal solution for us (I am very aware that this might not work for others; this is just a wishlist for a perfect world!) would be to be able to declare how many processes you want to run simultaneously, then set up a queue for each.


jamonwagner :us: (BOB member since 2007-03-14)

You are implementing something manually that already exists - job dependencies via workflows, “execute only once” etc.

Let me know if you disagree.


Werner Daehn :de: (BOB member since 2004-12-17)

As I mention in the other thread, this will work for some of our smaller flow jobs. The one I keep coming back to is our Direct Sales job. The Direct Sales job has 11 different jobs it is dependent upon. (Actually, in one of our companies, there are 9 jobs it is dependent upon, in two other companies, there are 10 jobs it is dependent upon, with 11 being the number from the fourth company.) I have consolidated all of the sub dependencies within those jobs into master jobs, so we’re now down to those 9-11 jobs that have to run before our Direct Sales jobs can run. Those 9-11 jobs constitute between 22 data flows in the best case scenario to 29 data flows in the worst case. Creating one monster job that runs all 22-29 data flows before running the 4-5 dataflows that constitute our Direct Sales job would trend heavily towards unmaintainability. Add in the fact that we are only half-way through activating jobs, and there are jobs which require the Direct Sales jobs to run before they can kick off, and we’ll eventually end up with four massive jobs (one for each company, as each company manipulates data slightly differently (grumble, grumble…)) with between 50-70 dataflows, with a number of independent jobs (currently between 6 and 13, depending upon the company) running whenever the processor catches a break. Even consolidating the dependencies into components would be a large job.

Am I misunderstanding something about the components that would make this much easier?


jamonwagner :us: (BOB member since 2007-03-14)

We have one big job which has more than 200 dataflows in it. (OK, it takes 10+ minutes for DI to optimize the thing when you try to run it!)

I think what Werner is saying is: You can have one big job, and then each place where you’re using Jobs, today, make those a Workflow instead.


dnewton :us: (BOB member since 2004-01-30)

OK, that’s implemented for several of the job streams we run. We’re still trying to figure out all of the error handling (i.e. if this workflow fails, do we want to kill the entire job, or keep processing?), but it seems to work pretty good. When fully implemented, we’ll have just six jobs running on an average weekday (for now), and five others (for now) that run over weekends. And yes, we are seeing performance gains over our in-house data dependency routines. Our small master job went from 130-135 seconds of combined processing time and four minutes of real time for the individual jobs to 55-60 seconds of processing and real time. We haven’t seen gains that big percentage-wise with the massive job that is currently implemented (5:10:00 avg combined vs. 5:45:00 avg individual), but they are gains…


jamonwagner :us: (BOB member since 2007-03-14)

Another enhancement, which is one that I mentioned a long time ago…

The ability to change a large number of mappings in a Query Transform to a constant value (say, null) in one step.

Many times in our data warehouse, we are pulling rows from several different sources, some of which have much less column data than others. So, in our first Query transform for each source table, we point the source table and a source version of the target table at the transform, copy over the mapping for the target table, then remove that source version of the target table. Then, for all fields that aren’t used, we have to manually go through each one and modify it to null. Not a problem on small tables, but when some of your tables have 100+ columns, it’s a mite bit of a pain to map 50-70 of them to null by hand…


jamonwagner :us: (BOB member since 2007-03-14)

  1. Include Agregator Transform to enable aggregation of columns based on grouping of select columns.
  2. Enable Dynamic Lookup
  3. Enable Storing of values within Transform
  4. Sequence Generator to be made simple
  5. Create better UI and logging
  6. Is very unstable and slow when processing large amounts of data, when comparing to other industry standard tools
  7. Scheduling of workflows to be made simpler and would prefer to have option to do it from the DI interface
  8. Arranging the transforming in the data flow using single click

I think it has a long while to go to reach any where to the other tool sets available in the market.

Cheers
JJ :box:


jitesh :australia: (BOB member since 2005-05-20)

Can you clarify what you mean for #3 and #8?

Regarding #6, can you start another thread (not in this thread) to discuss what your issues are? Maybe people here can help.


dnewton :us: (BOB member since 2004-01-30)

my take on #3

In order to set a variable to a value from a query you need to use a script task to set the variable = sql( datastore, query)

It would be nicer to be able to use a query transform (or similar) where the output is set to a variable instead of a field in the output stream.


HuwD :uk: (BOB member since 2007-04-19)

Some little things :

  1. Have a refresh in Designer for the Local Object Library. Usefull when we use the al_engine to import ATL’s.

  2. Have functions keys on command like “Validate current”, “Validate All”, …

  3. In the SQL Transform have a larger field for the Datastore name (we have some problems we long names).

  4. Get some filters on the warning during validation of objects to hide some warning on conversion.

To be continued…


F.Richard :belgium: (BOB member since 2008-08-27)