Select count distinct...

system · January 13, 2012, 4:37pm

I want to implement the equivalent of the following SQL in Data Services:

SELECT   x.pkey
,        COUNT (DISTINCT x.col_1)
,        COUNT (DISTINCT x.col_2)
,        COUNT (DISTINCT x.col_3)
    FROM x_cdb_join_match x
GROUP BY x.pkey

The result of this query is used to transform the input data (coming from the left Data_Transform; if their value is not unique per pkey it is set to null).

As far as I can see I have to create three of the same flows and join them again (see screen-shot). What makes this worse is that despite the fact that I use Data_Transfer transforms, the queries are not executed in parallel. So this could be done much more efficient. I am looking for suggestions to improve the performance of this. Is there a way to have one Query transform that generates one query like the one above?

lamanp (BOB member since 2008-09-02)

system · January 13, 2012, 4:57pm

Is it the DISTINCT part that is tripping up the Query transform? I’m not clear on where the problem is. If you were unable to code the count(distinct x) in a query transform then I would say just go to a SQL transform. But since you have the Data Transfer tranform in the Dataflow then the SQL transform cannot be used.

eganjp (BOB member since 2007-09-12)

system · January 13, 2012, 5:28pm

This is just part of a larger data flow. This is somewhere in the middle of it, so data is coming from other transforms, in this case separated by a Data_Transfer.

In my opinion SQL transforms are a last resort, and not to be used if alternatives are available. And in most cases when you think you need them I prefer using a database view instead.

The real problem is that I do not know of a way to do a count(distinct col) in oner Query transform. I use two steps:

Select distinct
Count
But that cannot be done for more than one column at a time, whereas SQL can do it in one statement with all the benefits that go with that. So I am actually looking for an efficient solution to execute the functionality from the above query.

lamanp (BOB member since 2008-09-02)

system · January 13, 2012, 5:36pm

Dumb question: Why aren’t you using the count_distinct() built in function? It will convert into SQL that is the same as count(distinct column_name).

eganjp (BOB member since 2007-09-12)

system · January 13, 2012, 5:51pm

That’s not a dumb question, but a brilliant idea. I seem to have overlooked this function for years…
Thanks.

lamanp (BOB member since 2008-09-02)

system · January 13, 2012, 6:09pm

I hope it works for you. It should eliminate a lot the extra work in your Dataflow.

eganjp (BOB member since 2007-09-12)

system · January 13, 2012, 7:13pm

It sure makes things a lot easier and it also speeds up the flow. Thanks again.

lamanp (BOB member since 2008-09-02)

system · August 24, 2018, 4:52pm

Again I run into problems with the count_distinct() function and came across this old topic. What I am trying to achieve is to build the following SQL in BODS, but it will not produce a pushdown of the whole lot ( I believe it should, or can this function not be pushed down to an Oracle 12 database?):

  SELECT h.fp_id
,        COUNT (
           DISTINCT CASE
                      WHEN o.fp IS NOT NULL THEN o.pos_id
                      WHEN l.etl_key > 0 THEN NULL
                      WHEN p.fp IS NOT NULL THEN p.pos_id
                    END)
           aantal_pos
    FROM fp_his h
        LEFT JOIN pos p
          ON p.fp = h.fp_id
        LEFT JOIN obj o
          ON o.fp = h.fp_id
        LEFT JOIN obj l
          ON l.pos_id = p.pos_id
GROUP BY h.fp_id

The purpose is to have 3 counters of distinct id’s, with some logic for the first. All of this can be written into a single decode():

count_distinct( decode( o.fp = h.fp_id, o.pos,
                        l.etl_key > 0, null,
                        p.fp = h.fp_id, p.pos,
                        null ) )

but it does not produce a pushdown to the database. Even if I remoce the logic and simply do a

count_distinct(o.pos_id)

it won’t pushdown. What’s happening?

lamanp (BOB member since 2008-09-02)

system · August 27, 2018, 7:19am

Again I run into problems with the count_distinct() function and came across this old topic. What I am trying to achieve is to build the following SQL in BODS, but it will not produce a pushdown of the whole lot ( I believe it should, or can this function not be pushed down to an Oracle 12 database?):

  SELECT h.fp_id
,        COUNT (
           DISTINCT CASE
                      WHEN o.fp IS NOT NULL THEN o.pos_id
                      WHEN l.etl_key > 0 THEN NULL
                      WHEN p.fp IS NOT NULL THEN p.pos_id
                    END)
           aantal_pos
    FROM fp_his h
        LEFT JOIN pos p
          ON p.fp = h.fp_id
        LEFT JOIN obj o
          ON o.fp = h.fp_id
        LEFT JOIN obj l
          ON l.pos_id = p.pos_id
GROUP BY h.fp_id

The purpose is to have a counter of distinct id’s, with some logic applied. All of this can be written into a single decode():

count_distinct( decode( o.fp = h.fp_id, o.pos,
                        l.etl_key > 0, null,
                        p.fp = h.fp_id, p.pos,
                        null ) )

but it does not produce a pushdown to the database. Even if I remove the logic and simply do a

count_distinct(o.pos_id)

it won’t pushdown. What’s happening?

lamanp (BOB member since 2008-09-02)

system · September 5, 2018, 5:47pm

Is it possible that the LEFT joins are tripping it up? I’ve seen DS produce some utterly stupid queries that don’t seem to be close to what I intended. Simplifying it sometimes helps. Breaking it up into separate (but linked) Query transforms can also help.

For example, I never code a gen_row_num_by_group() in the same Query transform that I do the Order By that the by_group is counting on. I have to code the Order By in the first Query transform and the gen_row_num_by_group() in the second Query transform.

eganjp (BOB member since 2007-09-12)