Select count distinct...

I want to implement the equivalent of the following SQL in Data Services:

SELECT   x.pkey
,        COUNT (DISTINCT x.col_1)
,        COUNT (DISTINCT x.col_2)
,        COUNT (DISTINCT x.col_3)
    FROM x_cdb_join_match x
GROUP BY x.pkey

The result of this query is used to transform the input data (coming from the left Data_Transform; if their value is not unique per pkey it is set to null).

As far as I can see I have to create three of the same flows and join them again (see screen-shot). What makes this worse is that despite the fact that I use Data_Transfer transforms, the queries are not executed in parallel. So this could be done much more efficient. I am looking for suggestions to improve the performance of this. Is there a way to have one Query transform that generates one query like the one above?
Image1.png


lamanp :netherlands: (BOB member since 2008-09-02)

Is it the DISTINCT part that is tripping up the Query transform? I’m not clear on where the problem is. If you were unable to code the count(distinct x) in a query transform then I would say just go to a SQL transform. But since you have the Data Transfer tranform in the Dataflow then the SQL transform cannot be used.


eganjp :us: (BOB member since 2007-09-12)

This is just part of a larger data flow. This is somewhere in the middle of it, so data is coming from other transforms, in this case separated by a Data_Transfer.

In my opinion SQL transforms are a last resort, and not to be used if alternatives are available. And in most cases when you think you need them I prefer using a database view instead.

The real problem is that I do not know of a way to do a count(distinct col) in oner Query transform. I use two steps:

  1. Select distinct
  2. Count
    But that cannot be done for more than one column at a time, whereas SQL can do it in one statement with all the benefits that go with that. So I am actually looking for an efficient solution to execute the functionality from the above query.

lamanp :netherlands: (BOB member since 2008-09-02)

Dumb question: Why aren’t you using the count_distinct() built in function? It will convert into SQL that is the same as count(distinct column_name).


eganjp :us: (BOB member since 2007-09-12)

That’s not a dumb question, but a brilliant idea. I seem to have overlooked this function for years…
Thanks.


lamanp :netherlands: (BOB member since 2008-09-02)

I hope it works for you. It should eliminate a lot the extra work in your Dataflow.


eganjp :us: (BOB member since 2007-09-12)

It sure makes things a lot easier and it also speeds up the flow. Thanks again.


lamanp :netherlands: (BOB member since 2008-09-02)

Again I run into problems with the count_distinct() function and came across this old topic. What I am trying to achieve is to build the following SQL in BODS, but it will not produce a pushdown of the whole lot ( I believe it should, or can this function not be pushed down to an Oracle 12 database?):

  SELECT h.fp_id
,        COUNT (
           DISTINCT CASE
                      WHEN o.fp IS NOT NULL THEN o.pos_id
                      WHEN l.etl_key > 0 THEN NULL
                      WHEN p.fp IS NOT NULL THEN p.pos_id
                    END)
           aantal_pos
    FROM fp_his h
        LEFT JOIN pos p
          ON p.fp = h.fp_id
        LEFT JOIN obj o
          ON o.fp = h.fp_id
        LEFT JOIN obj l
          ON l.pos_id = p.pos_id
GROUP BY h.fp_id

The purpose is to have 3 counters of distinct id’s, with some logic for the first. All of this can be written into a single decode():

count_distinct( decode( o.fp = h.fp_id, o.pos,
                        l.etl_key > 0, null,
                        p.fp = h.fp_id, p.pos,
                        null ) )

but it does not produce a pushdown to the database. Even if I remoce the logic and simply do a

count_distinct(o.pos_id)

it won’t pushdown. What’s happening?


lamanp :netherlands: (BOB member since 2008-09-02)

Again I run into problems with the count_distinct() function and came across this old topic. What I am trying to achieve is to build the following SQL in BODS, but it will not produce a pushdown of the whole lot ( I believe it should, or can this function not be pushed down to an Oracle 12 database?):

  SELECT h.fp_id
,        COUNT (
           DISTINCT CASE
                      WHEN o.fp IS NOT NULL THEN o.pos_id
                      WHEN l.etl_key > 0 THEN NULL
                      WHEN p.fp IS NOT NULL THEN p.pos_id
                    END)
           aantal_pos
    FROM fp_his h
        LEFT JOIN pos p
          ON p.fp = h.fp_id
        LEFT JOIN obj o
          ON o.fp = h.fp_id
        LEFT JOIN obj l
          ON l.pos_id = p.pos_id
GROUP BY h.fp_id

The purpose is to have a counter of distinct id’s, with some logic applied. All of this can be written into a single decode():

count_distinct( decode( o.fp = h.fp_id, o.pos,
                        l.etl_key > 0, null,
                        p.fp = h.fp_id, p.pos,
                        null ) )

but it does not produce a pushdown to the database. Even if I remove the logic and simply do a

count_distinct(o.pos_id)

it won’t pushdown. What’s happening?


lamanp :netherlands: (BOB member since 2008-09-02)

Is it possible that the LEFT joins are tripping it up? I’ve seen DS produce some utterly stupid queries that don’t seem to be close to what I intended. Simplifying it sometimes helps. Breaking it up into separate (but linked) Query transforms can also help.

For example, I never code a gen_row_num_by_group() in the same Query transform that I do the Order By that the by_group is counting on. I have to code the Order By in the first Query transform and the gen_row_num_by_group() in the second Query transform.


eganjp :us: (BOB member since 2007-09-12)