Random Sample from Oracle

I need to take a random sampling of records from a very large table. I would like the work to be pushed down to Oracle. The best way I know involves writing a subquery such as:

select pk_id from (select pk_id from table order by dbms_random.value()) where rownum <= 10

What’s the best way to accomplish the same thing in DI without sending all of the records to job server for the sampling?

Thanks!


tangel (BOB member since 2008-05-12)

Row_Gen --> Query_rand
                       \
             Table  --- Query_join ----->

The Row_Gen should have a join rank of 100, the Table of 10 and when you check the Display-Optimzed-SQL from the Validation menu it should show “select … from table where key = :ALVariable1”


Werner Daehn :de: (BOB member since 2004-12-17)

I’m not sure I follow. How do I join the tables in the second query transform such that a record sample is retrieved? Do I still need to use the Oracle random function and just use the row gen transform to limit the number of records?


tangel (BOB member since 2008-05-12)

The Row_Gen generates e.g .200 rows. For each you create an arbitary key value using the DI rand() function. And then you “lookup” via the join the 200 rows.


Werner Daehn :de: (BOB member since 2004-12-17)

I don’t understand the lookup part. What does the join look like?


tangel (BOB member since 2008-05-12)

I think I understand now. You think I can generate valid keys from my source table. Unfortunately, it’s a 16 character alphanumeric key that is sparse.

I was thinking this might work if I can randomize the sort order on my source table and then join it to 100 records from the row generation without returning the entire recordset…


tangel (BOB member since 2008-05-12)

If you have 16 characters, you can make it pseudo random by double seeding on the key column. First seed is character position (1-16) into a global integer variable. Second seed is the starting character for the column, using the set of valid characters for the key column. If upper, lower, numeric and punctuation are all allowed, just use an ASCII value. Otherwise you might have to get a bit more creative but it is not too complicated.

Then you use the two seeded variables as part of your where clause. Would that work?

  • Ernie

eepjr24 :us: (BOB member since 2005-09-16)

too bad. This asks for plan B then.

Create a new database user, this user will get a database view with the same name like the table, it is reading from the table but just the 200 sample rows.

Then create a second datastore configuration, same database but the new user and the configuration should have an alias to rename the original user to the new user.

Create system configurations for both datastore configurations.

If you execute the job with config_real, it will read the real table. Executing the job with config_sample will use the second datastore configuration, this one has an alias so it will execute a select * from newowner.tablename instead of oldowner.tablename and hence reading the view.


Werner Daehn :de: (BOB member since 2004-12-17)

The Row Gen and Query_Rand that Werner suggested would take the place of the randomrow_table in the query below:

SELECT sampled_table.* 
FROM (SELECT ETL_JOB_LOGS.*, rownum rnum FROM ETL_JOB_LOGS ORDER BY DBMS_RANDOM.VALUE) sampled_table,
     (SELECT DBMS_RANDOM.VALUE(1,20) as rnum from dual UNION ALL
      SELECT DBMS_RANDOM.VALUE(1,20) as rnum from dual UNION ALL
      SELECT DBMS_RANDOM.VALUE(1,20) as rnum from dual UNION ALL
      SELECT DBMS_RANDOM.VALUE(1,20) as rnum from dual UNION ALL
      SELECT DBMS_RANDOM.VALUE(1,20) as rnum from dual) randomrow_table
where sampled_table.rnum = CEIL(randomrow_table.rnum);

eganjp :us: (BOB member since 2007-09-12)

Does the size of the random set have to change? e.g. 5% or 10% of the records?

When I have approached this I have used an Oracle view and combined the random value and ranking functions to assign each row in the table a random rank between 1 & 100. Once you have that view it is just a matter of selecting rows where the rank is less than or equal to your percentage.


dfoster (BOB member since 2008-01-08)