I have 10 million records in source and I want to just just 1000 of them in my target. These 1000 rows are to be loaded on sample basis without any specific business rule. I know how to implement ‘where’ logic in query transformer to restrict the data but that is not my requirement.
Please help.
I do this a couple different ways, both using python scripts.
the python random module has a choice method, that returns a non repeating random selection for a given range. You simply generate N number of these into a list, sort them, then write them out to be joined on.
the other way is of course to use math and calculate mathmatically which record you want to select, based on total record/record selection then incremented based on an offset. The python generators work really well for this.
if you know the Nth offset, Id imagine you could also use a row gen transform and multiply the offset by the row number, and round it to get which record to select. You would have to do some checking to make sure you selected enough records and recursively Nth out the remainder.