There is a pipe (|) separated csv file which has 156 fields. Is it possible to validate file format to check whether it has all required 156 fileds [or pipe(|)] soonafter reading source file or before actual file load start?
The work environment is : BODS version : 4.0, Database : SQL Server 2008 R2
Hrm. With some work it should be possible. I would look at the using the Word function in combination with a file format that is set up with a single field. I won’t make any speculations about how fast it will be though.
? Like I outlined above would work. I mean, technically you are loading the file twice, but once you would discard the records or call a function to notify if the file was not correct.
Another way (which I have used in the past) which works well for Linux is to use an exec() call to sed or other file processing tools to pre-process the file for whatever criteria you want and raise an exception if you get unexpected results.
I am sure there are other ways to do, almost always more than a couple ways to solve a problem.
Please let me know how to handle load even though one invalid file is failed to process during file validation, remaining valid files should be processed.
Currently I am reading all files in single dataflow which reads files dynamically as
start
while(filecount>=1)
filecount=2
assign variable1=file1
Load data
filecount=filecount-1
end loop
end
Well, there are a number of options, depending on how you identify the files. So lets say you are using sed or similar tools. You simply move the bad files to a “badfiles” folder and check the next one until there are no more. Than within your dataflow if all files are the same format you just use a wildcard in the file name.
If processing files in a loop is necessary for other reasons, I would do something like use an exec() command to pipe a directory listing into a file. Then have a dataflow load that file to a table for processing. Put your main data flow inside a loop, look up a file to process and delete it from the processing table, then move on to the next until there are none left to process.
That certainly complicates things. Doing this with script and a loop could have some negative performance issues.
I would first validate the files. Any file passing the validation is sent to the PASSED_VALIDATION directory. Then all remaining files in that directory are OK to process and a single execution of a Dataflow can process them all.
As mentioned above Cygwin is a fine set of utilities and I’ve made extensive use of it with DS jobs. In this case you may be able to get by with the DOS utility FINDSTR. If you’re not on a Windows O/S then find the appropriate *nix utility.
FINDSTR /R /M “.|.|.*” *.txt
This will give you a list of all the files where every line matches the pattern. You’ll have to specify the pattern, but it’s pretty easy to do. It would work a lot easier if you could get a list of files where the pattern failed but I couldn’t figure out how to get that.
The problem here is that FINDSTR produces a list of files with each file on a separate line. The DS exec() function will only return the first 1024 characters. So if you have many files that satisfy the validation you’ll likely exceed the 1024 character limit. To work around this you may want to pipe the results of the FINDSTR to a file and then use a Dataflow to read in the list of file names and use exec() in a mapping to move each file to the PASSED_VALIDATION directory.
A dataflow which loads all files content to database using […*.txt] into single CONSOLIDATED_COLUMN with varchar(8000). Please notice that, we have included file name column (DI_Filename). One integer field which is mapped with LENGTH(CONSOLIDATED_COLUMN) and another integer field which is mapped with LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’)).
If LENGTH(CONSOLIDATED_COLUMN) - LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’))=155, then FLAG it ‘V’ else ‘I’
I guess, the findstr “.|.|.*” *.txt is the way we should work. Please suggest
As I told, there are 156 fields in pipe (|) separated csv input files. I need to look into number of pipe (|) in each record in file. This I accomplished as,
Multiple csv files read from source (with DI_Filename) like d:*.txt
Read all csv file into single column (say CONSOLIDATED_COLUMN )
created another column and populated values as LENGTH(CONSOLIDATED_COLUMN) of int type
created one more column and populated values as LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’)) of int type
subtracted values from above int columns
i.e LENGTH(CONSOLIDATED_COLUMN) - LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’)) =155