CSV File Format Validation

system · September 20, 2013, 8:03am

There is a pipe (|) separated csv file which has 156 fields. Is it possible to validate file format to check whether it has all required 156 fileds [or pipe(|)] soonafter reading source file or before actual file load start?
The work environment is : BODS version : 4.0, Database : SQL Server 2008 R2

maddyk (BOB member since 2011-02-08)

system · September 20, 2013, 1:12pm

Hrm. With some work it should be possible. I would look at the using the Word function in combination with a file format that is set up with a single field. I won’t make any speculations about how fast it will be though.

eepjr24 (BOB member since 2005-09-16)

system · September 20, 2013, 1:27pm

Won’t designer shoot out an error if it finds a row delimiter too early? You could wrap your job in a try catch and look for those errors.

DanHosler (BOB member since 2013-06-19)

system · September 20, 2013, 2:00pm

That would not be before the file load starts. And it would not catch those where there were too many delimiters.

eepjr24 (BOB member since 2005-09-16)

system · September 20, 2013, 2:51pm

I’m not sure how you can do anything with the file before file load.

DanHosler (BOB member since 2013-06-19)

system · September 20, 2013, 3:20pm

? Like I outlined above would work. I mean, technically you are loading the file twice, but once you would discard the records or call a function to notify if the file was not correct.

Another way (which I have used in the past) which works well for Linux is to use an exec() call to sed or other file processing tools to pre-process the file for whatever criteria you want and raise an exception if you get unexpected results.

I am sure there are other ways to do, almost always more than a couple ways to solve a problem.

eepjr24 (BOB member since 2005-09-16)

system · September 20, 2013, 6:23pm

Thanks for quick and valueable information. Let me search any window based file processing tools availability or to try with loading files twice

maddyk (BOB member since 2011-02-08)

system · September 20, 2013, 6:26pm

Any idea will be appreciated to handle such file format issues while reading and \ or loading file contents into database

maddyk (BOB member since 2011-02-08)

system · September 20, 2013, 6:29pm

Can you be more specific? What are you trying to do if there is a row with an extra or missing pipe besides notify the originator?

E

eepjr24 (BOB member since 2005-09-16)

system · September 20, 2013, 6:42pm

Please let me know how to handle load even though one invalid file is failed to process during file validation, remaining valid files should be processed.

Currently I am reading all files in single dataflow which reads files dynamically as

start
while(filecount>=1)
filecount=2
assign variable1=file1
Load data
filecount=filecount-1
end loop
end

maddyk (BOB member since 2011-02-08)

system · September 20, 2013, 7:16pm

If you install cygwin you can get access to tools such as sed.

DanHosler (BOB member since 2013-06-19)

system · September 20, 2013, 9:07pm

Well, there are a number of options, depending on how you identify the files. So lets say you are using sed or similar tools. You simply move the bad files to a “badfiles” folder and check the next one until there are no more. Than within your dataflow if all files are the same format you just use a wildcard in the file name.

If processing files in a loop is necessary for other reasons, I would do something like use an exec() command to pipe a directory listing into a file. Then have a dataflow load that file to a table for processing. Put your main data flow inside a loop, look up a file to process and delete it from the processing table, then move on to the next until there are none left to process.

eepjr24 (BOB member since 2005-09-16)

system · September 21, 2013, 7:13am

That certainly complicates things. Doing this with script and a loop could have some negative performance issues.

I would first validate the files. Any file passing the validation is sent to the PASSED_VALIDATION directory. Then all remaining files in that directory are OK to process and a single execution of a Dataflow can process them all.

As mentioned above Cygwin is a fine set of utilities and I’ve made extensive use of it with DS jobs. In this case you may be able to get by with the DOS utility FINDSTR. If you’re not on a Windows O/S then find the appropriate *nix utility.

FINDSTR /R /M “.|.|.*” *.txt
This will give you a list of all the files where every line matches the pattern. You’ll have to specify the pattern, but it’s pretty easy to do. It would work a lot easier if you could get a list of files where the pattern failed but I couldn’t figure out how to get that.

The problem here is that FINDSTR produces a list of files with each file on a separate line. The DS exec() function will only return the first 1024 characters. So if you have many files that satisfy the validation you’ll likely exceed the 1024 character limit. To work around this you may want to pipe the results of the FINDSTR to a file and then use a Dataflow to read in the list of file names and use exec() in a mapping to move each file to the PASSED_VALIDATION directory.

eganjp (BOB member since 2007-09-12)

system · October 3, 2013, 2:39pm

thanks for kind suggestion

maddyk (BOB member since 2011-02-08)

system · October 3, 2013, 4:11pm

Let us know how it works out and if you come up with something better.

eganjp (BOB member since 2007-09-12)

system · October 4, 2013, 9:52pm

A dataflow which loads all files content to database using […*.txt] into single CONSOLIDATED_COLUMN with varchar(8000). Please notice that, we have included file name column (DI_Filename). One integer field which is mapped with LENGTH(CONSOLIDATED_COLUMN) and another integer field which is mapped with LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’)).
If LENGTH(CONSOLIDATED_COLUMN) - LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’))=155, then FLAG it ‘V’ else ‘I’
I guess, the findstr “.|.|.*” *.txt is the way we should work. Please suggest

maddyk (BOB member since 2011-02-08)

system · October 4, 2013, 11:06pm

Can you be a little more clear about what you are asking for?

E

eepjr24 (BOB member since 2005-09-16)

system · October 5, 2013, 5:06am

As I told, there are 156 fields in pipe (|) separated csv input files. I need to look into number of pipe (|) in each record in file. This I accomplished as,

Multiple csv files read from source (with DI_Filename) like d:*.txt
Read all csv file into single column (say CONSOLIDATED_COLUMN )
created another column and populated values as LENGTH(CONSOLIDATED_COLUMN) of int type
created one more column and populated values as LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’)) of int type
subtracted values from above int columns
i.e LENGTH(CONSOLIDATED_COLUMN) - LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’)) =155
flagged which satisfy above condition

maddyk (BOB member since 2011-02-08)

system · October 5, 2013, 12:22pm

That sounds like it will work. What are you having trouble with?

E

eepjr24 (BOB member since 2005-09-16)

system · October 6, 2013, 5:59am

Right, it work now, sincere thanks to to you all for providing kind help, suggestions and cooperation

maddyk (BOB member since 2011-02-08)