CSV File Format Validation

There is a pipe (|) separated csv file which has 156 fields. Is it possible to validate file format to check whether it has all required 156 fileds [or pipe(|)] soonafter reading source file or before actual file load start?
The work environment is : BODS version : 4.0, Database : SQL Server 2008 R2


maddyk (BOB member since 2011-02-08)

Hrm. With some work it should be possible. I would look at the using the Word function in combination with a file format that is set up with a single field. I won’t make any speculations about how fast it will be though.


eepjr24 :us: (BOB member since 2005-09-16)

Won’t designer shoot out an error if it finds a row delimiter too early? You could wrap your job in a try catch and look for those errors.


DanHosler :us: (BOB member since 2013-06-19)

That would not be before the file load starts. And it would not catch those where there were too many delimiters.


eepjr24 :us: (BOB member since 2005-09-16)

I’m not sure how you can do anything with the file before file load.


DanHosler :us: (BOB member since 2013-06-19)

? Like I outlined above would work. I mean, technically you are loading the file twice, but once you would discard the records or call a function to notify if the file was not correct.

Another way (which I have used in the past) which works well for Linux is to use an exec() call to sed or other file processing tools to pre-process the file for whatever criteria you want and raise an exception if you get unexpected results.

I am sure there are other ways to do, almost always more than a couple ways to solve a problem.


eepjr24 :us: (BOB member since 2005-09-16)

Thanks for quick and valueable information. Let me search any window based file processing tools availability or to try with loading files twice


maddyk (BOB member since 2011-02-08)

Any idea will be appreciated to handle such file format issues while reading and \ or loading file contents into database


maddyk (BOB member since 2011-02-08)

Can you be more specific? What are you trying to do if there is a row with an extra or missing pipe besides notify the originator?

  • E

eepjr24 :us: (BOB member since 2005-09-16)

Please let me know how to handle load even though one invalid file is failed to process during file validation, remaining valid files should be processed.

Currently I am reading all files in single dataflow which reads files dynamically as

start
while(filecount>=1)
filecount=2
assign variable1=file1
Load data
filecount=filecount-1
end loop
end


maddyk (BOB member since 2011-02-08)

If you install cygwin you can get access to tools such as sed.


DanHosler :us: (BOB member since 2013-06-19)

Well, there are a number of options, depending on how you identify the files. So lets say you are using sed or similar tools. You simply move the bad files to a “badfiles” folder and check the next one until there are no more. Than within your dataflow if all files are the same format you just use a wildcard in the file name.

If processing files in a loop is necessary for other reasons, I would do something like use an exec() command to pipe a directory listing into a file. Then have a dataflow load that file to a table for processing. Put your main data flow inside a loop, look up a file to process and delete it from the processing table, then move on to the next until there are none left to process.


eepjr24 :us: (BOB member since 2005-09-16)

That certainly complicates things. Doing this with script and a loop could have some negative performance issues.

I would first validate the files. Any file passing the validation is sent to the PASSED_VALIDATION directory. Then all remaining files in that directory are OK to process and a single execution of a Dataflow can process them all.

As mentioned above Cygwin is a fine set of utilities and I’ve made extensive use of it with DS jobs. In this case you may be able to get by with the DOS utility FINDSTR. If you’re not on a Windows O/S then find the appropriate *nix utility.

FINDSTR /R /M “.|.|.*” *.txt
This will give you a list of all the files where every line matches the pattern. You’ll have to specify the pattern, but it’s pretty easy to do. It would work a lot easier if you could get a list of files where the pattern failed but I couldn’t figure out how to get that.

The problem here is that FINDSTR produces a list of files with each file on a separate line. The DS exec() function will only return the first 1024 characters. So if you have many files that satisfy the validation you’ll likely exceed the 1024 character limit. To work around this you may want to pipe the results of the FINDSTR to a file and then use a Dataflow to read in the list of file names and use exec() in a mapping to move each file to the PASSED_VALIDATION directory.


eganjp :us: (BOB member since 2007-09-12)

thanks for kind suggestion


maddyk (BOB member since 2011-02-08)

Let us know how it works out and if you come up with something better.


eganjp :us: (BOB member since 2007-09-12)

A dataflow which loads all files content to database using […*.txt] into single CONSOLIDATED_COLUMN with varchar(8000). Please notice that, we have included file name column (DI_Filename). One integer field which is mapped with LENGTH(CONSOLIDATED_COLUMN) and another integer field which is mapped with LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’)).
If LENGTH(CONSOLIDATED_COLUMN) - LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’))=155, then FLAG it ‘V’ else ‘I’
I guess, the findstr “.|.|.*” *.txt is the way we should work. Please suggest


maddyk (BOB member since 2011-02-08)

Can you be a little more clear about what you are asking for?

  • E

eepjr24 :us: (BOB member since 2005-09-16)

As I told, there are 156 fields in pipe (|) separated csv input files. I need to look into number of pipe (|) in each record in file. This I accomplished as,

  • Multiple csv files read from source (with DI_Filename) like d:*.txt
  • Read all csv file into single column (say CONSOLIDATED_COLUMN )
  • created another column and populated values as LENGTH(CONSOLIDATED_COLUMN) of int type
  • created one more column and populated values as LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’)) of int type
  • subtracted values from above int columns
    i.e LENGTH(CONSOLIDATED_COLUMN) - LENGTH(REPLACE(CONSOLIDATED_COLUMN),’|’,’’)) =155
  • flagged which satisfy above condition

maddyk (BOB member since 2011-02-08)

That sounds like it will work. What are you having trouble with?

  • E

eepjr24 :us: (BOB member since 2005-09-16)

Right, it work now, sincere thanks to to you all for providing kind help, suggestions and cooperation


maddyk (BOB member since 2011-02-08)