PDI - Condition to check is the row for each csv files more than 1 - kettle

I'm trying to create transformation that will read all csv and then check each files if file contain more than 1 row it will continue the transformation and else it will abort the transformation.

There is a step name "Detect empty stream", after reading all the files next step you can put and check, if file is empty then terminate the process and if not then move go forward with the process.

Related

Append CSVs in folder; how to skip / delete first rows in each file

I have 25 CSV files in a folder linked as my data source. The 1st row in each file contains just the file name in column A, followed by the column headers in the 2nd row (this is how the files are generated and sent to me; I do not have access to the database).
CSVs first 2 rows
When I remove the first row of the sample file, then promote headers, then Close & Apply, I get a list of errors which are essentially the redundant column header rows in each of the subsequent 24 files in the folder.
Error List
Upon suggestion, I changed the end of the first Applied Step in Transform Sample File from QuoteStyle.None]) to QuoteStyle.Csv]). This did not solve and didn't seem to change anything.
Another suggestion was that I could just proceed with the errors; filter as needed but that it wouldn't be a problem. This seems risky/sloppy to me, but maybe it's fine and I'm just a nervous newb?
Many thanks for any input!

Skip top N lines in snowflake load

My actual data in csv extracts starts from line 10. How can I skip top few lines in snowflake load using copy or any other utility. Do we have anything similar to SKIP_HEADER ?
I have files on S3 and its my stage. I would be creating a snowpipe later on this datasource.
yes there is a skip_header option for CSV, allowing you to skip a specified number of rows, when defining a file format. Please have a look here:
https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html#type-csv
So you create a file format associated with the csv files you have in mind and then use this when calling the copy commands.

Loop for creating several empty csv files

I need to create 215 empty csv files with Stata and save them on my computer.
Since this is a repetitive task, a loop would be perfect. Each file would have a similar but different name (for example Data_Australia, Data_Austria and so on).
How do I create a loop to generate several empty csv datasets with Stata?
I tried the community-contributed command touch but it works well when you only need to generate one empty dataset.
Assuming you want a completely empty file (no header row or anything), just open a file to write to, and immediately close it again.
cd "C:\Users\My\Directory"
local country_names Australia Austria "Republic of Korea" // add all the names here
foreach country_name in `country_names' {
file open f1 using "Data_`country_name'.csv", write replace
file close f1
}
If you have the names stored as a string variable, say country, you can instead loop through the values in that variable (in this case stopping when it reaches the end or an empty row).
local row = 1
while country[`row'] != "" {
file open f1 using "Data_`=country[`row']'.csv", write replace
file close f1
local ++row
}

PDI - Check data types of field

I'm trying to create a transformation read csv files and check data types for each field in that csv.
Like this : the standard field A should string(1) character and field B is integer/number.
And what I want is to check/validate: If A not string(1) then set Status = Not Valid also if B not a integer/number to. Then all file with status Not Valid will be moved to error folder.
I know I can use Data Validator to do it, but how to move the file with that status? I can't find any step to do it.
You can read files in loop, and
add step as below,
after data validation, you can filter rows with the negative result(not matched) -> add constant values step and with error = 1 -> add set variable step for error field with default values 0.
after transformation finishes, you can do add simple evaluation step in parent job to check value of ERROR variable.
If it has value 1 then move files else ....
I hope this can help.
You can do same as in this question. Once read use the Group by to have one flag per file. However, this time you cannot do it in one transform, you should use a job.
Your use case is in the samples that was shipped with your PDI distribution. The sample is in the folder your-PDI/samples/jobs/run_all. Open the Run all sample transformations.kjb and replace the Filter 2 of the Get Files - Get all transformations.ktr by your logic which includes a Group by to have one status per file and not one status per row.
In case you wonder why you need such a complex logic for such a task, remember that the PDI starts all the steps of a transformation at the same time. That's its great power, but you do not know if you have to move the file before every row has been processed.
Alternatively, you have the quick and dirty solution of your similar question. Change the filter row by a type check, and the final Synchronize after merge by a Process File/Move
And a final advice: instead of checking the type with a Data validator, which is a good solution in itself, you may use a Javascript like
there. It is more flexible if you need maintenance on the long run.

Source Header (first row) are dates in a FF then mapping success else fail

I have a requirement where my source flat file with first row with dates, second row with field names and so on and i am reading it as one string and loading into target table.
So I need to do a unit test where if the source file don't have dates in there first row but have some thing else then i want to fail my mapping else success.
Example of source file:
"2015-05-23","2015-06-05"
"carrier","contract",'Group",'Name",'record"
"1234","abcd","4567","kiran","1".
How do I approach this logic in Informatica, please share your inputs.
You can do a substring of the first line and check if it contains a date using the IS_DATE function.
ex. IS_DATE(SUBSTR(input, 2, 10),'YYYY-MM-DD')
Then, if the above return false, use the ABORT function to fail the workflow.
You can create two separate pipelines -
one which picks up first row from the file, check if its a date and abort whole flow if it isn't. Picking up first row - you can use sequence generator to determine if that is first row or not. then use IS_DATE(SUBSTR(input, 2, 10),'YYYY-MM-DD') and ABORT.
Second pipeline will process the data as usual.