Append CSVs in folder; how to skip / delete first rows in each file

Append CSVs in folder; how to skip / delete first rows in each file - powerbi

I have 25 CSV files in a folder linked as my data source. The 1st row in each file contains just the file name in column A, followed by the column headers in the 2nd row (this is how the files are generated and sent to me; I do not have access to the database).
CSVs first 2 rows
When I remove the first row of the sample file, then promote headers, then Close & Apply, I get a list of errors which are essentially the redundant column header rows in each of the subsequent 24 files in the folder.
Error List
Upon suggestion, I changed the end of the first Applied Step in Transform Sample File from QuoteStyle.None]) to QuoteStyle.Csv]). This did not solve and didn't seem to change anything.
Another suggestion was that I could just proceed with the errors; filter as needed but that it wouldn't be a problem. This seems risky/sloppy to me, but maybe it's fine and I'm just a nervous newb?
Many thanks for any input!

Related

import multiple csv files as list in R and change the name of one cell in each file

I am searching since days but I can't find an answer to my question.
I need to change a single cell currently named "29" into "Si29" in hundreds of csv files.
The position of the cell is the same in every file [3,7].
Then I need to save the files again (can be under the same name).
For one file I would do:
read_data[3,7]<-"Si29
However, I have no clue how I apply this to multiple files.
Cheers

How can i remove junk values and load multiple .csv files(different Schema) into BigQuery?

i have many .csv files which are stored into gcs and i want to load data from.csv to BigQuery using below commands:
bq load 'datasate.table' gs://path.csv json_schema
i have tried but giving errors, same error is giving for many file.
error screenshot
how can i remove unwanted values from .csv files before importing into table.
Suggest me to load file in easiest way

The answer depends on what do you want to do with this junk rows. If you look at the documentation, you have several options
Number of errors allowed. By default, it's set to 0 and that why the load job fails at the first line. If you know the total number of rom, set this value to the Number of errors allowed and all the errors will be ignored in the Load Job
Ignore unknown values. If your errors are made because some line contains more column as defined in the schema, this option keep the line in error and only the known column, the others are ignore
Allow jagged rows. If your errors are made by too short line (and it is in your message) and you still want to keep the first columns (because the last ones are optional and/or not relevant), you can check this option
For more advanced and specific filters, you have to perform pre or post processing. If it's the case, let me know to add this part to my answer.

Using pygsheets and searching for specific data and getting result, need output to include the cell location

Am working on inventory and, after searching for and locating a product by Mfg. Code, I need an output that prints the row data and location. No problem in getting the row data but don't know how to have both the data and locating output together.

C++ Request a specific row from a file

Is there a way, where I can open a file that contains a large amount of data, and retrieve only one specific row or index, without getting the rest of the content as well?
Update:
Based on what others have mentioned here in the comments, I have some follow-up questions.
Can anyone give me an example of how to put a fixed width on the rows/linebreaks(whatever you want to call it), or show me a good source where I can read more about it?
So if I set this up correctly, I will be able to get a specific line from the file superfast, even if it contains several million rows?

If you want to access a file by records or rows, and the rows are not fixed length, you'll have to create a structure that you can associate (or map) file positions to row indices.
I recommend using std::vector<std::streampos>.
Read through the file.
When the file is at the beginning of a row, read the file position and append to the vector.
If you need to access a row in the file:
1) Use the vector to get the file position of the row.
2) Seek to the row using the file position.
This technique will work with fixed length and variable length rows.

PDI - Check data types of field

I'm trying to create a transformation read csv files and check data types for each field in that csv.
Like this : the standard field A should string(1) character and field B is integer/number.
And what I want is to check/validate: If A not string(1) then set Status = Not Valid also if B not a integer/number to. Then all file with status Not Valid will be moved to error folder.
I know I can use Data Validator to do it, but how to move the file with that status? I can't find any step to do it.

You can read files in loop, and
add step as below,
after data validation, you can filter rows with the negative result(not matched) -> add constant values step and with error = 1 -> add set variable step for error field with default values 0.
after transformation finishes, you can do add simple evaluation step in parent job to check value of ERROR variable.
If it has value 1 then move files else ....
I hope this can help.

You can do same as in this question. Once read use the Group by to have one flag per file. However, this time you cannot do it in one transform, you should use a job.
Your use case is in the samples that was shipped with your PDI distribution. The sample is in the folder your-PDI/samples/jobs/run_all. Open the Run all sample transformations.kjb and replace the Filter 2 of the Get Files - Get all transformations.ktr by your logic which includes a Group by to have one status per file and not one status per row.
In case you wonder why you need such a complex logic for such a task, remember that the PDI starts all the steps of a transformation at the same time. That's its great power, but you do not know if you have to move the file before every row has been processed.
Alternatively, you have the quick and dirty solution of your similar question. Change the filter row by a type check, and the final Synchronize after merge by a Process File/Move
And a final advice: instead of checking the type with a Data validator, which is a good solution in itself, you may use a Javascript like
there. It is more flexible if you need maintenance on the long run.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Append CSVs in folder; how to skip / delete first rows in each file - powerbi

Related

import multiple csv files as list in R and change the name of one cell in each file

How can i remove junk values and load multiple .csv files(different Schema) into BigQuery?

Using pygsheets and searching for specific data and getting result, need output to include the cell location

C++ Request a specific row from a file

PDI - Check data types of field

Categories

Resources