How can i remove junk values and load multiple .csv files(different Schema) into BigQuery? - google-cloud-platform

i have many .csv files which are stored into gcs and i want to load data from.csv to BigQuery using below commands:
bq load 'datasate.table' gs://path.csv json_schema
i have tried but giving errors, same error is giving for many file.
error screenshot
how can i remove unwanted values from .csv files before importing into table.
Suggest me to load file in easiest way

The answer depends on what do you want to do with this junk rows. If you look at the documentation, you have several options
Number of errors allowed. By default, it's set to 0 and that why the load job fails at the first line. If you know the total number of rom, set this value to the Number of errors allowed and all the errors will be ignored in the Load Job
Ignore unknown values. If your errors are made because some line contains more column as defined in the schema, this option keep the line in error and only the known column, the others are ignore
Allow jagged rows. If your errors are made by too short line (and it is in your message) and you still want to keep the first columns (because the last ones are optional and/or not relevant), you can check this option
For more advanced and specific filters, you have to perform pre or post processing. If it's the case, let me know to add this part to my answer.

Related

Loading data into Amazon Redshift: Ignore last n rows

Is there an option to load a CSV into Redshift while skipping over a footer?
Just like we use ignoreheader when we want to ignore initial rows. If we want to ignore last rows is there any way?
No. There is no parameter to tell the COPY command to ignore rows at the end of a file.
However, you could load the file with an error by specifying a MAXERROR of 1, which will allow the file to load with one bad row (or more, if required).

Concatenate Monthy modis data

I downloaded daily MODIS DATA LEVEL 3 data for a few months from https://disc.gsfc.nasa.gov/datasets. The filenames are of the form MCD06COSP_M3_MODIS.A2006001.061.2020181145945 but the files do not contain any time dimension. Hence when I use ncecat to concatenate various files, the date information is missing in the resulting file. I want to know how to add the time information in the combined dataset.
Your commands look correct. Good job crafting them. Not sure why it's not working. Possibly the input files are HDF4 format (do they have a .hdf suffix?) and your NCO is not HDF4-enabled. Try to download the files in netCDF3 or netCDF4 format and your commands above should work. If that's not what's wrong, then examine the output files in each step of your procedure and identify which step produces the unintended results and then narrow your question. Good luck.

PDI - Check data types of field

I'm trying to create a transformation read csv files and check data types for each field in that csv.
Like this : the standard field A should string(1) character and field B is integer/number.
And what I want is to check/validate: If A not string(1) then set Status = Not Valid also if B not a integer/number to. Then all file with status Not Valid will be moved to error folder.
I know I can use Data Validator to do it, but how to move the file with that status? I can't find any step to do it.
You can read files in loop, and
add step as below,
after data validation, you can filter rows with the negative result(not matched) -> add constant values step and with error = 1 -> add set variable step for error field with default values 0.
after transformation finishes, you can do add simple evaluation step in parent job to check value of ERROR variable.
If it has value 1 then move files else ....
I hope this can help.
You can do same as in this question. Once read use the Group by to have one flag per file. However, this time you cannot do it in one transform, you should use a job.
Your use case is in the samples that was shipped with your PDI distribution. The sample is in the folder your-PDI/samples/jobs/run_all. Open the Run all sample transformations.kjb and replace the Filter 2 of the Get Files - Get all transformations.ktr by your logic which includes a Group by to have one status per file and not one status per row.
In case you wonder why you need such a complex logic for such a task, remember that the PDI starts all the steps of a transformation at the same time. That's its great power, but you do not know if you have to move the file before every row has been processed.
Alternatively, you have the quick and dirty solution of your similar question. Change the filter row by a type check, and the final Synchronize after merge by a Process File/Move
And a final advice: instead of checking the type with a Data validator, which is a good solution in itself, you may use a Javascript like
there. It is more flexible if you need maintenance on the long run.

is it possible to determine the max length of a field in a csv file using regex?

This has been discussed in stackoverflow before but I couldn't find a case/answer that might apply to my situation:
From time to time I have raw data in text to be imported into SQL, for almost every case I must try out several times as SSIS wizard doesn't know what's the max size of each field and the default is 50 characters. Only after it fails I can know from the error message which (first) field was truncated and I then increase the field's size.
There might be more than one field that needs getting its size increased, and the SSIS wizard only gives one error each time it encounters a truncate, as you can see this is very tedious, I want to find a way to have a quick inspect to the data first to determine the max size of each field.
I came across an old post on stackoverflow: Here is the post
Unfortunately it might not work on my case: my raw data could have as many rows as 10 Million (yes, in one single text file which is over GB).
I am kind of do not think there would be a way to get that, but just still want to post my question here hoping to get some clue.
Thank you very much.

Greenplum to SAS Bulkload gpfdist error - line too long in file

I'm currently doing a bulk load from Greenplum to SAS. Initially there was one field with a backslash "\" at the end of the column causing to throw an error during loading. To resolve it I changed the format from TEXT to CSV and worked fine. But loading more data I encountered this error:
gpfdist error - line too long in file
I've been doing some search but couldn't assess if the cause is due to that the max_length to set when starting the gpfdist service. I also saw that there is a limit for Windows which is 1MB? Greatly appreciate your help.
By the way here are some additional info which might help:
-Greenplum version: 4.2.1.0 build 3
-Gpfdist installed in Windows along with SAS Applications
-Script submitted to Greenplum based on SAS Logs:
CREATE EXTERNAL TABLE ( ) LOCATION ('gpfdist://:8081/fileout.dat')
FORMAT 'CSV' ( DELIMITER '|' NULL '\N') ENCODING 'LATIN1'
Thanks!
"Line too long" sorts of errors usually indicate that you've got extra delimiters buried in VARCHAR/TEXT columns that throw the parsing of the file off.
Another possibility is that you've got hidden control characters, extra linebreaks or other nasty stuff hidden in your file that again is throwing your formatting off. Gpfdist can handle a lot of different data errors and keep going, but extra delimeters throws it for a loop.
Scan your load file looking for extra pipe characters in a line.
Another option would be to re-export your data, picking a different delimiter.
Please try an alternate solution, by selecting the input format as Text and client encoding as ISO_8859_5 in the session and see if that will help you. In my case it worked.