Greenplum to SAS Bulkload gpfdist error - line too long in file - sas

I'm currently doing a bulk load from Greenplum to SAS. Initially there was one field with a backslash "\" at the end of the column causing to throw an error during loading. To resolve it I changed the format from TEXT to CSV and worked fine. But loading more data I encountered this error:
gpfdist error - line too long in file
I've been doing some search but couldn't assess if the cause is due to that the max_length to set when starting the gpfdist service. I also saw that there is a limit for Windows which is 1MB? Greatly appreciate your help.
By the way here are some additional info which might help:
-Greenplum version: 4.2.1.0 build 3
-Gpfdist installed in Windows along with SAS Applications
-Script submitted to Greenplum based on SAS Logs:
CREATE EXTERNAL TABLE ( ) LOCATION ('gpfdist://:8081/fileout.dat')
FORMAT 'CSV' ( DELIMITER '|' NULL '\N') ENCODING 'LATIN1'
Thanks!

"Line too long" sorts of errors usually indicate that you've got extra delimiters buried in VARCHAR/TEXT columns that throw the parsing of the file off.
Another possibility is that you've got hidden control characters, extra linebreaks or other nasty stuff hidden in your file that again is throwing your formatting off. Gpfdist can handle a lot of different data errors and keep going, but extra delimeters throws it for a loop.
Scan your load file looking for extra pipe characters in a line.
Another option would be to re-export your data, picking a different delimiter.

Please try an alternate solution, by selecting the input format as Text and client encoding as ISO_8859_5 in the session and see if that will help you. In my case it worked.

Related

AWS COPY From S3 Command Fails As A Result of Truncation - RESOLVED

First of all I would like to mention that I tried searching for this issue in existing SO questions but I couldn't find the scenario I came across. Hence asking a new question.
So I am trying to import data from S3 to Redshift.
The data in S3 is JSON data seperated by new line character i.e. \n (exported using UNLOAD command from other redshift cluster)
The copy command is -
copy redhist_table_name
from 's3://bucket/path/to/s3/json/file.json'
iam_role 'iam_role_info'
region 'region';
json 'auto';
The STL_LOAD_ERRORS shows error as - delimiter not found but when I looked closely, I found that the copy command is copying only first 1024 characters from the json row which results in above error
I looked for all the options that copy command offers to see if there's way to increase this limit but I found none.
Any ideas where is this limit coming from? Or is this not the root cause of this issue?
So I expect this is not the root cause. Stl_load_errors only stores 1024 characters in the "raw_line" column. There very well may be a limit to how long a line can be but I know that it is much longer than 1024 characters.
Which line in the file is COPY failing on? First or somewhere later in the file?
If a line deep in the file there may be something off about it. UNLOAD to COPY should work correctly but I can see that there may be some corner cases (like " in string values). If it a specific line then posting that line (sanitized if need be) would be helpful.

GCP > Video Intelligence: Prepare CSV error: Has critical error in root level csv, Expected 2 columns, but found 1 columns only

I'm trying to follow documentation from below GCP link to prepare my video training data. In the doc, it says that if you want to use GCP to label videos, you can use UNASSIGNED feature.
I have my videos uploaded to a bucket.
I have a traffic_video_labels.csv with below rows:
gs://video_intel/1.mp4
gs://video_intel/2.mp4
Now, in my Video Intelligence Import section, I want to use a CSV called check.csv that has below row as it references back to the video locations. Using UNNASIGNED value should let me use the labelling feature within GCP.
UNASSIGNED,gs://video_intel/traffic_video_labels.csv
However, when I try to check.csv as a file, I get the error:
Has critical error in root level csv gs://video_intel/check.csv line 1: Expected 2 columns, but found
1 columns only.
Can anyone pls help with this? thanks!
https://cloud.google.com/video-intelligence/automl/object-tracking/docs/prepare
For the error message "Expected 2 columns, but found
1 columns only." try to fix the format of your CSV file, open the file in a text editor of your choice (such as Cloud Shell, Sublime, Atom, etc.) to inspect the file format.
When opening a CSV file in Google Sheets or a similar product, you won't be able to format the file properly (i.e. empty values from tailing commas) due to limitation on the user interface, but in text editors, you should not run into those issues.
If this does not work, please share your CSV file to make a test with your file by my own.

How can i remove junk values and load multiple .csv files(different Schema) into BigQuery?

i have many .csv files which are stored into gcs and i want to load data from.csv to BigQuery using below commands:
bq load 'datasate.table' gs://path.csv json_schema
i have tried but giving errors, same error is giving for many file.
error screenshot
how can i remove unwanted values from .csv files before importing into table.
Suggest me to load file in easiest way
The answer depends on what do you want to do with this junk rows. If you look at the documentation, you have several options
Number of errors allowed. By default, it's set to 0 and that why the load job fails at the first line. If you know the total number of rom, set this value to the Number of errors allowed and all the errors will be ignored in the Load Job
Ignore unknown values. If your errors are made because some line contains more column as defined in the schema, this option keep the line in error and only the known column, the others are ignore
Allow jagged rows. If your errors are made by too short line (and it is in your message) and you still want to keep the first columns (because the last ones are optional and/or not relevant), you can check this option
For more advanced and specific filters, you have to perform pre or post processing. If it's the case, let me know to add this part to my answer.

PDI - How to keep Transformation run even an error occur?

I have a transformation with several steps that run by batch script using Windows Task Scheduler.
Sometimes the first step or the n steps fail and it stops the entire transformation.
I want to transformation to run from start to end regardless of any errors, any way of doing this?
1)One way is to “error handling”, however it is not available for all the steps. You can right click on the step and check whether error handling option is available or not.
2) if you are getting errors because of incorrect datatype, for example: you are expecting a integer value and for some specific record you may get string value so it may fail , for handling such situation you can use data validation step.
Basically you can implement logic based on the transformation you have created. Above are some of the General methods.
This is what you called "Error Handling". Though your transformation runs with some Errors, you still want your transformation to continue to run.
Situations:
- Data type issues in the data stream.
Ex: say you have a column X of data type integer but by mistake you got string value. then you can define Error handling to capture all these records.
- while Processing json data.
Ex: the path you mentioned to retrieve a value of json field and for some data node the path can't identify or missing it. you can define error handling to capture all missing path details.
- while Update table
- If you are updating a table with some key, and if the key was not available as it is coming from input stream then an error will occur. you can define error handling here also.

Finding and debugging bad record using hive

Is there any way to pinpoint the badrecord when we are loading the data using hive or while processing the data.
The scenario Goes like this.
Suppose I have file that need to be loaded as table using hive which got 1 Million records in it. Delimited by some '|' symbol.
So suppose after Half a million record processing I encounter a problem. IS there anyway to debug it or precisely pinpoint the record/records having the issues.
If you are not clear about my question please let me know.
I know there is a skipping of bad record in mapreduce (Kind of percentage). I would like to get this in the perspective of hive.
Thanks In Advance.