AWS COPY From S3 Command Fails As A Result of Truncation - RESOLVED - amazon-web-services

First of all I would like to mention that I tried searching for this issue in existing SO questions but I couldn't find the scenario I came across. Hence asking a new question.
So I am trying to import data from S3 to Redshift.
The data in S3 is JSON data seperated by new line character i.e. \n (exported using UNLOAD command from other redshift cluster)
The copy command is -
copy redhist_table_name
from 's3://bucket/path/to/s3/json/file.json'
iam_role 'iam_role_info'
region 'region';
json 'auto';
The STL_LOAD_ERRORS shows error as - delimiter not found but when I looked closely, I found that the copy command is copying only first 1024 characters from the json row which results in above error
I looked for all the options that copy command offers to see if there's way to increase this limit but I found none.
Any ideas where is this limit coming from? Or is this not the root cause of this issue?

So I expect this is not the root cause. Stl_load_errors only stores 1024 characters in the "raw_line" column. There very well may be a limit to how long a line can be but I know that it is much longer than 1024 characters.
Which line in the file is COPY failing on? First or somewhere later in the file?
If a line deep in the file there may be something off about it. UNLOAD to COPY should work correctly but I can see that there may be some corner cases (like " in string values). If it a specific line then posting that line (sanitized if need be) would be helpful.

Related

AWS Athena - how to process huge results file

Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)
You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
AS SELECT ...
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
)
You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?
You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.

Skip top N lines in snowflake load

My actual data in csv extracts starts from line 10. How can I skip top few lines in snowflake load using copy or any other utility. Do we have anything similar to SKIP_HEADER ?
I have files on S3 and its my stage. I would be creating a snowpipe later on this datasource.
yes there is a skip_header option for CSV, allowing you to skip a specified number of rows, when defining a file format. Please have a look here:
https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html#type-csv
So you create a file format associated with the csv files you have in mind and then use this when calling the copy commands.

How can i remove junk values and load multiple .csv files(different Schema) into BigQuery?

i have many .csv files which are stored into gcs and i want to load data from.csv to BigQuery using below commands:
bq load 'datasate.table' gs://path.csv json_schema
i have tried but giving errors, same error is giving for many file.
error screenshot
how can i remove unwanted values from .csv files before importing into table.
Suggest me to load file in easiest way
The answer depends on what do you want to do with this junk rows. If you look at the documentation, you have several options
Number of errors allowed. By default, it's set to 0 and that why the load job fails at the first line. If you know the total number of rom, set this value to the Number of errors allowed and all the errors will be ignored in the Load Job
Ignore unknown values. If your errors are made because some line contains more column as defined in the schema, this option keep the line in error and only the known column, the others are ignore
Allow jagged rows. If your errors are made by too short line (and it is in your message) and you still want to keep the first columns (because the last ones are optional and/or not relevant), you can check this option
For more advanced and specific filters, you have to perform pre or post processing. If it's the case, let me know to add this part to my answer.

Redshift skip entire file which contains error

Is there any way/option or workaround to skip the entire file which contains bad entries , while loading the data from S3 to Redshift.
Please note that I am not talking about skipping the entries that are invalid in the file, but the entire file which contains bad entry or record.
By default Redshift fails entire file if you don't supply Maxerror option in Copy command. Its default behavior.
copy catdemo from 's3://awssampledbuswest2/tickit/category_pipe.txt' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2';
Above command will fail entire file and will not load any data from given file. Read the documentation here for more information.
If you specify, Maxerror option then only it ignores records upto that # from particular file.
copy catdemo from 's3://awssampledbuswest2/tickit/category_pipe.txt' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2' MAXERROR 500;
In above example Redshift will tolerate up-to 500 bad records.
I hope this answers your question, but If it doesn't please update the question and I will refocus the answer.

Greenplum to SAS Bulkload gpfdist error - line too long in file

I'm currently doing a bulk load from Greenplum to SAS. Initially there was one field with a backslash "\" at the end of the column causing to throw an error during loading. To resolve it I changed the format from TEXT to CSV and worked fine. But loading more data I encountered this error:
gpfdist error - line too long in file
I've been doing some search but couldn't assess if the cause is due to that the max_length to set when starting the gpfdist service. I also saw that there is a limit for Windows which is 1MB? Greatly appreciate your help.
By the way here are some additional info which might help:
-Greenplum version: 4.2.1.0 build 3
-Gpfdist installed in Windows along with SAS Applications
-Script submitted to Greenplum based on SAS Logs:
CREATE EXTERNAL TABLE ( ) LOCATION ('gpfdist://:8081/fileout.dat')
FORMAT 'CSV' ( DELIMITER '|' NULL '\N') ENCODING 'LATIN1'
Thanks!
"Line too long" sorts of errors usually indicate that you've got extra delimiters buried in VARCHAR/TEXT columns that throw the parsing of the file off.
Another possibility is that you've got hidden control characters, extra linebreaks or other nasty stuff hidden in your file that again is throwing your formatting off. Gpfdist can handle a lot of different data errors and keep going, but extra delimeters throws it for a loop.
Scan your load file looking for extra pipe characters in a line.
Another option would be to re-export your data, picking a different delimiter.
Please try an alternate solution, by selecting the input format as Text and client encoding as ISO_8859_5 in the session and see if that will help you. In my case it worked.