AWS S3: How to plug in a dynamic file name in the S3 directory in COPY command

AWS S3: How to plug in a dynamic file name in the S3 directory in COPY command - amazon-web-services

I have job in Redshift that is responsible for pulling 6 files every month from S3. File names follow a standard naming convention as "file_label_MonthNameYYYY_Batch01.CSV". I'd like to modify the below COPY command to change the file naming in the S3 directory dynamically so I won't have to hard code the Month Name and YYYY and batch number. Batch number ranges 1-6.
Currently, here is what I have which is not efficient:
COPY tbl_name ( column_name1, column_name2, column_name3 )
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_Batch01.CSV'
CREDENTIALS 'aws_access_key_id = xxx;aws_secret_access_key = xxxxx'
removequotes
EMPTYASNULL
BLANKSASNULL
DATEFORMAT 'MM/DD/YYYY'
delimiter ','
IGNOREHEADER 1;
COPY tbl_name ( column_name1, column_name2, column_name3 )
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_Batch02.CSV'
CREDENTIALS 'aws_access_key_id = xxx;aws_secret_access_key = xxxxx'
removequotes
EMPTYASNULL
BLANKSASNULL
DATEFORMAT 'MM/DD/YYYY'
delimiter ','
IGNOREHEADER 1;
The dynamic file name shall change to August2021_Batch01 & August2021_Batch02 next month and so forth. Is there a way to do this? Thank you in advance.

There are lots of approaches to this. Which one is best for your case will depend on your circumstances. You need a layer in your process that controls configuring SQL for each month. Here are some ways to consider:
Use a manifest file - This file will have the S3 object names to
load. Your processing / file prep can update this file
Use a fixed load folder where the files are located for COPY, then
move these files to perm storage location after COPY.
Use variables in you bench to set the Month value and replace this
in when the SQL is issued to Redshift.
Write some code (Lambda?) to issue the SQL you are looking for
Last I checked you could leave the object name incomplete and all
matching objects would be loaded. Leave off the batch number and
suffix and load all the files with one text change.
It is desirable to load multiple files with a COPY command (uses more nodes in parallel) and options 1, 2, and 5 do this.

When specifying the FROM location of files to load, you can specify a partial filename.
Here is an example from COPY examples - Amazon Redshift:
The following example loads the SALES table with tab-delimited data from lzop-compressed files in an Amazon EMR cluster. COPY loads every file in the myoutput/ folder that begins with part-.
copy sales
from 'emr://j-SAMPLE2B500FC/myoutput/part-*'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
Therefore, you could specify:
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_*'
You would just need to change the Month & Year identifier. All files with that prefix would be loaded in one batch.

Related

Call the added file in Hive to use for UDF

I have a file which contains holidays and it is required by UDF to use this file to run and calculate the given business days for two dates. Issue I have is when I add the file, it goes to a working directory but this directory differs every session.
Unlike in the example below from Hive Resources - This is not what is happening.
hive> add FILE /tmp/tt.py;
hive> list FILES;
/tmp/tt.py
hive> select from networks a
MAP a.networkid
USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10;
This is what I am getting and the alpha numeric keeps changing.
/mnt/tmp/a17b43d5-df53-4eea-8e2c-565471b49d25_resources/holiday2021.csv
I need to make this file located in a more permanent folder and this hive sql can be executed into any of the 18 nodes.

Redshift: Possibility to specify suffix for paths when doing PARTITIONED UNLOAD to S3?

Is there any way to provide a suffix for paths when doing a partitioned unload to S3?
e.g. if I want to use the output of +several+ queries for batch jobs, where query outputs are partitioned by date.
Currently I have a structure in S3 like:
s3://bucket/path/queryA/key=1/ *.parquet
s3://bucket/path/queryA/key=2/ *.parquet
s3://bucket/path/queryB/key=1/ *.parquet
s3://bucket/path/queryB/key=2/ *.parquet
But ideally, I would like to have:
s3://bucket/path/key=1/queryA/ *.parquet
s3://bucket/path/key=2/queryA/ *.parquet
s3://bucket/path/key=1/queryB/ *.parquet
s3://bucket/path/key=2/queryB/ *.parquet
So that I can then use as input paths to batch processing jobs (e.g. on Sagemaker!):
s3://bucket/path/key=1/
s3://bucket/path/key=2/
Such that each batch job has the output of all queries for the particular day that the batch job is computing for.
Currently, I re-shape the data in S3 after unloading but it would be much faster and more convenient if I could specify a suffix for Redshift to append to S3 unload paths, +after+ the partition suffix.
From the UNLOAD docs I'm assuming that this isn't possible, and I'm unable to post on AWS forums.
But perhaps there's some other command or a connection variable that I can use, a hack involving something like a literal value for a second partition key, or a totally different strategy altogether?

You could add an artificial column q to mark the query, and then use it as a second partition - that would effectively add a q=queryA prefix to your path.
BUT, redshift does not allow to UNLOAD into a non-empty location, unless you provide an ALLOWOVERWRITE option.
Then, since you don't control the unloaded filenames (they'll depend on the slice count and max file size) allowing overwrite may cause your data to really be overwritten if you happen to have same partition keys.
To work around that, you could add one more artificial partitioning column which would add a unique component to your path (same value for each unload). I used RANDOM in my example for that - you could use something which is more clash-safe.
Below is an example query, which unloads data without overwriting results even if unloaded multiple times. I ran it for different part and q values.
unload ($$
WITH
rand(rand) as (select md5(random())),
input(val, part) as (
select 1, 'p1' union all
select 1, 'p2'
)
SELECT
val,
part,
'queryB' as q,
rand as r
FROM input, rand
$$)
TO 's3://XXX/partitioned_unload/'
IAM_ROLE 'XXX'
PARTITION by (part, q, r)
ALLOWOVERWRITE
These are the files produced by 3 runs:
aws s3 ls s3://XXX/partitioned_unload/ --recursive
2020-06-29 08:29:14 2 partitioned_unload/part=p1/q=queryA/r=b43e3ff9b6b271387e2ca5424c310bb5/0001_part_00
2020-06-29 08:28:58 2 partitioned_unload/part=p1/q=queryA/r=cfcd208495d565ef66e7dff9f98764da/0001_part_00
2020-06-29 08:29:54 2 partitioned_unload/part=p1/q=queryB/r=24a4976a535a584dabdf8861548772d4/0001_part_00
2020-06-29 08:29:54 2 partitioned_unload/part=p2/q=queryB/r=24a4976a535a584dabdf8861548772d4/0001_part_00
2020-06-29 08:29:14 2 partitioned_unload/part=p3/q=queryA/r=b43e3ff9b6b271387e2ca5424c310bb5/0002_part_00
2020-06-29 08:28:58 2 partitioned_unload/part=p3/q=queryA/r=cfcd208495d565ef66e7dff9f98764da/0001_part_00

Skip top N lines in snowflake load

My actual data in csv extracts starts from line 10. How can I skip top few lines in snowflake load using copy or any other utility. Do we have anything similar to SKIP_HEADER ?
I have files on S3 and its my stage. I would be creating a snowpipe later on this datasource.

yes there is a skip_header option for CSV, allowing you to skip a specified number of rows, when defining a file format. Please have a look here:
https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html#type-csv
So you create a file format associated with the csv files you have in mind and then use this when calling the copy commands.

BigQuery export to CSV file via SQL

I want to create a CSV file which contains the results of query.
This CSV file will live in Google Cloud Storage. (This query is around 15GB) I need it to be a single file. Is it possible, if so how?
CREATE OR REPLACE TABLE `your-project.your-dataset.chicago_taxitrips_mod` AS (
WITH
taxitrips AS (
SELECT
trip_start_timestamp,
trip_end_timestamp,
trip_seconds,
trip_miles,
pickup_census_tract,
dropoff_census_tract,
pickup_community_area,
dropoff_community_area,
fare,
tolls,
extras,
trip_total,
payment_type,
company,
pickup_longitude,
pickup_latitude,
dropoff_longitude,
dropoff_latitude,
IF((tips/fare >= 0.2),
1,
0) AS tip_bin
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_miles > 0
AND fare > 0)
SELECT
trip_start_timestamp,
trip_end_timestamp,
trip_seconds,
trip_miles,
pickup_census_tract,
dropoff_census_tract,
pickup_community_area,
dropoff_community_area,
fare,
tolls,
extras,
trip_total,
payment_type,
company,
tip_bin,
ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
pickup_latitude), 0.1)) AS pickup_grid,
ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
dropoff_latitude), 0.1)) AS dropoff_grid,
ST_Distance(ST_GeogPoint(pickup_longitude,
pickup_latitude),
ST_GeogPoint(dropoff_longitude,
dropoff_latitude)) AS euclidean,
CONCAT(ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
pickup_latitude), 0.1)), ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
dropoff_latitude), 0.1))) AS loc_cross
FROM
taxitrips
LIMIT
100000000
)

If BigQuery needs to output multiple files, you can then concatenate them into a single one with a gsutil operation for files in GCS:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose
Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.

Exporting 15GB to a single CSV file is not possible (to multiple files is possible). I tried your same query (Bytes processed 15.66 GB) then tried to export it to a CSV file in GCS but failed with this error
Table gs://[my_bucket]/bq_export/test.csv too large to be exported to a single file. Specify a uri including a * to shard export. See 'Exporting data into one or more files' in https://cloud.google.com/bigquery/docs/exporting-data.
BQ Documentation only allows you to export up to 1 GB of table data to a single file. Since the table exceeds 1GB then you have to use a wildcard like:
gs://your-bucket-name/csvfilename*.csv
Not sure why would you like the export csv file to be in a single file but IMHO it's too large to be in a single file. writing it to multiple files will be a lot faster since BQ would use its parallelism to write the output using multiple threads.

Redshift skip entire file which contains error

Is there any way/option or workaround to skip the entire file which contains bad entries , while loading the data from S3 to Redshift.
Please note that I am not talking about skipping the entries that are invalid in the file, but the entire file which contains bad entry or record.

By default Redshift fails entire file if you don't supply Maxerror option in Copy command. Its default behavior.
copy catdemo from 's3://awssampledbuswest2/tickit/category_pipe.txt' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2';
Above command will fail entire file and will not load any data from given file. Read the documentation here for more information.
If you specify, Maxerror option then only it ignores records upto that # from particular file.
copy catdemo from 's3://awssampledbuswest2/tickit/category_pipe.txt' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2' MAXERROR 500;
In above example Redshift will tolerate up-to 500 bad records.
I hope this answers your question, but If it doesn't please update the question and I will refocus the answer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS S3: How to plug in a dynamic file name in the S3 directory in COPY command - amazon-web-services

Related

Call the added file in Hive to use for UDF

Redshift: Possibility to specify suffix for paths when doing PARTITIONED UNLOAD to S3?

Skip top N lines in snowflake load

BigQuery export to CSV file via SQL

Redshift skip entire file which contains error

Categories

Resources