Redshift copy command recursive scan - amazon-web-services

Is it possible to copy all files under root directory/bucket
Example folder structure:
/2016/01/file.json
/2016/02/file.json
/2016/03/file.json
...
I've tried with the following command:
copy mytable
FROM 's3://mybucket/2016/*'
CREDENTIALS 'aws_access_key_id=<>;aws_secret_access_key=<>'
json 's3://mybucket/jsonpaths.json'

Specify a prefix for the load, and all Amazon S3 objects with that prefix will be loaded (in parallel) into Amazon Redshift.
Examples:
copy mytable
FROM 's3://mybucket/2016/'
will load all objets stored in: mybucket/2016/*
copy mytable
FROM 's3://mybucket/2016/02'
will load all objets stored in: mybucket/2016/02/*
copy mytable
FROM 's3://mybucket/2016/1'
will load all objets stored in: mybucket/2016/1* (eg 10, 11, 12)
Basically, it just makes sure the object starts with the given string (including the full path).

Apparently this is a simple as changing the source url to s3://mybucket/2016/, no wildcards required.

Related

S3 Folder Containing Redshift Spectrum Table Deleted Randomly

I have an external table in Redshift. When I use UNLOAD to fill this table, sometimes the S3 folder that contains the data gets deleted randomly (or I couldn’t figure out the reason).
Here's the script I use to fill the external table:
UNLOAD ('SELECT * FROM PUBLIC.TABLE_NAME T1 INNER JOIN EXTERNAL_SCHEMA.TABLE_NAME')
TO 's3://bucket-name/main_folder/folder_that_gets_deleted/'
IAM_ROLE 'arn:aws:iam::000000000000:role/my_role'
FORMAT AS PARQUET
CLEANPATH
PARALLEL OFF;
I'm not sure there's enough info the question to uniquely identify a solution. What's jumping out to me is: the CLEANPATH parameter deletes file that's targeted by TO. If something fails elsewhere that causes the UNLOAD not to complete (e.g. if it's a big file, maybe competing resources are slowing things down), then perhaps the CLEANPATH deletion completes and no file is created to replace the deleted one.
Perhaps try, instead of CLEANPATH, use ALLOWOVERWRITE. This parameter means that you overwrite any existing files with the output of the UNLOAD command. So if the UNLOAD fails then nothing gets deleted.

Amazon Redshift: Check column names during COPY

Can I check columns names during copy from S3 to Redshift?
For example, I have "good" CSV:
name ,sur_name
BOB , FISCHER
And I have "wrong" CSV:
sur_name,name
FISCHER , BOB
Can I check names of columns during copy command?
I don't want to use AWS Glue or AWS Lambda for checks because I don't want to open/load/save the same file many times.
(The same problem for other files with columns names.)
This is very simple check so Redshift should allowed that but I can't find any information about that.
Or if this is not possible? Can you give me some idea how do it without reading all files?
(For example, a Lambda function that reads only headers without getting all file.)
From Column mapping options - Amazon Redshift:
You can specify a comma-separated list of column names to load source data fields into specific target columns. The columns can be in any order in the COPY statement, but when loading from flat files, such as in an Amazon S3 bucket, their order must match the order of the source data.
Therefore, the only way to read such files would be to specify the column names in their correct order. This requires you to look inside the file to determine the order of the columns.
When reading an object from Amazon S3, it is possible to specify the range of bytes to be read. So, instead of reading the entire file, it could read just the first 200 bytes (or whatever size would be sufficient to include the header row). An AWS Lambda function could read these bytes, extract the column names, then generate a COPY command that would import the columns in the correct order (without having to read the entire file first).

Copy file from s3 subfolder in another subfolder in same bucket

I'd like to copy file from subfolder into another subfolder in same s3 bucket. I've read lots of questions in SO and I came finally with this code. It has an issue, when I run it it works, but it doesn't copy the file only, it copy the folder that contain the file into the destination wanted I've have the file but inside a folder(root). How do I only copy the files inside that subfolder?
XXXBUCKETNAME:
-- XXXX-input/ # I want to copy from here
-- XXXX-archive/ # to here
import boto3
from botocore.config import Config
s3 = boto3.resource('s3', config=Config(proxies={'https': getProperty('Proxy', 'Proxy.Host')}))
bucket_obj = s3.Bucket('XXX')
destbucket = 'XXX'
jsonfiles = []
for obj in bucket_obj.objects.filter(Delimiter='/', Prefix='XXXX-input/', ):
if obj.key.endswith('json'):
jsonfiles.append(obj.key)
for k in jsonfiles:
if k.split("_")[-1:][0] == "xxx.txt":
dest = s3.Bucket(destbucket)
source= { 'Bucket' : destbucket, 'Key': k}
dest.copy(source, "XXXX-archive/"+k)
it give:
XXXBUCKETNAME:
-- XXXX-input/
-- XXXX-archive/
-- XXXX-input/file.txt
I want:
XXXBUCKETNAME:
-- XXXX-input/
-- XXXX-archive/
-- file.txt
In S3 there really aren't any "folders." There are buckets and objects, as explained in documentation. The UI may make it seem like there are folders, but the key for an object is the entire path. So if you want to copy one item, you will need to parse its key and build the destination key differently so that it has the same prefix (path) but end with a different value.
In Amazon S3, buckets and objects are the primary resources, and
objects are stored in buckets. Amazon S3 has a flat structure instead
of a hierarchy like you would see in a file system. However, for the
sake of organizational simplicity, the Amazon S3 console supports the
folder concept as a means of grouping objects. It does this by using a
shared name prefix for objects (that is, objects have names that begin
with a common string). Object names are also referred to as key names.
In your code you are pulling out each object's key, so that means the key already contains the full "path" even though there isn't really a path. So you will want to split the key on the / character instead and then take the last element in the resulting list and append that as the file name:
dest.copy(source, "XXXX-archive/" + k.split("/")[-1])

Big query EXPORT DATA statement creating mutiple files with no data and just header record

I have read similar issue here but not able to understand if this is fixed.
Google bigquery export table to multiple files in Google Cloud storage and sometimes one single file
I am using below big query EXPORT DATA OPTIONS to export the data from 2 tables in a file. I have written select query for the same.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_'||CURRENT_DATE()||'*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I have only 2 rows returning from my select query and I assume that only one file should be getting created in google cloud storage. Multiple files are created only when data is more than 1 GB. thats what I understand.
However, 3 files got created in cloud storage where 2 files just had the header record and the third file has 3 records(one header and 2 actual data record)
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000002.csv
Why empty files are getting created?
Can anyone please help? We don't want to create empty files. I believe only one file should be created when it is 1 GB. more than 1 GB, we should have multiple files but NOT empty.
You have to force all data to be loaded into one worker. In this way you will be exporting only one file (if <1Gb).
My workaround: add a select distinct * on top of the Select statement.
Under the hood, BigQuery utilizes multiple workers to read and process different sections of data and when we use wildcards, each worker would create a separate output file.
Currently BigQuery produces empty files even if no data is returned and thus we get multiple empty files. The Bigquery product team is aware of this issue and they are working to fix this, however there is no ETA which can be shared.
There is a public issue tracker that will be updated with periodic progress. You can STAR the issue to receive automatic updates and give it traction by referring to this link.
However for the time being I would like to provide a workaround as follows:
If you know that the output will be less than 1GB, you can specify a single URI to get a single output file. However, the EXPORT DATA statement doesn’t support Single URI.
You can use the bq extract command to export the BQ table.
bq --location=location extract \
--destination_format format \
--compression compression_type \
--field_delimiter delimiter \
--print_header=boolean \
project_id:dataset.table \
gs://bucket/filename.ext
In fact bq extract should not have the empty file issue like the EXPORT DATA statement even when you use Wildcard URI.
I faced the same empty files issue when using EXPORT DATA.
After doing a bit of R&D found the solution. Put LIMIT xxx in your SELECT SQL and it will do the trick.
You can find the count, and put that as LIMIT value.
SELECT ....
FROM ...
WHERE ...
LIMIT xxx
It turns out you need to enforce multiple files, wildcard syntax. Either a file for CSV or folder for other like AVRO.
The uri option must be a single-wildcard URI as described
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements
Specifying a wildcard seems to start several workers to work on the extract, and as per the documentation, size of the exported files will vary.
Zero-length files is unusual but technically possible if the first worker is done before any other really get started. Hence why the wildcard is expected to be used only when you think your exported data will be larger than the 1 GB
I have just faced the same with Parquet but found out that bq CLI works, which should do for any format.
See (and star for traction) https://issuetracker.google.com/u/1/issues/181016197

NiFi - move files in hdfs to a file directory attribute

I've been trying to use a MoveHDFS processor to move parquet files from a /working/partition/ directory in hdfs to a /success/partition/ directory. The partition value is set based on a ExecuteSparkJob processor earlier in the flow. After finding my parquet files in the root / directory, I found the following in the processor description for Output Directory:
The HDFS directory where the files will be moved to Supports
Expression Language: true (will be evaluated using variable registry
only)
Turns out the processor was sending the files to / instead of ${dir}/.
Since my attributes are set on the fly based on the spark processing result, I can't simply add to the variable registry and restart nodes for each flowfile (which from my limited understanding is what using the variable registry requires). One option is to use an ExecuteStreamCommand processor with a custom script to accomplish this use case. Is that my only option here or is there a built-in way to move HDFS files to attribute-set directories?
You can try this approach :
step 1 : Use MoveHDFS to move your file to temporary location, say path X. Input directory property in MoveHDFS processor can accept flowfile attribute.
step 2 : Connect success connection to FetchHDFS processor.
step 3 : Now in Fetch HDFS processor you can write the expression language for HDFS Filename property as ${absolute.hdfs.path}/${filename}. This will fetch the file data from path X into flowfile content.
step 4 : Connect success connection from FetchHDFS to PutHDFS processor.
step 5 : Configure PutHDFS directory property as per your requirements to accept the flowfile attribute for the partion data on the fly.
Cons:
One con in this approach is , the duplicate copy which will be created from moveHDFS to store the data temporarily before sending the data to the actual location. You might have to develop a separate flow to delete the duplicate copy if not required.