Use TfileUnarchive on Amazon S3 - amazon-web-services

I have a talend job which is simple like below:
ts3Connection -> ts3Get -> tfileinputDelimeted -> tmap -> tamazonmysqloutput.
Now the scenario here is that some times I get the file in .txt format and sometimes I get it in a zip file.
So I want to use tFileUnarchive to unzip the file if it's in zip or process it bypassing the tFileUnarchive component if the file is in unzipped format i.e only in .txt format.
Any help on this is greatly appreciated.

The trick here is to break the file retrieval and potential unzipping into one sub job and then the processing of the files into another sub job afterwards.
Here's a simple example job:
As normal, you connect to S3 and then you might list all the relevant objects in the bucket using the tS3List and then pass this to tS3Get. Alternatively you might have another way of passing the relevant object key that you want to download to tS3Get.
In the above job I set tS3Get up to fetch every object that is iterated on by the tS3List component by setting the key as:
((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and then downloading it to:
"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
The extra bit I've added starts with a Run If conditional link from the tS3Get which links the tFileUnarchive with the condition:
((String)globalMap.get("tS3List_1_CURRENT_KEY")).endsWith(".zip")
Which checks to see if the file being downloaded from S3 is a .zip file.
The tFileUnarchive component then just needs to be told what to unzip, which will be the file we've just downloaded:
"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and where to extract it to:
"C:/Talend/5.6.1/studio/workspace/S3_downloads"
This then puts any extracted files in the same place as the ones that didn't need extracting.
From here we can now iterate through the downloads folder looking for the file types we want by setting the directory to "C:/Talend/5.6.1/studio/workspace/S3_downloads" and the global expression to "*.csv" in my case as I wanted to read in only the CSV files (including the zipped ones) I had in S3.
Finally, we then read the delimited files by setting the file to be read by the tFileInputDelimited component as:
((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))
And in my case I simply then printed this to the console but obviously you would then want to perform some transformation before uploading to your AWS RDS instance.

Related

How to exclude either files or folder paths on S3 within an AWS Glue job when reading an Athena table?

We have an AWS Glue job that is attempting to read data from an Athena table that is being populated by HUDI. Unfortunately, we are running into an error that relates to create_dynamic_frame.from_catalog trying to read from these tables.
An error occurred while calling o82.getDynamicFrame. s3://bucket/folder/.hoodie/20220831153536771.commit is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [32, 125, 10, 125]
This appears to be a known issue on GitHub: https://github.com/apache/hudi/issues/5891
Unfortunately, no workaround was provided. We are attempting to see if we can exclude either the folder or file(s) of .hoodie or *.commit, respectively within the additional_options of the create_dynamic_frame.from_catalog connection. Unfortunately, we are not having any success either with exclusion a file or folder. Note: we have .hoodie files in the root directory as well as a .hoodie folder that contains a .commit file, among other files. We prefer to exclude them all.
Per AWS:
"exclusions": (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, "["**.pdf"]" excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see Include and Exclude Patterns.
Question: how do we exclude both file and folder from a connection?
Folder
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['ENV']+"_some_database", table_name="some_table", transformation_ctx="datasource_x1", additional_options={"exclusions": "[\".hoodie/**\"]"})
File
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['ENV']+"_some_database", table_name="some_table", transformation_ctx="datasource_x1", additional_options={"exclusions": "[\"**.commit\"]"})
Turns out the original attempted solution of {"exclusions": "[\"**.commit\"]"} worked. Unfortunately, I wasn't paying close enough attention and there were multiple tables that needed to be excluded. After hacking through all of the file types, here are two working solutions:
Exclude folder
additional_options={"exclusions": "[\"s3://bucket/folder/.hoodie/*\"]"}
Exclude file(s)
additional_options={"exclusions": "[\"**.commit\",\"**.inflight\",\"**.properties\"]"}

How can I configure a snowpipe to grab the same filename from an S3 bucket when the file is refreshed and re-uploaded?

We have a csv file that is maintained by an analyst who manually updates it at irregular intervals and reuploads (by drag and drop) the same file to an S3 bucket. I have Snowpipe set up to ingest files from this S3 bucket, but it won't re-process the same filename even when the contents change. We don't want to rely on the analyst(s) remembering to manually rename the file each time they upload it, so are looking for an automated solution. I have pretty minimal input on how the analysts work with this file, I just need to ingest it for them. The options I'm considering are:
Somehow adding a timestamp or unique identifier to the filename on
upload (not finding a way to do this easily in S3). I've also
experimented with versioning in the S3 bucket but this doesn't seem
to have any effect.
Somehow forcing the pipe to grab the file again even with the same name. I've read
elsewhere that setting "Force=true" might do it, but that seems to
be an invalid option for a pipe COPY INTO statement.
Here is the pipe configuration, I'm not sure if this will be helpful here:
CREATE OR REPLACE PIPE S3_INGEST_MANUAL_CSV AUTO_INGEST=TRUE AS
COPY INTO DB.SCHEMA.STAGE_TABLE
FROM(
SELECT $1, $2, metadata$filename, metadata$file_row_number
FROM #DB.SCHEMA.S3STAGE
)
FILE_FORMAT=(
TYPE='csv'
skip_header=1
) ON_ERROR='SKIP_FILE_1%'
enter code here
Ignoring the fact that updating the same file rather than having a unique filename is really bad practice, you can use the FORCE option to force the reloading of the same file.
If the file hasn't been changed and you run the process with this option you'll potentially end up with duplicates in your target

Zip file contents in AWS Lambda

We have a function that gets the list of files in a zip file and it works standalone and in Lambda until the fire is larger than 512 meg.
The function needs to get a list of files in the zip file and read the contents of a JSON file that should be in the zip file.
This is part of the function:
try:
s3_object = s3_client.get_object(Bucket=bucketname, Key=filename)
#s3_object = s3_client.head_object(Bucket=bucketname, Key=filename)
#s3_object = s3_resource.Object(bucket_name=bucketname, key=filename)
except:
return ('NotExist')
zip_file = s3_object['Body'].read()
buffer = io.BytesIO(zip_file)
# buffer = io.BytesIO(s3_object.get()['Body'].read())
with zipfile.ZipFile(buffer, mode='r', allowZip64=True) as zip_files:
for content_filename in zip_files.namelist():
zipinfo = zip_files.getinfo(content_filename)
if zipinfo.filename[:2] != '__':
no_files += 1
if zipinfo.filename == json_file:
json_exist = True
with io.TextIOWrapper(zip_files.open(json_file), encoding='utf-8') as jsonfile:
object_json = jsonfile.read()
The get_object is the issue as it loads gets the file into memory hand obviously the large of the file it is going to be more than it's available in Lambda.
I've tried using head_object but that only gives me the meta data for the file and I don't know how to get the list of files in the zip file when using head_object or resource.Object.
I would be grateful for any ideas please.
It would likely be the .read() operation that consumes the memory.
So, one option is to simply increase the memory allocation given to the Lambda function.
Or, you can download_file() the zip file, but Lambda functions are only given 512MB in the /tmp/ directory for storage, so you would likely need to mount an Amazon EFS filesystem for additional storage.
Or you might be able to use smart-open ยท PyPI to directly read the contents of the Zip file from S3 -- it knows how to use open() with files in S3 and also Zip files.
Not a full answer, but too long for comments.
It is possible to download only part of a file from S3, so you should be able to grab only the list of files and parse that.
The zip file format places the list of archived files at the end of the archive, in a Central directory file header.
You can download part of a file from S3 by specifying a range to the GetObject API call. In Python, with boto3, you would pass the range as a parameter to the get_object() s3 client method, or the get() method of the Object resource.
So, you could read pieces from the end of the file in 1 MiB increments until you find the header signature (0x02014b50), then parse the header and extract the file names. You might even be able to trick Python into thinking it's a proper .zip file and convince it to give you the list while providing only the last piece(s). An elegant solution that doesn't require downloading huge files.
Or, it might be easier to ask the uploader to provide a list of files with the archive. Depending on your situation, not everything has to be solved in code :).

rclone - How do I list which directory has the latest files in AWS S3 bucket?

I am currently using rclone accessing AWS S3 data, and since I don't use either one much I am not an expert.
I am accessing the public bucket unidata-nexrad-level2-chunks and there are 1000 folders I am looking at. To see these, I am using the windows command prompt and entering :
rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX
Only one folder has realtime data being written to it at any time and that is the one I need to find. How do I determine which one is the one I need? I could run a check to see which folder has the newest data. But how can I do that?
The output from my command looks like this :
1/
10/
11/
12/
13/
14/
15/
16/
17/
18/
19/
2/
20/
21/
22/
23/
... ... ... (to 1000)
What can I do to find where the latest data is being written to? Since it is only one folder at a time, I hope it would be simple.
Edit : I realized I need a way to list the latest file (along with it's folder #) without listing every single file and timestamp possible in all 999 directories. I am starting a bounty and the correct answer that allows me to do this without slogging through all of them will be awarded the bounty. If it takes 20 minutes to list all contents from all 999 folders, it's useless as the next folder will be active by that time.
If you wanted to know the specific folder with the very latest file, you should write your own script that retrieves a list of ALL objects, then figures out which one is the latest and which bucket it is in. Here's a Python script that does it:
import boto3
s3_resource = boto3.resource('s3')
objects = s3_resource.Bucket('unidata-nexrad-level2-chunks').objects.filter(Prefix='KEWX/')
date_key_list = [(object.last_modified, object.key) for object in objects]
print(len(date_key_list)) # How many objects?
date_key_list.sort(reverse=True)
print(date_key_list[0][1])
Output:
43727
KEWX/125/20200912-071306-065-I
It takes a while to go through those 43,700 objects!

Load files from folder with a custom query with power BI

I am trying to load csv files from a folder but I need to apply several custom steps to each file, including dropping the PromoteHeaders default.
I have a custom query that can load a single file successfully. How do I turn it into a query that loads all files in a folder?
By default, File.folder's "promoteHeaders" messes up my data because of a missing column name (which my custom query fixes).
The easiest way to create a function that reads a specific template of file is to actually do it. Just create the M to read it and by right click on the entity transform it to a function.
After that is really simple to transform your M so it uses parameters.
You can create a blank query and replace the code with this on as an example, customize with more steps to deal with your file requirements.
= (myFile) => let
Source = Csv.Document(myFile,[Delimiter=",", Columns=33, Encoding=1252, QuoteStyle=QuoteStyle.None])
in
Source
And then Invoke Custom Function for each file with the content as the parameter.