I have a NiFi flow where I am getting all data from s3 and putting it in the destination folder. Now, the requirement is if there is any latest data then just transfer the latest data only. I have a data file in s3 like below:
20201130-011101493.parquet
20201129-011101493.parquet
And the regex I tried:
\d[0-9]{8}.parquet
The problem is it is not picking the first file which is the latest data i.e 30/11/2020
How can I modify my regex so that it will be picking the latest file only if the job runs once per day? I also referred this SO post but I guess I am not able to get my regex correct.
Related
While uploading file one by one to AWS S3 bucket using Java code, I'm observing a strange issue with Last Modified Date column, all the files are showing same last modified date. I followed few posted in StackOverFlow but no-where is properly mentioned on how to set user defied meta-data while storing file to S3.
I did like this in my code but didn't not work for me. Could you please suggest?
ObjectMetadata metaData = new ObjectMetadata();
metaData.setContentLength(bytes.length);
metaData.setHeader("x-amz-meta-last-modified", OffsetDateTime.parse(new Date().toString()).toLocalDate());
I have an external table in Athena linked to a folder in S3. There are some pseudocolumns in Presto that allows me to get some metadata information about the the files sitting in that folder (for example, the $path pseudocolumn).
I wonder if there is a pseudocolumn where I can get the last modified timestamp of a file in S3 by using a query in AWS Athena.
This seems like a reasonable feature request. Please file an issue and include details about your use case (it's possible there is a better approach).
I have data being written from Kafka to a directory in s3 with a structure like this:
s3://bucket/topics/topic1/files1...N
s3://bucket/topics/topic2/files1...N
.
.
s3://bucket/topics/topicN/files1...N
There is already a lot of data in this bucket and I want to use AWS Glue to transform it into parquet and partition it, but there is way too much data to do it all at once. I was looking into bookmarking and it seems like you can't use it to only read the most recent data or to process data in chunks. Is there a recommended way of processing data like this so that bookmarking will work for when new data comes in?
Also, does bookmarking require that spark or glue has to scan my entire dataset each time I run a job to figure out which files are greater than the last runs max_last_modified timestamp? That seems pretty inefficient especially as the data in the source bucket continues to grow.
I have learned that Glue wants all similar files (files with same structure and purpose) to be under one folder, with optional subfolders.
s3://my-bucket/report-type-a/yyyy/mm/dd/file1.txt
s3://my-bucket/report-type-a/yyyy/mm/dd/file2.txt
...
s3://my-bucket/report-type-b/yyyy/mm/dd/file23.txt
All of the files under report-type-a folder must be of the same format. Put a different report like report-type-b in a different folder.
You might try putting just a few of your input files in the proper place, running your ETL job, placing more files in the bucket, running again, etc.
I tried this by getting the current files working (one file per day), then back-filling historical files. Note however, that this did not work completely. I have been getting files processed ok in s3://my-bucket/report-type/2019/07/report_20190722.gzp, but when I tried to add past files to 's3://my-bucket/report-type/2019/05/report_20190510.gzip`, Glue did not "see" or process the file in the older folder.
However, if I moved the old report to the current partition, it worked: s3://my-bucket/report-type/2019/07/report_20190510.gzip .
AWS Glue bookmarking works only with a select few formats (more here) and when read using glueContext.create_dynamic_frame.from_options function. Along with this job.init() and job.commit() should also be present in the glue script. You can checkout a related answer.
I have gzip files in a S3 Bucket. They are not CSV files , they are text files with columns separated by space . I am new using Glue and it Is some way to use Glue - Data Crawler to read this content ?
Glue is just Spark under the hood. So you can just use the same spark code to process the space delimited file i.e. splitBy etc. Glue Crawler will create the metadata for the table by parsing the data. If your data is space separated, then Glue crawler won't be able to parse it. It will basically consider the whole line as one single text column. To process it, you will need to write a custom classifier using Grok pattern. Unfortunately there is no clear example provided in AWS documentation. I am giving an example below:
Assuming your data is like below: (it can be in the gzip file as well)
qwe 123 22.3 2019-09-02
asd 123 12.3 2019-09-02
de3 345 23.3 2019-08-22
we3 455 12.3 2018-08-11
ccc 543 12.0 2017-12-12
First you have to create a custom classifier
Grok Pattern
%{NOTSPACE:name} %{INT:class_num} %{BASE10NUM:balance} %{CUSTOMDATE:balance_date}
Custom patterns
CUSTOMDATE %{YEAR}-%{MONTHNUM}-%{MONTHDAY}
Now create a crawler using the custom classifier you just created. Run the crawler. Then check the metadata created in your database to see if it has recognised the data properly.
Please let me know if any question. You can also share few lines from the file you are trying to process.
If you are new to Glue and keen to try, you may like to read the blog I have written in LinkedIn regarding Glue. Please click this link.
I have uploaded a simple 10 row csv file (S3) into AWS ML website. It keeps giving me the error,
"We cannot find any valid records for this datasource."
There are records there and Y variable is continuous (not binary). I am pretty much stuck at this point because there is only 1 button to move forward to build Machine Learning. Does any one know what should I do to fix it? Thanks!
The only way I have been able to upload .csv files to S3 created on my own is by downloading an existing .csv file from my S3 server, modifying the data, uploading it then changing the name in the S3 console.
Could you post the first few lines of contents of the .csv file? I am able to upload my own .csv file along with a schema that I have created and it is working. However, I did have issues in that Amazon ML was unable to create the schema for me.
Also, did you try to save the data in something like Sublime, Notepad++, etc. in order to get a different format? On my mac with Microsoft Excel, the CSV did not work, but when I tried LibreOffice on my Windows, the same file worked perfectly.