List all forecast CSV files exported to AWS S3 bucket when using AWS Forecast Export Job - amazon-web-services

I have trained a Predictor on AWS Forecast, and used it to make some forecasts.
I want to get these forecasts as CSV files. To do so, I created a "ForecastExportJob".
After the exportation is done, I can successfully see the CSV files in my S3 bucket.
I would like to download them programmatically, so is there a way to have a list of S3 keys that correspond to the CSV files created with the "ForecastExportJob" command?
I could list all objects in the destination buckets and filter them, but I am wondering if there is a "more elegant" solution to my problem.
Put it simply, I would like to know if there is an AWS command that can list the files created by the "ForecastExportJob" command:
electricityforecast_export_job_2021-01-04T06-40-23Z_part0.csv
...
electricityforecast_export_job_2021-01-04T06-40-23Z_part7.csv
Note: I am using boto3
Thank you in advance and happy new year!

Related

Update data in csv table which is stored in AWS S3 bucket

I need a solution for entering new data in csv that is stored in S3 bucket in AWS.
At this point we are downloading the file, editing and then uploading it again in s3 and we would like to automatize this process.
We need to add one row in a three column.
Thank you in advance!
I think you will be able to do that using Lambda Functions. You will need to programmatically make the modifications you need over the CSV but there are multiple programming languages that allow you to do that. One quick example is using python and the csv library
Then you can invoke that lambda or add more logic to the operations you want to do using an AWS API Gateway.
You can access the CSV file (object) inside the S3 Bucket from the lambda code using the AWS SDK and append the new rows with data you pass as parameters to the function
There is no way to directly modify the csv stored in S3 (if that is what you're asking). The process will always entail some version of download, modify, upload. There are many examples of how you can do this, for example here

How to migrate data from s3 bucket to glacier?

I have a TB sized S3 bucket with pdf files. I need to migrate the old files to glacier. I know that I can create a life cycle rule to migrate files which are older than certain number of days. But in my case currently the bucket consists of both old and new pdf files and they were added at a same time. So they may have same uploaded date. In this case a life cycle rule won't be useful.
In the pdf files there is a field called capture_date. So i need to migrate those files based on the capture_date. (ie: migrate all pdf files if the capture_date < 2015-05-21 likewise).
Will a Fargate job will be useful here? if so, please give a brief idea.
Please suggest your ideas. Thanks in advance
S3 by itself will not read your pdf files. Thus you have to read them yourself, extract data that determine which ones are old and new, and using AWS SDK (or CLI) to move them to Glacier.
Since the files are not too big, you could use S3 Batch along with lambda function which would do the change of the class to glacier.
Alternatively, you could do this on an EC2 instance, using S3 Inventory's CSV list of your objects (assuming large number of them).
And the most traditional way is to just list your bucket, and iterate over each object.

how do I automatically save the s3 bucket names in glue pyspark script?

My problem now is I have tons of JSON files in a S3 bucket (which contains several sub buckets).
I want to explode it and save to a new flat file with one of the column telling me which sub bucket the records were originally from.
How do I do it in SQL to automatically get that information? THanks!!
I am using glue pyspark , btw.
from comments -> You can use input_file_name()column

Splunk migration to S3 DataLake

We're looking at moving away from Splunk as our datastore and looking at AWS Data Lake backed by S3.
What would be the process of migrating data from Splunk to S3? I've read lots of documents talking about archiving data from Splunk to S3 but not sure if this archives the data as a usable format OR if its in some archive format that needs to be restored to splunk itself?
Check out Splunk's SmartStore feature. It moves your non-hot buckets to S3 so you save storage costs. Running SmartStore on AWS only makes sense, however, if you run Splunk on AWS. Otherwise, the data export charges will bankrupt you. Data export applies when Splunk needs to search a bucket that's stored in S3 and so copies that bucket to an indexer. See https://docs.splunk.com/Documentation/Splunk/8.0.0/Indexer/AboutSmartStore for more information.
From what I've read there are a couple of ways to do it:
Export using the Web UI
Export using REST API Endpoint
Export using CLI
Copy certain files in the filesystem
So far I've tried using the CLI to export and I've managed to export around 500,000 events at a time using
splunk search "index=main earliest=11/11/2019:00:00:01 latest=11/15/2019:23:59:59" -output rawdata -maxout 500000 > output2.dmp
However - I'm not sure how I can accurately repeat this step to make sure I include all 100 million+ events. IE search from DATE A to DATE B for 500,000 records, then search from DATE B to DATE C for the next 500,000 - without missing any events inbetween.

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3

Part One :
I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.
But the demo data of ELB in Athena works fine.
Part Two (Scenario:)
Suppose I Have a excel file and data dictionary of how and what format data is stored in that file , I want that data to be dumped in AWS Redshift What would be best way to achieve this ?
I experienced the same issue. You need to give the folder path instead of the real file name to the crawler and run it. I tried with feeding folder name to the crawler and it worked. Hope this helps. Let me know. Thanks,
I experienced the same issue. try creating separate folder for single table in s3 buckets than rerun the glue crawler.you will get a new table in glue data catalog which has the same name as s3 bucket folder name .
Delete Crawler ones again create Crawler(only one csv file should be not more available in s3 and run the crawler)
important note
one CSV file run it we can view the records in Athena.
I was indeed providing the S3 folder path instead of the filename and still couldn't get Athena to return any records ("Zero records returned", "Data scanned: 0KB").
Turns out the problem was that the input files (my rotated log files automatically uploaded to S3 from Elastic Beanstalk) start with underscore (_), e.g. _var_log_nginx_rotated_access.log1534237261.gz! Apparently that's not allowed.
The structure of the s3 bucket / folder is very important :
s3://<bucketname>/<data-folder>/
/<type-1-[CSVs|Parquets etc]>/<files.[csv or parquet]>
/<type-2-[CSVs|Parquets etc]>/<files.[csv or parquet]>
...
/<type-N-[CSVs|Parquets etc]>/<files.[csv or parquet]>
and specify in the "include path" of the Glue Crawler:
s3://<bucketname e.g my-s3-bucket-ewhbfhvf>/<data-folder e.g data>
Solution: Select path of folder even if within folder you have many files. This will generate one table and data will be displayed.
So in many such cases using EXCLUDE PATTERN in Glue Crawler helps me.
This is sure that instead of directly pointing the crawler to the file, we should point it to the directory and even by doing so when we do not get any records, Exclude Pattern comes to rescue.
You will have to devise some pattern by which only the file which u want gets crawled and rest are excluded. (suggesting to do this instead of creating different directories for each file and most of the times in production bucket, doing such changes is not feasible )
I was having data in S3 bucket ! There were multiple directories and inside each directory there were snappy parquet file and json file. The json file was causing the issue.
So i ran the crawler on the master directory that was containing many directories and in the EXCLUDE PATTERN i gave - * / *.json
And this time, it did no create any table for the json file and i was able to see the records of the table using Athena.
for reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html
Pointing glue crawler to the S3 folder and not the acutal file did the trick.
Here's what worked for me: I needed to move all of my CSVs into their own folders, just pointing Glue Crawler to the parent folder ('csv/' for me) was not enough.
csv/allergies.csv -> fails
csv/allergies/allergies.csv -> succeeds
Then, I just pointed AWS Glue Crawler to csv/ and everything was parsed out well.