how do I automatically save the s3 bucket names in glue pyspark script?

how do I automatically save the s3 bucket names in glue pyspark script? - amazon-web-services

My problem now is I have tons of JSON files in a S3 bucket (which contains several sub buckets).
I want to explode it and save to a new flat file with one of the column telling me which sub bucket the records were originally from.
How do I do it in SQL to automatically get that information? THanks!!
I am using glue pyspark , btw.

from comments -> You can use input_file_name()column

Related

Update data in csv table which is stored in AWS S3 bucket

I need a solution for entering new data in csv that is stored in S3 bucket in AWS.
At this point we are downloading the file, editing and then uploading it again in s3 and we would like to automatize this process.
We need to add one row in a three column.
Thank you in advance!

I think you will be able to do that using Lambda Functions. You will need to programmatically make the modifications you need over the CSV but there are multiple programming languages that allow you to do that. One quick example is using python and the csv library
Then you can invoke that lambda or add more logic to the operations you want to do using an AWS API Gateway.
You can access the CSV file (object) inside the S3 Bucket from the lambda code using the AWS SDK and append the new rows with data you pass as parameters to the function

There is no way to directly modify the csv stored in S3 (if that is what you're asking). The process will always entail some version of download, modify, upload. There are many examples of how you can do this, for example here

Create Athena table using s3 source data

Below is given the s3 path where I have stored the files obtained at the end of a process. The below-provided path is dynamic, that is, the value of the following fields will vary - partner_name, customer_name, product_name.
s3://bucket/{val1}/data/{val2}/output/intermediate_results
I am trying to create Athena tables for each output file present under output/ as well as under intermediate_results/ directories, for each val1-val2.
Each file is a CSV.
But I am not much familiar with AWS Athena so I'm unable to figure out the way to implement this. I would really appreciate any kind of help. Thanks!

Use CREATE TABLE - Amazon Athena. You will need to specify the LOCATION of the data in Amazon S3 by providing a path.
Amazon Athena will automatically use all files in that path, including subdirectories. This means that a table created with a Location of output/ will include all subdirectories, including intermediate_results. Therefore, your data storage format is not compatible with your desired use for Amazon Athena. You would need to put the data into separate paths for each table.

List all forecast CSV files exported to AWS S3 bucket when using AWS Forecast Export Job

I have trained a Predictor on AWS Forecast, and used it to make some forecasts.
I want to get these forecasts as CSV files. To do so, I created a "ForecastExportJob".
After the exportation is done, I can successfully see the CSV files in my S3 bucket.
I would like to download them programmatically, so is there a way to have a list of S3 keys that correspond to the CSV files created with the "ForecastExportJob" command?
I could list all objects in the destination buckets and filter them, but I am wondering if there is a "more elegant" solution to my problem.
Put it simply, I would like to know if there is an AWS command that can list the files created by the "ForecastExportJob" command:
electricityforecast_export_job_2021-01-04T06-40-23Z_part0.csv
...
electricityforecast_export_job_2021-01-04T06-40-23Z_part7.csv
Note: I am using boto3
Thank you in advance and happy new year!

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3

Part One :
I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.
But the demo data of ELB in Athena works fine.
Part Two (Scenario:)
Suppose I Have a excel file and data dictionary of how and what format data is stored in that file , I want that data to be dumped in AWS Redshift What would be best way to achieve this ?

I experienced the same issue. You need to give the folder path instead of the real file name to the crawler and run it. I tried with feeding folder name to the crawler and it worked. Hope this helps. Let me know. Thanks,

I experienced the same issue. try creating separate folder for single table in s3 buckets than rerun the glue crawler.you will get a new table in glue data catalog which has the same name as s3 bucket folder name .

Delete Crawler ones again create Crawler(only one csv file should be not more available in s3 and run the crawler)
important note
one CSV file run it we can view the records in Athena.

I was indeed providing the S3 folder path instead of the filename and still couldn't get Athena to return any records ("Zero records returned", "Data scanned: 0KB").
Turns out the problem was that the input files (my rotated log files automatically uploaded to S3 from Elastic Beanstalk) start with underscore (_), e.g. _var_log_nginx_rotated_access.log1534237261.gz! Apparently that's not allowed.

The structure of the s3 bucket / folder is very important :
s3://<bucketname>/<data-folder>/
/<type-1-[CSVs|Parquets etc]>/<files.[csv or parquet]>
/<type-2-[CSVs|Parquets etc]>/<files.[csv or parquet]>
...
/<type-N-[CSVs|Parquets etc]>/<files.[csv or parquet]>
and specify in the "include path" of the Glue Crawler:
s3://<bucketname e.g my-s3-bucket-ewhbfhvf>/<data-folder e.g data>

Solution: Select path of folder even if within folder you have many files. This will generate one table and data will be displayed.

So in many such cases using EXCLUDE PATTERN in Glue Crawler helps me.
This is sure that instead of directly pointing the crawler to the file, we should point it to the directory and even by doing so when we do not get any records, Exclude Pattern comes to rescue.
You will have to devise some pattern by which only the file which u want gets crawled and rest are excluded. (suggesting to do this instead of creating different directories for each file and most of the times in production bucket, doing such changes is not feasible )
I was having data in S3 bucket ! There were multiple directories and inside each directory there were snappy parquet file and json file. The json file was causing the issue.
So i ran the crawler on the master directory that was containing many directories and in the EXCLUDE PATTERN i gave - * / *.json
And this time, it did no create any table for the json file and i was able to see the records of the table using Athena.
for reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html

Pointing glue crawler to the S3 folder and not the acutal file did the trick.

Here's what worked for me: I needed to move all of my CSVs into their own folders, just pointing Glue Crawler to the parent folder ('csv/' for me) was not enough.
csv/allergies.csv -> fails
csv/allergies/allergies.csv -> succeeds
Then, I just pointed AWS Glue Crawler to csv/ and everything was parsed out well.

Retaining source file name while importing data from s3 to Redshift

I have large numbers of files within s3 bucket and usually import it to Redshift. Since number of files is large I need a column in Redshift table which should contain source file name from s3 location.
Is there any means to carried out problem ?

Agree with Ketan that this is currently not possible in Redshift. If this is what you would want to achieve, it is possible through either
Reading the S3 files programmatically and write a new S3 files with file name as the column and load the new file
Alternatively, use Hive. Create external table on S3 file bucket location and use INPUT__FILE__NAME to get the file names, create a new table and then write back to S3. You can also do some pre-processing in Hive.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
Hope this helps.

That isn't possible. During a Copy operation, Redshift only loads file contents into a table; it doesn't provide access to S3 file names.
To achieve what you want, you need to preprocess the data to add additional information inside the files.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

how do I automatically save the s3 bucket names in glue pyspark script? - amazon-web-services

from comments -> You can use input_file_name()column

Related

Update data in csv table which is stored in AWS S3 bucket

Create Athena table using s3 source data

List all forecast CSV files exported to AWS S3 bucket when using AWS Forecast Export Job

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3

Retaining source file name while importing data from s3 to Redshift

Categories

Resources