I was reading through the AWS Glue documentation and playing around with the service but I am not able to figure out these 2 things:
Can I update all columns to the same value using dynamic frame (similar to UPDATE ALL SQL query) or is there a way to do it is through Spark SQL?
I have a Crawler that has created a data catalog from my source database in a new database. Is there a way for me to flag on AWS Glue if the source database schema has changed from the time I created the data catalog to when I schedule the job run?
Apologies if my questions seem silly. Thanks a ton for the help :)
Not sure how to do this directly using dynamic frames of spark SQL, but it can be done using DataFrames.
from pyspark.sql.functions import lit
from awsglue.dynamicframe import DynamicFrame
# Load dynamic_frame with data.
dynamic_frame = ...
# Convert to Spark DataFrame
df = dynamic_frame.toDF()
# Loop over columns and set records to a constant value, e.g. 999
for column in df.columns:
df = df.withColumn(column, lit(999))
# Convert back to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(df, glueContext, "dynamic_frame")
I'm not aware of an option to flag a changed database schema. But you could create something like this yourself: make a simple script that reads a flag you set somewhere and let this script trigger the crawler. This can be done with boto3 in Python or with the aws cli for example.
Related
If I have the following code:
import awswrangler
#df = some dataframe with year, date and other columns
wr.s3.to_parquet(
df=df,
path=f's3://some/path/',
index=False,
dataset=True,
mode="append",
partition_cols=['year', 'date'],
database=f'series_data',
table=f'signal_data'
)
What exactly is happening when database and table are specified? I know that the table will be created (if it is not), but are Glue Crawlers run or something?
Should I use database and table only the first time I run this piece of code, or I can leave it like that (will it run any Crawlers or processes that may cause additional AWS charge?)
For example, if a new partition appears (a new date), how will the table understand the new partition? Usually, this is done when a Glue crawler is run to find a new partition.
Is it possible to get the last updated timestamp of a Athena Table stored as a CSV file format in S3 location using Spark SQL query.
If yes, can someone please provide more information on it.
There are multiple ways to do this.
Use the athena jdbc driver and do a spark read where the format is jdbc. In this read you will provide your "select max(timestamp) from table" query. Then as the next step just save to s3 fcrom the spark dataframe
You can skip the jdbc read altogther and just use boto3 to run the above query. It would be a combination of start_query_execution and get_query_results. You can then save this to s3 as well.
I want too load data from S3 to Redshift. The data coming to S3 in around 5MB{approximate size} per sec.
I need to automate the loading of data from S3 to Redshift.
The data to S3 is dumping from the kafka-stream consumer application.
The folder S3 data is in folder structure.
Example folder :
bucketName/abc-event/2020/9/15/10
files in this folder :
abc-event-2020-9-15-10-00-01-abxwdhf. 5MB
abc-event-2020-9-15-10-00-02-aasdljc. 5MB
abc-event-2020-9-15-10-00-03-thntsfv. 5MB
the files in S3 have json objects separated with next line.
This data need to be loaded to abc-event table in redshift.
I know few options like AWS Data pipeline, AWS Glue, AWS Lambda Redshift loader (https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/).
What would be the best way to do it.
Really appreciate if someone will guide me.
Thanks you
=============================================
Thanks Prabhakar for the answer. Need some help in continuation on this.
Created a table in Data Catalog by crawler and
then running a ETLL job in glue does the job of loading the data from S3 to redshift.
I am using approach 1. Predicate pushdown
New files get loaded in S3 in different partition say (new hour started.)
I am adding new partition using a AWS Glue python script job.
Adding new partition in the table using Athena API. (using ALTER TABLE ADD PARTITION).
I have checked in the console that the new partition gets added by the python script job. I checked new partion gets added in Data catalog table.
When I run the same job with pushdown predicate giving same partition added by the python script glue job.
The job did not load the new files from S3 in this new partition to Redshift.
I cant figure out what I am doing wrong ???
In your use case you can leverage AWS Glue to load the data periodically into redshift.You can schedule your Glue job using trigger to run every 60 minutes which will calculate to be around 1.8 GB in your case.
This interval can be changed according to your needs and depending on how much data that you want to process each run.
There are couple of approaches you can follow in reading this data :
Predicate pushdown :
This will only load the partitions that mentioned in the job. You can calculate the partition values every run on the fly and pass them to the filter. For this you need to run Glue crawler each run so that the table partitions are updated in the table metadata.
If you don't want to use crawler then you can either use boto3 create_partition or Athena add partition which will be a free operation.
Job bookmark :
This will load only the latest s3 data that is accumulated from the time that your Glue job completed it's previous run.This approach might not be effective if there is no data generated in S3 in some runs.
Once you calculate the data that is to be read you can simply write it to redshift table every run.
In your case you have files present in sub directories for which you need to enable recurse as shown in below statement.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =<name>, table_name = <name>, push_down_predicate = "(year=='<2019>' and month=='<06>')", transformation_ctx = "datasource0", additional_options = {"recurse": True})
I have a S3 bucket named Employee. Every three hours I will be getting a file in the bucket with a timestamp attached to it. I will be using Glue job to move the file from S3 to Redshift with some transformations. My input file in S3 bucket will have a fixed structure. My Glue Job will use the table created in Data Catalog via crawler as the input.
First run:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "employee_623215", transformation_ctx = "datasource0")
After three hours if I am getting one more file for employee should I crawl it again?
Is there a way to have a single table in Data Catalog like employee and update the table with the latest S3 file which can be used by Glue Job for processing. Or should I run crawler every time to get the latest data? The issue with that is more number of tables will be created in my Data Catalog.
Please let me know if this is possible.
You only need to run the AWS Glue Crawler again if the schema changes. As long as the schema remains unchanged, you can just add files to Amazon S3 without having to re-run the Crawler.
Update: #Eman's comment below is correct
If you are reading from catalog this suggestion will not work. Partitions will not be updated to the catalog table if you do not recrawl. Running the crawler maps those new partitions to the table and allow you to process the next day's partitions.
An alternative approach can be, instead of reading from catalog read directly from s3 and process data in Glue job.
This way you need not to run crawler again.
Use
from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")
Documented here
There's an excel file testFile.xlsx, it looks like as below:
ID ENTITY STATE
1 Montgomery County Muni Utility Dist No.39 TX
2 State of Washington WA
3 Waterloo CUSD 5 IL
4 Staunton CUSD 6 IL
5 Berea City SD OH
6 City of Coshocton OH
Now I want to import the data into the AWS GLUE database, a crawler in AWS GLUE has been created, there's nothing in the table in AWS GLUE database after running the crawler. I guess it should be the issue of classifier in AWS GLUE, but have no idea to create a proper classifier to successfully import data in the excel file to AWS GLUE database. Thanks for any answers or advice.
I'm afraid Glue Crawlers have no classifier for MS Excel files (.xlsx or .xls). Here you can find list of supported formats and built-in classifiers. Probably, it would be better to convert files to CSV or some other supported format before exporting to AWS Glue Catalog.
Glue crawlers doesn't support MS Excel files.
If you want to create a table for the excel file you have to convert it first from excel to csv/json/parquet and then run crawler on the newly created file.
You can convert it easily using pandas.
Create a normal python job and read the excel file.
import pandas as pd
df = pd.read_excel('yourFile.xlsx', 'SheetName', dtype=str, index_col=None)
df.to_csv('yourFile.csv', encoding='utf-8', index=False)
This will convert your file to csv then run crawler over this file and your table will be loaded.
Hope it helps.
When you say that "there's nothing in the table in AWS Glue database after running the crawler" are you saying that in the Glue UI, you are clicking on Databases, then the database name, then on "Tables in xxx", and nothing is showing up?
The second part of your question seems to indicate that you are looking for Glue to import the actual data rows of your file into the Glue database. Is that correct? The Glue database does not store data rows, just the schema information about the files. You will need to use a Glue ETL job, or Athena, or hive to actually move the data from the data file into something like mySQL.
You should write script (most likely python shell job in glue) to convert excel to csv and then run crawler over it.