AWS GLUE Data Import Issue - amazon-web-services

There's an excel file testFile.xlsx, it looks like as below:
ID ENTITY STATE
1 Montgomery County Muni Utility Dist No.39 TX
2 State of Washington WA
3 Waterloo CUSD 5 IL
4 Staunton CUSD 6 IL
5 Berea City SD OH
6 City of Coshocton OH
Now I want to import the data into the AWS GLUE database, a crawler in AWS GLUE has been created, there's nothing in the table in AWS GLUE database after running the crawler. I guess it should be the issue of classifier in AWS GLUE, but have no idea to create a proper classifier to successfully import data in the excel file to AWS GLUE database. Thanks for any answers or advice.

I'm afraid Glue Crawlers have no classifier for MS Excel files (.xlsx or .xls). Here you can find list of supported formats and built-in classifiers. Probably, it would be better to convert files to CSV or some other supported format before exporting to AWS Glue Catalog.

Glue crawlers doesn't support MS Excel files.
If you want to create a table for the excel file you have to convert it first from excel to csv/json/parquet and then run crawler on the newly created file.
You can convert it easily using pandas.
Create a normal python job and read the excel file.
import pandas as pd
df = pd.read_excel('yourFile.xlsx', 'SheetName', dtype=str, index_col=None)
df.to_csv('yourFile.csv', encoding='utf-8', index=False)
This will convert your file to csv then run crawler over this file and your table will be loaded.
Hope it helps.

When you say that "there's nothing in the table in AWS Glue database after running the crawler" are you saying that in the Glue UI, you are clicking on Databases, then the database name, then on "Tables in xxx", and nothing is showing up?
The second part of your question seems to indicate that you are looking for Glue to import the actual data rows of your file into the Glue database. Is that correct? The Glue database does not store data rows, just the schema information about the files. You will need to use a Glue ETL job, or Athena, or hive to actually move the data from the data file into something like mySQL.

You should write script (most likely python shell job in glue) to convert excel to csv and then run crawler over it.

Related

AWS Glue reading glue catalog table VS reading files from s3

I am writing the AWS Glue ETL job and I have 2 options to construct the spark dataframe :
Use the AWS Glue Data Catalog as the metastore for Spark SQL
df = spark.sql("select name from bronze_db.table_tbl")
df.write.save("s3://silver/...")
another option is to read directly from s3 location like this
df = spark.read.format("parquet").load("s3://bronze/table_tbl/1.parquet","s3://bronze/table_tbl/2.parquet")
df.write.save("s3://silver/...")
should I consider reading files directly to save cost or any limit on the number of queries (select name from bronze_db.table_tbl) or to get better read performance?
I am not sure if this query will be run on Athena to return the results
If you only have one file and you know the schema there is no need for a table. A table is useful when there are multiple files, you don't know the schema (e.g. the table was set up and is populated by another process), or if you are querying the data from multiple engines (Athena, EMR, Redshift Spectrum, etc.)
Think of tables as an interoperability thing. Interoperability with other processes, other engines, etc.

AWS Glue Dynamic Frame Update Column and Crawler Schema Match

I was reading through the AWS Glue documentation and playing around with the service but I am not able to figure out these 2 things:
Can I update all columns to the same value using dynamic frame (similar to UPDATE ALL SQL query) or is there a way to do it is through Spark SQL?
I have a Crawler that has created a data catalog from my source database in a new database. Is there a way for me to flag on AWS Glue if the source database schema has changed from the time I created the data catalog to when I schedule the job run?
Apologies if my questions seem silly. Thanks a ton for the help :)
Not sure how to do this directly using dynamic frames of spark SQL, but it can be done using DataFrames.
from pyspark.sql.functions import lit
from awsglue.dynamicframe import DynamicFrame
# Load dynamic_frame with data.
dynamic_frame = ...
# Convert to Spark DataFrame
df = dynamic_frame.toDF()
# Loop over columns and set records to a constant value, e.g. 999
for column in df.columns:
df = df.withColumn(column, lit(999))
# Convert back to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(df, glueContext, "dynamic_frame")
I'm not aware of an option to flag a changed database schema. But you could create something like this yourself: make a simple script that reads a flag you set somewhere and let this script trigger the crawler. This can be done with boto3 in Python or with the aws cli for example.

AWS CVS data pipelining

I am new to AWS and want to do some data pipelining in AWS.
I have a bunch on CSV file stored in S3
Things I want to achieve:
I want to union all the CSV files and add the filename to each
line, the first line needs to be removed for each file before
unioning the CSVs;
Split the filename column by the _ delimiter;
Store this all in a DB after processing.
What is the best/fastest way to achieve this in a way.
Thanks
You can create a glue job using pyspark which will get the csv file in df and then you can transform it however you like.
After that you can convert that df to parquet and save that in s3.
Then you can run a glue crawler which will convert the parquet data to table which you can query.
Basically you are doing ETL using glue aws.

Vertica HDFS as external table

What is the best practice for working with Vertica and Parquet
my application architecture is:
Kafka Topic (Avro Data).
Vertica DB.
Vertica's scheduler consumed the data from Kafka and ingest it into a managed table in Vertica.
let's say I have Vertica's Storage only for one month of data.
As far as I understood I can create an external table on HDFS using parquet and Vertica API enables me to query these tables as well.
What is the best practice for this scenario? can I add some Vertica scheduler for coping the date from managed tables to external tables (as parquet).
how do I configure the rolling data in Vertica (dropped 30 days ago every day )
Thanks.
You can use external tables with Parquet data, whether that data was ever in Vertica or came from some other source. For Parquet and ORC formats specifically, there are some extra features, like predicate pushdown and taking advantage of partition columns.
You can export data in Vertica to Parquet format. You can export the results of a query, so you can select only the 30-day-old data. And despite that section being in the Hadoop section of Vertica's documentation, you can actually write your Parquet files anywhere; you don't need to be running HDFS at all. It just has to be somewhere that all nodes in your database can reach, because external tables read the data at query time.
I don't know of an in-Vertica way to do scheduled exports, but you could write a script and run it nightly. You can run a .sql script from the command line using vsql -f filename.sql.

AWS Glue Crawler Overwrite Data vs. Append

I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.
CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).
I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.
Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?
Thanks very much in advance!
It is not possible the way you are asking. The Crawler does not alter data.
The Crawler is populating the AWS Glue Data Catalog with tables only.
Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:
Map the data using Crawler into a temporary Athena database/table
Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter
Use Glue job to
make data transformation/cleaning/renaming/deduping using PySpark or Scala
export data into S3 new location (.csv / .paruqet etc.) potentially partitioning
Run one more Crawler to map cleaned data from the new S3 location into Athena database
The dedupe you are askinging about happens in step 3