AWS CVS data pipelining - amazon-web-services

I am new to AWS and want to do some data pipelining in AWS.
I have a bunch on CSV file stored in S3
Things I want to achieve:
I want to union all the CSV files and add the filename to each
line, the first line needs to be removed for each file before
unioning the CSVs;
Split the filename column by the _ delimiter;
Store this all in a DB after processing.
What is the best/fastest way to achieve this in a way.
Thanks

You can create a glue job using pyspark which will get the csv file in df and then you can transform it however you like.
After that you can convert that df to parquet and save that in s3.
Then you can run a glue crawler which will convert the parquet data to table which you can query.
Basically you are doing ETL using glue aws.

Related

Spark SQL query to get the last updated timestamp of a Athena table stored as CSV in AWS S3

Is it possible to get the last updated timestamp of a Athena Table stored as a CSV file format in S3 location using Spark SQL query.
If yes, can someone please provide more information on it.
There are multiple ways to do this.
Use the athena jdbc driver and do a spark read where the format is jdbc. In this read you will provide your "select max(timestamp) from table" query. Then as the next step just save to s3 fcrom the spark dataframe
You can skip the jdbc read altogther and just use boto3 to run the above query. It would be a combination of start_query_execution and get_query_results. You can then save this to s3 as well.

Load Parquet files into Redshift

I have a bunch of Parquet files on S3, i want to load them into redshift in most optimal way.
Each file is split into multiple chunks......what is the most optimal way to load data from S3 into Redshift?
Also, how do you create the target table definition in Redshift? Is there a way to infer schema from Parquet and create table programatically? I believe there is a way to do this using Redshift spectrum, but i want to know if this can be done in scripting.
Appreciate your help!
I am considering all AWS tools such as Glue, Lambda etc to do this the most optimal way(in terms of performance, security and cost).
The Amazon Redshift COPY command can natively load Parquet files by using the parameter:
FORMAT AS PARQUET
See: Amazon Redshift Can Now COPY from Parquet and ORC File Formats
The table must be pre-created; it cannot be created automatically.
Also note from COPY from Columnar Data Formats - Amazon Redshift:
COPY inserts values into the target table's columns in the same order as the columns occur in the columnar data files. The number of columns in the target table and the number of columns in the data file must match.
use parquet-tools from GitHub to dissect the file :
parquet-tool schema <filename> #will dump the schema w/datatypes
parquet-tool head <filename> #will dump the first 5 data structures
Use the jsonpaths file to specify mappings

Is it possible to create a AWS glue classifier which can convert the csv file to pipe delimited

I would like to convert a monthly feed to convert from csv to pipe delimited using AWS Glue Crawler. Is it possible to create a classifier which can convert csv file to pipe delimited (Using Grok or something) and monthly scheduled crawler can create the Glue catalog
Glue Crawler is used for populating the AWS Glue Data Catalog with tables so you cannot convert your file from csv format to pipe delimited by using only this functionality. Right steps should be like this:
Creating two tables in Glue Data Catalog. One for file in CSV format, and one for pipe delimited format. To catalog the source table, you can use Glue Crawler.
Creating glue job to transfer data between these tables.
This article does not refer exactly to your problem, but you can see how these mentioned steps should look:
https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/
You have also tutorials in Glue console (at the bottom in the left menu)

AWS Glue Crawler Overwrite Data vs. Append

I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.
CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).
I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.
Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?
Thanks very much in advance!
It is not possible the way you are asking. The Crawler does not alter data.
The Crawler is populating the AWS Glue Data Catalog with tables only.
Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:
Map the data using Crawler into a temporary Athena database/table
Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter
Use Glue job to
make data transformation/cleaning/renaming/deduping using PySpark or Scala
export data into S3 new location (.csv / .paruqet etc.) potentially partitioning
Run one more Crawler to map cleaned data from the new S3 location into Athena database
The dedupe you are askinging about happens in step 3

AWS GLUE Data Import Issue

There's an excel file testFile.xlsx, it looks like as below:
ID ENTITY STATE
1 Montgomery County Muni Utility Dist No.39 TX
2 State of Washington WA
3 Waterloo CUSD 5 IL
4 Staunton CUSD 6 IL
5 Berea City SD OH
6 City of Coshocton OH
Now I want to import the data into the AWS GLUE database, a crawler in AWS GLUE has been created, there's nothing in the table in AWS GLUE database after running the crawler. I guess it should be the issue of classifier in AWS GLUE, but have no idea to create a proper classifier to successfully import data in the excel file to AWS GLUE database. Thanks for any answers or advice.
I'm afraid Glue Crawlers have no classifier for MS Excel files (.xlsx or .xls). Here you can find list of supported formats and built-in classifiers. Probably, it would be better to convert files to CSV or some other supported format before exporting to AWS Glue Catalog.
Glue crawlers doesn't support MS Excel files.
If you want to create a table for the excel file you have to convert it first from excel to csv/json/parquet and then run crawler on the newly created file.
You can convert it easily using pandas.
Create a normal python job and read the excel file.
import pandas as pd
df = pd.read_excel('yourFile.xlsx', 'SheetName', dtype=str, index_col=None)
df.to_csv('yourFile.csv', encoding='utf-8', index=False)
This will convert your file to csv then run crawler over this file and your table will be loaded.
Hope it helps.
When you say that "there's nothing in the table in AWS Glue database after running the crawler" are you saying that in the Glue UI, you are clicking on Databases, then the database name, then on "Tables in xxx", and nothing is showing up?
The second part of your question seems to indicate that you are looking for Glue to import the actual data rows of your file into the Glue database. Is that correct? The Glue database does not store data rows, just the schema information about the files. You will need to use a Glue ETL job, or Athena, or hive to actually move the data from the data file into something like mySQL.
You should write script (most likely python shell job in glue) to convert excel to csv and then run crawler over it.