AWS Glue Crawler breaks table data structure in Athena query - amazon-web-services

Problem Statement: CSV data currently stored in S3 (extracted from Postgresql RDS), I need to query this S3 data using Athena. To achieve this, I created AWS Glue DB and running a crawler on this S3 bucket but the data in Athena query is broken (starts from columns with large text content). I tried changing the data type of Glue Table schema from string to varchar(1000) and recrawl, but still it breaks.
Data stored in S3 bucket :
Data coming out of Athena query on same bucket (using SELECT *) [note the missing row]
Also tested loading the S3 data using jupyter notebook in AWS Glue Studio with this code snippet and the output data looks correct here :
dynamicFrame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://..."]},
format="csv",
format_options={
"withHeader": True,
# "optimizePerformance": True,
},
)
Any help on this would be greatly appreciated!

Related

How to copy data from Amazon S3 to DDB using AWS Glue

I am following AWS documentation on how to transfer DDB table from one account to another. There are two steps:
Export DDB table into Amazon S3
Use a Glue job to read the files from the Amazon S3 bucket and write them to the target DynamoDB table
I was able to do the first step. Unfortunately the instructions don't say how to do the second step. I have worked with Glue a couple of times, but the console UI is very user un-friendly and I have no idea how to achieve it.
Can somebody please explain how to import the data from S3 into the DDB?
You could use Glue studio to generate a script.
Log into AWS
Go to Glue
Go to Glue studio
Set up the source , basically point it to S3
and then use something like below this is for a dynamo db with pk and sk as a composite primary key
This is just the mapping to a Dataframe and writing it to DynamoDB
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[
("Item.pk.S", "string", "Item.pk.S", "string"),
("Item.sk.S", "string", "Item.sk.S", "string")
],
transformation_ctx="ApplyMapping_node2"
)
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="dynamodb",
connection_options={"dynamodb.output.tableName": "my-target-table"}
}

Ways to backup AWS Athena views

In an AWS Athena instance we have several user-created views.
Would like to back-up the views.
Have been experimenting using AWS CLI
aws athena start-query-execution --query-string “show views...
and for each view
aws athena start-query-execution --query-string “show create views...
and then
aws athena get-query-execution --query-execution-id...
to get the s3 location for the create view code.
Looking for ways to get the view definitions backed up.If AWS CLI is the best suggestion, then I will create a Lambda to do the backup.
I think SHOW VIEWS is the best option.
Then you can get the Data Definition Language (DDL) with SHOW CREATE VIEW.
There are a couple of ways to back the views up. You could use GIT (AWS offers CodeCommit). You could definitely leverage CodeCommit in a Lambda Function using Boto3.
In fact, just by checking the DDL, you are in fact backing them up to [S3].
Consider the following DDL:
CREATE EXTERNAL TABLE default.dogs (
`breed_id` int,
`breed_name` string,
`category` string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION
's3://stack-exchange/48836509'
TBLPROPERTIES ('skip.header.line.count'='1')
and the following view based on it.
CREATE VIEW default.vdogs AS SELECT * FROM default.dogs;
When we show the DDL:
$ aws athena start-query-execution --query-string "SHOW CREATE VIEW default.vdogs" --result-config
uration OutputLocation=s3://stack-exchange/66620228/
{
"QueryExecutionId": "ab21599f-d2f3-49ce-89fb-c1327245129e"
}
We write to S3 (just like any Athena query).
$ cat ab21599f-d2f3-49ce-89fb-c1327245129e.txt
CREATE VIEW default.vdogs AS
SELECT *
FROM
default.dogs

AWS Glue - boto3 crawler not creating table

I am trying to create and run a AWS glue crawler through the boto3 library. The crawler is going against JSON files in an s3 folder. The crawler completes successfully, when i check the logs there are no errors but it doesn't create any table in my glue database
It's not a permission issue as I am able to create the same crawler through a CFT and when I run that it creates the table as expected. Im using the same role as my CFT in my code I'm running with boto3 to create it.
Have tried using boto3 create_crawler() and run_crawler(). Tried using boto3 update_crawler() on the crawler created from the CFT and updating the s3 target path.
response = glue.create_crawler(
Name='my-crawler',
Role='my-role-arn',
DatabaseName='glue_database',
Description='Crawler for generating table from s3 target',
Targets={
'S3Targets': [
{
'Path': s3_target
}
]
},
SchemaChangePolicy={
'UpdateBehavior': 'UPDATE_IN_DATABASE',
'DeleteBehavior': 'LOG'
},
TablePrefix=''
)
Are you sure you have passed correct region in glue client (glue object creation).
Once I copied code and forgot to change region and spent hours figuring out why it is not creating table when there is no error. Eventually I ended in figuring out that table is created in another region as I forgot to change region while I copied code to new region.

How to convert headerless, compressed, pipe-delimited files stored in S3 into parquet using AWS Glue

Currently, I have several thousand headerless, pipe-delimited, GZIP compressed files in S3, totaling ~10TB, with the same schema. What is the best way, in AWS Glue, to (1) add a header file, (2) convert to parquet format partitioned by week using a "date" field in the files, (3) have the files be added to the Glue Data Catalog for accessibility for querying in AWS Athena?
1) Create an athena table pointing your data on S3 :
Create external table on athena
2) Create a dynamic frame from glue catalog, using the table you created in above step.
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
DyF = glueContext.create_dynamic_frame.from_catalog(database="{{database}}", table_name="{{table_name}}")
3) Write the data back to new S3 location in whatever format you like:
glueContext.write_dynamic_frame.from_options(
frame = DyF,
connection_type = "s3",
connection_options = {"path": "path to new s3 location"},
format = "parquet")
4) Create an athena table pointing your parquet data on S3 :
Create external table on athena
Note : Instead of creating athena table manually, you can also use glue crawler to create one for you. However, that will incur some charges.

AWS Kinesis S3 Redshift

I am trying to send data to Amazon Redshift through Kinesis via S3 using the copy command. The data format is ok in S3 in JSON, but when it reaches Redshift, lots of the column values are empty. I have tried everything and haven't been able to fix this for the past few days.
Data format in s3:
{"name":"abc", "clientID":"ahdjxjxjcnbddnn"}
{"name":"def", "clientID":"uhrbdkdndbdbnn"}
Redshift table structure:
User(
name: varchar(20)
clientID: varchar(25)
)
In Redshift I get only one of the two fields populated.
I have used JSON auto in the copy command.