S3 to Dynamodb item logs - amazon-web-services

I am working on some data stuff. Where I am transferring/importing csv data from s3 to my DynamoDB table. Since the data is very huge it is very difficult to find any discrepancy in each row. Some-how my importing fails with error. The provided key element does not match the schema and the row it is telling me, is alright and does not have any weird values.
What I want to do now:
I want to some-how log each item that is taken from csv s3 to dynamodb. In this way I can find the row which made this error.
Is there anyway to log that?
Thank you for your suggestions.

Related

How does Amazon Athena manage rename of columns?

everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

AWS Glue table Map data type for arbitratry number of fields and challenges faced

We are working on a Data-Lake project and we are getting the data from the cloudwatch logs, which in turn is going to be sent to S3 through the help of Kinesis service. Once this is in S3, we need to create a table to see the data through Athena, we have JSON files. We have 3 fields one is timestamp and the other one is properties, which in turn is an object which may hold arbitrary number of fields and differs on case to case, hence while creating the table, I defined it to be map<string,string>, based on some research and advises. Now the challenge is while querying through Athena, it always says zero records returned when there is data for sure. To confirm this, I have create a table this time through crawler and I am able to see the data through Athena, but only difference is the properties column is defined as struct with specific fields inside it, but where manual table has map<string,string> to handle arbitrary fileds coming in. Appreciate for any help to identify the root cause against this. Thank you.
Below is sample JSON line which is sitting in S3.
{"streamedAt":1599012411567,"properties":{"timestamp":1597998038615,"message":"Example Event 1","processFlag":"true"},"event":"aws:kinesis:record"}
Zero records returned usually means the table's location is wrong, or that you have a partitioned table but haven't added any partitions. I assume you would have figured it out if it was the former, so I suspect the latter. When you run a crawler it will add partitions it finds in addition to creating the table.
If you need help adding partitions please edit your question and provide examples of how your data is structured on S3.
This is a case where using a Glue crawler will probably not work very well, it will try too hard to figure out the schema of the properties column and it's never going to get it right. Glue crawlers are in general pretty bad at things that aren't very basic (see this question for a similar use case to yours when Glue didn't work out).
I think you'll be fine with a manually created table that uses map<string,string> as the type for the properties column. When you know the type of a property and want to use it as that type you just cast the value at query time. An alternative is to use string as the type and use the JSON functions to extract values at query time.

Query Athena tables & output column for 'S3 source' path

Currently using information_schema.tables to list all tables in my catalog.
What I am missing, is a column to tell me which S3 path each table (external) is pointing to.
Looked in all the information_schema tables, but cannot see this info.
The only place I've seen this via 'sql' is with the 'SHOW CREATE TABLE' command, which doesn't give the result in a proper recordset.
Failing that ... is there another way to keep tabs on all of your tables and their sources ?
Many Thanks.
So as above, could find no way of doing this from the database.
Actual solution below for interest (& in case anyone finds a better way)
From CLI:
Call AWS glue get-tables & output json to file
Sync file to S3
ETL job to convert multi-line json into single-line json and place in new bucket
Crawl new bucket
Now query/unnest in Athena
'convoluted' is a word that comes to mind !
At least it gets there data I need where I need it
Again, if anyone finds an easier way.... ?

AWS Personalize complaining about user_id not present in csv when it is. Any suggestions?

I have a csv with user-interactions data and trying to create a data import job on amazon personalize, however it keeps on failing saying column user_id does not exist. can someone please help?
I have tried renaming the column different things and changing schema accordingly but it still fails on that first column.
I figured it out myself, kinda annoying but the first column in CSV cannot be any of the columns that Personalize requires. So, just add some random key or something in first column and it'll pass thru their validation. I hope it helps if anyone has the same issue.
I've also had this issue recently. Turns out my file was encoded in UTF-16 and that didn't play too well with Amazon's systems.

AWS Athena: HIVE_UNKNOWN_ERROR: Unable to create input format

I've crawled a couple of XML files on S3 using AWS Glue, using a simple XML classifier:
However, when I try running any query on that data using AWS Athena, I get the following error (note that it's the simplest possible query I'm doing here):
HIVE_UNKNOWN_ERROR: Unable to create input format
Note that Athena can see my tables and it can see the columns, it just can't query them:
I noticed that there is someone with the same problem on the AWS Discussion forums: Athena XML Query Give HIVE Unknown Error but it got no love from anyone.
I know there is a similar question here about this error but the query in question targeted an RDS database, unlike an S3 bucket like I have here.
Has anyone got a solution for this?
Sadly at this time 12/2018 Athena cannot query XML input which is hard to understand when you may hear that Athena along with AWS Glue can query xml.
What output you are seeing from the AWS crawler is correct though, just not what you think its doing! For example after your crawler has run and you see the tables, but cannot execute any Athena queries. Go into your AWS Glue Catalog and at the right click tables, click your table, edit properties it will look something like this:
Notice how input format is null? If you have any other tables you can look at their properties or refer back to the input formatters documentation for Athena. This is the error you recieve.
Solutions:
convert your data to text/json/avro/other supported formats prior to upload
create a AWS glue job which converts a source to target from xml to target supported Athena format(compressed hopefully with ORC/Parquet)