I've configures my DMS to read from a MySQL database and migrate its data to S3 with replication. Everything seems to work fine, it creates big CSV files for all the data and starts to create smaller CSV files with the deltas.
The problem is when I read this CSV files with AWS Glue Crawlers, they don't seem to get these deltas or even worse, they seem to get only the deltas, ignoring the big CSV files.
I know that there is a similar post here: Athena can't resolve CSV files from AWS DMS
But it is unaswered and I can't comment there, so I'm opening this one.
Does anyone have found the solution to this?
Best regards.
Related
I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.
I have been looking to the best way to programatically pull Redshift table (table needs to be aggregated) into s3.
What would be the best solution. From Athena to s3 I found this article however, I could not find any information to do it from Redshift to s3.
https://www.datastackpros.com/2020/07/export-athena-view-as-csv-to-aws-s3.html
I would be daily ingestion and the csv file should be overwritten.
Thanks
There are 2 ways that come to mind right away - UNLOAD and CREATE EXTERNAL TABLE. Each has its pros and cons. Your use case isn't completely clear as to what you need the resulting file(s) to look like but let me take a guess.
I expect you need a single CSV file (with or without header row?) for other tools to read / use. In this case I'd use UNLOAD with PARALLEL OFF to save the result of the query to S3. This will produce 1 file in S3 ONLY IF the resulting size is less than 5GB.
I'm testing this architecture: Kinesis Firehose → S3 → Glue → Athena. For now I'm using dummy data which is generated by Kinesis, each line looks like this: {"ticker_symbol":"NFLX","sector":"TECHNOLOGY","change":-1.17,"price":97.83}
However, there are two problems. First, a Glue Crawler creates a separate table per file. I've read that if the schema is matching Glue should provide only one table. As you can see in the screenshots below, the schema is identical. In Crawler options, I tried ticking Create a single schema for each S3 path on and off, but no change.
Files also sit in the same path, which leads me to the second problem: when those tables are queried, Athena doesn't show any data. That's likely because files share a folder - I've read about it here, point 1, and tested several times. If I remove all but one file from S3 folder and crawl, Athena shows data.
Can I force Kinesis to put each file in a separate folder or Glue to record that data in a single table?
File1:
File2:
Regarding the AWS Glue creating separate tables there could be some reasons based on the AWS documentation:
Confirm that these files use the same schema, format, and compression type as the rest of your source data. It seems this doesn't your issue but still to make sure I suggest you test it with smaller files by dropping all the rows except a few of them in each file.
combine compatible schemas when you create the crawler by choosing to Create a single schema for each S3 path. For this case, file schema should be similar, setting should be enabled, and data should be compatible. For more information, see How to Create a Single Schema for Each Amazon S3 Include Path.
When using CSV data, be sure that you're using headers consistently. If some of your files have headers and some don't, the crawler creates multiple tables
One another really important point is, you should have one folder at root and inside it, you should have partition sub-folders. If you have partitions at S3 bucket level, it will not create one table.(mentioned by Sandeep in this Stackoverflow Question)
I hope this could help you to resolve your problem.
I have two problems in my intended solution:
1.
My S3 store structure is as following:
mainfolder/date=2019-01-01/hour=14/abcd.json
mainfolder/date=2019-01-01/hour=13/abcd2.json.gz
...
mainfolder/date=2019-01-15/hour=13/abcd74.json.gz
All json files have the same schema and I want to make a crawler pointing to mainfolder/ which can then create a table in Athena for querying.
I have already tried with just one file format, e.g. if the files are just json or just gz then the crawler works perfectly but I am looking for a solution through which I can automate either type of file processing. I am open to write a custom script or any out of the box solution but need pointers where to start.
2.
The second issue that my json data has a field(column) which the crawler interprets as struct data but I want to make that field type as string. Reason being that if the type remains struct the date/hour partitions get a mismatch error as obviously struct data has not the same internal schema across the files. I have tried to make a custom classifier but there are no options there to describe data types.
I would suggest skipping using a crawler altogether. In my experience Glue crawlers are not worth the problems they cause. It's easy to create tables with the Glue API, and so is adding partitions. The API is a bit verbose, especially adding partitions, but it's much less pain than trying to make a crawler do what you want it to do.
You can of course also create the table from Athena, that way you can be sure you get tables that work with Athena (otherwise there are some details you need to get right). Adding partitions is also less verbose using SQL through Athena, but slower.
Crawler will not take compressed and uncompressed data together , so it will not work out of box.
It is better to write spark job in glue and use spark.read()
I'd like to run a MapReduce job on a DynamoDB Table.
My question is:
Is it ok to dump all the table (even if it's very big, with tens of millions of entries) into one file on S3?
That is, will the MapReduce know to take "chunks" of these file and distribute to the mappers? Or is the atomic unit provided to a mapper a file on S3, and then I need to break the table into lots of little files, for example make files of at most 10,0000 rows.
If that is the case, if there a way to use the AWS Data Pipline to dump a dynamoDB table into several different files on S3?
Thanks!
You can see this to export DynamoDB data to S3
https://aws.amazon.com/articles/Elastic-MapReduce/28549
Check Exporting data stored in DynamoDB to Amazon S3.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMRforDynamoDB.html
Video at
http://www.youtube.com/watch?v=RlKndm22bXw
Hope this helps.