I have been using a Lambda to process a .csv file that is dropped into an S3 bucket. I'm using the base Lambda code as described in this AWS Blog post using boto3.
This method works really well for loading the data from a CSV but when I want to upload a new CSV file and I remove data that is previously there, the Lambda does not remove that data since this uses batch_writer and put_item so it will only update data with the same PK and create new items if they don't exist.
I'm trying to figure out a way to make sure that if data is removed from the CSV, the Lambda will remove that data from the DynamoDB database as well but I just can't get my head around how I would go about doing that with the current process.
Has anyone solved this problem before?
Thanks!
Load to a new table. That way it will hold only the data in the CSV file, none other. Delete the old table after the load to the new table completes. (It's nice that you avoid paying for any delete calls.)
Related
I apologise if the title is a bit misleading for the question I am going to ask. I am trying to understand how athena works a bit more clearly.
I have a daily job, which uploads files to a s3 location. I have created a athena table, which reads table from that s3 location. Every day the data gets updated and new files (i.e. new data) is uploaded to the location. (New necessarily doesn't mean overwriting but also adding more files).
My issue is, when I try to read the latest data from athena gui, it doesn't return anything but an empty table.
How do I read the latest data? Do I have to run another command like ALTER TABLE or INSERT INTO after uploading files to s3. My understanding was uploading files to that s3 location is akin to inserting data into table and vice versa i.e. running ALTER TABLE/INSERT INTO is akin to uploading files to s3?
I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.
I have two problems in my intended solution:
1.
My S3 store structure is as following:
mainfolder/date=2019-01-01/hour=14/abcd.json
mainfolder/date=2019-01-01/hour=13/abcd2.json.gz
...
mainfolder/date=2019-01-15/hour=13/abcd74.json.gz
All json files have the same schema and I want to make a crawler pointing to mainfolder/ which can then create a table in Athena for querying.
I have already tried with just one file format, e.g. if the files are just json or just gz then the crawler works perfectly but I am looking for a solution through which I can automate either type of file processing. I am open to write a custom script or any out of the box solution but need pointers where to start.
2.
The second issue that my json data has a field(column) which the crawler interprets as struct data but I want to make that field type as string. Reason being that if the type remains struct the date/hour partitions get a mismatch error as obviously struct data has not the same internal schema across the files. I have tried to make a custom classifier but there are no options there to describe data types.
I would suggest skipping using a crawler altogether. In my experience Glue crawlers are not worth the problems they cause. It's easy to create tables with the Glue API, and so is adding partitions. The API is a bit verbose, especially adding partitions, but it's much less pain than trying to make a crawler do what you want it to do.
You can of course also create the table from Athena, that way you can be sure you get tables that work with Athena (otherwise there are some details you need to get right). Adding partitions is also less verbose using SQL through Athena, but slower.
Crawler will not take compressed and uncompressed data together , so it will not work out of box.
It is better to write spark job in glue and use spark.read()
For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects
I need to build a data pipeline which takes input from a CSV file (stored on S3) and "updates" records in the Aurora RDS table. I understand the standard format (out of the box template) for bulk record insertion, but for the records update or deletion, is there any standard way to have those statements in the SqlActivity?
I can write an update statement, but then the way CSV inputs are referenced, they are just question marks (?) without any liberty to index a column.
Let me know if data pipeline can be used in this way? If yes any specific way I can refer CSV columns? Thanks in advance!
You will need to do some preprocessing of your CSV to a SQL script containing your bulk updates and then invoke the SqlActivity with a reference to your script.
If you have inserts you might be able to perform this by using the following:
CopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html) which takes:
S3DataNode as an input
SqlDataNode as the output.
If performance is not a concern then this is the closest you can get to an out of the box transport using AWS Data Pipeline.
You can refer to the AWS Data Pipeline docs (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html) for more information.