I need to build a data pipeline which takes input from a CSV file (stored on S3) and "updates" records in the Aurora RDS table. I understand the standard format (out of the box template) for bulk record insertion, but for the records update or deletion, is there any standard way to have those statements in the SqlActivity?
I can write an update statement, but then the way CSV inputs are referenced, they are just question marks (?) without any liberty to index a column.
Let me know if data pipeline can be used in this way? If yes any specific way I can refer CSV columns? Thanks in advance!
You will need to do some preprocessing of your CSV to a SQL script containing your bulk updates and then invoke the SqlActivity with a reference to your script.
If you have inserts you might be able to perform this by using the following:
CopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html) which takes:
S3DataNode as an input
SqlDataNode as the output.
If performance is not a concern then this is the closest you can get to an out of the box transport using AWS Data Pipeline.
You can refer to the AWS Data Pipeline docs (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html) for more information.
Related
I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.
I have JSON files in an S3 Bucket that may change their schema from time to time. To be able to analyze the data I want to run a glue crawler periodically on them, the analysis in Athena works in general.
Problem: My timestamp string is not recognized as timestamp
The timestamps currently have the following format 2020-04-06T10:37:38+00:00, but I have also tried others, e.g. 2020-04-06 10:37:38 - I have control over this and can adjust the format.
The suggestion to set the serde parameters might not work for my application, I want to have the scheme completely recognized and not have to define each field individually. (AWS Glue: Crawler does not recognize Timestamp columns in CSV format)
Manual adjustments in the table are generally not wanted, I would like to deploy Glue automatically within a CloudFormation stack.
Do you have an idea what else I can try?
This is a very common problem. The way we got around the problem when reading text/json files is we had an extra step in between to cast and set proper data types. The crawler data types are a bit iffy sometimes and is based on the data sample available at that point in time
We have raw data stored in S3 as parquet.
I want a subset of that data loaded into Redshift.
To be clear, the Redshift data would be the result of a query (joins, filters, aggregations) of the raw data.
I originally thought that I could build views in Athena, and load the results into Redshift - but seems that it's not that simple !
Glue ETL jobs need an S3 or RDS source - will not accept a view from Athena.
(Cannot crawl a view either).
Next solution, was to have a play with the Athena CTAS functionality, write the results of the view to S3, and then load into RedShift.
However, there is no 'overwrite' option with CTAS.
So questions ...
Is there an easier way to approach this ? (seems a simple requirement)
Is there an easy workaround to execute a CTAS with 'overwrite' behaviour ?
With that, would have to be a solution that could be bundled up into a scheduled job - and already I think is leading into a custom script.
When a simple job becomes so difficult - I cannot help but think I'm missing something simple !?
Thanks
Ol' reliable: use a lambda! Lambda functions can programmatically connect to both s3 and redshift to execute SQL statements, and you have many options for what will trigger the lambda (if it's just a one-time thing, you can just have it be a scheduled lambda). You will be able use cloudwatch logs to examine the process too.
But beware: I noticed that you stored your data as a parquet... Normal Redshift does not support parquet formatted data. So, if you want to store types like structs, etc. you will need to use Redshift Spectrum.
I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind,
I cannot use AWS Data pipeline
File is going to contain millions of rows (say 10 million), so would need an efficient solution. I believe boto API (even with batch write) might not be that efficient ?
Are there any other alternatives ?
Can you just refer to the Parquet files in a Spark RDD and have the workers put the entries to dynamoDB? Ignoring the challenge of caching the DynamoDB client in each worker for reuse in different rows, it some bit of scala to take a row, build an entry for dynamo and PUT that should be enough.
BTW: Use DynamoDB on demand here, as it handles peak loads well without you having to commit to some SLA.
Look at the answer below:
https://stackoverflow.com/a/59519234/4253760
To explain the process:
Create desired dataframe
Use .withColumn to create new column and use psf.collect_list to convert to desired collection/json format, in the new column in the
same dataframe.
Drop all un-necessary (tabular) columns and keep only the JSON format Dataframe columns in Spark.
Load the JSON data into DynamoDB as explained in the answer.
My personal suggestion: whatever you do, do NOT use RDD. RDD interface even in Scala is 2-3 times slower than Dataframe API of any language.
Dataframe API's performance is programming language agnostic, as long as you dont use UDF.
For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects