my company wants to build a Data warehouse in Redshift. We have an OLTP database running in Amazon Aurora and we are thinking of using the DMS (data migration service). I am trying to get my head around the capabilities of CDC (change data capture). The thing is that CDC (over DMS) replicates and stores changes (in our case in Redshift) and I was wondering if it is possible to select specific columns which I want to store (this should be possible to do with table mapping - include) and based on which I want to store? As far as I understand it, if any columns of a row are updated, then the replication is triggered, which could mean a replication that is useless (e.g. if somebody updates a column that I do not want to follow)
E.g. I have a table with leads, which has some 30 columns. Now I am interested in the DW purposes only in 5 columns and I want to get a new line to the redshift table only if any of those 5 columns changes (is updated)... like if the stage of lead is changed, I will get a new line. On the other hand, I am not interested in the column 'Salesmans_comment' so if the salesman updates a comment, I do not want to have a new line, because I am not interested in it...Cheers!
I have run through most of the available yt tutorials and read through the documentation, but I haven't found a clear answer...
Thanks
I am trying to migrate the database using AWS DMS. Source is Azure SQL server and destination is Redshift. Is there any way to know the rows updated or inserted? We dont have any audit columns in source database.
Redshift doesn’t track changes and you would need to have audit columns to do this at the user level. You may be able to deduce this from Redshift query history and save data input files but this will be solution dependent. Query history can be achieved in a couple of ways but both require some action. The first is to review the query logs but these are only saved for a few days. If you need to look back further than this you need a process to save these tables so the information isn’t lost. The other is to turn on Redshift logging to S3 but this would need to be turned on before you run queries on Redshift. There may be some logging from DMS that could be helpful but I think the bottom line answer is that row level change tracking is not something that is on in Redshift by default.
Hi we want to replicate the data from Mysql(source) to GoogleBigquery(destination) we adopted the method described by google Docs with Data fusion replication pipeline as mentioned in Link
https://cloud.google.com/data-fusion/docs/tutorials/replicating-data/mysql-to-bigquery
Berief of what we are doing:
Enabling bin log in MY SQL for CDC(Change data Capture)
creating a replication pipeline in data fusion
starting the pipeline and syncing the data
we are successfully able to create MySql data in comupute engine and enabling bin-log for CDC and provided all necessary permission to user for the data replication pipeline in my SQL
we are successful in creating a data Fusion instance and able to create a replication pipeline
replication pipeline is able to fetch our SQL database details and target Big query is also set
On starting the pipeline it is tracking the Changes successfully (Insert,update and delete ) and table Schema is also created in Bigquery Successfully automatically.
But we are getting PROBLEM that no data is getting transsferred to Bigquery table. In log what i have seen is loading batch of 1 event in to statging Bucket
sharing the screenshot also
able to fetch every change from MYSQL but data is not transferring to bigquery
table schema was created but data is not transferred
loading batch of 1 event in to statging Bucket we are using developer mode and waited for more than 90 mins
The issue might be happening because there may be a schema/data type mismatch with the BigQuery table and the source MYSQL database table on the columns.
For example: if you have a column in source table, in BigQuery this column is of INT64 datatype with a length of 19, while in the source database table, it is Integer type with a length of 10, so you need to update the length of columns as per your datasize.
I am using solutions like cloudera navigator, atlas and Wherehows
to get Hadoop, HDFS, HIVE, SQOOP, MAPREDUCE metadata and lineage.
Now we have a data warehouse in AWS redshift as well. Is there a way to extract metadata or lineage or both information out of redshift.
So far i have not found anything on this.
Is there a way to integrate the same to wherehows as a crawled solution?
I found only one post which gives some information about how to get some information from redshift assuming it will be similar to postgresql. I am sure someone would have written some open source solution to this problem.
Or is it just matter of writing a simple single script to extract this information?
I am looking for a enterprise level solution. I hope someone will point me in right direction.
AWS Glue Data catalog is a fully managed metadata management service.It has AWS Glue crawler which automatically crawls through your source(for you its redshift) and creates a centralized metadata repository which can be accessed by other AWS services.
Refer:
https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
https://aws.amazon.com/glue/
You can access metadata by querying the system tables in Redshift:
https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html
The system tables are on the leader node in each cluster (see this guide on the Redshift Architecture that I wrote)
Redshift deletes the content of the system tables on a rolling basis, so you need to store that data in your cluster, or another separate cluster, to get a history. With the data in the system tables, you have a baseline of information about your queries and what tables they are touching.
You can put a dashboard like Kibana or Periscope Data on top of that data to visualize it. Plaid has done a write-up of how they've built an in-house monitoring solution that has some information about data lineage:
https://blog.plaid.com/managing-your-amazon-redshift-performance-how-plaid-uses-periscope-data/
But go get true data lineage, you need to understand how queries relate to your workflows, i.e. for an Airflow DAG. To get that information, you need to "tag" your queries so you can trace them in the context of transformations / workflows, vs. looking at the individual query.
This is something we've built into our product - heads up that it's a commercial solution:
https://www.intermix.io/blog/announcing-query-insights/
Unlike the raw logs from the system tables, we give you the context of what apps / workflows are triggering queries, which users are running them, and what tables they are touching.
Lars
I have a web app that needs to send reports on its usage, I want to use Amazon RedShift as a data warehouse for that purpose,
How should i collect the data ?
Every time, the user interact with my app, i want to report that.. so when should i write the files to S3 ? and how many ?
What i mean is:
- If do not send the info immediately, then I might lose it as a result of a connection lost, or from some bug in my system while its been collected and get ready to be sent to S3...
- If i do write files to S3 on each user interaction, i will end up with hundreds of files (on each file has minimal data), that need to be managed, sorted, deleted after been copied to RedShift.. that dose not seems like a good solution .
What am i missing? Should i use DynamoDB instead, Should i use simple insert into Redshift instead !?
If i do need to write the data to DynamoDB, should i delete the hold table after been copied .. what are the best practices ?
On any case what are the best practices to avoid data duplication in RedShift ?
Appreciate the help!
It is preferred to aggregate event logs before ingesting them into Amazon Redshift.
The benefits are:
You will use the parallel nature of Redshift better; COPY on a set of larger files in S3 (or from a large DynamoDB table) will be much faster than individual INSERT or COPY of a small file.
You can pre-sort your data (especially if the sorting is based on event time) before loading it into Redshift. This is also improve your load performance and reduce the need for VACUUM of your tables.
You can accumulate your events in several places before aggregating and loading them into Redshift:
Local file to S3 - the most common way is to aggregate your logs on the client/server and every x MB or y minutes upload them to S3. There are many log appenders that are supporting this functionality, and you don't need to make any modifications in the code (for example, FluentD or Log4J). This can be done with container configuration only. The down side is that you risk losing some logs and these local log files can be deleted before the upload.
DynamoDB - as #Swami described, DynamoDB is a very good way to accumulate the events.
Amazon Kinesis - the recently released service is also a good way to stream your events from the various clients and servers to a central location in a fast and reliable way. The events are in order of insertion, which makes it easy to load it later pre-sorted to Redshift. The events are stored in Kinesis for 24 hours, and you can schedule the reading from kinesis and loading to Redshift every hour, for example, for better performance.
Please note that all these services (S3, SQS, DynamoDB and Kinesis) allow you to push the events directly from the end users/devices, without the need to go through a middle web server. This can significantly improve the high availability of your service (how to handle increased load or server failure) and the cost of the system (you only pay for what you use and you don't need to have underutilized servers just for logs).
See for example how you can get temporary security tokens for mobile devices here: http://aws.amazon.com/articles/4611615499399490
Another important set of tools to allow direct interaction with these services are the various SDKs. For example for Java, .NET, JavaScript, iOS and Android.
Regarding the de-duplication requirement; in most of the options above you can do that in the aggregation phase, for example, when you are reading from a Kinesis stream, you can check that you don't have duplications in your events, but analysing a large buffer of events before putting into the data store.
However, you can do this check in Redshift as well. A good practice is to COPY the data into a staging tables and then SELECT INTO a well organized and sorted table.
Another best practice you can implement is to have a daily (or weekly) table partition. Even if you would like to have one big long events table, but the majority of your queries are running on a single day (the last day, for example), you can create a set of tables with similar structure (events_01012014, events_01022014, events_01032014...). Then you can SELECT INTO ... WHERE date = ... to each of this tables. When you want to query the data from multiple days, you can use UNION_ALL.
One option to consider is to create time series tables in DynamoDB where you create a table every day or week in DynamoDB to write every user interaction. At the end of the time period (day, hour or week), you can copy the logs on to Redshift.
For more details, on DynamoDB time series table see this pattern: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
and this blog:
http://aws.typepad.com/aws/2012/09/optimizing-provisioned-throughput-in-amazon-dynamodb.html
For Redshift DynamoDB copy: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/RedshiftforDynamoDB.html
Hope this helps.
Though there is already an accepted answer here, AWS launched a new service called Kinesis Firehose which handles the aggregation according to user defined intervals, a temporary upload to s3 and the upload (SAVE) to redshift, retries and error handling, throughput management,etc...
This is probably the easiest and most reliable way to do so.
You can write data to CSV file on local disk and then run Python/boto/psycopg2 script to load data to Amazon Redshift.
In my CSV_Loader_For_Redshift I do just that:
Compress and load data to S3 using boto Python module and multipart upload.
conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = s3_key_name
k.set_contents_from_file(file_handle, cb=progress, num_cb=20,
reduced_redundancy=use_rr )
Use psycopg2 COPY command to append data to Redshift table.
sql="""
copy %s from '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
DELIMITER '%s'
FORMAT CSV %s
%s
%s
%s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)
Just being a little selfish here and describing exactly what Snowplow ,an event analytics platform does. They use this awesome unique way of collecting event logs from the client and aggregating it on S3.
They use Cloudfront for this. What you can do is, host a pixel in one of the S3 buckets and put that bucket behind a CloudFront distribution as an origin. Enable logs to an S3 bucket for the same CloudFront.
You can send logs as url parameters whenever you call that pixel on your client (similar to google analytics). These logs can then be enriched and added to Redshift database using Copy.
This solves the purpose of aggregation of logs. This setup will handle all of that for you.
You can also look into Piwik which is an open source analytics service and see if you can modify it specific to your needs.