How to replicate salesforce data to AWS S3/RDS? - amazon-web-services

I am trying to build a data warehouse using RedShift in AWS. I want preprocess salesforce data my moving it to RDS or S3(use them as stage) before finally moving it to RedShift. I am trying to find out what are different ways on how I can replicate salesforce data in S3/RDS for this purpose. I have seen many third party tools which are able to do this.But I am looking for something which can be built in-house. I would like to use this data for dimensional modeling.
Thanks for your help!

Related

How to share large data sets to third party data consumers/services?

Lets assume I have a client who has plethora of data related to railways(Signals, tracks, train timings, hazard. offers etc). There are various internal department in railways which wants that data. Just like various weather websites get data from weather department and show that data on their website. Similar is my requirement that I want to share the data securely with other department and services. I want to look at best method to share the data to other services as quickly as possible when the data is available.
Possible Solution
API based: Create API for each department and share them data via API. This has its own pros and cons. This is something which came to my mind but we would have to create lot go API's I was looking if Azure and AWS has any other service which can do the same.
Azure based solution: I am looking for help in this if Azure and services it provides can help. Service bus, Event Grid, Event Hub etc can these be of any use?.
AWS based solution: Is there any service in AWS which can help here?. I dont have much exposure to AWS.
Any other solution ?
I have a fair idea that this could be built using API's but I am looking if I can get this done using cloud platforms like Azure and AWS> This will help in better integration of the product and can scale.

Automating CSV analysis?

My e-commerce company generates lots of CSV data. To track order status, the team must download a number of trackers. Creating a relationship and subsequently analyse,its a time-consuming process. Which AWS low-code solution can be used to automate the workflow?
Depending on what 'workflow' you require, a few options are:
Amazon Honeycode, which is a low-code application builder
You can Filter and retrieving data using Amazon S3 Select, which works on individual CSV files. This can be scripted via the AWS CLI or an AWS SDK
If you want to run SQL and create JOINs between multiple files, then Amazon Athena is fantastic. This, too, can be scripted.
While not exactly low code, AWS Athena uses SQL-like queries to analyze CSV files, among many other formats

Google Merchant Center - retrieve the BestSellers_TopProducts_ report without the BigQuery Data Transfer service?

I'm trying to find a way to retrieve a specific Google Merchant Center report (BestSellers_TopProducts_) and upload it to BigQuery as part of a specific ETL process we're developing for a customer we have at my workplace.
So far, I know you can set up the BigQuery Data Transfer service so it automates the process of downloading this report but I was wondering if I could accomplish the same with Python and some API libraries from Google (like python-google-shopping) but I may be overdoing it and setting up the service is the way to go.
Is there a way to accomplish this rather than resorting to the aforementioned service?
On the other hand, and assuming the BigQuery Data Transfer service is the way to go, I see (in the examples) you need to create and provide the dataset you're going to extract the report data to so I guess the extraction is limited to the GCP project you're working with.
I mean... you can't extract the report data for a third-party even if you had the proper service account credentials, right?

Sending Data From Elasticsearch to AWS Databases in Real Time

I know this is a very different use case for Elasticsearch and I need your help.
Main structure (can't be changed):
There are some physical machines and we have sensors there. Data from
these sensors are going to AWS Greengrass.
Then, with Lambda function data are going to Elasticsearch by using
MQTT. Elasticsearch is running on the docker.
This is the structure and until here everything is ready and running โœ…
Now, on the top of the ES I need some software that can send this data by using MQTT to Cloud database, for example DynamoDB.
But this is not one time migration. It should send the data continuously. Basically, I need a channel between ES and AWS DynamoDB.
Also, sensors are producing so much data and we don't want to store all of them in the Cloud but we want to store them in ES. Some filtering is needed in the Elasticsearch side before we send data to Cloud. Like "save every 10th data to cloud" so we can only save 1 data out of 10.
Do you have any idea about how can it be done? I have no experience in this field and it looks like a challenging task. I would love to get some suggestions from experienced people in these areas.
Thanks a lot! ๐Ÿ™Œ๐Ÿ˜Š
I havenโ€™t worked on a similar use case but you can try looking into Logstash for this.
It's an open source service, part of ELK stack and provides the option of filtering the output. The pipeline will look something link below:
data ----> ES ----> Logstash -----> DynamoDB or any other destination.
It supports various plugins required for your use case, like:
DynamoDB output plugin -
https://github.com/tellapart/logstash-output-dynamodb
Logstash MQTT Output Plugin -
https://github.com/kompa3/logstash-output-mqtt

Google Cloud Dataflow - is it possible to define a pipeline that reads data from BigQuery and writes to an on-premise database?

My organization plans to store a set of data in BigQuery and would like to periodically extract some of that data and bring it back to an on-premise database. In reviewing what I've found online about Dataflow, the most common examples involve moving data in the other direction - from an on-premise database into the cloud. Is it possible to use Dataflow to bring data back out of the cloud to our systems? If not, are there other tools that are better suited to this task?
Abstractly, yes. If you've got a set of sources and syncs and you want to move data between them with some set of transformations, then Beam/Dataflow should be perfectly suitable for the task. It sounds like you're discussing a batch-based periodic workflow rather than a continuous streaming workflow.
In terms of implementation effort, there's more questions to consider. Does an appropriate Beam connector exist for your intended on-premise database? You can see the built-in connectors here: https://beam.apache.org/documentation/io/built-in/ (note the per-language SDK toggle at top of page)
Do you need custom transformations? Are you combining data from systems other than just BigQuery? Either implies to me that you're on the right track with Beam.
On the other hand, if your extract process is relatively straightforward (e.g. just run a query once a week and extract it), you may find there are simpler solutions, particularly if you're not moving much data and your database can ingest data in one of the BigQuery export formats.