I have following process that is run nightly:
Grab XML from an FTP server
Transform the XML with a number of XSLT's to XML formatted for MySql
Load the new XML using the "LOAD XML" mysql command
I've been reading about AWS Data Pipelines and instead of me having this process run on an Ec2 instance it sounds like aws pipelines may be suited for this but I have a couple questions:
With step 2, the xslt's have some custom functions that are run. Currently the transforms are done with a .NET console app but I could convert this to Node if there is a way to do this in a cloud lambda
Can the pipeline run a LOAD XML command on a database? I assume I'd have to out the xml to an s3 bucket?
Is AWS pipelines a good idea for this task or and I heading in the wrong direction?
This is very much possible with AWS Data Pipeline. See the following examples among many others in the github repository https://github.com/awslabs/data-pipeline-samples
ShellCommandWithFTP
RedshiftToRDS
You can transform your xml to CSV and use CopyActivity
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
Related
What is the best way to replicate data from Oracle Goldengate On premise to AWS (SQL or NOSQL)?
I was just checking this for azure,
My company is looking for solutions of moving data to the cloud
Minimal impact for on-prem legacy/3rd party systems.
No oracle db instances on the cloud side.
Minimum "hops" for the data between the source and destination.
Paas over IaaS solutions.
Out of the box features over native code and in-house development.
oracle server 12c or above
some custom filtering solution
some custom transformations
** filtering can be done in goldengate, in nifi, azure mapping, ksqldb
solutions are divided into:
If solution is alolwed to touch.read the logfile of the oracle server
you can use azure ADF, azure synapse, K2view, apache nifi, Orcle CDC adapter for BigData (check versions) to directly move data to the cloud buffered by kafka however the info inside the kafka will be in special-schema json format.
If you must use GG Trail file as input to your sync/etl paradigm you can
use a custom data provider that would translate the trailfile into a flowfile for nifi (you need to write it, see this 2 star project on github for a direction
use github project with gg for bigdata and kafka over kafkaconect to also get translated SQL dml and ddl statements which would make the solution much more readable
other solutions are corner cases, but i hope this gives you what you needed
In my company's case we have Oracle as a source db and Snowflake as a target db. We've built the following processing sequence:
On-premise OGG Extract works with on-premise Oracle DB.
Datapump sends trails to another host
On this host we have OGG for Big data Replicat that processes trails and then sends result as json to AWS S3 bucket.
Since Snowflake DB can handle JSON as a source of data and works with S3 bucket it loads jsons into staging tables where further processing takes place.
You can read more about this approach here: https://www.snowflake.com/blog/continuous-data-replication-into-snowflake-with-oracle-goldengate/
I have a scenario that we want to trigger a data-flow pipeline via cloud function, And in data Flow pipeline we have to transform some data and insert in big query
I had created our custom data_Flowpipeline and transformed the Data inserted in the big query (follow standard way of installing Apache beam and using Deployment command from cloud Shell) Pipeline ran successfully, log is showing in monitoring with DAG.
Now what I want do is to trigger the pipeline with cloud-function and for that I researched that
(i)we can create custom flex template of pipeline
(ii) stage it in google Bucket
(iii)Call it with REST-API from cloud function
is the mentioned step in second step is recommended way of doing it or should I try another approach? I don't get any other way apart from classic templates
Is it possible to send files from a mobile application to ES2 that has a python script file that processes the file and the final product will be save into S3?
Deploy a simple webapp in EC2 to receive the data from your mobile app, run the python script you mentioned with the data, use the S3 API and save the data there. As for how you're going to deploy that webapp, there are tons of ways/languages/technologies, fit for another question.
I am new to GAE and I am trying to quickly find a way to retrieve logs in DataStore, clean them to my specs, and then save them to a table to be called on later for a reports view in my app. I was thinking of using Google Data Flow and creating batch jobs (app is python/Django) but the documentation does not seem to fit my use case so maybe data flow is not the answer. I could create a python script with BigQuery and schedule through CRON but then I would have to contend with errors and it would seem that there is a faster way to solve this problem.
Any help/thoughts/suggestions is always greatly appreciated.
You can use Dataflow/Beam Python SDK to develop a pipeline that read entities from Datastore [1], transform data, and write a table to BigQuery [2]. To schedule this job to run regularly you'll have to use a third party mechanism such as a cron job. Note that Dataflow performs automatic scaling and perform retries to handle errors so you are not expected to manually address these complexities.
[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/datastore/v1/datastoreio.py
[2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
I am aware of flume and Kafka but these are event driven tools. I don't need it to be event driven or real time but may be just schedule the import once in a day.
What are the data ingestion tools available for importing data from API's in HDFS?
I am not using HBase either but HDFS and Hive.
I have used R language for that for quite a time but I am looking for a more robust,may be native solution to Hadoop environment.
Look into using Scala or Python for this. There are a couple ways to approach pulling from an API into HDFS. The first approach would be to write a script which runs on your edge node(essentially just a linux server) and pulls data from the API and lands it in a directory on the linux file system. Then your script can use HDFS file system commands to put the data into HDFS.
The second approach would be to use Scala or Python with Spark to call the API and directly load the data into HDFS using a Spark submit job. Again this script would be run from an edge node it is just utilizing Spark to bypass having to land the data in the LFS.
The first option is easier to implement. The second option is worth looking into if you have huge data volumes or an API that could be parallelized by making calls to muliple IDs/accounts at once.