I have a social network website and I stored all the media in S3. I'm planning to use AWS for S3+Lambda and GCP for GCE, Cloudsql. What are the cons of using it this way? Bandwidth between GCP and S3 (since it's not in the same network)?
Thanks.
Using both services together can make sense when you're leveraging one provider's strengths, or for redundancy / disaster recovery. You might also find the pricing model of one provider suits your use-case better. The tradeoff is inconvenience, extra code to manage interoperability, learning two sets of APIs and libraries, and possibly latency.
A few use-cases I've seen personally:
Backing up S3 buckets to Cloud Storage in COLDLINE via the Transfer Job system; goal is to protect code and data backups against worst-case S3 data loss or account hacking in AWS
Using BigQuery to analyze logs pre-processed in AWS EMR and synced into Cloud Storage; depending on your workload BigQuery might cost a lot less than running a Redshift cluster
I've also heard arguments that Google's ML pipelines are superior in some domains, so this might be a common crossover case.
Given the bulk of your infrastructure is already in Google, have you considered using Cloud Functions and Cloud Storage instead of Lambda and S3?
Related
I deploy my app to AWS.
On AWS there are RDS which support some industrial standard DBMS like PostgreSQL/MySQL/Oracle.
These dbms can be make available on development machine (docker) as well, make it easy to achieve dev/prod parity.
I'm looking for a time series specialized database that I can achieve dev/prod as well.
AWS has Timestream that is specialized for time series, but I'm clueless of a local equivalent database for it.
There probably some EC2-hosted database possible, but I prefer to be lazy and have Amazon take care of manage the database cluster for me.
What options do I have?
Apache Druid is a very good Time-Series Database that can be deployed on local development environments and on multiple cloud environments easily.
Druid is ofered as a fully managed cloud service, on AWS, by Imply.
The fully managed variant of Druid is called Imply Cloud.
More information: https://imply.io/product/imply-cloud
You should try Amazon Timestream
It is as a nonrelational, fully managed service built specifically to collect, store and process time-series data. The arrival of masses of IoT data is expected to push time-series technology into wider use,that's why Amazon came with Timestream.
I'm executing a Flink Job with this tools.
I think both can do exactly the same with the proper configuration. Does Kinesis Data Analytics do something that EMR can not do or vice versa?
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time.
Amazon Elastic Map Reduce provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR.
The major difference is maintainability and management from your side.
If you want more independent management and more control then I would say go for AWS EMR. Where its your responsibility to manage the EMR infrastructure as well as the Apache Flink cluster in it.
But if you want less control and more focus on application development and you need to deliver faster(tight deadline) then KDA is the way to go. Here AWS provides all the bells and whistles you need for running your application. This also easily sets up with AWS s3 as code source and provides a bare minimum Configuration Management using the UI.
It scales automatically as well.(Need to understand KCU though).
It provides the same Flink dashboard where you can monitor your application and AWS Cloudwatch integration for debugging your application.
Please go through this nice presentation and let me know it that helps.
Please let me know.
https://www.youtube.com/watch?v=c_LswkrwOvk
I will say one major difference between the two is that Kinesis does not provide a hosted Hadoop service unlike Elastic MapReduce (now EMR)
Got this same question also. This video was helpful in explaining with a real architecture scenario and AWS explanation here tries to explain how Kinesis and EMR can fit together with possible use cases.
I want to build some neural network models for NLP and recommendation applications. The framework I want to use is TensorFlow. I plan to train these models and make predictions on Amazon web services. The application will be most likely distributed computing.
I am wondering what are the pros and cons of SageMaker and EMR for TensorFlow applications?
They both have TensorFlow integrated.
In general terms, they serve different purposes.
EMR is when you need to process massive amounts of data and heavily rely on Spark, Hadoop, and MapReduce (EMR = Elastic MapReduce). Essentially, if your data is in large enough volume to make use of the efficiencies of Spark, Hadoop, Hive, HDFS, HBase and Pig stack then go with EMR.
EMR Pros:
Generally, low cost compared to EC2 instances
As the name suggests Elastic meaning you can provision what you need when you need it
Hive, Pig, and HBase out of the box
EMR Cons:
You need a very specific use case to truly benefit from all the offerings in EMR. Most don't take advantage of its entire offering
SageMaker is an attempt to make Machine Learning easier and distributed. The marketplace provides out of the box algos and models for quick use. It's a great service if you conform to the workflows it enforces. Meaning creating training jobs, deploying inference endpoints
SageMaker Pros:
Easy to get up and running with Notebooks
Rich marketplace to quickly try existing models
Many different example notebooks for popular algorithms
Predefined kernels that minimize configuration
Easy to deploy models
Allows you to distribute inference compute by deploying endpoints
SageMaker Cons:
Expensive!
Enforces a certain workflow making it hard to be fully custom
Expensive!
From AWS documentation:
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
(...) Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. Amazon SageMaker removes all the barriers that typically slow down developers who want to use machine learning.
Conclussion:
If you want to deploy AI models just use AWS SageMaker
Let's say a company has an application with a database hosted on AWS and also has a read replica on AWS. Then that same company wants to build out a data analytics infrastructure in Google Cloud -- to take advantage of data analysis and ML services in Google Cloud.
Is it necessary to create an additional read replica within the Google Cloud context? If not, is there an alternative strategy that is frequently used in this context to bridge the two cloud services?
While services like Amazon Relational Database Service (RDS) provides read-replica capabilities, it is only between managed database instances on AWS.
If you are replicating a database between providers, then you are probably running the database yourself on virtual machines rather than using a managed service. This means the databases appear just like any resource on the Internet, so you can connect them exactly the way you would connect two resources across the internet. However, you would be responsible for managing, monitoring, deploying, etc. This takes away from much of the benefit of using cloud services.
Replicating between storage services like Amazon S3 would be easier since it is just raw data rather than a running database. Also, Big Data is normally stored in raw format rather than being loaded into a database.
If the existing infrastructure is on a cloud provider, then try to perform the remaining activities on the same cloud provider.
This is not a duplicate question. I am just confused in Iaas,Saas with respect to AWS services like Dynamo, RDS, RedShift and Kinesis etc. They helps users to create database So, should we categorize them in Iaas or Saas?
Thanks
To help you understand, SaaS is Software as a Service. It's more like an on demand application where you don't have to worry about configurations, accesses, whitelisting etc. For instance, Google Maps (or Google Apps).
IaaS or Infra as a Service gives you more flexibility in terms of spawning of nodes and clusters, to deal with security services at IP and Port levels, manage access control and authentication etc. On AWS, you may specify what all private or public IPs will have access to your system, whether you prefer to go with dense storage or dense compute nodes for your warehouse, rotate your log files etc.
A page on Amazon RDS reads -
When you buy a server, you get CPU, memory, storage, and IOPS, all
bundled together. With Amazon RDS, these are split apart so that you
can scale them independently.
So, in short... Services like AWS and Azure are mostly now either IaaS or PaaS.