Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 months ago.
Improve this question
I have some configuration data i want to read from the spring boot application. The configuration can get bigger and bigger, therefore i dont want to create yaml file in the project. I was thinking about what would be the cheapest way to save this data in AWS. Like DynamoDB? or do you have maybe better and cheaper ideas?
DynamoDB would be a good solution, it will allow you to quickly retrieve and edit the config set. DynamoDB also has a free tier of 25RCU/WCU and 25GB storage per month, which I assume you will fall into that category making it a free option.
The only limitation with DynamoDB is that 400KB is the maximum item size, however, you can split your item up into an "item-collection" where all of the items share the same partition key and the sort key identifies the sub components.
AWS App Config is the correct service to use for application configuration setup but it is not cheaper than using DynamoDB for free, so your trade off is down to what you need from your configuration data.
Ignoring the cheaper part of the question, the AWS AppConfig is the right tool for storing configuration, as it comes with many useful features out of the box (like gradual deployment, rollbacks, validation, etc). You can find pricing information here. Note that the total price will also depend on how often you refresh configuration (i.e. your cache invalidation time). Other tools, like S3, might be cheaper, but are not as well suited for configuration use case.
what would be the cheapest way to save this data in AWS
The cheapest way to save data in AWS is to use S3 Intelligent Deep Archive or S3 Glacier Deep Archive for $0.00099 per GB.
Reference: https://aws.amazon.com/s3/pricing/
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I wish to record time series data at a very high frequency. I am wondering if there is a elegant serverless solution that allows me to store, and react to real-time data.
I want to use stored data to create statistical models, and then I want to process new data in real-time based on those models.
AWS Kinesis streams seems to fit the bill - however, I am unsure if it is only for reacting in real time, or if it also collects historical data that I might be able to use offline to build models.
Google DataFlow and Pub/Sub also seem to be relevant, but not sure if it would be appropriate for the above.
If you go with AWS, you might use Kinesis and EMR to achieve your goal. Firstly you can create a delivery stream in fully managed Kinesis Firehose and route it to S3 or Redshift to collect historical data.
Once your data is on S3, you may do the statistical analysis by pointing the S3 bucket to an EMR job to process fresh data that s3 receive. Read this article for more information.
On EMR managed hadoop framework, you may setup Open-Source R and RStudio for statistical analysis if you will. Here is guide on that.
We accomplished this using Kinesis with Flink ( from apache ) . Flink is really very scalable solution.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Data locality with MapReduce and HDFS is very important (same thing goes for Spark, HBase). I've been researching about AWS and the two options when deploying the cluster in their cloud:
EC2
EMR + S3
The second option seems more appealing for different reasons, where the most interesting is the ability to scale storage and processing separately and to shutdown processing when you don't need it (more correct, to turn it on only when needed). This is an example explaining the advantages of using S3.
What bugs me is the issue of data locality. If the data is stored in S3 it will need to be pulled to HDFS every time a job is run. My question is - how big can this issue be, and is it still worth of it?
What comforts me is the fact that I'll be pulling the data only the first time and then all the next jobs will have the intermediate results locally.
I'm hopping for an answer from some person having practical experience with this. Thank you.
EMR does not pull data from S3 to HDFS. It uses its own implementation of HDFS support on S3 (as if you are operating on an actual HDFS). https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
As for data locality, S3 is RACK_LOCAL to EMR spark clusters.
As per the source mentioned below, EMR+S3 with EMRFS doesn't maintain data locality and is not suitable for analytics processing based on tools such as SQL. RedShift is the right choice for such use cases where compute and data are at one place. Please refer to 39:00 to 42:00 in the below link:
https://youtu.be/08G9NfDETVE
This is also mentioned in https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html. Please refer to the performance per dollar section.
To check how EMR works with S3 please refer to Programming elastic map reduce book by KEVIN SCHMIDT & CHRISTOPHER PHILLIPS(Chapter 1 Amazon Elastic MapReduce Versus Traditional Hadoop Installs section).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We are researching on creating a Data Lake solution on AWS - similar to what's outlined here - https://aws.amazon.com/blogs/big-data/introducing-the-data-lake-solution-on-aws/
We will be storing all the "raw" data in S3 and load it into EMR or Redshift as needed.
At this stage, I am looking for suggestions on whether to use the ETL or the ELT approach for loading data into Amazon Redshift. We will be using Talend for ETL/ELT.
Should we stage the "raw" data from S3 in Redshift first before transforming it or should we transform the data in S3 and load it into Redshift?
I would appreciate any suggestion/advise.
Thank you.
Definitely ELT.
The only case where ETL may be better is if you are simply taking one pass over your raw data, then using COPY to load it into Redshift, and then doing nothing transformational with it. Even then, because you'll be shifting data in and out of S3, I doubt this use case will be faster.
As soon as you need to filter, join, and otherwise transform information, it is much faster to do it in the DBMS. If you hit a case where the data transformation relies on data that is already in the DW, it will be orders of magnitude faster.
We run hundreds of ELT jobs a day on different DW platforms, performance testing alternative methods of ingesting and transforming data. In our experience the difference between ETL and ELT in an MPP DW can be 2000+ percent.
It depends on the purpose of having Redshift. If your business case is for users to query the data against Redshift (or a front end application using Redshift as backend), then i would not recommend to do ETL in Redshift. In this case, it would be better to perform your business transformations ahead of time (ex: S3->EMR->S3) and then load the processed data to Redshift.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am looking for a way to pass log events from AWS application to my company site.
The thing is that the AWS application is 100% firewalled from everything except only one IP address because it's encryption related service.
I just don't know what service I should use to do this. There's so many services so I do really have no idea what is it.
I think I'd just use simple message service, does this makes sense? The thing is there's plenty of events (let's say 1M per day), so I don't want big extra costs for this.
Sorry for the generic question, but I think it's quite concrete - "What is the most optimal way to pass event message from AWS when volume is approx 1M per day each 256 bytes on average?".
I'd like to connect to AWS service instead to any of the EC2 hosts...
On both sides I have tomcats with AWS-SDK.
I just want to avoid rewriting. Maybe I should do it with S3? The files are immutable, but I could upload files every 1h. I don't need real-time events. I just need to have logfiles on site for analysis of user experience and that customers can access it, but having log in 1M chunks would either require further assembling etc, I am really confused, sorry.
Kinesis is good for streaming event data. S3 is good if you already have files that you want stored.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have the following requirement. I have with me a database containing the contact and address details of at least 2000 members of my school alumni organization. We want to store all that information in a relation model so that
This data can be created and edited on demand.
This data is always backed up and should be simple to restore in case the master copy becomes unusable.
All sensitive personal information residing in this database is guaranteed to be available only to authorized users.
This database won't be online in the first 6 months. It will become online only after a website is built on top of it.
I am not a DBA and I don't want to spend time doing things like backups. I thought Amazon's RDS with it's automatic backup facility was the perfect solution for our needs. The only problem is that being a voluntary organization we cannot spare the monthly $100 to $150 fees this service demands.
So my question is, are there any less costlier alternatives to Amazon's RDS?
In your case of just contact and address data I would choose Amazon SimpleDB. I know SimpleDB might not be suitable for a large number of tables with relationships and all, but for your kind of data I think SimpleDB is sufficient. And costs is much much cheaper than Amazon RDS.
I also wanted to use RDS, but the smallest db size costs $80 p/month.
With out a bit more info I may be way off base here. but 2000 names addresses etc. is not a large DB and I would have thought that the possible use of Amazons RDS was a bit "overkill" to say the least.
Depending on how (and who) you want view edit etc. there are a number of free or almost free alternatives.
One method may be to set up /use a hosting package that has something like phpMyAdmin linked to a mySQL DB. Doing this it is possible to access and edit etc. the DB without having a website front end. Not pretty (like a website front end) but practical. A good host should also back up for you.
Another is to look at Google Documents. OK not really a database more a spread sheet, but very much on the lines of Excel. You can share Google docs with invited people and even set up a small website via Google Docs. This is a free method, but may not be that practical depending on your needs.
Have you taken a look at Microsoft SQL Azure? You can use it free for something like 90 days and then if you only need a 1GB db it would only be about $10 a month.
You mention backup so I thought I would talk about that as well. They way SQL Azure works is that it automatically creates 2 additional copies of your database on different machines in the data center. If one of the machines or db's become unavailable it automatically fails over to one of the other db's.
If you need anything above that you can also use the copy command to backup the database.
You can check
http://www.enciva.com/postgresql9-hosting.htm
and
http://www.acugis.com/postgresql-hosting.htm
They work for Postgres and MySQL.
For a frankly tiny db of that size I'd seriously look at http://www.sqlite.org/
it's inprocess, easy to constantly .dump off to S3 and you can use update hooks to keep checkpoints after updates.
backups/restores are almost the equivalent of windows batchfiles and wgets
good encryption using http://sqlcipher.net/
standard OS Filesystem and user level ACLs control security.
running a file backed db makes sense given the fragility of a normal EC2 backed RDBMS to EBS gremlins.
there are exclusions from to SQL92 (no real showstoppers), but given the project cost sensitivity and the RPO and RTO's of an alumni database, I reckon it's a good bet.