Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am building a web application (book readers group) , where every one can select a book from website books and then create a readers group for that book.
1- User selected a book named (How to develop with ReactJS).
2- User specify the number of group persons, let say 20 persons to read that book. then that selected book pages will be divided into the /20 persons .
3- A user will have a read URL that he send to his reader group.
4- A table will be shown to each reader opened the shared URL:-
I may later add a feature where a readers can send an internal message to the group creater...
Now i am in confuse between which DB should i choose, its not important for me the ease of implementation .
My matter is the cost , because i am expecting a high volume traffic for the website.
Speed is not really necessary for me if the read speed different between SQL and noSQL is less than 1 sec , the important thing is accessibility and availability of the services 24 hours and the cost of course.
Let say if select Amazon Dynamo DB , dynamo db will cost me according to each read and write request.
The hourly rate for Amazon RDS (Mysql) for db.m5.xlarge instance is $0.396 and $0.133 per GB-month, and i later on i may need to run an auto scale to start more instance.
While in DynamoDB is charging as per read and write request and storage usage.
In my experience - and with my uses cases - I have found that for small to medium sized projects DynamoDb ends up being cheaper, and in some cases even completely free because the use fits within the free-tier that aws offers - which is pretty generous. DynamoDb is my goto for these types of applications.
On larger projects I have found it not so clear - not knowing you usage patterns, and amount of data storage used/needed, there is not an easy, one size fits all answer.
Based on the use case scenarios mentioned above, if the solution has to be developed using DynamoDB, it may require main tables and some secondary indexes to search the book by name etc. So, in terms of pricing, AWS will charge you for both main table and secondary indexes read / writes separately.
In general, DynamoDB would give better results if you find by id or key (i.e. Partition key). As soon as you need wildcard search or find some data by non key attributes, you may need to scan the full table or create some secondary index.
If you foresee wide range of features which will be added to your application to give better user experience, you should go with some typical RDBMS option. It will be cost effective and flexible to add more features as well.
You can consider AWS Mariadb if you are going to stick with AWS cloud.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
The community reviewed whether to reopen this question 4 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
AWS recently launched new service DocumentDB similar to MongoDB interface.
What is the difference between AWS DynamoDB vs. DocumentDB services?
A major difference is that DocumentDB is a middle step between MongoDB and DynamoDB.
DynamoDB is a fully managed scalable service where you set the upper limit of it's potential.
DocumentDB is a bit more hands on and you have to select the number of instances for the cluster and the instance sizes. This means you would need to keep an eye on their usage / performance but not to the extend of MongoDB.
MongoDB would be the most flexible but also require the most maintenance.
All are good for performance depending on the application, but it depends on how flexible you want with the cost of more maintenance.
The other factor is the pricing model. Both MongoDB (Atlas) and DocumentDB you pay per hour (plus usage for DocumentDB). DynamoDB you can pay based on provisioned resources or on-demand (pay for what you use).
Edit: I've written a more extensive article based on my experiences with the three: https://medium.com/#caseygibson_42696/difference-between-aws-dynamodb-vs-aws-documentdb-vs-mongodb-9cb026a94767
Some Key differences
Amazon DynamoDB is a fully managed NoSQL database service. It provides fast and predictable performance with scalability. You can use Amazon DynamoDB to create a database table that can store and retrieve any amount of data, and serve any level of request traffic.
Document DB is based upon an open-source document database Mongo DB and leading NoSQL database .Document DB is based upon open-source Mongo DB and is a document database designed for less of development and scaling.
DynamoDB uses tables, items and attributes as the core components that you work with. A table is a collection of items, and each item is a collection of attributes. DynamoDB uses primary keys to uniquely identify each item in a table and secondary indexes to provide more querying flexibility.
MongoDB / Document DB uses JSON-like documents to store schema-free data. In Document DB, collections of documents do not require a predefined structure and columns can vary for different documents. Document DB has many features of a relational database, including an expressive query language and strong consistency. Since it is schema-free, MongoDB/Document DB allows you to create documents without having to create the structure for the document first.
In DynamoDB, you can create and use a so-called secondary index for similar purposes as in RDBMS. When you create a secondary index, you must specify its key attributes and after you create it, you can query it or scan it as you would a table. DynamoDB does not have a query optimizer, so a secondary index is only used when querying or scanning.
Indexes are preferred in MongoDB/Document DB. If an index is missing, every document within the collection must be searched to select the documents requested by the query. This can slow down read times.
DynamoDB is popular in the gaming industry as well as in the internet of things (IoT) industry.
To summarize, Document DB can be a good choice if you need scalability and caching for real-time analytics; however, it is not built for transactional data (accounting systems, etc.). Document DB can used for mobile apps, content management, real-time analytics and applications for IoT. If you have a case with no clear schema definition, DocumentDB can be a good choice.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I've been researching different approaches to streaming data to a real-time dashboard. One way that I have done in the past is using a star schema/dimension and fact tables. This would be an implementation of aggregate tables. For example, the dashboard would contain multiple charts, one being the total sales for the day, total sales by product, total sales by manufacturer, etc. etc.
But what if this needed to be real-time? What if the data needs to stream to these charts and do the analytical processing real-time?
I've been looking into solutions like Kinesis streams and Kafka, but I may be missing something obvious. For example, consider the following example. A company runs a website where they sell pies. The company has a backend dashboard where they keep track of all data and analytics related to sales, users, orders, etc.
Custom places order through website
The relational (mysql) database receives this new order
The charts and analytical data updates real-time on the backend, for example total sales for the day, or total sales for the year by user.
If the scenario is that this data needs to be streamed, what is the best approach to this? Aggregate tables seem like the obvious but it seems that would be periodic and not real-time. Kinesis/Kafka feels like it would fit somewhere in here. The other option would be something like Redshift but it's pretty pricey and still may not be the best way to address the issue and scale.
Here is an example of a chart that would need to be updated in real-time that could suffer by just doing place aggregate SQL queries when there are tons and tons of rows to parse.
In case of "always up-to-date" reports like this (sales, users, orders etc) that don't need live updates with near-zero-latency streaming processing might be overkill, and ROLAP-like approach seems to be more optimal in meaning of efforts/result.
You mentioned Redshift, and if you already ready to mirror your data for analytics purposes and only problem is a price you can consider another free open-source alternatives that could be used for handling OLAP (aggregate) queries in the real-time (like Yandex ClickHouse, or maybe MongoDb in some cases).
A lot of depends on the dataset size; unless you have really big data that need to be aggregated (hundreds of GB) you can try to keep using mysql and use some tricks:
use separate slave mysql server with high IOPS for analytics and replicate only tables needed to build your reports; possibly use another table engine, more suitable for analytical queries. Setup indexes specially for these queries, to avoid table full scan if you need to get numbers only for last weeks.
pre-calculate metrics for previous periods (with materialized view-like approach) and refresh them on schedule (say, daily), and then combine pre-calculated aggregates with on-the-fly aggregates only for last period to get actual report data without need to scan whole facts table each time.
use data visualization backend that can efficiently cache reports data in-memory to prevent SQL DB overload because of many similar queries (and if the same report or dashboard is displayed for 100 users SQL DB load will be the same as for 1). BTW, I develop solution like that (cannot adv it here as it is commercial product).
This is a typical trade-off for most the architects. Amazon Redshift offers exemplary read optimisations but AWS stack comes for a price. You may try using Cassandra, but it comes with its own set of challenges. When it comes to analytics, I never recommend going real time for the reasons elaborated below.
Doing analytics at real time is not desired, specially using MySQL
The solution for above comes by seggregating transactional and analytical infra. This involves cost but will make sure you don't have to spend time in housekeeping once you scale. MySQL is a row based RDBMS mostly used for storing transactional data. Being row based, it optimises writes i.e. the writes are almost real time and thus, it compromises on reads. When I say this, I refer to a typical analytics dataset running into millions of records/day. If your dataset is not that voluminous, you might still be able to render a graph showing transactional status. But since you're referring to Kafka, I assume the dataset is very large.
A real-time dashboard with visualisations gives a bad customer experience
Considering the above point, even if you go for a warehouse / a read optimised infra, you need to understand how the visualisations work. If 100 people access the dashboard at the same time, 100 connections will be made to the database, all fetching the same data, putting them in memory, applying calculations, parameters and filters defined in your dashboard, adjust the refined dataset in the visualisation and then render the dashboard. Till this time, the dashboard will simply freeze. A poorly constructed query, inefficient use of indexes etc will further make the matter worse.
The above problems will amplify more and more with the increase in your dataset. Good practices to achieve what you need would be:
To have almost realtime (delay of 1hr, 30 mins, 15 mins etc) rather than an absolute real time system. This will help you to create a flat file with the data already fetched in the memory. Your dashboard will simply read this data and will be extremely fast in terms of responses to filters etc. Also, multiple connections to databases will be avoided.
Have a data structure, database/warehouse optimised for reads.
For these types of operational analytics use-cases where the real-time nature of the data is critical, you're completely correct that most "traditional" methods can be quite clumsy, especially as your data size increases. A quick overview of your options:
Historical Approach (TLDR– Meh)
Up until about 5 years ago, the de facto way to do this looked something like
Set up a primary OLTP database that will handle the data in its raw form and have stricter guarantees on performance or ACID properties. Usually this is something SQL-esque, i.e. MySQL, PostgreSQL.
Set up a secondary OLAP database that is meant for serving offline (aka non user-facing) queries. This could also be a SQL-esque db but its schema would be drastically different because it stores the data in enriched form.
Set up some mechanism by which you can keep these 2 in sync. This pretty much boils down to either a) changing your application to always write to both databases and performing the necessary data enrichment or b) building a stand-alone application that reads from your OLTP database, performs the necessary transformations and enrichment and writes to your OLAP database
Plug your dashboard into your OLAP database which will have a schema and indexes optimized for the kind of queries you want.
Using your example about the pie store, the OLTP database would be used to store the purchases of all the pies and reference things like customer ids, billing information, delivery information, etc. In contrast, the OLAP database might just maintain a table with a schema
purchase_totals(day: Date, weekNumber: int, dayOfWeek: int, year: int, total: float)
While the weekNumber, dayOfWeek, and year and technically redundant they make your queries faster! With the proper indexes on these fields, your dashboard has turned into 5 simple (and fast!) aggregation queries with a group by and sum, and then the differences week-over-week or year-over-year can be computed on the client-side. As long as your dashboard refreshes every minute or so you have near-real-time data at your fingertips.
Current Approach (TLDR– Ok)
The recent trends in computing, database technologies, and data science/analytics have led to improvements to the above process, namely by replacing certain components of it. The changes include
Making the OLTP db, the OLAP db, or both a NoSQL database (Mongo usually being the most popular). The pro here is that you have a more flexible schema which won't break if something upstream changes (say, you start selling cakes in addition to pies).
Keeping the SQL db but shifting to cloud provider solution like AWS RDS or Google Cloud SQL. This fundamentally doesn't change anything about the architecture, but it does significantly reduce your operational burden.
Using hard-to-maintain ETL pipelines on top of streaming platforms like Kafka or AWS Kinesis to act as the middle layer between OLAP and OLTP.
Using dedicated tools for data cleaning and transformation as you plan out how to do your ETL
Using dedicated visualization tools on top of your OLAP db (think Tableau)
Using a pull-based approach for getting data out of your OLTP db or your application directly instead of waiting for it to eventually reach your OLAP db. This is helpful for online services because it actually gives you both the data you want AND confirmation that the service is alive and running well (because it just served your request for data). Systems like Prometheus are quite popular for this now.
The company which I work right now planning to use AWS to host a new website for a client. Their old website had roughly 75,000 sessions and 250,000 page views per year. We haven't used AWS before and I need to give a rough cost estimate to my project manager.
This new website is going to be mostly content-driven with a cms backend (probably WordPress) + a cost calculator for their services. Can anyone give me a rough idea about the cost to host such kind of a website in aws?
I have used simple monthly calculator with a single Linux t2.small 3 Year upfront which gave me around 470$.
(forgive my English)
The only way to know the cost is to know the actual services you will consume (Amazon EC2, Amazon EBS, database, etc). It is not possible to give an accurate "guess" of these requirements because it really does depend upon the application and usage patterns.
It is normally recommended that you implement the system and run it for a while before committing to Reserved Instances so that you have a chance to measure performance and test a few different instance types.
Be careful using T2 instances for production workloads. They are very powerful instances, but if the CPU Credits run out, the amount of CPU is limited.
Bottom line: Implement, measure, test. Then you'll know what is right for your needs.
Take Note
When you are new in AWS you have a 1 year free tier on a single t2.micro
Just pulled it out, looking into your requirement you may not need this
One load balancer and App server should be fine (Just use route53 to serve some static pages from s3 while upgrading or scalling )
Use of email subscription and processing of Some document can be handled with AWS Lambda, SNS and SWQ which may further reduce the cost ( you may reduce the server size and do all the hevay lifting from Lambda)
A simple webpage with 3000 request/monthly can be handled by T2 micro which is almost free for one year as mentioned above in the note
You don't have a lot of details in your question. AWS has a wide variety of services that you could be using in that scenario. To accurately estimate costs, you should gather these details:
What will the AWS storage be used for? A database, applications, file storage?
How big will the objects be? Each type of storage has different limits on individual file size, estimate your largest object size.
How long will you store these objects? This will help you determine static, persistent or container storage.
What is the total size of the storage you need? Again, different products have different limits.
How often do you need to do backup snapshots? Where will you store them?
Every cloud vendor has a detailed calculator to help you determine costs. However, to use them effectively you need to have all of these questions answered and you need to understand what each product is used for. If you would like to get a quick estimate of costs, you can use this calculator by NetApp.
I read the document that both for data analysis and in cluster structure but I don't understand what use case different.
Amazon Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analytics.Amazon Elasticsearch
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift
Amazon Redshift is a hosted data warehouse product, while Amazon Elasticsearch is a hosted ElasticSearch cluster.
Redshift is based on PostgreSQL and (afaik) mostly used for BI purpuses and other compute-intensive jobs, the Amazon Elasticsearch is an out-of-the-box ElasticSearch managed cluster (which you cannot use to run SQL queries, since ES is a NoSQL database).
Both Amazon Redshift and Amazon ES are managed services, which means you don't need to do anything in order to manage your servers (this is what you pay for). Using the AWS Console you can add new cluster and you don't need to run any commands on order to install any software - you just need to choose which server to run your cluster on (number of nodes, disk, ram, etc).
If you are not familiar with ElasticSearch you should check their website.
Edit: It is now possible to write SQL queries on ElasticSearch: SQL Support for AWS ElasticSearch
I agree with #IMSoP's assertions above...
To compare the two is like comparing an elephant and a tiger - you're not really asking the right question quite yet.
What you should really be asking is - what are my requirements for my use cases to best fulfill my stakeholder / customer needs, first, and then which data storage technology best aligns with my requirements second...
To be clear - Whether speaking of AWS ElasticSearch Service, or FOSS / Enterprise ElasticSearch (which have signifficant differences, between, even) - ElasticSearch is NOT a Relational Database (RDBMS), nor is it quite a NoSQL (Document Store) Database, either...
ElasticSearch is a Search Engine / Index. It does some things very well, for very specific use cases, however unlike RDBMS data models most signifficantly, ElasticSearch or NoSQL are not going to provide you with FULL ACID Compliance, or Transactional Statement Processing, so if your use case prioritizes data integrity, constrainability, reliability, audit ability, regulatory compliance, recover ability (to Point in Time, even), and normalization of data model for performance and least repetition of data while providing deep cardinality and enforcing model constraints for optimal integrity, "NoSQL and Elastic are not the Droids you're looking for..." and you should be implementing a RDBMS solution. As already mentioned, the AWS Redshift Service is based on PostgreSQL - which is one of the most popular OpenSource RDBMS flavors out there, just offered by AWS as a fully managed solution / service for their customers.
Elastic falls between RDBMS and NoSQL categories, as it is a Search Engine / Index that works most optimally with "single index" type use cases, where A LOT of content is indexed all at once and those documents aren't updated very frequently after the initial bulk indexing,but perhaps the most important thing I could stress is that in my experience it typically does not scale very cost effectively (even managed cluster services) if you want your clusters to perform well, not degrade over time, retain large historical datasets, and remain highly available for your consumers - and for most will likely become cost PROHIBITIVE VERY fast. That said, Elastic Search DOES still have very optimal use cases, so is always worth evaluating against your unique requirements - just keep scalability and cost in mind while doing so.
Lastly let's call NoSQL what it is, a Document Store that stores collections of documents (most often in JSON format) and while they also do indexing, offer some semblance of an Authentication and Authorization model, provide CRUD operability (or even SQL support nowadays, which makes the career Enterprise Data Engineer in me giggle, that SQL is now the preferred means of querying data from their NoSQL instances! :D )- Still NOT a traditional database, likely won't provide you with much control over your data's integrity - BUT that is precisely what "NoSQL" Document Stores were designed to work best for - UNSTRUCTURED DATA - where you may not always know what your data model is going to look like from the start, or your use case prioritizes data model flexibility over enforcing data integrity in general (non mission critical data). Last - while most modern NoSQL Document Stores may have SOME features that appear on the surface to resemble RDBMS, I am not aware of ANY in that category at current that could claim to offer all that a relational database does, with Oracle MySQL's DocumentStore being probably the best of both worlds in my opinion (and not just because I've worked with it every day for the last decade, either...).
So - I hope Developers with similar questions come across this thread, and after reading are much better informed to make the most optimal design decisions for their use cases - because if we're all being honest with ourselves - everything we do in our profession is about data - either generating it, transporting it, rendering it, transforming it....it all starts and ends with data, and making the most optimal data storage decisions for your applications will literally define the rest of your project!
Cheers!
This strikes me as like asking "What is the difference between apples and oranges? I've heard they're both types of fruit."
AWS has an overview of the analytics products they offer, which at the time of writing lists 21 different services. They also have a list of database products which includes Redshift and 10 others. There's no particularly obvious reason why these two should be compared, and the others on both pages ignored.
There is inevitably a lot of overlap between the capabilities of these tools, so there is no way to write an exhaustive list of use cases for each. Their strengths and weaknesses, and the other tools they integrate easily with, will change over time, and some differences are a matter of "taste" or "style".
Regarding the two picked out in the question:
Elasticsearch is a product built by elastic.co, which AWS can manage the installation and configuration for. As its name suggests, its core functionality is based around search - it can be used to build a flexible but fast product search for an e-commerce site, for instance. It's also commonly used along with other tools to search and aggregate logs and monitoring data.
Redshift is a database system built by AWS, based on PostgreSQL but optimised for extremely large data sets. It is designed for "data warehouse" applications, where you want to write complex logical queries against the data, like "how many people in each city bought both a toothbrush and toothpaste, this year compared to last year".
Rather than trying to make an abstract comparison of all the different services available, it makes more sense to start from the use case which you actually have, and see which tool best fits that need.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have the following requirement. I have with me a database containing the contact and address details of at least 2000 members of my school alumni organization. We want to store all that information in a relation model so that
This data can be created and edited on demand.
This data is always backed up and should be simple to restore in case the master copy becomes unusable.
All sensitive personal information residing in this database is guaranteed to be available only to authorized users.
This database won't be online in the first 6 months. It will become online only after a website is built on top of it.
I am not a DBA and I don't want to spend time doing things like backups. I thought Amazon's RDS with it's automatic backup facility was the perfect solution for our needs. The only problem is that being a voluntary organization we cannot spare the monthly $100 to $150 fees this service demands.
So my question is, are there any less costlier alternatives to Amazon's RDS?
In your case of just contact and address data I would choose Amazon SimpleDB. I know SimpleDB might not be suitable for a large number of tables with relationships and all, but for your kind of data I think SimpleDB is sufficient. And costs is much much cheaper than Amazon RDS.
I also wanted to use RDS, but the smallest db size costs $80 p/month.
With out a bit more info I may be way off base here. but 2000 names addresses etc. is not a large DB and I would have thought that the possible use of Amazons RDS was a bit "overkill" to say the least.
Depending on how (and who) you want view edit etc. there are a number of free or almost free alternatives.
One method may be to set up /use a hosting package that has something like phpMyAdmin linked to a mySQL DB. Doing this it is possible to access and edit etc. the DB without having a website front end. Not pretty (like a website front end) but practical. A good host should also back up for you.
Another is to look at Google Documents. OK not really a database more a spread sheet, but very much on the lines of Excel. You can share Google docs with invited people and even set up a small website via Google Docs. This is a free method, but may not be that practical depending on your needs.
Have you taken a look at Microsoft SQL Azure? You can use it free for something like 90 days and then if you only need a 1GB db it would only be about $10 a month.
You mention backup so I thought I would talk about that as well. They way SQL Azure works is that it automatically creates 2 additional copies of your database on different machines in the data center. If one of the machines or db's become unavailable it automatically fails over to one of the other db's.
If you need anything above that you can also use the copy command to backup the database.
You can check
http://www.enciva.com/postgresql9-hosting.htm
and
http://www.acugis.com/postgresql-hosting.htm
They work for Postgres and MySQL.
For a frankly tiny db of that size I'd seriously look at http://www.sqlite.org/
it's inprocess, easy to constantly .dump off to S3 and you can use update hooks to keep checkpoints after updates.
backups/restores are almost the equivalent of windows batchfiles and wgets
good encryption using http://sqlcipher.net/
standard OS Filesystem and user level ACLs control security.
running a file backed db makes sense given the fragility of a normal EC2 backed RDBMS to EBS gremlins.
there are exclusions from to SQL92 (no real showstoppers), but given the project cost sensitivity and the RPO and RTO's of an alumni database, I reckon it's a good bet.