I am making a MVC website that has a SQL database for storage on Azure. Potentially there will be many hundreds and possibly thousands of transactions per day to the website via a web service.
What type of database should I use? Should it be the web retired version, standard or any of the other types? What is the cheapest, that still works well with the traffic of many hundreds and possibly thousands of transactions per day.
Thanks in advance.
Your description of your requirements equates to a light transaction workload - "hundreds and possibly thousands per day". It's difficult to give a concrete answer with such little information. However, assuming you're OK with the 2GB database size, the Basic tier is where I would start. You can always change your service tier and/or performance level if you find you need more.
Be sure to check the Overview of the Performance Model chart to find the service tier and performance level that meets your requirements. Notice that Basic benchmarks are per hour, Standard are per minute, and Premium are per second.
Also, look at the SQL Database pricing page and you will see the pricing for a Basic is going to be cheaper than an equivalent Web Tier.
Related
Any recommendation on how to make superset faster?
Cache seems to load full data from the cache, I thought it load only old data from the cache, and real-time data from the database, isn't it like this?
What about some parallel processing?
This answer is valid as of Superset 0.37.0.
At the moment, dashboard performance is affected by a few different factors. I'll enumerate them below along with methods to improve performance:
Database concurrency limits can have an impact on dashboard performance. Dashboards load their information in parallel via concurrent web requests. Make sure that the database user provided allows enough concurrency that queries aren't being queued at the database layer.
Cache performance your caching layer should be able to return multiple results, if not in parallel, extremely quickly. We've had success leveraging S3 for our cache.
Cache hit percentage Superset will hit the cache only for queries that exactly match one that has been run recently. Otherwise the full query will fall through to the underlying analytical DB (Druid in this case). You can reduce the query load on Druid by using a less granular resolution on your dashboard - if it's possible to have it update less frequently, say a couple of times a day rather than in real-time, this can hit cache for all requests other than the first request in the new period under consideration.
Python Web Process Concurrency Limits make sure that your web application server can handle enough parallel requests. The browser will request multiple charts' data at the same time, and the system will need to be able to handle these requests in parallel.
Chart Query Performance As data is frequently requested, especially for real-time data from a database like Druid, optimizing the queries run by the charts can be very useful. I'd take a look at any virtual datasources that are being leveraged to see if they can be materialized or made more efficient.
Web browser concurrent request limits By default most web browsers limit the number of concurrent requests that can be made to the same FQDN. If you have more than 6 charts on the same dashboard, it can be helpful to balance requests across multiple FQDNs running Superset to get around this browser limitation. There's more information on the approach to that in the issue history on Github, but Superset does support this type of configuration.
The community is very interested in improving performance over time, and as such there have been recommendations to move all analytical queries to Celery as well as making other architectural changes to improve performance. I hope this description helps and that something in here will help you track down the issue!
The company which I work right now planning to use AWS to host a new website for a client. Their old website had roughly 75,000 sessions and 250,000 page views per year. We haven't used AWS before and I need to give a rough cost estimate to my project manager.
This new website is going to be mostly content-driven with a cms backend (probably WordPress) + a cost calculator for their services. Can anyone give me a rough idea about the cost to host such kind of a website in aws?
I have used simple monthly calculator with a single Linux t2.small 3 Year upfront which gave me around 470$.
(forgive my English)
The only way to know the cost is to know the actual services you will consume (Amazon EC2, Amazon EBS, database, etc). It is not possible to give an accurate "guess" of these requirements because it really does depend upon the application and usage patterns.
It is normally recommended that you implement the system and run it for a while before committing to Reserved Instances so that you have a chance to measure performance and test a few different instance types.
Be careful using T2 instances for production workloads. They are very powerful instances, but if the CPU Credits run out, the amount of CPU is limited.
Bottom line: Implement, measure, test. Then you'll know what is right for your needs.
Take Note
When you are new in AWS you have a 1 year free tier on a single t2.micro
Just pulled it out, looking into your requirement you may not need this
One load balancer and App server should be fine (Just use route53 to serve some static pages from s3 while upgrading or scalling )
Use of email subscription and processing of Some document can be handled with AWS Lambda, SNS and SWQ which may further reduce the cost ( you may reduce the server size and do all the hevay lifting from Lambda)
A simple webpage with 3000 request/monthly can be handled by T2 micro which is almost free for one year as mentioned above in the note
You don't have a lot of details in your question. AWS has a wide variety of services that you could be using in that scenario. To accurately estimate costs, you should gather these details:
What will the AWS storage be used for? A database, applications, file storage?
How big will the objects be? Each type of storage has different limits on individual file size, estimate your largest object size.
How long will you store these objects? This will help you determine static, persistent or container storage.
What is the total size of the storage you need? Again, different products have different limits.
How often do you need to do backup snapshots? Where will you store them?
Every cloud vendor has a detailed calculator to help you determine costs. However, to use them effectively you need to have all of these questions answered and you need to understand what each product is used for. If you would like to get a quick estimate of costs, you can use this calculator by NetApp.
I read the document that both for data analysis and in cluster structure but I don't understand what use case different.
Amazon Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analytics.Amazon Elasticsearch
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift
Amazon Redshift is a hosted data warehouse product, while Amazon Elasticsearch is a hosted ElasticSearch cluster.
Redshift is based on PostgreSQL and (afaik) mostly used for BI purpuses and other compute-intensive jobs, the Amazon Elasticsearch is an out-of-the-box ElasticSearch managed cluster (which you cannot use to run SQL queries, since ES is a NoSQL database).
Both Amazon Redshift and Amazon ES are managed services, which means you don't need to do anything in order to manage your servers (this is what you pay for). Using the AWS Console you can add new cluster and you don't need to run any commands on order to install any software - you just need to choose which server to run your cluster on (number of nodes, disk, ram, etc).
If you are not familiar with ElasticSearch you should check their website.
Edit: It is now possible to write SQL queries on ElasticSearch: SQL Support for AWS ElasticSearch
I agree with #IMSoP's assertions above...
To compare the two is like comparing an elephant and a tiger - you're not really asking the right question quite yet.
What you should really be asking is - what are my requirements for my use cases to best fulfill my stakeholder / customer needs, first, and then which data storage technology best aligns with my requirements second...
To be clear - Whether speaking of AWS ElasticSearch Service, or FOSS / Enterprise ElasticSearch (which have signifficant differences, between, even) - ElasticSearch is NOT a Relational Database (RDBMS), nor is it quite a NoSQL (Document Store) Database, either...
ElasticSearch is a Search Engine / Index. It does some things very well, for very specific use cases, however unlike RDBMS data models most signifficantly, ElasticSearch or NoSQL are not going to provide you with FULL ACID Compliance, or Transactional Statement Processing, so if your use case prioritizes data integrity, constrainability, reliability, audit ability, regulatory compliance, recover ability (to Point in Time, even), and normalization of data model for performance and least repetition of data while providing deep cardinality and enforcing model constraints for optimal integrity, "NoSQL and Elastic are not the Droids you're looking for..." and you should be implementing a RDBMS solution. As already mentioned, the AWS Redshift Service is based on PostgreSQL - which is one of the most popular OpenSource RDBMS flavors out there, just offered by AWS as a fully managed solution / service for their customers.
Elastic falls between RDBMS and NoSQL categories, as it is a Search Engine / Index that works most optimally with "single index" type use cases, where A LOT of content is indexed all at once and those documents aren't updated very frequently after the initial bulk indexing,but perhaps the most important thing I could stress is that in my experience it typically does not scale very cost effectively (even managed cluster services) if you want your clusters to perform well, not degrade over time, retain large historical datasets, and remain highly available for your consumers - and for most will likely become cost PROHIBITIVE VERY fast. That said, Elastic Search DOES still have very optimal use cases, so is always worth evaluating against your unique requirements - just keep scalability and cost in mind while doing so.
Lastly let's call NoSQL what it is, a Document Store that stores collections of documents (most often in JSON format) and while they also do indexing, offer some semblance of an Authentication and Authorization model, provide CRUD operability (or even SQL support nowadays, which makes the career Enterprise Data Engineer in me giggle, that SQL is now the preferred means of querying data from their NoSQL instances! :D )- Still NOT a traditional database, likely won't provide you with much control over your data's integrity - BUT that is precisely what "NoSQL" Document Stores were designed to work best for - UNSTRUCTURED DATA - where you may not always know what your data model is going to look like from the start, or your use case prioritizes data model flexibility over enforcing data integrity in general (non mission critical data). Last - while most modern NoSQL Document Stores may have SOME features that appear on the surface to resemble RDBMS, I am not aware of ANY in that category at current that could claim to offer all that a relational database does, with Oracle MySQL's DocumentStore being probably the best of both worlds in my opinion (and not just because I've worked with it every day for the last decade, either...).
So - I hope Developers with similar questions come across this thread, and after reading are much better informed to make the most optimal design decisions for their use cases - because if we're all being honest with ourselves - everything we do in our profession is about data - either generating it, transporting it, rendering it, transforming it....it all starts and ends with data, and making the most optimal data storage decisions for your applications will literally define the rest of your project!
Cheers!
This strikes me as like asking "What is the difference between apples and oranges? I've heard they're both types of fruit."
AWS has an overview of the analytics products they offer, which at the time of writing lists 21 different services. They also have a list of database products which includes Redshift and 10 others. There's no particularly obvious reason why these two should be compared, and the others on both pages ignored.
There is inevitably a lot of overlap between the capabilities of these tools, so there is no way to write an exhaustive list of use cases for each. Their strengths and weaknesses, and the other tools they integrate easily with, will change over time, and some differences are a matter of "taste" or "style".
Regarding the two picked out in the question:
Elasticsearch is a product built by elastic.co, which AWS can manage the installation and configuration for. As its name suggests, its core functionality is based around search - it can be used to build a flexible but fast product search for an e-commerce site, for instance. It's also commonly used along with other tools to search and aggregate logs and monitoring data.
Redshift is a database system built by AWS, based on PostgreSQL but optimised for extremely large data sets. It is designed for "data warehouse" applications, where you want to write complex logical queries against the data, like "how many people in each city bought both a toothbrush and toothpaste, this year compared to last year".
Rather than trying to make an abstract comparison of all the different services available, it makes more sense to start from the use case which you actually have, and see which tool best fits that need.
I am working on a civics related project and I need to be able to display all the properties in the City of Philadelphia on a map, so I'll need to get the latitude & longitude for all 580,000 properties. (Only once)
Most APIs like Google/Yahoo have limits of 5,000 per day, and even BatchGeo has a similar limit.
Is there a way I can do a one-time geocoding of all these addresses?
You can find a list of free and paid geocoding services at USC site.
Also check Microsoft's Geocode Dataflow API, it allows up to 200,000 entries / 300 Mb and takes up to 14 days.
Another possibility to combine several services at once: use 4 services that allow 5,000 entries a day and you'll finish your task in a month.
You can use Map Quest of Cloud Made.
I have created a small utility to help compare these API's.
The utility is hosted at below url:
http://ankit-zalani.appspot.com/GeoCode/index.jsp
Tobias, I work for an address verification (and recently, geocoding) company called SmartyStreets.
Many services have usage restrictions based on volume and license agreements which prevent users from storing the results of geocoding queries. There are some vendors, however, which don't have limits or restrictions like that.
I would recommend something like LiveAddress which will not only geocode the addresses but also perform CASS-Certified verification to make sure your addresses are correct before giving you potentially faulty coordinates. You can run 580,000 or even millions at a time in a few minutes, and we allow you to store your results.
Hope this helps. If you have any more questions about addresses, I'll personally assist.
This thread is pretty old by now, but there have been some developments in recent years making bulk geocoding very cheap. My favorite option is to just obtain a geocoding server on AWS ( google: geocoding on aws), many options there, some free some with low hourly rates (total cost depends on the server you choose, of course.)
In developing a new web service I haven't been able to find very much information on how companies bill for their web services.
Do you bill by request or only certain requests ie) GET or POST?
-would these be tracked at the application or server level?
Do you bill by bandwidth?
-again how would this be tracked on a per user basis
Do you charge a subscription to simply have access?
-this is assuming that they are only granted an api key after payment has been made.
A combination of the above or other options?
Thanks for your help.
As all things in a market economy, the price, but also the inconvenience (or convenience) and risk associated with the actual payment (irrespective of the amount) is a function of how unique and cool and valued your service or product is.
It is therefore impossible to answer the question but in very generic terms, i.e. in the form of suggestions. You actual invoicing model may base on one or several of the following
bill for a one-time setup fee
bill on a subscription basis (i.e. for a defined period, with explicitly defined maximum amounts of usage)
bill for maintenance
bill by the act, i.e. a certain amount (possibly on a decreasing unit price schedule). Such acts should be counted at the server level, (The client-side may include some audit/monitoring/log of sorts, but the server-side should be the authoritative source of info)
bill by volume (for example number of MBytes transfered etc.), this is applicable to services where there is a big variation in the volume of info produced for each "act".
In general, the price and the modality of accounting should seem fair, to both parties, particularly to the buyer, and typically, the simpler the better. The price should not necessarily be low, provided you can make the case that the service provided is effectively valuable, and that you either invested and took risk to introduce the service, or the on-going expenses associated with running the service are evident.
I guess It Depends™ on what the service does. Broadly, I'd say you should bill when you provide some intrinsic value; how you determine what that billing criteria is is quite domain-specific. There may be some property of the service provided which allows you to determine how much to bill.
For example, suppose you've a web service that performs a calculation. You might decide that for every successful computation you do, you're going to charge a fixed fee, say $0.01, but let users off if there's a validation problem, such as an invalid request. Alternatively, if those computations are vaguely long-running, you might have a charging model that's based on some sort of CPU-time metric.
Your point about subscriptions is a good one, and this is an area where you might potentially benefit from allowing a couple of commercial models; one to cater for the users who might perform a lot of requests per month, in which case a fixed subscription might make sense, and one to cater for users who make a few ad-hoc requests. In the latter case, of course, if you only attract those customers, then you're not going to make a good return on investment. Some kind of middle ground, whereby you have a small subscription, but then allow customers to buy a "block" or "bundle" of requests on top without incurring additional processing costs, might work.
Most webservices I know of charge for two things:
Volume of "usage". Generally giving low volumes "free" access (i.e., less than X hits/hour from a given IP address account combination). This is similar to say, twitter which gives you 150 hits/hour to its service from either your username, or unique IP or combination of the two (so you dont abuse it by changing IPs frequently). If you want a higher volume you pay for that access and its usually assigned by account (in twitters case you can get a dev account [for free] which gives you 20K or more hits an hour)
Depth of Details, Access to features. Again free accounts get a minimum amount of access, but dont get access to more data or to more advanced features (filtering, etc). Lots of google services work like this, were base access is given to everyone but if you want more refined abilities (greater search, more data, faster results) you have to buy an account code with the corresponding functionality.
I havent really seen or participated in any projects with pay-for-performance, or pay-per-hit/access models as they get very difficult to reliably bill for and very hard to account for to customers, even if you use tiered or banded ranges. How do you tell your customers how many hits they have used, especially in a distributed system, with redundant fail-over, etc. If I had to pay $0.01 cents per access I would want to know exactly how its measured, and what the company had in place to control access, and how accurate their monitoring was, etc.
Its not impossible, and definitely can be done, and may work well in large bulk scenarios.
Many of the ones I have seen bill by time, such as on a monthly or yearly basis. Some allow you to pay by the month, some require some (or all) of the fee up front. Access might be restricted by issuing a security certificate for the web service that expires when the customer's account expires, or possibly by having them send a client ID and letting the server check if that client ID is allowed to have an answer (but that's open to people stealing someone else's client ID ;) ).
I suppose if you have a service that sends and receives very large amounts of data, it might make sense to bill per service request, but the billing for that could get trickier. Are clients likely to make dozens of requests per day, or just a few? How much to bill per transaction? $100? $0.01? That all would depend on the nature of the service. If you want to go that route, you would probably need to be able to ensure that clients only get billed for requests that are successfully answered (I'd hate to get billed even though my client app failed to receive the entire web service message from your server).
Per request or as a subscription, and yes, bandwidth can be a variable that is used to set the fee. Depends of the value of binding the customer close or having a myriad of loosely coupled customers using it. There is no correct answer to the question that fits all or even most cases.
If I look at the services I have made in the past, the subscription model would be the best model to use. Sometime a tick of $ per request seems like the best approach but I have never had a service configured that way yet.
I agree with what has been said by Rob and Des. One thing to remember is that a subscription is a really simple concept that everyone is used to and comfortable with (if you price it right). If you want to cover a wide audience look at how the payment providers do - they have slightly different methods of payment depending on how many transactions you do per year. There'll be a fixed subscription plus a per-transaction charge and they both vary with the number of transactions. This is the most flexible, but it depends if it makes sense for your business.