I would like to get a picture of how pricing works in azure for SQL datawarehouse.
Scenario - I have kept the SQL datawarehouse in azure at 6000 DWU , and have not done any inserts/updates/deletes/selects i.e no operations have been done on SQL datawarehouse. So will I be charged the 6000 tier pricing for each day even if no operations has been performed or will I not be charged.
Thanks.
You will be charged. Azure SQL Data Warehouse operations are charged based on hours running (ie not in pause state) and depending on what DWU you were running at, as per the pricing page here. It's kind of expensive to run at DWU 6000 and not do anything. Why didn't you pause it?
Azure Data Lake Analytics (ADLA) is pay as you go if that's what you are looking for.
Related
In GCP bigquery, there are two types of pricing 1) On demand 2) Flat rate pricing.
I have two projects one is configured with on-demand, other one is configured with flat rate pricing. Now when i executes a query on two projects, i need to know the differentiation between the query executed on on-demand pricing and flat rate pricing. In the GUI, we can know the difference .
Flat Rate Pricing
On demand pricing
But through bigquery libraries i am calling the bigquery API to get the job Object by jobId. In that i am unable to find the difference between those queries , atleast i expected some info related to reservation in the flat rate pricing query, but there is no luck. I need this info in our monitoring tool to identify the difference between the queries that are executed on on-demand and flat rate.
One analysis i found was through information schema we can get the info, but i am more interested at the API level through bigquery java libraries.
In the Get job API, you get a JOB object. In it, you have a JobStatistic entry. In this entry, you have the reservation detail
I am looking for a programmatic way to monitor my lambda serverless environment cost in real time, or x hours retrospective. I am looking at the budget API but it seems like it always goes around a defined budget which is not my use case. The other way I thought might work is the count lambda executions and calculate according to lambda instance type. Any insight or direction how to go about this programmatically would be highly appreciated.
From Using the AWS Cost Explorer API - AWS Billing and Cost Management:
The Cost Explorer API allows you to programmatically query your cost and usage data. You can query for aggregated data such as total monthly costs or total daily usage. You can also query for granular data, such as the number of daily write operations for DynamoDB database tables in your production environment.
Cost Explorer refreshes your cost data at least once every 24 hours, so it isn't "real-time".
According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes about 2 minutes to finish. In a AWS account, it is only me who is using Athena service. However, when I look at the state of queries through console I see that only a few of queries (5 on average) are actually being executed despite all of them being in state Running. Here is what would normally see in Athena hisotry tab:
I understand that, after I submit queries to Athena, it processes the queries by assigning resources based on the overall service load and the amount of incoming requests. But I tried to run them at different days and hours, still would get about 5 queries being executed at the same time.
So my question is this how it supposed to be? If it is then what is the point of being able to submit up to 20 queries if roughly 15 of them would be idling and waiting for available slots.
Update 2019-09-26
Just stumbled across HIVE CONNECTOR in presto documentation, which has a section AWS Glue Catalog Configuration Properties. There we can see
hive.metastore.glue.max-connections: Max number of concurrent connections to Glue (defaults to 5).
This got me wonder if it has something to do with my issue. As I understand, Athena is simply a Presto that runs on EMR cluster which is configured to use AWS Glue Data Catalog as the Metastore.
So what if my issue comes from the fact that EMR cluster for Athena simply uses default value for concurrent connections to Glue, which is 5 which and is exactly of how many concurrent queries are actually getting executed (on average) in my case.
Update 2019-11-27
The Athena team recently deployed a host of new functionality for Athena. although QUEUED has been in the state enum for some time is hasn't been used until now. So now I get, correct info about query state in a history tab, but everything else remains the same.
Also, another post was published with similar problem.
Your account's limits for the Athena service is not an SLA, it's more of a priority in the query scheduler.
Depending on available capacity your queries may be queued even though you're not running any other queries. Exactly what a higher concurrency limit means is internal and could change, but in my experience it's best to think of it as the priority by which he query scheduler will deal with your query. Queries for all accounts run in the same server pool(s) and if everyone is running queries there will not be any capacity left for you.
You can see this in action by running the same query over and over again and then plot the query execution metrics over time, you will notice that they vary a lot, and you will notice spikes in the time your queries are queued on the top of every hour – when everyone else is running their scheduled queries.
What tools are there to investigate query performance in Cloud Datastore? I'm interesting in seeing things similar to a SQL query plan or any metrics beyond the total round trip request time.
Unfortunately, these insights don't exist. Cloud Datastore only supports queries where an index exists.
Does anyone know how fast the copy speed is from Amazon S3 to Redshift?
I only want to use RedShift for about an hour a day, to run updates on Tabelau reports. The queries being run are always on the same database, but I need to run them each night to take in to account new data that's come in that day.
I don't want to keep a cluster going 24x7 just to be used for one hour a day, but the only way that I can see of doing this is to Import the entire database each night into Redshift (I don't think you can't suspend or pause a cluster). I have no idea what the copy speed is so I have no idea if its going to be relatively quick to copy a 10GB file in to Redshift every night.
Assuming its feasible, my thinking is to push the incremental changes on SQL Server dbase in to S3. Using Cloud Formation, I automate the provisioning of a Redshift cluster at 1am for 1 hour, import the dbase from S3, and schedule Tableau to run its queries between that time and get its results. I keep an eye on how long the queries take, and If I need longer than an hour I just amend the cloud formation.
In this way I hope to keep a really 'lean' Tableau server by outsourcing all the ETL to Redshift, and buying only what I consume on Redshift.
Please feel free to critique my solution, or out right blow it out of the water. Otherwise If the consensus of the answer is that importing is relevantly quick, It gives me a thumbs up I'm headed in the right direction with this solution.
Thanks for any assistance!
Redshift loads from S3 are very quick, however Redshift clusters do not come up / tear down very quickly at all. In the above example most of your time (and money) would be spent waiting for the cluster to come up, existing data to load, refreshed data to unload and cluster to tear down again.
In my opinion it would be better to use another approach for your overnight processing. I would suggest either:
For a couple of TB, InfiniDB on a largish EC2 instance with the database stored on an EBS volume.
For many TBs, Amazon EMR with the data stored on S3. If you don't want to get into Hadoop too much you can use Xplenty/Syncsort Ironcluster/etc. to orchestrate the Hadoop element.
While this question was written three years ago and it wasn't available at that time, a suitable solution to this now would be to use Amazon Athena, which allows on-demand SQL querying of data held in S3. This works on a pay-per-query model, and is intended for ad-hoc and "quick" workloads like this.
Behind the scenes, Athena uses Presto and Elastic MapReduce, but the only required knowledge for a developer/analyst in practice is SQL.
Tableau also now has a built-in Athena connector (as of 10.3).
More on Athena here: https://aws.amazon.com/athena/
You can presort data you are keeping on S3. It will make Vacuum much faster.
This is the classic problem with Redshift... if you looking different way .. Microsoft recently announced new service called SQL Data Warehouse (Uses PDW Engine) I think they want to compete directly with Redshift.... Most interesting concept here is ... Familiar SQL Server Query language and Toolset (including Stored proc support). They also decoupled Storage and Compute so you can have 1 GB storage but 10 Compute node for intensive query and vice versa.... they are claiming that compute node start in few seconds and when you resize cluster you don't have to take it offline. Cloud Data Warehouse Battle getting hot :)