Bigquery data consumption - google-cloud-platform

I have a big query to run. It has multiple tables joining and multiple ctes.
Before I run the query in GCP , once it's ready to run it gives me a message saying
This script will process 30 MiB when run.
Once I've run the query It says
Query complete (0.4 sec elapsed, 4.91 GB processed)
How can I reduce Bytes processed ? it's 4.91 GB now.
If I want to optimize a big query , where can I see the information of data consumption and time of that query's performance?
I want to optimize my query and show the results of data consumption of the query.

Most of the the time, when federated views are used in the query then there will be difference in pre-execution and post-execution stats.
In other cases it is almost same.

Related

Does a Google Datastore query plus transaction run slower than just a query?

Are Google Datastore queries slower when put into a transaction? Assuming the query is exactly the same, would the run time of a transaction + query be slower than the query not in a transaction?
Does the setup of the transaction add any execution time?
Here's some data from running a single document get 100 times sequentially.
type
avg
p99
transactional
46ms
86ms
nontransactional
16ms
27ms

How to make Athena process multiple queries concurrently

I'm launching several concurrent queries to Athena via a Python application.
Given Athena's history of queries, it seems that multiple queries are all indeed received at the same time by Athena, and processed concurrently.
However, it turns out that the overall query running time is not that different from sending queries one after the other.
Example: sending three queries sequentially vs concurrently:
# sequentially
received at took finished at
query_1 22:01:14 6s 22:01:20
query_2 22:01:20 6s 22:01:27
query_3 22:01:27 5s 22:01:25
# concurrently
received at took finished at
query_1 22:02:25 17s 22:02:42
query_2 22:02:25 17s 22:02:42
query_3 22:02:25 17s 22:02:42
According to these results, in the second case, it seems that Athena, although pretending to be treating the queries concurrently, effectively processed them in a sequential manner.
Is there some configuration I wouldn't be aware of, to make Athena effectively process multiple queries concurrently? Ideally, in this example, the three queries processed concurrently would take a global running time of 6s (the longest time of the three individual queries).
Note: these are three queries targeting the same database/table, backed by the same (single) Parquet file in S3. This Parquet file is approx. 70Mb big and has 2.5M rows with a half-dozen columns.
In general the way you run concurrent queries in Athena is to run as many StartQueryExecution calls as you need, collect the query execution IDs, and then poll using GetQueryExecution for each one to be completed. Athena runs each query independently, concurrently, and asynchronously.
Depending on how long you wait between polling each query execution ID it may look like queries take different amounts of time. You can use the Statistics.EngineExecutionTimeInMillis property of the response from GetQueryExecution to see how long the query executed in Athena, and the difference between the Status.SubmissionDateTime and Status.CompletionDateTime properties to see the total time between when Athena received the query and when the response was available. Usually these two numbers are very close, and if there is a difference your query got queued internally in Athena.
The numbers in your question look unlikely. That they ended on the exact same second after running for 17 seconds looks suspicious. How many times did you run your experiment? If you look at Statistics.EngineExecutionTimeInMillis do they differ in the number of milliseconds, or are all numbers identical? Did you set ClientRequestToken, and if so, was it the same value for all three queries (in that case you actually only ran one).
What do you mean by "concurrently", do you start and poll from different threads, or poll in a single loop? How long did you wait between each poll call?

Number of concurrent Big Query load jobs

I am firing thousands of load jobs concurrently to different BigQuery tables. I see some of them to be executed instantly while others are queued. I was wondering how many load jobs can be run concurrently and if there is a way to run more of them instantaneously.
As seen in the documentation, the limit is 100 concurrent queries and to raise it you need to contact support or sales:
The following limits apply to query jobs created automatically by running interactive queries and to jobs submitted programmatically using jobs.query and query-type jobs.insert method calls.
Concurrent rate limit for on-demand, interactive queries — 100 concurrent queries
Queries with results that are returned from the query cache, and dry run queries do not count against this limit. You can specify a dry run query using the --dry_run flag or by setting the dryRun property in a query job.
This limit is applied at the project level. To raise the limit, contact support or contact sales.

Long running prepare statements on Azure SQL Data Warehouse

I am running a functional test of a 3rd party application on an Azure SQL Data Warehouse database set at DWU 1000. In reviewing the current activity via:
sys.dm_pdw_exec_requests
I see:
prepare statements taking 30+ seconds,
NULL statements taking up to 25 seconds,
compilation of statements takes up to 60 seconds,
explain statements taking 60+ seconds, and
select count(1) from empty tables take 60+ seconds.
How does one identify the bottleneck involved?
The test has been running for a few hours and the Azure portal shows little DWU consumed on average, so I doubt that modifying the DWU will make any difference.
The third-party application has workload management feature, so I've specified a limit of 30 connections to the ADW database (understanding that only 32 sessions are active on the database itself.)
There are approximately ~1,800 tables and ~350 views in the database across 29 schemas (per information_schema.tables).
I am in a functional testing mode, so many of the tables involved in the queries have not yet been loaded, but statistics have been created on every column on every table in the scope of the test.
One userID is being used in the test. It is in smallrc.
have a look at your tables - in the query? Make sure all columns in joins, group by, and order by have up-to-date statistics.
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-statistics

Questions on dynamoDB query result

I'm currently thinking about how I should write my queries for DynamoDB. I have questions below which I hope someone could advise me on it.
Given scenarios: I have a million records on a table.
Questions:
When I query, can I fetch 1000 records in batches instead of 1 million records at one go?
Is the time taken to fetch 1000 records similar to 1 million records?
What happens if I hit the limit of 1MB or my throughput for the table so that I can fetch again for the remaining records?
Thanks in advance!
1) Yes you can specify a limit for a query (1000 in your case).
2) No. The time is not the same. More records will mean more time - because you will need to fetch more pages (most time will be spend in network roundtrips)
3) If you hit the 1MB limit, Dynamo will provide a LastEvaluatedKey. You repeat the request and pass the LastEvaluatedKey until you fetch everything (you are basically fetching in a loop).
If you hit the provisioned throughput limits, you either increase the limits or you back off (i.e. you need to regulate your consumption to stay within the limits)
Reference: http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html