In the performanceInsights of bigquery there is one field called avgPreviousExecutionMs, In the description, it was given as like this
Output only. Average execution ms of previous runs. Indicates the job ran slow compared to previous executions. To find previous executions, use INFORMATION_SCHEMA tables and filter jobs with the same query hash.
I tried to validate this avgPreviousExecutionMs data for one of my jobs based on the views in the information schema by filtering the queries with the same hash based on the query_info.query_hashes.normalized_literals field.
Steps that I have done to validate this info.
I ran a job on the flat rate pricing concurrently with other queries, to make it slower such that I can get the avgPreviousExecutionMs field in the PerformanceInsights section.
Now I want to validate this field with the information schema data
I ran this query in the information schema, by excluding my current job id
SELECT
AVG(TIMESTAMP_DIFF(end_time, start_time, MILLISECOND)) as avg_duration
FROM
region-us.INFORMATION_SCHEMA.JOBS_BY_USER where query_info.query_hashes.
normalized_literals = "myqueryHash" and job_id != "myjobId";
The result of the query and the avgPreviousExecutionMs value shown for that job in this section are not matching.
How can we validate this info ?
the avgPreviousExecutionMs is based on taking the how much time
period of data.
This average is based on the JOBS view or JOBS_BY_USER view or JOBS_BY_FOLDER or JOBS_BY_ORGANIZATION
Related
This is my use case:
I have a JSON Api with 200k objects. The dataset looks a little something like this: date, bike model, production time in min. I use Lambda to read from a JSON Api and write in DynamoDB via http request. The Lambda function runs everyday and updates DynamoDB with the most recent data.
I then retrieve the data by date since I want to calculate the average production time for each day and put it in a second table. An Alexa skill is connected to the second table and reads out the average value for each day.
First question: Since the same bike model is produced multiple times per day, using a composite primary key with date and bike model won't give me a unique key. Shall I create a UUID for the entries instead? Or is there a better solution?
Second question: For the calculation I would need to do a full table scan each time, which is very costly and advised against by many. How can I solve this problem without doing a full table scan?
Third question: Is it better to avoid DynamoDB altogether for my use case? Which AWS database is more suitable for my use case then?
Yes, uuid or any other unique identifier (ex: date+bike model+created time) as pk is fine.
It seems your daily job for average value is some sort of data analytics job not really a transaction job. I would suggest to go with a service support data analytics such as Amazon Redshift. You should be able to add data to such database service using Dynamodb streams. Alternatively, you can stream data into s3 and use a service like Athena to get the daily average.
There is a simple database model that you could use for this task:
PartitionKey: a UUID or use any combination of fields that provide uniqueness.
SortKey: Production date, as a string, i.e. 2020-07-28
If you then create a secondary index which uses as PK the Production date and includes the production time, you can then query (not scan) the secondary index for a specific date and perform any calculations you need on production time. You can then provision the required read/write capacity on the secondary index and the table independently.
Regarding your third question, I don't see any real benefit of using DynamoDB for this task. Any RDS (i.e. MySQL), Redshift or even S3+Athena can easily handle such use case. If you require real time analytics, you could even consider AWS Kinesis.
I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.
Google BigQuery (BQ) allows you to create a partition using timestamp or date types only.
99% of my data has a very clear selector, idClient. I've created to my customer's views with a predicate like idClient = code so the privacy is guaranteed.
The problem with this strategy is that there are customers with 5M rows and others with 200K and as BQ does not have indexes, they are always processing data from each other (and the costs are rising).
I am intending to create a timestamp field where each customer will have a different timestamp that will be repeated for every Insert in every customer sensitive table and thus I can query by timestamp by fixing it as it would be with a standard ID.
Does this make any sense? If BQ was an indexed database I'd be concerned about skewed data but as it is always full table scan, I think I'd have only benefits and no downsides.
The solution for your problem is to add Cluster field to your table which is equivalent to an Index in other databases
This link provides the basic on how to use cluster field
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns
Note: When using cluster field BigQuert dryRun doesn't show the cost improvement which can only be seen post-execution
I don't understand the way "firestore" counts the reads for the quota and billing.
Example: I have a collection with 200'000 documents. Every document has a timestamp as attribute. Now I would like to get all documents within the last hour. So I create a query which gives me back all documents with "timestamp > now()-60 minutes". The result is a set of 10 documents. Does Firestore counts 10 document reads or 200'000 documents read for this?
I would like to build a query that read a document always once (fetch, not real time).
I assume that firestore is only at the first view a cheaper solution than with other solution (e.g. google cloud sql, aws etc).
To allow that query, you will need to have an index on timestamp. Firestore uses that index to determine what documents to return. So this would count as 10 document reads.
I'd like to give my partners the results of simple COUNT(*) ... GROUP BY items.color type queries and perhaps joins over items and orders or some such. I'd like query response time to be sub-second (on the order of a second, at worst), and scale to billions of rows counted.
My current approach is to either backup my GCDatastore data and load it into BigQuery and provide daily analytics or use GCDataflow to maintain a set of pre-defined counters.
Is this something Spanner has as a use-case for, if I transition my backend from Datastore to Spanner?
Today, running counting queries in Cloud Spanner requires a full table scan. Depending on the size of the table this could take more than a second.
One thing you could do is to track the count in a separate table, and whenever you update the items table, update the count in the same transaction.