I don't understand the way "firestore" counts the reads for the quota and billing.
Example: I have a collection with 200'000 documents. Every document has a timestamp as attribute. Now I would like to get all documents within the last hour. So I create a query which gives me back all documents with "timestamp > now()-60 minutes". The result is a set of 10 documents. Does Firestore counts 10 document reads or 200'000 documents read for this?
I would like to build a query that read a document always once (fetch, not real time).
I assume that firestore is only at the first view a cheaper solution than with other solution (e.g. google cloud sql, aws etc).
To allow that query, you will need to have an index on timestamp. Firestore uses that index to determine what documents to return. So this would count as 10 document reads.
Related
We can query results in any language from Google BigQuery using the predefined methods -> see docs.
Alternatively, we can also query the results and store them to cloud storage, for example in a .csv -> see docs on storing data to GCS
When we repeatedly need to extract the same data, eg lets say 100 times per day, does it make sense to cache the data to Cloud Storage and load it from there, or to redo the BigQuery requests?
What is more cost efficient and how would I obtain the unit cost of these requests, to estimate a % difference?
BigQuery's pricing model is based on how much bytes your query uses.
So if you want to query the results, than store the results as a BigQuery table. Set a destination table when you run your query.
There is no point of reloading from GCS a previous results. The cost will be the same, you just complicate this.
I'm copying Spanner data to BigQuery through a Dataflow job. The job is scheduled to run every 15 minutes. The problem is, if the data is read from a Spanner table which is also being written at the same time, some of the records get missed while copying to BigQuery.
I'm using readOnlyTransaction() while reading Spanner data. Is there any other precaution that I must take while doing this activity?
It is recommended to use Cloud Spanner commit timestamps to populate columns like update_date. Commit timestamps allow applications to determine the exact ordering of mutations.
Using commit timestamps for update_date and specifying an exact timestamp read, the Dataflow job will be able to find all existing records written/committed since the previous run.
https://cloud.google.com/spanner/docs/commit-timestamp
https://cloud.google.com/spanner/docs/timestamp-bounds
if the data is read from a Spanner table which is also being written at the same time, some of the records get missed while copying to BigQuery
This is how transactions work. They present a 'snapshot view' of the database at the time the transaction was created, so any rows written after this snapshot is taken will not be included.
As #rose-liu mentioned, using commit timestamps on your rows, and keeping track of the timestamp when you last exported (available from the ReadOnlyTransaction object) will allow you to accurately select 'new/updated rows since last export'
I have the following Dynamo DB table structure:
item_type (string) --> Partition Key
item_id (number) --> Secondary Index
The table has around 5 million records and auto scaling is enabled with default read capacity of 5. I need to fetch the item_id given certain item type. We have around 500000 item_types and each item type will be associated with multiple item ids. I see a response of around 4 seconds for popular item_types. I am testing this on AWS Lambda, I start the timer when we make the query and end it once we get the response. Both Lambda and Dynamo DB are in the same region.
This is the query I am using:
response = items_table.query(
KeyConditionExpression=Key('item_type').eq(search_term),
ProjectionExpression='item_id'
)
Following are some of the observations:
It takes more time to fetch popular items
As the number of records increase, the response time increases
I have tried Dynamo DB Cache but the Python SDK is not up to the mark and it has certain limitations.
Given these details following are the questions:
Why is the response time so high? Is it because I am querying on a string not a number.
Increasing the read capacity also did not help but why?
Is there any other aws service which is faster than Dynamo DB for such type of queries.
I have seen seminars where they claim to get sub millisecond response times on billions of records with multiple users accessing the table. Any pointers towards achieving sub second response time will be helpful. Thanks.
I'd like to give my partners the results of simple COUNT(*) ... GROUP BY items.color type queries and perhaps joins over items and orders or some such. I'd like query response time to be sub-second (on the order of a second, at worst), and scale to billions of rows counted.
My current approach is to either backup my GCDatastore data and load it into BigQuery and provide daily analytics or use GCDataflow to maintain a set of pre-defined counters.
Is this something Spanner has as a use-case for, if I transition my backend from Datastore to Spanner?
Today, running counting queries in Cloud Spanner requires a full table scan. Depending on the size of the table this could take more than a second.
One thing you could do is to track the count in a separate table, and whenever you update the items table, update the count in the same transaction.
I'm trying to retrieve the result of query with aggregates, based on the GA sessions and using the BigQuery API in python. And then to push it to my data warehouse.
Issue: I can only retrieve 8333 records of the aforementioned query result.
But there are always 40k+ records any day of the year..
I tried to do 'allowLargeResults': True
I read I should extract all to google cloud first and then retrieve it...
Also read somewhere in Google doc that I might only get the first page?!
Has anybody faced the same situation?
See section on paging through results in the BigQuery docs https://cloud.google.com/bigquery/docs/data#paging
Alternately, you can export your table to Google Cloud Storage: https://cloud.google.com/bigquery/exporting-data-from-bigquery