Fetching data from large BigQuery table in python

Fetching data from large BigQuery table in python - python-2.7

What I have is a BigQuery table(>5mil rows).
I need to fetch this data in batches and process it inside AppEngine, python.
The only way to fetch from a table that I know is to run SELECT query on this table and then iterate the result using tokens fetch_data returns.
It looks like this:
query = u"""\
SELECT url FROM %s
""" % (query_table)
query_job = client.run_async_query(str(uuid.uuid4()), query)
query_job.begin()
wait_for_job(query_job, 1)
query_results = query_job.results()
rows, total_rows, next_token = query_results.fetch_data(max_results=per_page, page_token=page_token)
This works on smaller tables, but on larger ones like mine it asks to allow large requests and specify target table. But this makes no sense to me. For to simply fetch data from a table I have to copy it to another table?

What you are running into is described in this documentation. In summary, apart from the limit on how much data can be fetched at a time, there is a point where your results become "large results." This is when your results are more than 128MB compressed as described here. When your results are classified as large, you can only store the result of a query in a table in Big Query.
Unfortunately I'm not sure there's a nice way to do what you want without reducing how many rows you are retrieving at once. What you'll likely need to do is explore the exporting data documentation for big query.

You should use tabledata.list API for fetching data from table.
Using parameters (startIndex or pageToken) and maxResults you can control size of page you fetch.

I think this is exactly what you need link, as far as I understood you can't get a large result of a query but you can get the entire table data to your app no mater how big it is, thats why you need to put the large result in a table and then get this table data to your app and do whatever you want with it
good luck :)

Related

Cloud Spanner - read performance with large number of items in WHERE clause

I'm in the process of evaluating some different data stores for a project and I have a strange but inflexible requirement to check the existence of a 1500 keys per query... Basically the only query I'll be running is of the form:
SELECT user_id, name, gender
WHERE user_id in (user1, user2, ..., user1500)
I will have around 3.5 billion rows in the table. One data store that has caught my eye is Spanner. I was wondering if querying the data in this way would be feasible or if I would run into performance issues due to the large number of items in my WHERE clause. I have only been able to test these queries on a small amount of data so far so I'm leaning more on what the theoretical performance hit might look like instead having the luxury to just "try and found out".
Also, are there other data stores that might work better for this read pattern? I expected to run no more than 80 queries per second. Also, the data will be bulk loaded on a weekly basis. The data is structured by nature but we don't use it in a relational way (i.e. no joins).
Anyways, sorry if this question is vague in any way. I'm happy to provide more detail if needed.

1500 keys should not be a problem if you use a bound array parameter to specify the keys:
SELECT user_id, name, gender
FROM table
WHERE user_id in UNNEST(#users)
https://cloud.google.com/spanner/docs/sql-best-practices#write_efficient_queries_for_range_key_lookup

Require Partition Filter On BigQuery Views

We have currently a couple of authorized views in the big query for various teams
Currently, we are using partition_date column to use in the query to reduce the amount of data processed (reference)
#standardSQL
SELECT
<required_fields,...>,
EXTRACT(DATE FROM _PARTITIONTIME) AS partition_date
FROM
`<project-name>.<dataset-name>.<table-name>`
WHERE
_PARTITIONTIME >= TIMESTAMP("2018-05-01")
AND _PARTITIONTIME <= CURRENT_TIMESTAMP()
AND <Blah-Blah-Blah>
However, due to the number of users & data we have, it's very hard to maintain the quality of big query scripts leading us with increased query cost with the relatively increasing number of users.
I see we can use --require_partition_filter (reference) when creating TABLEs. So, could someone help me address the following questions
When I create a table with the above filter, does the referenced view will also expect the partition condition because of the partition filter enabled on the table level?
Due to the number of authorized views connected to tables we have, it requires significant efforts to change it to materialized views (tables). Is there an alternative way possible to apply something similar/use like --require_partition_filter on view level?
FYI, for someone who wants to update the current table with the above filter, I see we can use bq update command (reference) which I am planning to use for existing partitioned tables.

Yes, the same restriction on the tables being queried through the view applies.
There is not.

Google Big Query splitting an ingestion time partitioned table

I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.

Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)

BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.

Does searching by id depends on number of columns in postgres?

I have the following query: MyModel.objects.filter(id__in=ids).
I noticed that increasing number of columns in table decreases speed of the above query.
Why is that?

Query time in Postgres mostly consists of planing time, execution time and data fetch.
Planing time and execution time should not be affected by a number of columns in the table, but the data fetch phase definitely is as you are returning more data.
Also, an additional step that happens is the mapping of return data into Django QuerySet which takes more time if more columns are involved.
To limit the scope of data returned if applicable, you can always use values, defer, or only.
In some complex data-modeling situations, your models might contain a lot of fields, some of which could contain a lot of data (for example, text fields), or require expensive processing to convert them to Python objects. If you are using the results of a queryset in some situation where you don’t know if you need those particular fields when you initially fetch the data, you can tell Django not to retrieve them from the database.

Excluding a blob column from active record/linq query results

What's the easiest way to exclude a column from the result set in a Subsonic/ActiveRecord/Linq query?
I've a got a table of images, and often I only want the meta data associated with the image (image id/name/dimensions for example). Seems fairly wasteful to be pulling in the entire image data for these requests.
My current thought is to split out the image data to a separate table, but I'm wondering if there's an easier/better way.

As You can see in the docs via the link below its possible to use LINQ and detail your query that way.
http://subsonicproject.com/docs/Using_ActiveRecord

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Fetching data from large BigQuery table in python - python-2.7

You should use tabledata.list API for fetching data from table. Using parameters (startIndex or pageToken) and maxResults you can control size of page you fetch.

Related

Cloud Spanner - read performance with large number of items in WHERE clause

Require Partition Filter On BigQuery Views

Google Big Query splitting an ingestion time partitioned table

Does searching by id depends on number of columns in postgres?

Excluding a blob column from active record/linq query results

Categories

Resources