Require Partition Filter On BigQuery Views - google-cloud-platform

We have currently a couple of authorized views in the big query for various teams
Currently, we are using partition_date column to use in the query to reduce the amount of data processed (reference)
#standardSQL
SELECT
<required_fields,...>,
EXTRACT(DATE FROM _PARTITIONTIME) AS partition_date
FROM
`<project-name>.<dataset-name>.<table-name>`
WHERE
_PARTITIONTIME >= TIMESTAMP("2018-05-01")
AND _PARTITIONTIME <= CURRENT_TIMESTAMP()
AND <Blah-Blah-Blah>
However, due to the number of users & data we have, it's very hard to maintain the quality of big query scripts leading us with increased query cost with the relatively increasing number of users.
I see we can use --require_partition_filter (reference) when creating TABLEs. So, could someone help me address the following questions
When I create a table with the above filter, does the referenced view will also expect the partition condition because of the partition filter enabled on the table level?
Due to the number of authorized views connected to tables we have, it requires significant efforts to change it to materialized views (tables). Is there an alternative way possible to apply something similar/use like --require_partition_filter on view level?
FYI, for someone who wants to update the current table with the above filter, I see we can use bq update command (reference) which I am planning to use for existing partitioned tables.

Yes, the same restriction on the tables being queried through the view applies.
There is not.

Related

Can we alter AWS QLDB table?

Suppose I have created a table like this.
CREATE TABLE Vehicle
and insert some documents to this table.
INSERT INTO Vehicle
<< {
'VIN' : '1N4AL11D75C109151',
'Type' : 'Sedan',
} >>
So my requirement is to change the table name from Vehicle to VehicleCar and want to change the 'VIN' to 'VID'
How can I do that?
Thanks,
Dasun.
QLDB doesn't currently offer an ALTER TABLE capability. You'd have to DROP the table and re-create it. This counts against your table limits, so don't do it too often.
QLDB is schema-less, so you can change your field names and/or the structure of your documents anytime you want to, simply by writing new revisions to your documents in the new format. The journal will still contain the old revisions, however. If your application has any functionality that uses the history() function to access old revisions, then it needs to be able to gracefully handle variations in the document format.
It is important to note that QLDB is not optimized for scanning large volumes of data. It's optimized for targeted queries against an index using an equality operator. A query like "SELECT * FROM table" will scan the entire table. This is an anti-pattern for QLDB and will not perform well as your ledger grows. So if you change your document format, running a SELECT * and updating every document to the new format may be more work than you realize. First, that SELECT * scan query may time-out or it may be aborted with an Optimistic Concurrency Control exception because another process inserted a document in the table. Second, you'd have to do it in batches of 40 documents at a time because of the limit to the number of documents in a transaction.
All of this is to say that making your application resilient to schema changes is a good idea. :-)

Redshift - Redesign tables to use DIST and SORT keys (performance issue)

I'm having serious performance problems on Redshift and I've started to rethink my tables structures.
Right now, I'm identifying tables that have most significance on my dashboard. First of all, I run the following query:
SELECT * FROM admin.v_extended_table_info
WHERE table_id IN (
SELECT DISTINCT s.tbl FROM stl_scan s
JOIN pg_user u ON u.usesysid = s.userid
WHERE s.type=2 AND u.usename='looker'
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC,
size DESC;
Based on query result, I could identify a lot of small tables (1-1000 records) that are distributed as EVEN and it could be ALL - this tables are used in a lot of joins instructions.
Beside that, I've identified that 99% of my tables are using EVEN without sort key. I'm not using denormalized tables so I need to run plenty of joins to get data - for what I've read, EVEN is not good for joins because it could be distributed over the network.
I have 3 tables related to Ticket flow: user, ticket and ticket_history. All those tables are EVEN without sort keys and diststyle as EVEN.
For now, I would like to redesign table user: this table is used on join by condition ticket.user_id = user.id and where clauses like user.email = 'xxxx#xxxx.com' or user.email like '%#something.com%' or group by user.email.
First thing I'm planning to do is use diststyle as distribution and key as id. Does make sense use a unique value as dist key? I've read plenty of posts about dist keys and still confuse for me.
As sort keys makes sense use email as compound? I've read to avoid columns that grows like dates, timestamps or identities, that's why i'm not using it as interleaved. To avoid that like, I'm planning to create a new column to identify what is email domain.
After that, I'll change small tables to dist ALL and try my queries again.
Am I on right way? Any other tip?
This question could sound stupid but my tech background is only software development, I'm learning about Redshift and reading a lot of documentations.
The basic rule of thumb is:
Set the DISTKEY to the column that is most used in JOINs
Set the SORTKEY to the column(s) most used in WHEREs
You are correct that small tables can have a distribution of ALL, which would avoid sending data between nodes.
DISTKEY provides the most benefit when tables are join via a common column that has the same DISTKEY in both tables. This means that each row is contained on the same node and no data needs to be sent between nodes (or, more accurately, slices). However, you can only select one DISTKEY, so do it on the column that is most often used for the JOIN.
SORTKEY provides the most benefit when Redshift can skip over blocks of storage. Each block of storage contains data for one column and is marked with a MIN and MAX value. When a table is sorted on a particular column, it minimises the number of disk blocks that contain data for a given column value (since they are all located together, rather than being spread randomly throughout disk storage). Thus, use column(s) that are most frequently used in WHERE statements.
If the user.email wildcard search is slow, you can certainly create a new column with the domain. Or, for even better performance, you could consider creating a separate lookup table with just user_id and domain, having SORTKEY = domain. This will perform the fastest when searching by domain.
A tip from experience: I would advise against using an email address as a user_id because people sometimes want to change email address. It is better to use a unique number for such id columns, with email address as a changeable attribute. (I've seen software systems need major rewrites to fix such an early design decision!)

Fetching data from large BigQuery table in python

What I have is a BigQuery table(>5mil rows).
I need to fetch this data in batches and process it inside AppEngine, python.
The only way to fetch from a table that I know is to run SELECT query on this table and then iterate the result using tokens fetch_data returns.
It looks like this:
query = u"""\
SELECT url FROM %s
""" % (query_table)
query_job = client.run_async_query(str(uuid.uuid4()), query)
query_job.begin()
wait_for_job(query_job, 1)
query_results = query_job.results()
rows, total_rows, next_token = query_results.fetch_data(max_results=per_page, page_token=page_token)
This works on smaller tables, but on larger ones like mine it asks to allow large requests and specify target table. But this makes no sense to me. For to simply fetch data from a table I have to copy it to another table?
What you are running into is described in this documentation. In summary, apart from the limit on how much data can be fetched at a time, there is a point where your results become "large results." This is when your results are more than 128MB compressed as described here. When your results are classified as large, you can only store the result of a query in a table in Big Query.
Unfortunately I'm not sure there's a nice way to do what you want without reducing how many rows you are retrieving at once. What you'll likely need to do is explore the exporting data documentation for big query.
You should use tabledata.list API for fetching data from table.
Using parameters (startIndex or pageToken) and maxResults you can control size of page you fetch.
I think this is exactly what you need link, as far as I understood you can't get a large result of a query but you can get the entire table data to your app no mater how big it is, thats why you need to put the large result in a table and then get this table data to your app and do whatever you want with it
good luck :)

MS SQL to DynamoDB migration, what's the best partition key to chose in my case

i am working on a migration from MS Sql to DynamoDB and i'm not sure what's the best hash key for my purpose. In MS SQL i've an item table where i store some product information for different customers, so actually the primary key are two columns customer_id and item_no. In application code i need to query specific items and all items for a customer id, so my first idea was to setup the customer id as hash key and the item no as range key. But is this the best concept in terms of partitioning? I need to import product data daily with 50.000-100.000 products for some larger customers and as far as i know it would be better to have a random hash key. Otherwise the import job will run on one partition only.
Can somebody give me a hint what's the best data model in this case?
Bye,
Peter
It sounds like you need item_no as the partition key, with customer_id as the sort key. Also, in order to query all items for a customer_id efficiently you will want to create a Global Secondary Index on customer_id.
This configuration should give you a good distribution while allowing you to run the queries you have specified.
You are on the right track, you should really be careful on how you are handling write operations as you are executing an import job in a daily basis. Also avoid adding indexes unnecessarily as they will only multiply your writing operations.
Using customer_id as hash key and item_no as range key will provide the best option not only to query but also to upload your data.
As you mentioned, randomization of your customer ids would be very helpful to optimize the use of resources and prevent a possibility of a hot partition. In your case, I would follow the exact example contained in the DynamoDB documentation:
[...] One way to increase the write throughput of this application
would be to randomize the writes across multiple partition key values.
Choose a random number from a fixed set (for example, 1 to 200) and
concatenate it as a suffix [...]
So when you are writing your customer information just randomly assign the suffix to your customer ids, make sure you distribute them evenly (e.g. CustomerXYZ.1, CustomerXYZ.2, ..., CustomerXYZ.200).
To read all of the items you would need to obtain all of the items for each suffix. For example, you would first issue a Query request for the partition key value CustomerXYZ.1, then another Query for CustomerXYZ.2, and so on through CustomerXYZ.200. Because you know the suffix range (on this case 1...200), you only need to query the records appending each suffix to the customer id.
Each query by the hash key CustomerXYZ.n should return a set of items (specified by the range key) from that specific customer, your application would need to merge the results from all of the Query requests.
This will for sure make your life harder to read the records (in terms of the additional requests needed), however, the benefits of optimized throughput and performance will pay off. Remember a hot partition will not only increase your overall financial cost, but will also impact drastically your performance.
If you have a well designed partition key your queries will always return very quickly with minimum cost.
Additionally, make sure your import job does not execute write operations grouped by customer, for example, instead of writing all items from a specific customer in series, sort the write operations so they are distributed across all customers. Even though your customers will be distributed by several partitions (due to the id randomization process), you are better off taking this additional safety measure to prevent a burst of write activity in a single partition. More details below:
From the 'Distribute Write Activity During Data Upload' section of the official DynamoDB documentation:
To fully utilize all of the throughput capacity that has been
provisioned for your tables, you need to distribute your workload
across your partition key values. In this case, by directing an uneven
amount of upload work toward items all with the same partition key
value, you may not be able to fully utilize all of the resources
DynamoDB has provisioned for your table. You can distribute your
upload work by uploading one item from each partition key value first.
Then you repeat the pattern for the next set of sort key values for
all the items until you upload all the data [...]
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
I hope that helps. Regards.

Cassandra NOT EQUAL Operator

Question to all Cassandra experts out there.
I have a column family with about a million records.
I would like to query these records in such a way that I should be able to perform a Not-Equal-To kind of operation.
I Googled on this and it seems I have to use some sort of Map-Reduce.
Can somebody tell me what are the options available in this regard.
I can suggest a few approaches.
1) If you have a limited number of values that you would like to test for not-equality, consider modeling those as a boolean columns (i.e.: column isEqualToUnitedStates with true or false).
2) Otherwise, consider emulating the unsupported query != X by combining results of two separate queries, < X and > X on the client-side.
3) If your schema cannot support either type of query above, you may have to resort to writing custom routines that will do client-side filtering and construct the not-equal set dynamically. This will work if you can first narrow down your search space to manageable proportions, such that it's relatively cheap to run the query without the not-equal.
So let's say you're interested in all purchases of a particular customer of every product type except Widget. An ideal query could look something like SELECT * FROM purchases WHERE customer = 'Bob' AND item != 'Widget'; Now of course, you cannot run this, but in this case you should be able to run SELECT * FROM purchases WHERE customer = 'Bob' without wasting too many resources and filter item != 'Widget' in the client application.
4) Finally, if there is no way to restrict the data in a meaningful way before doing the scan (querying without the equality check would returning too many rows to handle comfortably), you may have to resort to MapReduce. This means running a distributed job that would scan all rows in the table across the cluster. Such jobs will obviously run a lot slower than native queries, and are quite complex to set up. If you want to go this way, please look into Cassandra Hadoop integration.
If you want to use not-equals operator on a specific partition key and get all other data from table then you can use a combination of range queries and TOKEN function from CQL to achieve this
For example, if you want to fetch all rows except the ones having partition key as 'abc' then you execute below 2 queries
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) < TOKEN('abc');
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) > TOKEN('abc');
But, beware that result is going to be huge (depending on size of table and fields you need). So you might want to use this in conjunction with dsbulk kind of utility. Also note that there is no guarantee of ordering in your result. This is just a kind of data dump which will most probably be useful for some kind of one-time data migration like scenarios.