How to store the result of query on the current table without changing the table schema? - sql-update

I have a structure
id: "123",
id: "123",
Query to remove duplicate:
FROM table1
WHERE row_number = 1
I specified destination table as table1.
Here I have made scans as repeated records, scanid as string and status as string. But when I do some query (I am making a query to remove duplicate) and overwrite the existing table, the table schema is changed. It becomes scans_scanid(string) and scans_status(string). Scans record schema is changed now. Please suggest where am I going wrong?

It is known that NEST() is not compatible with UnFlatten Results Output and mostly is used for intermediate result in subquery.
Try below workaround
Note, I use INTEGER for id and scanid. If they should be STRING you need to
a. make change in output schema section
as well as
b. remove use of parseInt() function in t = {scanid:parseInt(x[0]), status:x[1]}
SELECT id, scans.scanid, scans.status
( // input table
SELECT id, NEST(CONCAT(STRING(scanid), ',', STRING(status))) AS scans
SELECT id, scans.scanid, scans.status
SELECT id, scans.scanid, scans.status,
FROM table1
) WHERE dup = 1
id, scans, // input columns
"[{'name': 'id', 'type': 'INTEGER'}, // output schema
{'name': 'scans', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'scanid', 'type': 'INTEGER'},
{'name': 'status', 'type': 'STRING'}
"function(row, emit){ // function
var c = [];
for (var i = 0; i < row.scans.length; i++) {
x = row.scans[i].toString().split(',');
t = {scanid:parseInt(x[0]), status:x[1]}
emit({id:, scans: c});
Here I use BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. Also have in mind - they are quite a candidates for being qualified as expensive High-Compute queries
Complex queries can consume extraordinarily large computing resources
relative to the number of bytes processed. Typically, such queries
contain a very large number of JOIN or CROSS JOIN clauses or complex
User-defined Functions.

1) If you run the query on the web UI, the result is automatically flattened, so that's why you see the schema is changed.
You need to run your query and write to a destination table, you have options on the web UI also to do this.
2) If you don't run your query on the web UI but still see schema changed, you should make explicit selects so the schema is retained for you eg:
select 'foo' as scans.scanid
This creates for you a record like output, but it won't be a repeated record for that please read further.
3) For some use cases you may need to use the NEST(expr) function which
Aggregates all values in the current aggregation scope into a repeated
field. For example, the query "SELECT x, NEST(y) FROM ... GROUP BY x"
returns one output record for each distinct x value, and contains a
repeated field for all y values paired with x in the query input. The
NEST function requires a GROUP BY clause.
BigQuery automatically flattens query results, so if you use the NEST
function on the top level query, the results won't contain repeated
fields. Use the NEST function when using a subselect that produces
intermediate results for immediate use by the same query.


How to query DynamoDB GSI with compound conditions

I have a DynamoDB table called 'frank' with a single GSI. The partition key is called PK, the sort key is called SK, the GSI partition key is called GSI1_PK and the GSI sort key is called GSI1_SK. I have a single 'data' map storing the actual data.
Populated with some test data it looks like this:
The GSI partition key and sort key map directly to the attributes with the same names within the table.
I can run a partiql query to grab the results that are shown in the image. Here's the partiql code:
select PK, SK, GSI1_PK, GSI1_SK, data from "frank"."GSI1"
( "GSI1_SK" >= 'A_VISITOR#2021-06-01-00-00-00-000' and "GSI1_SK" <= 'A_VISITOR#2021-06-20-23-59-59-999' )
( "GSI1_SK" >= 'B_INTERACTION#2021-06-01-00-00-00-000' and "GSI1_SK" <= 'B_INTERACTION#2021-06-20-23-59-59-999' )
Note how the partiql code references "GSI1_SK" multiple times. The partiql query works, and returns the data shown in the image. All great so far.
However, I now want to move this into a Lambda function. How do I structure a AWS.DynamoDB.DocumentClient query to do exactly what this partiql query is doing?
I can get this to work in my Lambda function:
const visitorStart="A_VISITOR#2021-06-01-00-00-00-000";
const visitorEnd="A_VISITOR#2021-06-20-23-59-59-999";
var params = {
TableName: "frank",
IndexName: "GSI1",
KeyConditionExpression: "#GSI1_PK=:tmn AND #GSI1_SK BETWEEN :visitorStart AND :visitorEnd",
ExpressionAttributeNames :{ "#GSI1_PK":"GSI1_PK", "#GSI1_SK":"GSI1_SK" },
ExpressionAttributeValues: {
":tmn": lowerCaseTeamName,
":visitorStart": visitorStart,
":visitorEnd": visitorEnd
const data = await documentClient.query(params).promise();
But as soon as I try a more complex compound condition I get this error:
ValidationException: Invalid operator used in KeyConditionExpression: OR
Here is the more complex attempt:
const visitorStart="A_VISITOR#2021-06-01-00-00-00-000";
const visitorEnd="A_VISITOR#2021-06-20-23-59-59-999";
const interactionStart="B_INTERACTION#2021-06-01-00-00-00-000";
const interactionEnd="B_INTERACTION#2021-06-20-23-59-59-999";
var params = {
TableName: "frank",
IndexName: "GSI1",
KeyConditionExpression: "#GSI1_PK=:tmn AND (#GSI1_SK BETWEEN :visitorStart AND :visitorEnd OR #GSI1_SK BETWEEN :interactionStart AND :interactionEnd) ",
ExpressionAttributeNames :{ "#GSI1_PK":"GSI1_PK", "#GSI1_SK":"GSI1_SK" },
ExpressionAttributeValues: {
":tmn": lowerCaseTeamName,
":visitorStart": visitorStart,
":visitorEnd": visitorEnd,
":interactionStart": interactionStart,
":interactionEnd": interactionEnd
const data = await documentClient.query(params).promise();
The docs say that KeyConditionExpressions don't support 'OR'. So, how do I replicate my more complex partiql query in Lambda using AWS.DynamoDB.DocumentClient?
If you look at the documentation of PartiQL for DynamoDB they do warn you, that PartiQL has no scruples to use a full table scan to get you your data:
To ensure that a SELECT statement does not result in a full table scan, the WHERE clause condition must specify a partition key. Use the equality or IN operator.
In those cases PartiQL would run a scan and use a FilterExpression to filter out the data.
Of course in your example you provided a partition key, so I'd assume that PartiQL would run a query with the partition key and a FilterExpression to apply the rest of the condition.
You could replicate it that way, and depending on the size of your partitions this might work just fine. However, if the partition will grow beyond 1MB and most of the data would be filtered out, you'll need to deal with pagination even though you won't get any data.
Because of that I'd suggest you to simply split it up and run each or condition as a separate query, and merge the data on the client.
Unfortunately, DynamoDB does not support multiple boolean operations in the KeyConditionExpression. The partiql query you are executing is probably performing a full table scan to return the results.
If you want to replicate the partiql query using the DocumentClient, you could use the scan operation. If you want to avoid using scan, you could perform two separate query operations and join the results in your application code.

How to fetch all records based on year in Amazon QLDB

I have a requirement to fetch all records from an amazon QLDB based on the given year.
Here is my data inside the Revenues Table.
ownerId: "u102john2021",
transactionId: "tran010101010101",
timeStamp: 2021-06-11T19:31:31.000Z
ownerId: "u102john2021",
transactionId: "tran010101010101",
timeStamp: 2020-06-11T19:31:31.000Z
If I pass the year 2020 I want to select relevant records.
How can I write a select query on this?
To immediately answer your question, there are a couple of ways that you can achieve what you're trying to do, based on the ION data type of the timeStamp field.
1/ If the data type is of the timestamps type i.e
'ownerId' : 'A',
'transactionId' : 't1',
'timeStamp' : `2021-06-11T19:31:31.000Z`
'ownerId' : 'B',
'transactionId' : 't2',
'timeStamp' : `2020-06-11T19:31:31.000Z`
You can use a WHERE clause that sets the boundaries of the SELECT statement i.e
SELECT * FROM revenues WHERE "timeStamp" < `2021T` AND "timeStamp" >= `2020T`
Note that I've placed the timeStamp field in double quotation marks because it is a reserved keyword:
2/ If the data type is of the string type i.e
'ownerId' : 'C',
'transactionId' : 't3',
'timeStamp' : '2021-06-11T19:31:31.000Z'
'ownerId' : 'D',
'transactionId' : 't4',
'timeStamp' : '2020-06-11T19:31:31.000Z'
You can use a WHERE clause with the LIKE operator to match a pattern i.e
SELECT * FROM revenues WHERE "timeStamp" LIKE "2020%"
I'd like to mention that though these queries will achieve what you want them to, they are not optimised for QLDB and as the size of the data set grows, there will be significant performance problems in the form of query latency, transaction timeouts, and concurrency conflicts. The reason for this is that QLDB performs a full table scan unless a predicate with an equality check against an indexed field is provided e.g
SELECT * FROM revenues WHERE "timeStamp" = `2021-06-11T19:31:31.000Z`
Scan queries face high latency that increases with the amount of data that has to be examined. The queries provided will result in scans in order to determine the right documents to return that fit the ranges.
With the increase in latency, another aspect that has to be considered is the QLDB transaction timeout of 30 seconds. All queries in QLDB are transactions with serializable isolation, including SELECT statements. As the scan latency goes up with increase in the data set, the transaction timeout will inevitably be triggered and the query will error.
Ideally, you should run statements with a WHERE predicate clause that filters on an indexed field or a document ID. For more information on optimal queries for QLDB, please see:
For running such scans as provided above, we recommend streaming the data to a purpose-built database service of your choice that is optimized for analytical use cases.

clickhouse how to guarantee one data row per a pk(sorting key)?

I am struggling with clickhouse to keep unique data row per a PK.
I choose this Column base DB to express statistics data quickly and very satisfied with its speed. However, got some duplicated data issue here.
The test table looks like...
`uid` String COMMENT 'User ID',
`name` String COMMENT 'name'
) ENGINE ReplacingMergeTree(uid)
Let's presume that I am going to use this table to join for display names(name field in this table). However, I can insert many data as I want in same PK(Sorting key).
For Example
(uid, name) VALUES ('1', 'User1');
(uid, name) VALUES ('1', 'User2');
(uid, name) VALUES ('1', 'User3');
SELECT * FROM test2 WHERE uid = '1';
Now, I can see 3 rows with same sorting key. Is there any way to make key unique, at least, prevent insert if the key exists?
Let's think about below scenario
tables and data are
`blog_id` String,
`blog_writer` String
) ENGINE MergeTree
ORDER BY tuple();
CREATE TABLE statistics (
`date` UInt32,
`blog_id` String,
`read_cnt` UInt32,
`like_cnt` UInt32
) ENGINE MergeTree
ORDER BY tuple();
INSERT INTO blog (blog_id, blog_writer) VALUES ('1', 'name1');
INSERT INTO blog (blog_id, blog_writer) VALUES ('2', 'name2');
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202007, '1', 10, 20);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '1', 20, 0);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202009, '1', 3, 1);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '2', 11, 2);
And here is summing query
SUM(read_cnt) as read_sum,
SUM(like_cnt) as like_sum
FROM statistics
GROUP BY blog_id
) a JOIN
SELECT blog_id, blog_writer as writer FROM blog
) b
ON a.blog_id = b.blog_id;
At this moment it works fine, but if there comes a new low like
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '1', 60, 0);
What I expected is update low and sum of the "name1"'read_sum is 73. but it shows 93 since it allows duplicated insert.
Is there any way to
prevent duplicated insert
or set unique guaranteed PK in table
One thing that comes to mind is ReplacingMergeTree. It won't guarantee absence of duplication right away, but it it will do so eventually. As docs state:
Data deduplication occurs only during a merge. Merging occurs in the
background at an unknown time, so you can’t plan for it. Some of the
data may remain unprocessed.
Another approach that i personally use is introducing another column named, say, _ts - a timestamp when row was inserted. This lets you track changes and with help of clickhouse's beautiful limit by you can easily get last version of a row for given pk.
`uid` String COMMENT 'User ID',
`name` String COMMENT 'name',
`_ts` DateTime
) ENGINE MergeTree(uid)
Select would look like this:
SELECT uid, name FROM test2 ORDER BY _ts DESC LIMIT 1 BY uid;
In fact, you don't need a pk, just specify any row/rows in limit by that you need rows to be unique by.
Besides ReplacingMergeTree which runs deduplication asynchronously, so you can have temporarily duplicated rows with the same pk, you can use CollapsingMergeTree or VersionedCollapsingMergeTree.
With CollapsingMergeTree you could do something like this:
CREATE TABLE statistics (
`date` UInt32,
`blog_id` String,
`read_cnt` UInt32,
`like_cnt` UInt32,
`sign` Int8
) ENGINE CollapsingMergeTree(sign)
ORDER BY tuple()
PRIMARY KEY blog_id;
The only caveat is on every insert of a duplicated PK you have to cancel the previous register, something like this:
# first insert
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 20, 0, 1);
# cancel previous insert and insert the new one
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 20, 0, -1);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 11, 2, 1);
I do not think this is a solution for the problem, but at least I detour above problem in this way in the perspective of business.
Since clickhouse officially does not support modification of table data.(They provide ALTER TABLE ... UPDATE | DELETE, but eventually those will rewrite the table) I split the table into small multiple partitions(In my case, 1 partition has about 50,000 data) and if duplicated data comes, 1) drop the partition 2) re-insert data again. In above case, I alway execute ALTER TABLE ... DROP PARTITION statement before insert.
I also have tried ReplacingMergeTree, but data duplication still occurred.(Maybe I do not understand how to use the table but I gave a single sorting key - and when I insert duplicated data there are multiple data in same sorting key)

DAX Query to Get Distinct Items from Multiple Tables

I'm trying to generate a table of distinct email addresses from multiple source tables. However, with the UNION statement on the outer part of the statement, it isn't generating a truly distinct list.
Participants = UNION(DISTINCT('Registrations'[Email Address]), DISTINCT( 'EnteredTickets'[Email]))
*Note that while I'm starting with just two source tables, I need to expand this to 3 or 4 by the end of it.
A combination of using VALUES on the table selects plus wrapping the whole statement in one more DISTINCT did the trick:
Participants = DISTINCT(UNION(VALUES('Registrations'[Email Address]), VALUES( 'EnteredTickets'[Email])))
If you want a bridge table with unique values for all different tables, use DISTINCT instead of VALUES:
Participants =
TOPN ( 0, ROW ("NiceEmail", "asdf") ), -- adds zero rows table with nice new column name
DISTINCT ( 'Registrations'[Email Address] ),
DISTINCT ( 'EnteredTickets'[Email] )
[NiceEmail] <> BLANK () -- removes all blank emails
DISTINCT AND VALUES may lead to different results. Essentially, using VALUES, you are likely to end up with (unwanted) blank value in your list.
Check this documentation:
You might also like information under this link which you can use to get a specific column name for your table of distinct values:
DAX create empty table with specific column names and no rows

How to use subquery in django?

I want to get a list of the latest purchase of each customer, sorted by the date.
The following query does what I want except for the date:
.order_by('customer', '-date'))
It produces a query like:
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"" DESC;
I am forced to use customer_id as the first ORDER BY expression because of DISTINCT ON.
I want to sort by the date, so what the query I really need should look like this:
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"" DESC;
AS result
I don't want to sort using python because I still got to page limit the query. There can be tens of thousands of rows in the database.
In fact it is currently sorted by in python now and is causing very long page load times, so that's why I'm trying to fix this.
Basically I want something like this Is it possible to express it with django querysets instead of writing raw SQL?
The actual models and methods are several pages long, but here is the set of models required for the queryset above.
class Customer(models.Model):
user = models.OneToOneField(User)
class Purchase(models.Model):
customer = models.ForeignKey(Customer)
date = models.DateField(auto_now_add=True)
item = models.CharField(max_length=255)
If I have data like:
Customer A -
Purchase(item=Chair, date=January),
Purchase(item=Table, date=February)
Customer B -
Purchase(item=Speakers, date=January),
Purchase(item=Monitor, date=May)
Customer C -
Purchase(item=Laptop, date=March),
Purchase(item=Printer, date=April)
I want to be able to extract the following:
Purchase(item=Monitor, date=May)
Purchase(item=Printer, date=April)
Purchase(item=Table, date=February)
There is at most one purchase in the list per customer. The purchase is each customer's latest. It is sorted by latest date.
This query will be able to extract that:
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"" DESC;
AS result
I'm trying to find a way not to have to use raw SQL to achieve this result.
This may not be exactly what you're looking for, but it might get you closer. Take a look at Django's annotate.
Here is an example of something that may help:
from django.db.models import Max
This will give you a list of your customer models each one of which will have a new attribute called "most_recent_purchase" and will contain the date on which they made their last purchase. The sql produced looks like this:
SELECT "demo_customer"."id",
MAX("demo_purchase"."date") AS "most_recent_purchase"
FROM "demo_customer"
LEFT OUTER JOIN "demo_purchase" ON ("demo_customer"."id" = "demo_purchase"."customer_id")
GROUP BY "demo_customer"."id",
Another option, would be adding a property to your customer model that would look something like this:
def latest_purchase(self):
return self.purchase_set.order_by('-date')[0]
You would obviously need to handle the case where there aren't any purchases in this property, and this would potentially not perform very well (since you would be running one query for each customer to get their latest purchase).
I've used both of these techniques in the past and they've both worked fine in different situations. I hope this helps. Best of luck!
Whenever there is a difficult query to write using Django ORM, I first try the query in psql(or whatever client you use). The SQL that you want is not this:
"shop_purchase.customer_id" "" ""
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC, "" DESC;
) AS result
In the above SQL, the inner SQL is looking for distinct on a combination of (customer_id, id, and date) and since id will be unique for all, you will get all records from the table. I am assuming id is the primary key as per convention.
If you need to find the last purchase of every customer, you need to do something like:
SELECT "shop_purchase.customer_id", max("")
FROM shop_purchase
But the problem with the above query is that it will give you only the customer name and date. Using that will not help you in finding the records when you use these results in a subquery.
To use IN you need a list of unique parameters to identify a record, e.g., id
If in your records id is a serial key, then you can leverage the fact that the latest date will be the maximum id as well. So your SQL becomes:
SELECT max("")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id";
Note that I kept only one field (id) in the selected clause to use it in a subquery using IN.
The complete SQL will now be:
FROM shop_customer
(SELECT max("")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id");
and using the Django ORM it looks like:
.values_list('latest', flat=True)))
Hope it helps!
I have a similar situation and this is how I'm planning to go about it:
query = Purchase.objects.distinct('customer').order_by('customer').query
query = 'SELECT * FROM ({}) AS result ORDER BY sent DESC'.format(query)
return Purchase.objects.raw(query)
Upside it gives me the query I want. Downside is that it is raw query and I can't append any other queryset filters.
This is my approach if I need some subset of data (N items) along with the Django query. This is example using PostgreSQL and handy json_build_object() function (Postgres 9.4+), but same way you can use other aggregate function in other database system. For older PostgreSQL versions you can use combination of array_agg() and array_to_string() functions.
Imagine you have Article and Comment models and along with every article in the list you want to select 3 recent comments (change LIMIT 3 to adjust size of subset or ORDER BY DESC to change sorting of subset).
qs = Article.objects.all()
qs = qs.extra(select = {
'recent_comments': """
json_build_object('id', id, 'user_id', user_id, 'body', body)
FROM app_comment c
WHERE c.article_id =
) sub
for article in qs:
# Output:
# {u'comments': [{u'user_id': 1, u'id': 3, u'body': u'foo'}, {u'user_id': 1, u'id': 2, u'body': u'bar'}, {u'user_id': 1, u'id': 1, u'body': u'joe'}]}
# ....