clickhouse how to guarantee one data row per a pk(sorting key)? - sql-update

I am struggling with clickhouse to keep unique data row per a PK.
I choose this Column base DB to express statistics data quickly and very satisfied with its speed. However, got some duplicated data issue here.
The test table looks like...
CREATE TABLE test2 (
`uid` String COMMENT 'User ID',
`name` String COMMENT 'name'
) ENGINE ReplacingMergeTree(uid)
ORDER BY uid
PRIMARY KEY uid;
Let's presume that I am going to use this table to join for display names(name field in this table). However, I can insert many data as I want in same PK(Sorting key).
For Example
INSERT INTO test2
(uid, name) VALUES ('1', 'User1');
INSERT INTO test2
(uid, name) VALUES ('1', 'User2');
INSERT INTO test2
(uid, name) VALUES ('1', 'User3');
SELECT * FROM test2 WHERE uid = '1';
Now, I can see 3 rows with same sorting key. Is there any way to make key unique, at least, prevent insert if the key exists?
Let's think about below scenario
tables and data are
CREATE TABLE blog (
`blog_id` String,
`blog_writer` String
) ENGINE MergeTree
ORDER BY tuple();
CREATE TABLE statistics (
`date` UInt32,
`blog_id` String,
`read_cnt` UInt32,
`like_cnt` UInt32
) ENGINE MergeTree
ORDER BY tuple();
INSERT INTO blog (blog_id, blog_writer) VALUES ('1', 'name1');
INSERT INTO blog (blog_id, blog_writer) VALUES ('2', 'name2');
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202007, '1', 10, 20);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '1', 20, 0);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202009, '1', 3, 1);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '2', 11, 2);
And here is summing query
SELECT
b.writer,
a.read_sum,
a.like_sum
FROM
(
SELECT
blog_id,
SUM(read_cnt) as read_sum,
SUM(like_cnt) as like_sum
FROM statistics
GROUP BY blog_id
) a JOIN
(
SELECT blog_id, blog_writer as writer FROM blog
) b
ON a.blog_id = b.blog_id;
At this moment it works fine, but if there comes a new low like
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '1', 60, 0);
What I expected is update low and sum of the "name1"'read_sum is 73. but it shows 93 since it allows duplicated insert.
Is there any way to
prevent duplicated insert
or set unique guaranteed PK in table
Thanks.

One thing that comes to mind is ReplacingMergeTree. It won't guarantee absence of duplication right away, but it it will do so eventually. As docs state:
Data deduplication occurs only during a merge. Merging occurs in the
background at an unknown time, so you can’t plan for it. Some of the
data may remain unprocessed.
Another approach that i personally use is introducing another column named, say, _ts - a timestamp when row was inserted. This lets you track changes and with help of clickhouse's beautiful limit by you can easily get last version of a row for given pk.
CREATE TABLE test2 (
`uid` String COMMENT 'User ID',
`name` String COMMENT 'name',
`_ts` DateTime
) ENGINE MergeTree(uid)
ORDER BY uid;
Select would look like this:
SELECT uid, name FROM test2 ORDER BY _ts DESC LIMIT 1 BY uid;
In fact, you don't need a pk, just specify any row/rows in limit by that you need rows to be unique by.

Besides ReplacingMergeTree which runs deduplication asynchronously, so you can have temporarily duplicated rows with the same pk, you can use CollapsingMergeTree or VersionedCollapsingMergeTree.
With CollapsingMergeTree you could do something like this:
CREATE TABLE statistics (
`date` UInt32,
`blog_id` String,
`read_cnt` UInt32,
`like_cnt` UInt32,
`sign` Int8
) ENGINE CollapsingMergeTree(sign)
ORDER BY tuple()
PRIMARY KEY blog_id;
The only caveat is on every insert of a duplicated PK you have to cancel the previous register, something like this:
# first insert
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 20, 0, 1);
# cancel previous insert and insert the new one
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 20, 0, -1);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 11, 2, 1);

I do not think this is a solution for the problem, but at least I detour above problem in this way in the perspective of business.
Since clickhouse officially does not support modification of table data.(They provide ALTER TABLE ... UPDATE | DELETE, but eventually those will rewrite the table) I split the table into small multiple partitions(In my case, 1 partition has about 50,000 data) and if duplicated data comes, 1) drop the partition 2) re-insert data again. In above case, I alway execute ALTER TABLE ... DROP PARTITION statement before insert.
I also have tried ReplacingMergeTree, but data duplication still occurred.(Maybe I do not understand how to use the table but I gave a single sorting key - and when I insert duplicated data there are multiple data in same sorting key)

Related

Remove duplicates based on sort

I have a customers table with ID's and some datetime columns. But those ID's have duplicates and i just want to Analyse distinct ID values.
I tried using groupby but this makes the process very slow.
Due to data sensitivity can't share it.
Any suggestions would be helpful.
I'd suggest using ROW_NUMBER() This lets you rank the rows by chosen columns and you can then pick out the first result.
Given you've shared no data or table and column names here's an example based on the Adventureworks database. The technique will be the same, you partition by whatever makes the group of rows you want to deduplicate unique (ProductKey below) and order in a way that makes the version you want to keep first (Children, birthdate and customerkey in my example).
USE AdventureWorksDW2017;
WITH CustomersOrdered AS
(
SELECT S.ProductKey, C.CustomerKey, C.TotalChildren, C.BirthDate
, ROW_NUMBER() OVER (
PARTITION BY S.ProductKey
ORDER BY C.TotalChildren DESC, C.BirthDate DESC, C.CustomerKey ASC
) AS CustomerSequence
FROM dbo.FactInternetSales AS S
INNER JOIN dbo.DimCustomer AS C
ON S.CustomerKey = C.CustomerKey
)
SELECT ProductKey, CustomerKey
FROM CustomersOrdered
WHERE CustomerSequence = 1
ORDER BY ProductKey, CustomerKey;
you can also just sort the columns with date column an than click on id column and delete duplicates...

apex select list, get user data

I've created a select list item and need to reference this list on another page. how do i get the user input, place it in a variable that can be used to calculate the total value of the selected item on another page?
code for the select list which is displayed on a report to allow for amount to be stated:
select item_id,
itemname,
item_price,
apex_item.hidden(1, item_id) ||
apex_item.hidden(2, item_price) ||
apex_item.select_list(
p_idx => 3,
p_value => nvl(null,'item'),
p_list_values => '1,2,3,4,5,6,7,8,9,10',
p_show_null => 'YES',
p_null_value => 0,
p_null_text => '0',
p_item_id => 'f03_#ROWNUM#',
p_item_label => 'f03_#ROWNUM#',
p_show_extra => 'NO') "item"
from item
the select to display the sum will look something like this:
select itemname,
item_price,
'select list value',
('select list value'* item_price) as sum
from item
how do i get the chosen amount from the select list?
Thanks
Oracle APEX has package APEX_APPLICATION which contains collections with names from G_F01 to G_F50. When you use any function from apex_item to create an item inside a report, you pass a number into p_idx parameter. In your example it is 3 for the select list. This value corresponds to one of collections mentioned above - G_F03 in this case.
All values from user's input are passed to these collections after the submit. So you can to write in the after submit process following code:
for i in apex_application.g_f01.first .. apex_application.g_f01.last loop
insert into my_table (id, new_value)
values (apex_application.g_f01(i), apex_application.g_f03(i));
end loop;
This code shows how to insert values from user's input into my_table. Note, that these collections contain only rows changed by user. If a user gets a report with 10 rows and changes 3 of them, each collection will have 3 elements.
Also, APEX has APEX_COLLECTION API to store temporary data.
For more information, see in the documentation: APEX_APPLICATION, APEX_COLLECTION.
UPD.
There are two ways to work with the data further - using a table or APEX_COLLECTION. I have never worked with APEX_COLLECTION, so here is an example with a table.
If you use a table to store temporary data, you need columns for ID, user's name and data (data from the Select List in your example). On the page with the report you need to do this after the submit:
delete from my_table where user_name = :APP_USER;
for i in apex_application.g_f01.first .. apex_application.g_f01.last loop
insert into my_table (id, new_value, user_name)
values (apex_application.g_f01(i), apex_application.g_f03(i), :APP_USER);
end loop;
On the next page, where you need to use this data, you create the following report:
select t.new_value, ...
from my_table t, ... <other tables>
where t.user_name = :APP_USER
and t.id = ...
and <other conditions>

How to store the result of query on the current table without changing the table schema?

I have a structure
{
id: "123",
scans:[{
"scanid":"123",
"status":"sleep"
}]
},
{
id: "123",
scans:[{
"scanid":"123",
"status":"sleep"
}]
}
Query to remove duplicate:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY id)
row_number,
FROM table1
)
WHERE row_number = 1
I specified destination table as table1.
Here I have made scans as repeated records, scanid as string and status as string. But when I do some query (I am making a query to remove duplicate) and overwrite the existing table, the table schema is changed. It becomes scans_scanid(string) and scans_status(string). Scans record schema is changed now. Please suggest where am I going wrong?
It is known that NEST() is not compatible with UnFlatten Results Output and mostly is used for intermediate result in subquery.
Try below workaround
Note, I use INTEGER for id and scanid. If they should be STRING you need to
a. make change in output schema section
as well as
b. remove use of parseInt() function in t = {scanid:parseInt(x[0]), status:x[1]}
SELECT id, scans.scanid, scans.status
FROM JS(
( // input table
SELECT id, NEST(CONCAT(STRING(scanid), ',', STRING(status))) AS scans
FROM (
SELECT id, scans.scanid, scans.status
FROM (
SELECT id, scans.scanid, scans.status,
ROW_NUMBER() OVER (PARTITION BY id) AS dup
FROM table1
) WHERE dup = 1
) GROUP BY id
),
id, scans, // input columns
"[{'name': 'id', 'type': 'INTEGER'}, // output schema
{'name': 'scans', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'scanid', 'type': 'INTEGER'},
{'name': 'status', 'type': 'STRING'}
]
}
]",
"function(row, emit){ // function
var c = [];
for (var i = 0; i < row.scans.length; i++) {
x = row.scans[i].toString().split(',');
t = {scanid:parseInt(x[0]), status:x[1]}
c.push(t);
};
emit({id: row.id, scans: c});
}"
)
Here I use BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. Also have in mind - they are quite a candidates for being qualified as expensive High-Compute queries
Complex queries can consume extraordinarily large computing resources
relative to the number of bytes processed. Typically, such queries
contain a very large number of JOIN or CROSS JOIN clauses or complex
User-defined Functions.
1) If you run the query on the web UI, the result is automatically flattened, so that's why you see the schema is changed.
You need to run your query and write to a destination table, you have options on the web UI also to do this.
2) If you don't run your query on the web UI but still see schema changed, you should make explicit selects so the schema is retained for you eg:
select 'foo' as scans.scanid
This creates for you a record like output, but it won't be a repeated record for that please read further.
3) For some use cases you may need to use the NEST(expr) function which
Aggregates all values in the current aggregation scope into a repeated
field. For example, the query "SELECT x, NEST(y) FROM ... GROUP BY x"
returns one output record for each distinct x value, and contains a
repeated field for all y values paired with x in the query input. The
NEST function requires a GROUP BY clause.
BigQuery automatically flattens query results, so if you use the NEST
function on the top level query, the results won't contain repeated
fields. Use the NEST function when using a subselect that produces
intermediate results for immediate use by the same query.

sql Column with multiple values (query implementation in a cpp file )

I am using this link.
I have connected my cpp file with Eclipse to my Database with 3 tables (two simple tables
Person and Item
and a third one PersonItem that connects them). In the third table I use one simple primary and then two foreign keys like that:
CREATE TABLE PersonsItems(PersonsItemsId int not null auto_increment primary key,
Person_Id int not null,
Item_id int not null,
constraint fk_Person_id foreign key (Person_Id) references Person(PersonId),
constraint fk_Item_id foreign key (Item_id) references Items(ItemId));
So, then with embedded sql in c I want a Person to have multiple items.
My code:
mysql_query(connection, \
"INSERT INTO PersonsItems(PersonsItemsId, Person_Id, Item_id) VALUES (1,1,5), (1,1,8);");
printf("%ld PersonsItems Row(s) Updated!\n", (long) mysql_affected_rows(connection));
//SELECT newly inserted record.
mysql_query(connection, \
"SELECT Order_id FROM PersonsItems");
//Resource struct with rows of returned data.
resource = mysql_use_result(connection);
// Fetch multiple results
while((result = mysql_fetch_row(resource))) {
printf("%s %s\n",result[0], result[1]);
}
My result is
-1 PersonsItems Row(s) Updated!
5
but with VALUES (1,1,5), (1,1,8);
I would like that to be
-1 PersonsItems Row(s) Updated!
5 8
Can somone tell me why is this not happening?
Kind regards.
I suspect this is because your first insert is failing with the following error:
Duplicate entry '1' for key 'PRIMARY'
Because you are trying to insert 1 twice into the PersonsItemsId which is the primary key so has to be unique (it is also auto_increment so there is no need to specify a value at all);
This is why rows affected is -1, and why in this line:
printf("%s %s\n",result[0], result[1]);
you are only seeing 5 because the first statement failed after the values (1,1,5) had already been inserted, so there is still one row of data in the table.
I think to get the behaviour you are expecting you need to use the ON DUPLICATE KEY UPDATE syntax:
INSERT INTO PersonsItems(PersonsItemsId, Person_Id, order_id)
VALUES (1,1,5), (1,1,8)
ON DUPLICATE KEY UPDATE Person_id = VALUES(person_Id), Order_ID = VALUES(Order_ID);
Example on SQL Fiddle
Or do not specify the value for personsItemsID and let auto_increment do its thing:
INSERT INTO PersonsItems( Person_Id, order_id)
VALUES (1,5), (1,8);
Example on SQL Fiddle
I think you have a typo or mistake in your two queries.
You are inserting "PersonsItemsId, Person_Id, Item_id"
INSERT INTO PersonsItems(PersonsItemsId, Person_Id, Item_id) VALUES (1,1,5), (1,1,8)
and then your select statement selects "Order_id".
SELECT Order_id FROM PersonsItems
In order to achieve 5, 8 as you request, your second query needs to be:
SELECT Item_id FROM PersonsItems
Edit to add:
Your primary key is autoincrement so you don't need to pass it to your insert statement (in fact it will error as you pass 1 twice).
You only need to insert your other columns:
INSERT INTO PersonsItems(Person_Id, Item_id) VALUES (1,5), (1,8)

How to use subquery in django?

I want to get a list of the latest purchase of each customer, sorted by the date.
The following query does what I want except for the date:
(Purchase.objects
.all()
.distinct('customer')
.order_by('customer', '-date'))
It produces a query like:
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
I am forced to use customer_id as the first ORDER BY expression because of DISTINCT ON.
I want to sort by the date, so what the query I really need should look like this:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
)
AS result
ORDER BY date DESC;
I don't want to sort using python because I still got to page limit the query. There can be tens of thousands of rows in the database.
In fact it is currently sorted by in python now and is causing very long page load times, so that's why I'm trying to fix this.
Basically I want something like this https://stackoverflow.com/a/9796104/242969. Is it possible to express it with django querysets instead of writing raw SQL?
The actual models and methods are several pages long, but here is the set of models required for the queryset above.
class Customer(models.Model):
user = models.OneToOneField(User)
class Purchase(models.Model):
customer = models.ForeignKey(Customer)
date = models.DateField(auto_now_add=True)
item = models.CharField(max_length=255)
If I have data like:
Customer A -
Purchase(item=Chair, date=January),
Purchase(item=Table, date=February)
Customer B -
Purchase(item=Speakers, date=January),
Purchase(item=Monitor, date=May)
Customer C -
Purchase(item=Laptop, date=March),
Purchase(item=Printer, date=April)
I want to be able to extract the following:
Purchase(item=Monitor, date=May)
Purchase(item=Printer, date=April)
Purchase(item=Table, date=February)
There is at most one purchase in the list per customer. The purchase is each customer's latest. It is sorted by latest date.
This query will be able to extract that:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
)
AS result
ORDER BY date DESC;
I'm trying to find a way not to have to use raw SQL to achieve this result.
This may not be exactly what you're looking for, but it might get you closer. Take a look at Django's annotate.
Here is an example of something that may help:
from django.db.models import Max
Customer.objects.all().annotate(most_recent_purchase=Max('purchase__date'))
This will give you a list of your customer models each one of which will have a new attribute called "most_recent_purchase" and will contain the date on which they made their last purchase. The sql produced looks like this:
SELECT "demo_customer"."id",
"demo_customer"."user_id",
MAX("demo_purchase"."date") AS "most_recent_purchase"
FROM "demo_customer"
LEFT OUTER JOIN "demo_purchase" ON ("demo_customer"."id" = "demo_purchase"."customer_id")
GROUP BY "demo_customer"."id",
"demo_customer"."user_id"
Another option, would be adding a property to your customer model that would look something like this:
#property
def latest_purchase(self):
return self.purchase_set.order_by('-date')[0]
You would obviously need to handle the case where there aren't any purchases in this property, and this would potentially not perform very well (since you would be running one query for each customer to get their latest purchase).
I've used both of these techniques in the past and they've both worked fine in different situations. I hope this helps. Best of luck!
Whenever there is a difficult query to write using Django ORM, I first try the query in psql(or whatever client you use). The SQL that you want is not this:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id" "shop_purchase.id" "shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC, "shop_purchase.date" DESC;
) AS result
ORDER BY date DESC;
In the above SQL, the inner SQL is looking for distinct on a combination of (customer_id, id, and date) and since id will be unique for all, you will get all records from the table. I am assuming id is the primary key as per convention.
If you need to find the last purchase of every customer, you need to do something like:
SELECT "shop_purchase.customer_id", max("shop_purchase.date")
FROM shop_purchase
GROUP BY 1
But the problem with the above query is that it will give you only the customer name and date. Using that will not help you in finding the records when you use these results in a subquery.
To use IN you need a list of unique parameters to identify a record, e.g., id
If in your records id is a serial key, then you can leverage the fact that the latest date will be the maximum id as well. So your SQL becomes:
SELECT max("shop_purchase.id")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id";
Note that I kept only one field (id) in the selected clause to use it in a subquery using IN.
The complete SQL will now be:
SELECT *
FROM shop_customer
WHERE "shop_customer.id" IN
(SELECT max("shop_purchase.id")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id");
and using the Django ORM it looks like:
(Purchase.objects.filter(
id__in=Purchase.objects
.values('customer_id')
.annotate(latest=Max('id'))
.values_list('latest', flat=True)))
Hope it helps!
I have a similar situation and this is how I'm planning to go about it:
query = Purchase.objects.distinct('customer').order_by('customer').query
query = 'SELECT * FROM ({}) AS result ORDER BY sent DESC'.format(query)
return Purchase.objects.raw(query)
Upside it gives me the query I want. Downside is that it is raw query and I can't append any other queryset filters.
This is my approach if I need some subset of data (N items) along with the Django query. This is example using PostgreSQL and handy json_build_object() function (Postgres 9.4+), but same way you can use other aggregate function in other database system. For older PostgreSQL versions you can use combination of array_agg() and array_to_string() functions.
Imagine you have Article and Comment models and along with every article in the list you want to select 3 recent comments (change LIMIT 3 to adjust size of subset or ORDER BY c.id DESC to change sorting of subset).
qs = Article.objects.all()
qs = qs.extra(select = {
'recent_comments': """
SELECT
json_build_object('comments',
array_agg(
json_build_object('id', id, 'user_id', user_id, 'body', body)
)
)
FROM (
SELECT
c.id,
c.user_id,
c.body
FROM app_comment c
WHERE c.article_id = app_article.id
ORDER BY c.id DESC
LIMIT 3
) sub
"""
})
for article in qs:
print(article.recent_comments)
# Output:
# {u'comments': [{u'user_id': 1, u'id': 3, u'body': u'foo'}, {u'user_id': 1, u'id': 2, u'body': u'bar'}, {u'user_id': 1, u'id': 1, u'body': u'joe'}]}
# ....