I am struggling with clickhouse to keep unique data row per a PK.
I choose this Column base DB to express statistics data quickly and very satisfied with its speed. However, got some duplicated data issue here.
The test table looks like...
CREATE TABLE test2 (
`uid` String COMMENT 'User ID',
`name` String COMMENT 'name'
) ENGINE ReplacingMergeTree(uid)
ORDER BY uid
PRIMARY KEY uid;
Let's presume that I am going to use this table to join for display names(name field in this table). However, I can insert many data as I want in same PK(Sorting key).
For Example
INSERT INTO test2
(uid, name) VALUES ('1', 'User1');
INSERT INTO test2
(uid, name) VALUES ('1', 'User2');
INSERT INTO test2
(uid, name) VALUES ('1', 'User3');
SELECT * FROM test2 WHERE uid = '1';
Now, I can see 3 rows with same sorting key. Is there any way to make key unique, at least, prevent insert if the key exists?
Let's think about below scenario
tables and data are
CREATE TABLE blog (
`blog_id` String,
`blog_writer` String
) ENGINE MergeTree
ORDER BY tuple();
CREATE TABLE statistics (
`date` UInt32,
`blog_id` String,
`read_cnt` UInt32,
`like_cnt` UInt32
) ENGINE MergeTree
ORDER BY tuple();
INSERT INTO blog (blog_id, blog_writer) VALUES ('1', 'name1');
INSERT INTO blog (blog_id, blog_writer) VALUES ('2', 'name2');
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202007, '1', 10, 20);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '1', 20, 0);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202009, '1', 3, 1);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '2', 11, 2);
And here is summing query
SELECT
b.writer,
a.read_sum,
a.like_sum
FROM
(
SELECT
blog_id,
SUM(read_cnt) as read_sum,
SUM(like_cnt) as like_sum
FROM statistics
GROUP BY blog_id
) a JOIN
(
SELECT blog_id, blog_writer as writer FROM blog
) b
ON a.blog_id = b.blog_id;
At this moment it works fine, but if there comes a new low like
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '1', 60, 0);
What I expected is update low and sum of the "name1"'read_sum is 73. but it shows 93 since it allows duplicated insert.
Is there any way to
prevent duplicated insert
or set unique guaranteed PK in table
Thanks.
One thing that comes to mind is ReplacingMergeTree. It won't guarantee absence of duplication right away, but it it will do so eventually. As docs state:
Data deduplication occurs only during a merge. Merging occurs in the
background at an unknown time, so you can’t plan for it. Some of the
data may remain unprocessed.
Another approach that i personally use is introducing another column named, say, _ts - a timestamp when row was inserted. This lets you track changes and with help of clickhouse's beautiful limit by you can easily get last version of a row for given pk.
CREATE TABLE test2 (
`uid` String COMMENT 'User ID',
`name` String COMMENT 'name',
`_ts` DateTime
) ENGINE MergeTree(uid)
ORDER BY uid;
Select would look like this:
SELECT uid, name FROM test2 ORDER BY _ts DESC LIMIT 1 BY uid;
In fact, you don't need a pk, just specify any row/rows in limit by that you need rows to be unique by.
Besides ReplacingMergeTree which runs deduplication asynchronously, so you can have temporarily duplicated rows with the same pk, you can use CollapsingMergeTree or VersionedCollapsingMergeTree.
With CollapsingMergeTree you could do something like this:
CREATE TABLE statistics (
`date` UInt32,
`blog_id` String,
`read_cnt` UInt32,
`like_cnt` UInt32,
`sign` Int8
) ENGINE CollapsingMergeTree(sign)
ORDER BY tuple()
PRIMARY KEY blog_id;
The only caveat is on every insert of a duplicated PK you have to cancel the previous register, something like this:
# first insert
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 20, 0, 1);
# cancel previous insert and insert the new one
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 20, 0, -1);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 11, 2, 1);
I do not think this is a solution for the problem, but at least I detour above problem in this way in the perspective of business.
Since clickhouse officially does not support modification of table data.(They provide ALTER TABLE ... UPDATE | DELETE, but eventually those will rewrite the table) I split the table into small multiple partitions(In my case, 1 partition has about 50,000 data) and if duplicated data comes, 1) drop the partition 2) re-insert data again. In above case, I alway execute ALTER TABLE ... DROP PARTITION statement before insert.
I also have tried ReplacingMergeTree, but data duplication still occurred.(Maybe I do not understand how to use the table but I gave a single sorting key - and when I insert duplicated data there are multiple data in same sorting key)
In my django project i had a table with a column named 'key_id'. Until today i had to group different values of this values, count it and display the largest result.
I did this:
maxpar = temp_test_keywords.objects.filter(main_id=test_id).values('key_id').annotate(total=Count('key_id')).order_by('-total').first()
maxMax = maxpar['total']
all done.
Today, we deceide to add another field to table, 'key_group'(could be 1,2,3,4) and now i have to use the same query for group and count max key_id field but related also for key_group.
For example before if in my table there was 5 record with key_id=187 and 3 with 112 my query had to return 5, now if for that 5 records 4 contain 'key_group=1' and one = 2 query have to return 4
Hope i was clear
Thanks in advance
Luke
I got a dataframe with millions of entries, with one of the columns 'TYPE' (string). There is a total of 400 values for this specific column and I want to replace the values with integer id starting from 1 to 400. I also want to export this dictionary 'TYPE' => id for future reference. I tried with to_dict but it did not help. Anyway can do this ?
Option 1: you can use pd.factorize:
df['new'] = pd.factorize(df['str_col'])[0]+1
Option 2: using category dtype:
df['new'] = df['str_col'].astype('category').cat.codes+1
or even better just convert it to categorical dtype:
df['str_col'] = df['str_col'].astype('category')
and when you need to use numbers instead just use category codes:
df['str_col'].cat.codes
thanks to #jezrael for extending the answer - for creating a dictionary:
cats = df['str_col'].cat.categories
d = dict(zip(cats, range(1, len(cats) + 1)))
PS category dtype is very memory-efficient too
I currently have Django models like this
MyFirstObject(models.Model):
some_field = models.BooleanField(default=False)
MySecondObject(models.Model):
first_object = models.ForeignKey(MyFirstObject, db_column='firstObjectId')
Because of various issues, our data integrity is corrupt. So, I need to find instances where MyFirstObject has been deleted, but MySecondObject still has a row w a foreign key to it.
The database would look similar to:
TABLE my_first_object
id someField
1 a
2 a
3 b
TABLE my_second_object
id firstObjectId
1 1
2 3
3 4
Notice row 3 of the TABLE my_second_object has an firstObjectID that does not have a corresponding record in the my_first_object table. I want to find all instances like that.
If I was doing raw SQL, I would do
SELECT my_second_object.id, my_second_object.firstObjectId
FROM my_second_object
LEFT JOIN ON ( my_second_object.firstObjectId = my_first_object.id )
WHERE my_first_object.id IS NULL
In Djago, I am trying
MySecondObject.objects.filter(my_first_object__id__isnull=true)
But when I look at the query that results, it is doing an inner join instead of left join. Does anyone have suggestions? Thanks!
Try like this:
first_object_ids = MyFirstObject.objects.values_list('id')
get_second_objects = MySecondObject.objects.exclude(my_first_object_id__in = first_object_ids)
I have the following table
id val
--------------
1 abc
2 xyz
3 abc
4 abc
Given the primary key(id) I need to be able to get all the rows with the same val as the row with the primary key.
Currently I have the following django code:
Table.objects.filter(val = Table.objects.get(id=1).val)
But this makes two queries to the database. I want to reduce this to a single database call. Is this possible in Django.
You can always use extra():
Table.objects.extra(where=['val=(select val from app_table where id=1)'])
This will result into single query:
SELECT
*
FROM
app_table
WHERE
val=(SELECT val FROM app_table WHERE id=1)