Is it possible to use clickhouse to implement efficient union-find algorithm? - c++

I have typical union-find problem where I have to group records, but it includes multiple files of hundreds bilion of records.
Can I somehow use clickhouse database to solve it?
Edit - minimal reproducible example:
I have tree columns (item_id, from, to) which represent graph nodes.
I want to create groups (id, group_id, item_id) which names groups from disjoint sets.
[Data]
item_id from to
0 101 102
1 102 103
2 104 105
[Result]
id group_id item_id
0 0 0
1 0 1
2 1 2
There are only two groups #0 (101->102->103) and #1 (104->105).
The problem in implementation in memory is that there's too much records and I want clickhouse (or some other solution) to care about filesystem caches.

Without knowing more about your specific data and questions, it is tricky to provide a definitive answer. In general, this represents a moderate size for ClickHouse. UNION is fully supported. Your best bet is to simply try - loading data or generating data is straightforward and SQL queries can usually be translated from Postgresql/MySQL easily.

Related

In SAS: Analyses of Multiple choice variables

I have a dataset for a survey that has several multiple choice questions (Check all, check 3, etc);
Each option is coded as binary variables
Location Popn1 Popn2 Popn3 Popn4 .... Popn20
Location1 0 1 1 1
Location2 1 1 0 0
Location3 0 0 0 0
Here is my code:
proc tabulate data=cath.binarydata;
class location sectorcollapsed;
var popn1-popn20;
table (location='Location'),
(popn1-popn20)*(Sum='Count'*f=best5. mean='Percent'*f=percent8.1 N='Total responses received per question')
/box="Populations Served by Location";
I'm using a proc tabulate to do a sum (count) and mean (percent) of each option in the multiple choice question by Location. However, I am finding that, when I do a check using my original dataset, the numbers don't make sense.
Here is a sample output:
This is the kind of output I want and have right now
Popn1 Popn2 Popn3 ....... Popn20.
Count Freq N Count N Freq
Location1 13 50% 26 11 42% 26
Location2
However, when I check back and manually calculate, what I think its doing doesn't make sense; for example, the N of 26 makes sense for location1, because there are 26 people in location1 and they all answered the question. So the sum being out of 26 makes sense.
However, for some of them, the N doesn't make sense - I thought the N would be all of the people who answered the question, but it doesn't quite add up like this. As an example, in one of the locations, there were 149 total people in that location, and 19 did not provide an answer at all - so the N here should be 130, but it is giving me a value of 134 in the output.
Does anyone have any thoughts or can help me understand how to use SAS to tabulate the multiple variables together in one column, while giving me the total answers for that option, and the percentage (out of the number of people who answered the question?)
Any help is much appreciated,

Creating an ID based on factor and filling down with Stata

Consider the fictional data to illustrate my problem, which contains in reality thousands of rows.
Figure 1
Each individual is characterized by values attached to A,B,C,D,E. In figure1, I show 3 individuals for which some characteristics are missing. Do you have any idea how can I get the following completed table (figure 2)?
Figure 2
With the ID in figure 1 I could have used the carryforward command to filling in the values. But since each individual has a different number of rows I don't know how to create the ID.
Edit: All individual share the characteristic "A".
Edit: the existing order of observations is informative.
To detect the change of id, the idea is to compare if the precedent value of char is >= in each rows.
This works only if your data are ordered, but it seems mandatory in your data.
gen id= 1 if (char[_n-1] >= char[_n]) | _n ==1
replace id = sum(id) if id==1
replace id = id[_n-1] if missing(id)
fillin id char
drop _fillin
If an individual as only the characteristics A and C and another individual as only the characteristics D and E, this won't work, but it seems impossible to detect with your data.

Association Mining Weka Only 1

I'm trying to apply associate mining using apriori with Weka on my data set that looks like
A B C
1 0 1
0 0 1
1 0 0
But it's only finding rules where its 0 while I only want rules where there are 1s
How can I get around this? I don't want it to look for rules where an absence of something indicates the absence of something else but rather the presence of A to indicate the presence of C for example.
Try replacing 0s with missing values instead! If I recall correctly, this will then produce the desired results. But I haven't used this for a long time, because Weka is just so much slower than ELKI or SPMF. Weka would just die on my data sets, whereas the other two worked fine.

Loading first few observations of data set without reading entire data set (Stata 13.1)?

(Stata/MP 13.1)
I am working with a set of massive data sets that takes an extremely long time to load. I am currently looping through all the data sets to load them each time.
Is it possible to just tell Stata to load in the first 5 observations of each dataset (or in general the first n data sets in each use command) without actually having to load the entire data set? Otherwise, if I were to load in the entire data set and then just keep the first 5 observations, the process takes extremely long time.
Here are two work-arounds I have already tried
use in 1/5 using mydata : I think this is more efficient than just loading the data and then keeping the observations you want in a different line, but I think it still reads in the entire data set.
First load all the data sets, then save copies of all the data sets to just be the first 5 observations, and then just use the copies: This is cumbersome as I have a lot of different files; I would very much prefer just a direct way to read in the first 5 observations without having to resort to this method and without having to read the entire data set.
I'd say using in is the natural way to do this in Stata, but testing shows
you are correct: it really makes no "big" difference, given the size of the data set. An example is (with 148,000,000 observations)
sysuse auto, clear
expand 2000000
tempfile bigfile
save "`bigfile'", replace
clear
timer on 1
use "`bigfile'"
timer off 1
clear
timer on 2
use "`bigfile'" in 1/5
timer off 2
timer list
timer clear
Resulting in
. timer list
1: 6.44 / 1 = 6.4400
2: 4.85 / 1 = 4.8480
I find that surprising since in seems really efficicient in other contexts.
I would contact Stata Tech support (and/or search around, including www.statalist.com) only to ask why in isn't much faster
(independently of you finding some other strategy to handle this problem).
It's worth using, of course; but not fast enough for many applications.
In terms of workflow, your second option might be the best. Leave the computer running while the smaller datasets are created (use a for loop), and get back to your regular coding/debugging once that's finished. This really depends on what you're doing, so it may work or not.
Actually, I found the solution. If you run
use mybigdata if runiform() <= 0.0001
Stata will take a random sample of 0.0001 of the data set without reading the entire data set.
Thanks!
Vincent
Edit: 4/28/2015 (1:58 PM EST)
My apologies. It turns out the above was actually not a solution to the original question. It seems that on my system there was high variability in the speed of using
use mybigdata if runiform() <= 0.0001
each time I ran it. When I posted that the above was a solution, I think when I ran the code, it just happened to be a faster instance. However, as I now am repeatedly running
use mybigdata if runiform() <= 0.0001
vs.
use in 1/5 using mydata
I am actually finding that
use in 1/5 using mydata
is on average faster.
In general, my question is simply how to read in a portion of a Stata data set without having to read in the entire data set for computational purposes especially when the data set is really large.
Edit: 4/28/2015 (2:50 PM EST)
In total, I have 20 datasets, each with between 5 - 15 million observations. I only need to keep 8 of the variables (There are 58-65 variables in each data set). Below is the output from the first four "describe, short" statements.
2004 action1
Contains data from 2004action1.dta
obs: 15,039,576
vars: 64 30 Oct 2014 17:09
size: 2,827,440,288
Sorted by:
2004 action2578
Contains data from 2004action2578.dta
obs: 13,449,087
vars: 59 30 Oct 2014 17:16
size: 2,098,057,572
Sorted by:
2005 action1
Contains data from 2005action1.dta
obs: 15,638,296
vars: 65 30 Oct 2014 16:47
size: 3,143,297,496
Sorted by:
2005 action2578
Contains data from 2005action2578.dta
obs: 14,951,428
vars: 59 30 Oct 2014 17:03
size: 2,362,325,624
Sorted by:
Thanks!
Vincent

Ideal directory structure for web application

I'm about to create a user based website and will have to store photo, docs and other data for each user.
If I take a silly number like 1 000 000 000 users, I believe than one folder with 1 000 000 000 won't be the fastest thing in the world! So I was thinking of creating something like
1st level : [a-z]
2nd level : [a-z]
3rd level : [a-z]
Therefor bobby will be in /b/o/b/by
But this also mean that it won't be spread equaly, because there will be very few user starting with a z and many more with a m,s,l ...
so I was thinking of using a user id
such as "000000000001", "000000000001" etc...
1st level : [000-999]
2nd level : [000-999]
3rd level : [000-999]
therefore data of the user 000000000001 will be store in /data/000/000/000/001
then I will be sure to have a maximum of 1000 folder in each level.
What do you guys think about it, what I should do or not do ?
The server will be running Centos 5.4 with EXT3 on raid 1, if the I/O get's too bad
i will probably go for a raid 10.
A hash function provides a way to distribute large amounts of data across an easily searchable structure.
See this related question: Why use hashing to create pathnames for large collections of files?
And also try looking through Google results for Directory Hashing.