C++ splitting two dimensional vector into groups - c++

I'm trying to achieve a group by function (from sql), using a two dimensional vector of strings, which represents the data source.
I'm allowing the user to select which field to group by. I don't know the best way to achieve this.
I don't want to group if the selected field doesn't contain enough consistency. Example:
ID | name | type
1 | Sam | a
2 | Alex | b
3 | Tom | b
4 | Ryan | a
With the above example, grouping by name shouldn't pass because there is too much variability in the data. Whereas type is a valid condition. How could I implement this type of checking? I was thinking of keeping track of how many instances of each group field there is?
Would it be unnecessary to store each group in its individual vectors?

Lets answer your first question.
How do you determine if an attribute is valid to group on.
You want a low variability.
You need a metric that tells you if you should be able to group by that attribute.
A very simple metric would be to find the number of unique elements in an attribute and divide by the total number of elements in that attribute.
(1 means all elements are different 1/(number of elements) means all elements are the same)
So you can set a threshold on weather or not you group on an attribute by that number.
In your example:
name has 4 unique elements out of 4 elements. it's score would be 1
type has 2 unique elements out of 4 elements. it's score would be 0.5
Note this metric may perform poorly on small data-sets.
No it's not necessary to store each attribute in it's own vector (but it will work).
Other solutions:
create a struct/class to hold your data and store that class in a vector.
vector[0] => {id: 1, name: Sam, type: a}
vector[1] => {id: 2, name: Alex, type: b}
vector[2] => {id: 3, name: Tom, type: b}
vector[3] => {id: 4, name: Ryan, type: a}
you could then group by sorting based on a specific key (ie based on type)
or
Create a separate hash or map for each group. each hash/map will store pointers to your objects.
type_hash [0] => List of pointers to data objects with type a
type_hash [1] => List of pointers to data objects with type b

Related

Check if two arrays are exactly the same in BigQuery merge statement

I have two tables in BigQuery that I am trying to merge. For the purpose of explanation, let us name the two tables as A and B. So, we merge B into A. Also, I have a primary key called id based on which I am performing the merge. Now, both of them have a column (let us name it as X for explanation purposes) which is of the type ARRAY. My main intention is to replace the array data in A with that of B if the arrays are not equal in both the table. How can I do that. I did find posts on SO and other sites but none of them are working in my usecase.
A B
---------- ----------
id | x id | x
---------- ----------
1 | [1,2] 1 | [1,2]
---------- ----------
2 | [3] 2 | [4, 5]
The result of the merge should be
A
----------
id | x
----------
1 | [1,2]
----------
2 | [4,5]
How can I achieve the above result. Any leads will be very helpful. Also, if there are some other posts that address the above scenario directly, please point me to them
Edits:
I tried the following:
merge A as main_table
using B as updated_table
on main_table.id = updated_taable.id
when matched and main_table.x != updated_table.x then update set main_table.x = updated_table.x
when not matched then
insert(id, x) values (updated_table.id, updated_table.x)
;
Hope, this helps.
I cannot direclty use a compare operator over array right. My use case is that only update values when they are not equal. So, i cannot use something like != directly. This is the main problem
You can use to_json_string function to compare two arrays "directly"
to_json_string(main_table.x) != to_json_string(updated_table.x)

Changing the name of a list on the fly using a counter

I have a set of list,
list_0=[a,b,a,b,b,c,f,h................]
list_1=[f,g,c,g,f,a,b,b,b,.............]
list_2=[...............................]
............
list_j=[...............................]
where j is (k-1), with some thousands of value stored in them. I want to count for how many times a specific value is in a specific list. And I can have only 8 different values (I mean, every single element of those list can only have one out of 8 specific values, let's say a,b,c,d,e,f,g,h; so I want to count for every list how many times there's the value a, how many times the value b, and so on).
This is not so complicated.
What is complicated, at least for me, is to change on the fly the name of the list.
I tried:
for i in range(k):
my_list='list_'+str(int(k))
a_sum=exec(my_list.count(a))
b_sum=exec(my_list.count(b))
...
and it doesn't work.
I've read some other answer to similar problem, but I' not able to translate it to fit my need :-(
Tkx.
What you want is to dynamically access a local variable by its name. That's doable, all you need is locals().
If you have variables with names "var0", "var1" and "var2", but you want to access their content without hardcoding it. You can do it as follows:
var0 = [1,2,3]
var1 = [4,5,6]
var2 = [7,8,9]
for i in range(3):
variable = locals()['var'+str(i)]
print(variable)
Output:
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
Although doable, it's not advised to do this, you could store those lists in a dict containing their names as string keys, so that later you could access them by simply using a string without needing to take care about variable scopes.
If your names differ just by a number then perhaps you could also use a list, and the number would be the index inside it.

How to self-join two bags?

I have some set of numbers that describes connections between the first set of integers and the second set of integers. For example:
1,2
3,4
5,6
5,7
6,8
I then load my data as follows, and group it:
data = load 'data.csv' as integer_1, integer_2;
grouped = group data by integer_1;
grouped_numbers = foreach grouped generate group as node, data.integer_2 as connection;
Which then yields a bag with each first integer and its first-degree connections:
(1,{(2)})
(3,{(4)})
(5,{(6),(7)})
(6,{(8)})
I would then like to do a self-join of the grouped_numbers bag, in order to give the resultant first integer with each of its first- and second-degree connections. In this case, that would be:
(1,{(2)})
(3,{(4)})
(5,{(6),(7),(8)})
(6,{(8)})
because 5 is connected to 6, which is connected to 8, so 8 is a second-degree connection of 6. How would I implement this in Pig?
First join:
joined = join data1 by int_2, data2 by int_1
where data1 and data2 are the same set (copies of data in this example).
then group by the first field. The inner bag will have all the connections to the 'group', possibly more than once. So you might need a distinct on the inner bags as well, if you just one the unique elements.
(answered via a Pig mailing list)

cloudsearch query to boost exact match on range

In a cloudsearch structured query.
I have a couple of fields I am searching on.
On field one, the user selects "2"
On field two the user selects "1"
I am wanting to run this as a range query, so that the results that are returned are -1 to +1
eg. on field one the range would be 1,3 and on field 2 it would be 0,2
What I am wanting to do is sort the results so that the results that match both field 1 and field 2 are at the top, and the rest under it.
eg. where field one=2 and field two =1 would be at the top and the rest are not in any specific order,
note: I do need to end up sorting the results by distance, so that all the exact matching results are in distance order, then all the rest are ordered by distance.
I am sure I can do this with 2 queries, just trying to make it work in one query if at all possible to lighten the load.
Say your fields are 'a' and 'b', and the specified values are a=2 and b=1 (as in your example, except I've named the fields 'a' and 'b' instead of 'one' and 'two'). Here are the various terms of your query.
Range Query
This is the query for the range a±1 and b±1 where a=2 and b=1:
q=(and (range field=a[1,3]) (range field=b[0,2]))
Rank Expression
For your rank expression, compute a distance-based score using absolute value so that scores 'a' and 'b' can't cancel each other out (like a=3,b=0 would, for example):
expr.rank1=abs(a-2)+abs(b-1)
Sort by Rank
That defined a ranking expression named rank1, which we now want to sort by, starting with the lowest values ('0' means a=2,b=1):
sort=rank1 asc
Return the Rank
For debugging purposes, you may want return the ranking score:
return=rank1
Put all those terms together and you've got your query.
Further Potentially-Useful Things
If you want to get fancy and penalize things in a non-linear way, you can use exp. For example, if you want to differentiate between 'a' and 'b' both being off by 1 vs 'a' being an exact match and 'b' being off by 2 (eg a=3,b=2 will rank ahead of a=2,b=3 even though the previous ranker would give them both a score of 2):
expr.rank1=exp(abs(a-2))+exp(abs(b-1))
And you can use boolean logic and the ternary operator to detect and prefer certain results that meet certain criteria, eg to give a big boost when 'a' and 'b' are on-target, a smaller boost when 'a' or 'b' is on target, etc (since we're sorting in low-to-high, a boost in rank is actually achieved by adding less to the result):
((a==1&&b==2)?0:100)+((a==1||b==2)?0:1000)+abs(a-1)+abs(b-2)
See http://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-expressions.html

"fill in the blanks"

I'm trying to make a simple "fill in the blanks" type of exam in django and would like to know what is the best way to design the database.
Example: "9 is the sum of 4 and 5, or 3 and 6."
During the exam, the above sentence would appear as "__ is the sum of __ and _, or _ and __."
Obviously there are unlimited number of answers to this question, but assume that the above numbers are the only answers. But the catch is that you can switch the places of 4 and 5, or the places of 3 and 6 and still get the right answer. Besides, the number of blanks is not known, so it can be 1 or more.
I would go with something like. First define a Question table:
Question
--------------------------
Id Text
1 9 is the sum of 4 and 5, or 3 and 6
...
Then save the position of the hidden substrings, let's call them fields, in another table:
QuestionField
--------------------------
Id QuestionId StartsAt EndsAt Set
1 1 0 1 1
2 1 16 17 2
3 1 22 23 2 # NOTE: Is in the same set as QuestionField #2
...
This table lets you retrieve the actual value of the field by querying the Question table (e.g. entry one refers to the value '9' in the first question).
The "Set" column contains an identifier of the "set" in which this field is, where fields in the same set can be replaced by each other. When you populate it, you would have to ensure that all questions that can be replaced by each other are in the same set. The actual number of the set doesn't matter, as long as it's unique. But it makes sense to have it equal to the ID of one of the elements of the set.