Check if two arrays are exactly the same in BigQuery merge statement

Check if two arrays are exactly the same in BigQuery merge statement - google-cloud-platform

I have two tables in BigQuery that I am trying to merge. For the purpose of explanation, let us name the two tables as A and B. So, we merge B into A. Also, I have a primary key called id based on which I am performing the merge. Now, both of them have a column (let us name it as X for explanation purposes) which is of the type ARRAY. My main intention is to replace the array data in A with that of B if the arrays are not equal in both the table. How can I do that. I did find posts on SO and other sites but none of them are working in my usecase.
A B
---------- ----------
id | x id | x
---------- ----------
1 | [1,2] 1 | [1,2]
---------- ----------
2 | [3] 2 | [4, 5]
The result of the merge should be
A
----------
id | x
----------
1 | [1,2]
----------
2 | [4,5]
How can I achieve the above result. Any leads will be very helpful. Also, if there are some other posts that address the above scenario directly, please point me to them
Edits:
I tried the following:
merge A as main_table
using B as updated_table
on main_table.id = updated_taable.id
when matched and main_table.x != updated_table.x then update set main_table.x = updated_table.x
when not matched then
insert(id, x) values (updated_table.id, updated_table.x)
;
Hope, this helps.

I cannot direclty use a compare operator over array right. My use case is that only update values when they are not equal. So, i cannot use something like != directly. This is the main problem
You can use to_json_string function to compare two arrays "directly"
to_json_string(main_table.x) != to_json_string(updated_table.x)

Related

How will the Memory behave in this C++ program?

So I am not used to hardware-close programming, pointers and ram always were done for me but because I want to learn C++, I wanted to try a 2d array, that didn't work because of unknown size, so i decided to go with a 2d list.
The problem I have now though is I have no idea how the program will behave, and before I can test it I want to know if values will be copied, overwritten etc.
#include "board.h"
#include "list"
using namespace std;
void Board::initiate_board()
{
list<list<int>> list_of_rows;
for (int rows = 0; rows++; rows < Board::rows) {
list<int> new_row;
for (int columns = 0; columns++; columns < Board::columns) {
new_row.push_back(0);
}
list_of_rows.push_back(new_row);
}
}
What this is supposed to do is create a 2d list filled with 0s. I don't know though what will happen in storage and I have no way to visualise RAM and know what is were (and if I could I'd be overwhelmed) so I was hoping someone could clear this up for me.
What I think this code does is create a list of 0s, puts it into the other list and then starts a new list, deleting the old one automatically as it will not be referenced or it will be overwritter (no clue though which one). So with rows and columns at 4 it will look like
|0|0|0|0| => ... => |0|0|0|0|
|0|0|0|0|
|0|0|0|0|
|0|0|0|0|
The 2 things i am uncertain of are A: will a new list be created? Or will the old one just be increased like:
|0|0|0|0|
|0|0|0|0|0|0|0|0|
|0|0|0|0|0|0|0|0|0|0|0|0|
|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
The second question is: Will the list be copied or the reference be stored? So will it after saving the 4-long-list into the first list increase the original list and as only a reference is saved increase the list[0] also to be 8 long, so that if i changed the 2nd value in the list it would be changed in every row?
|0|0|0|0| => |0|0|0|0|0|0|0|0|
|0|0|0|0|0|0|0|0|
I know this question might be very basic for someone who knows C++ but as I come from dart and python with C# being the hardware-closest language I somewhat know, this is just confusing to me. Is there a way to know what will happen other than trying it out with printing the list of lists or just guessing?
And if I wanted to save a reference and not a copy to the list, how would I then do that?

I recommend you pick up a good book on the basics of modern C++. C++ is a challenging language with a steep learning curve and long legacy. There is a lot to learn regarding object lifetimes, different types of construction and initialization, etc -- and it will be easier to start with a good book covering these topics.
To answer your question: the code will work the way you expect it to work; you will create a std::list containing rows std::lists that each contain columns 0s.
That is, you will produce a container of containers that logically[1] looks like:
<--columns-->
^ |0|0|...|0|0|
| |0|0| |0|0|
| . . .
rows . . .
| . . .
| |0|0| |0|0|
V |0|0|...|0|0|
Variables in C++ have lifetimes associated to them, often tied to the scope they are in. new_row starts its life where it is defined in the for loop, and is destroyed at the closing brace of the loop for each iteration.
More generally, all objects are destroyed in the reverse order declared by the end of scope; and a loop is simply a scope that occurs multiple times.
So in your above code, what is happening is:
A list called new_row is created with 0 elements
columns 0s are pushed into it (the inner loop)
list_of_rows makes a copy of new_row
new_row is destroyed (end of scope).
Repeat to step 1 for rows times
Technically, because list_of_rows isn't used, this gets destroyed at the end of the function scope -- though I assume that its use was omitted for brevity
[1] I recommend that you look into using std::vector instead of std::list, which is probably closer to what you intend to use.
std::vector is a dynamic array type (contiguous storage), whereas std::list is actually a doubly-linked list implementation. This means you can't simply index into a std::list (e.g. list_of_rows[1][2]) because access requires traversal of the list. std::vector does not have this issue.
What your list<list<...>> implementation does is actually much closer to:
<--------columns-------->
^ |*| --> |0| <-> |0| ... |0| <-> |0|
| ^
| |
| V
| |*| --> |0| <-> |0| ... |0| <-> |0|
| .
rows .
| .
| |*| --> |0| <-> |0| ... |0| <-> |0|
| ^
| |
| V
V |*| --> |0| <-> |0| ... |0| <-> |0|
Whereas a vector<vector<...>> implementation will be:
<--columns-->
^ |*| -> |0|0|...|0|0|
| |*| -> |0|0| |0|0|
| . . .
rows . . .
| . . .
| |*| -> |0|0| |0|0|
V |*| -> |0|0|...|0|0|

Different result in OCaml and ReasonML

There is a case mapping two vectors into a single vector. I expected that the result of both ML should be same. Unfortunately, the result of ReasonML is different. Please help and comment how to fix it.
OCaml
List.map2 (fun x y -> x+y) [1;2;3] [10;20;30];;
[11;;22;;33]
ReasonML
Js.log(List.map2 ( (fun (x,y) => x+y), [1,2,3], [10,20,30]))
[11,[22,[33,0]]]

This is the same result. If you run:
Js.log([11,22,33]);
You'll get:
[11,[22,[33,0]]]

The result is the same, but you're using different methods of printing them. If instead of Js.log you use rtop or sketch.sh, you'll get the output you expect:
- : list(int) = [11, 22, 33]
Js.log prints it differently because it is a BuckleScript binding to console.log, which will print the JavaScript-representation of the value you give to it. And lists don't exist in JavaScript, only arrays do.
The way BuckleScript represents lists is pretty much the same way it is done natively. A list in OCaml and Reason is a "cons-cell", which is essentially a tuple or a 2-element array, where the first item is the value of that cell and the last item is a pointer to the next cell. The list type is essentially defined like this:
type list('a) =
| Node('a, list('a))
| Empty;
And with this definition could have been constructed with:
Node(11, Node(22, Node(33, Empty)))
which is represented in JavaScript like this:
[11,[22,[33,0]]]
^ ^ ^ ^
| | | The Empty list
| | Third value
| Second value
First value
Lists are defined this way because immutability makes this representation very efficient. Because we can add or remove values without copying all the items of the old list into a new one. To add an item we only need to create one new "cons-cell". Using the JavaScript representation with imagined immutability:
const old = [11,[22,[33,0]]];
const new = [99, old];
And to remove an item from the front we don't have to create anything. We can just get a reference to and re-use a sub-list, because we know it won't change.
const old = [11,[22,[33,0]]];
const new = old[1];
The downside of lists is that adding and removing items to the end is relatively expensive. But in practice, if you structure your code in a functional way, using recursion, the list will be very natural to work with. And very efficient.

#Igor Kapkov, thank you for your help. Base on your comment, I found a pipeline statement in the link, there is a summary.
let a = List.map2 ( (fun (x,y) => x+y), [1,2,3], [10,20,30] )
let logl = l => l |> Array.of_list |> Js.log;
a |> logl
[11,22,33]

Combine Q objects in Django and limit one of them

I want to combine these 2 queries in 1 to get the Music object with name xyz and also get top 3 objects from genre 10, ordered by artist:
1. Music.objects.filter(name='xyz', genre=10)
2. Music.objects.filter(genre=10).order_by('artist')[:3]
I can use Q objects like this but I don't know how to order & filter the 3rd Q object below:
Music.objects.filter( (Q(name='xyz') & Q(genre=10)) | Q(genre=10) )

Maybe try like this?
Music.objects.filter( (Q(name='xyz') & Q(genre=10)) | Q(genre=10) ).order_by('artist')[:3]

No way to do it, use two queries but change second to avoid duplicates:
exact = Music.objects.filter(name='xyz', genre=10)
additional = Music.objects.filter(genre=10).exclude(name='xyz').order_by('artist')[:3]
do_something_with = list(exact) + list(additional)

Storing values in a macro variable

I'm using the levelsof command to identify unique values of a variable and stick them into a macro. Then later on I'd like to use those values in the macro to select records from another dataset that I'll load.
What i have in mind is something along the following lines:
keep if inlist(variable, "`macrovariable'")
Does that work? And is there another more efficient option? I could do this easily in R (because vectors are easier to work with than macros), but this project requires Stata.
Clarification:
if I have a variable with three unique values, a, b and c, I want to store those in a macro variable so I can later take another dataset and select observations that match one of those values.
Normally can use the inlist function to do this manually, but I'd like to soft-code it so I can run the program with different sets of values. And I can't get the inlist function to work with macros.

* the source data
levelsof x, local( allx )
* make it -inlist-friendly
local allxcommas : subinstr local allx " " ", ", all
* bring in the new data
use using blah.dta if inlist(x, `allxcommas')

I suspect your difficulty in using a macro generated by levelsof with inlist is that you forgot to use the separate(,) option. I also do not believe you can use the inlist function with keep if-- you will need to add the extra step of defining a new indicator.
In the example below I used the 1978 auto data and created a variable make_abb of vehicle manufacturers (or make) which took only a handful of distinct values ("Do" for Dodge, etc.).
I then used the levelsof command to generate a local macro of the manufacturers which had a vehicle model with a poor repair record (the variable rep78 is a categorical repair record variable where 1 is poor and 5 is good). The option separate(,) is what adds the commas into the macro and enables inlist to read it later on.
Finally, if I want to drop the manufacturers which did not have a poor repair record, I generate a dummy variable named "keep_me" and fill it in using the inlist function.
*load some data
sysuse auto
*create some make categories by splitting the make and model string
gen make_abb=substr(make,1,2)
lab var make_abb "make abbreviation (string)"
*use levelsof with "local(macro_name)" and "separate(,)" options
levelsof make_abb if rep78<=2, separate(,) local(make_poor)
*generate a dummy using inlist and your levelsof macro from above
gen keep_me=1 if inlist(make_abb,`make_poor')
lab var keep_me "dummy of makes that had a bad repair record"
*now you can discard the rest of your data
keep if keep_me==1

This seems to work for me.
* "using" data
clear
tempfile so
set obs 10
foreach v in list a b c d {
generate `v' = runiform()
}
save `so'
* "master" data
clear
set obs 10
foreach v in list e f g h {
generate `v' = runiform()
}
* merge
local tokeepusing a b
merge 1:1 _n using `so', keepusing(`tokeepusing')
Yields:
. list
+------------------------------------------------------------------------------------------+
| list e f g h a b _merge |
|------------------------------------------------------------------------------------------|
1. | .7767971 .5910658 .6107377 .7256517 .357592 .8953723 .0871481 matched (3) |
2. | .643114 .6305301 .6441092 .7770287 .5247816 .4854506 .3840067 matched (3) |
3. | .3833295 .175099 .4530386 .5267127 .628081 .2273252 .0460549 matched (3) |
4. | .0057233 .1090542 .1437526 .3133509 .604553 .9375801 .8091199 matched (3) |
5. | .8772233 .6420991 .5403687 .1591801 .5742173 .8948932 .4121684 matched (3) |
|------------------------------------------------------------------------------------------|
6. | .6526399 .5137199 .933116 .5415702 .4313532 .8602547 .5049801 matched (3) |
7. | .2033027 .8745837 .8609 .0087578 .9844069 .1909852 .3695011 matched (3) |
8. | .6363281 .0064866 .6632325 .307236 .9544498 .6267227 .2908498 matched (3) |
9. | .366027 .4896181 .0955155 .4972361 .9161932 .7391482 .414847 matched (3) |
10. | .8637221 .8478178 .5457179 .8971257 .9640535 .541567 .1966634 matched (3) |
+------------------------------------------------------------------------------------------+
Does this answer your question? If not, please comment.

Thinking in Lazy Sequences

Taking an example of Fibonacci Series from the Clojure Wiki, the Clojure code is :
(def fib-seq
(lazy-cat [0 1] (map + (rest fib-seq) fib-seq)))
If you were to think about this starting from the [0 1], how does it work ? Would be great if there are suggestions on the thought process that goes into thinking in these terms.

As you noted, the [0 1] establishes the base cases: The first two values in the sequence are zero, then one. After that, each value is to be the sum of the previous value and the value before that one. Hence, we can't even compute the third value in the sequence without having at least two that come before it. That's why we need two values with which to start off.
Now look at the map form. It says to take the head items from two different sequences, combine them with the + function (adding multiple values to produce one sum), and expose the result as the next value in a sequence. The map form is zipping together two sequences — presumably of equal length — into one sequence of the same length.
The two sequences fed to map are different views of the same basic sequence, shifted by one element. The first sequence is "all but the first value of the base sequence". The second sequence is the base sequence itself, which, of course, includes the first value. But what should the base sequence be?
The definition above said that each new element is the sum of the previous (Z - 1) and the predecessor to the previous element (Z - 2). That means that extending the sequence of values requires access to the previously computed values in the same sequence. We definitely need a two-element shift register, but we can also request access to our previous results instead. That's what the recursive reference to the sequence called fib-seq does here. The symbol fib-seq refers to a sequence that's a concatenation of zero, one, and then the sum of its own Z - 2 and Z - 1 values.
Taking the sequence called fib-seq, drawing the first item yields the first element of the [0 1] vector — zero. Drawing the second item yields the second element of the vector — one. Upon drawing the third item, we consult the map to generate a sequence and use that as the remaining values. The sequence generated by map here starts out with the sum of the first item of "the rest of" [0 1], which is one, and the first item of [0 1], which is zero. That sum is one.
Drawing the fourth item consults map again, which now must compute the sum of the second item of "the rest of" the base sequence, which is the one generated by map, and the second item of the base sequence, which is the one from the vector [0 1]. That sum is two.
Drawing the fifth item consults map, summing the third item of "the rest of" the base sequence — again, the one resulting from summing zero and one — and the third item of the base sequence — which we just found to be two.
You can see how this is building up to match the intended definition for the series. What's harder to see is whether drawing each item is recomputing all the preceding values twice — once for each sequence examined by map. It turns out there's no such repetition here.
To confirm this, augment the definition of fib-seq like this to instrument the use of function +:
(def fib-seq
(lazy-cat [0 1]
(map
(fn [a b]
(println (format "Adding %d and %d." a b))
(+ a b))
(rest fib-seq) fib-seq)))
Now ask for the first ten items:
> (doall (take 10 fib-seq))
Adding 1 and 0.
Adding 1 and 1.
Adding 2 and 1.
Adding 3 and 2.
Adding 5 and 3.
Adding 8 and 5.
Adding 13 and 8.
Adding 21 and 13.
(0 1 1 2 3 5 8 13 21 34)
Notice that there are eight calls to + to generate the first ten values.
Since writing the preceding discussion, I've spent some time studying the implementation of lazy sequences in Clojure — in particular, the file LazySeq.java — and thought this would be a good place to share a few observations.
First, note that many of the lazy sequence processing functions in Clojure eventually use lazy-seq over some other collection. lazy-seq creates an instance of the Java type LazySeq, which models a small state machine. It has several constructors that allow it to start in different states, but the most interesting case is the one that starts with just a reference to a nullary function. Constructed that way, the LazySeq has neither evaluated the function nor found a delegate sequence (type ISeq in Java). The first time one asks the LazySeq for its first element — via first — or any successors — via next or rest — it evaluates the function, digs down through the resulting object to peel away any wrapping layers of other LazySeq instances, and finally feeds the innermost object through the java function RT#seq(), which results in an ISeq instance.
At this point, the LazySeq has an ISeq to which to delegate calls on behalf of first, next, and rest. Usually the "head" ISeq will be of type Cons, which stores a constant value in its "first" (or "car") slot and another ISeq in its "rest" (or "cdr") slot. That ISeq in its "rest" slot can in turn be a LazySeq, in which case accessing it will again require this same evaluation of a function, peeling away any lazy wrappers on the return value, and passing that value through RT#seq() to yield another ISeq to which to delegate.
The LazySeq instances remain linked together, but having forced one (through first, next, or rest) causes it to delegate straight through to some non-lazy ISeq thereafter. Usually that forcing evaluates a function that yields a Cons bound to first value and its tail bound to another LazySeq; it's a chain of generator functions that each yield one value (the Cons's "first" slot) linked to another opportunity to yield more values (a LazySeq in the Cons's "rest" slot).
Tying this back, in the Fibonacci Sequence example above, map will take each of the nested references to to fib-seq and walk them separately via repeated calls to rest. Each such call will transform at most one LazySeq holding an unevaluated function into a LazySeq pointing to something like a Cons. Once transformed, any subsequent accesses will quickly resolve to the Conses — where the actual values are stored. When one branch of the map zipping walks fib-seq one element behind the other, the values have already been resolved and are available for constant-time access, with no further evaluation of the generator function required.
Here are some diagrams to help visualize this interpretation of the code:
+---------+
| LazySeq |
fib-seq | fn -------> (fn ...)
| sv |
| s |
+---------+
+---------+
| LazySeq |
fib-seq | fn -------> (fn ...) -+
| sv <------------------+
| s |
+---------+
+---------+
| LazySeq |
fib-seq | fn |
| sv -------> RT#seq() -+
| s <------------------+
+---------+
+---------+ +------+
| LazySeq | | ISeq |
fib-seq | fn | | |
| sv | | |
| s ------->| |
+---------+ +------+
+---------+ +--------+ +------+
| LazySeq | | Cons | | ISeq |
fib-seq | fn | | first ---> 1 | |
| sv | | more -------->| |
| s ------->| | | |
+---------+ +--------+ +------+
+---------+ +--------+ +---------+
| LazySeq | | Cons | | LazySeq |
fib-seq | fn | | first ---> 1 | fn -------> (fn ...)
| sv | | more -------->| sv |
| s ------->| | | s |
+---------+ +--------+ +---------+
As map progresses, it hops from LazySeq to LazySeq (and hence Cons to Cons), and the rightmost edge only expands the first time one calls first, next, or rest on a given LazySeq.

My Clojure is a bit rusty, but this seems to be a literal translation of the famous Haskell one-liner:
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
[I'm going to be using pseudo-Haskell, because it's a little bit more succinct.]
The first thing you need to do, is simply let laziness sink in. When you look at a definition like this:
zeroes = 0 : zeroes
Your immediate gut reaction as a strict programmer would be "ZOMG infinite loop! Must fix bug!" But it isn't an infinite loop. This is a lazy infinite loop. If you do something stupid like
print zeroes
Then, yes, there will be an infinite loop. But as long as you simply use a finite number of elements, you will never notice that the recursion doesn't actually terminate. This is really hard to get. I still don't.
Laziness is like the monetary system: it's based on the assumption that the vast majority of people never use the vast majority of their money. So, when you put $1000 in the bank, they don't keep it in their safe. They lend it to someone else. Actually, they leverage the money, which means that they lend $5000 to someone else. They only ever need enough money so that they can quickly reshuffle it so that it's there when you are looking at it, giving you the appearance that they actually keep your money.
As long as they can manage to always give out money when you walk up to an ATM, it doesn't actually matter that the vast majority of your money isn't there: they only need the small amount you are withdrawing at the point in time when you are making your withdrawal.
Laziness works the same: whenever you look at it, the value is there. If you look at the first, tenth, hundreth, quadrillionth element of zeroes, it will be there. But it will only be there, if and when you look at it, not before.
That's why this inifintely recursive definition of zeroes works: as long as you don't try to look at the last element (or every element) of an infinite list, you are safe.
Next step is zipWith. (Clojure's map is just a generalization of what in other programming languages are usually three different functions: map, zip and zipWith. In this example, it is used as zipWith.)
The reason why the zip family of functions is named that way, is because it actually works like a zipper, and that is also how to best visualize it. Say we have some sporting event, where every contestant gets two tries, and the score from both tries is added up to give the end result. If we have two sequences, run1 and run2 with the scores from each run, we can calculate the end result like so:
res = zipWith (+) run1 run2
Assuming our two lists are (3 1 6 8 6) and (4 6 7 1 3), we line both of those lists up side by side, like the two halves of a zipper, and then we zip them together using our given function (+ in this case) to yield a new sequence:
3 1 6 8 6
+ + + + +
4 6 7 1 3
= = = = =
7 7 13 9 9
Contestant #3 wins.
So, what does our fib look like?
Well, it starts out with a 0, then we append a 1, then we append the sum of the infinite list with the infinite list shifted by one element. It's easiest to just draw that out:
the first element is zero:
0
the second element is one:
0 1
the third element is the first element plus the first element of the rest (i.e. the second element). We visualize this again like a zipper, by putting the two lists on top of each other.
0 1
+
1
=
1
Now, the element that we just computed is not just the output of the zipWith function, it is at the same time also the input, as it gets appended to both lists (which are actually the same list, just shifted by one):
0 1 1
+ +
1 1
= =
1 2
and so forth:
0 1 1 2
+ + +
1 1 2
= = =
1 2 3
0 1 1 2 3 ^
+ + + + |
1 1 2 3 ^ |
= = = = | |
1 2 3 5---+---+
Or if you draw it a little bit differently and merge the result list and the second input list (which really are the same, anyway) into one:
0 1 1 2 3 ^
+ + + + + |
1 = 1 = 2 = 3 = 5---+
That's how I visualize it, anyway.

As for how this works:
Each term of the fibonacci series is obviously the result of adding the previous two terms.
That's what map is doing here, map applies + to each element in each sequence until one of the sequences runs out (which they won't in this case, of course). So the result is a sequence of numbers that result from adding one term in the sequence to the next term in the sequence. Then you need lazy-cat to give it a starting point and make sure the function only returns what it's asked for.
The problem with this implementation is that fib-seq is holding onto the whole sequence for as long as the fib-seq is defined, so it will eventually run you out of memory.
Stuart Halloway's book spends some time on dissecting different implementations of this function, I think the most interesting one is below (it's Christophe Grande's):
(defn fibo []
(map first (iterate (fn [[a b]] [b (+ a b)]) [0 1])))
Unlike the posted implementation previously read elements of the sequence have nothing holding onto them so this one can keep running without generating an OutOfMemoryError.
How to get thinking in these terms is a harder question. So far for me it's a matter of getting acquainted with a lot of different ways of doing things and trying them out, while in general looking for ways to apply the existing function library in preference to using recursion and lazy-cat. But in some cases the recursive solution is really great, so it depends on the problem. I'm looking forward to getting the Joy of Clojure book, because I think it will help me a lot with this issue.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Check if two arrays are exactly the same in BigQuery merge statement - google-cloud-platform

Related

How will the Memory behave in this C++ program?

Different result in OCaml and ReasonML

Combine Q objects in Django and limit one of them

Storing values in a macro variable

Thinking in Lazy Sequences

Categories

Resources