Generate dummies in Stata vs. in R

Generate dummies in Stata vs. in R - stata

Stata
r, u, s are dummies. I'm wondering if the following line is also generating dummy n, if r or u or s ==1, but just omit ==1 after r, u, s?
generate byte n = r | u | s
R
Does it make a difference when we generate byte and variable in R or it's the same in R?

This answer addresses Stata questions only.
In Stata if r u s are all 0, 1 variables then r | u | s is also 0, 1 and will be 1 if any of those is 1 and 0 if and only if all are 0. So, it is equivalent to max(r, u, s).
But watch out if r u s are 0, 1 or missing, then r | u | s will also be 1 if any of those is missing. But max(r, u, s) will be missing only if all of those are missing.
If missings are present, then you could use
* 1
gen n = r | u | s if !missing(r, u, s)
The result will be 1 if any argument r u s is 1, 0 if all arguments are 0, and missing if any argument is missing.
* 2
gen n = (r == 1) | (u == 1) | (s == 1)
The result will be 1 if any argument is 1 and 0 otherwise. "Otherwise" is anything from all 0s to all missings.
* 3
gen n = inlist(1, r, u, s)
#3 is equivalent to #2.
In all cases, specifying byte is good practice to save on storage, but not material otherwise.

Related

What does 'if((mask | u)==u)' mean?

This is the recurrence relation of maximum sum subset problem.
The complete code is:
if ((mask | u) == u)
dp[u] = max(max(0, dp[u ^ mask] + array[I], dp[u]);
What does exactly mean the following if-statement?
if((mask | u) == u)
Thank you in advance!

It means: “Are all bits of mask in u”.
So if there is a bit in mask not in u this test returns false.
For instance with mask=0b001 and u=0b011 it returns true. But with mask=0b101 and u=0b011 it returns false because the third bit of mask is not set in u.

Binary OR for binary values A, B evaluates as (1) if either A = 1 or B = 1. These bitwise operations extend to strings of binary digits. In C/C++, that's most commonly expressed as integral types.
OR | A = 0 | A = 1 |
-----------------------
B = 0 | (0) | (1) |
-----------------------
B = 1 | (1) | (1) |
-----------------------
(forgive the ASCII art - more concise illustrations and links are here)
mask = {m(n - 1), m(n - 2), .., m(1), m(0)} : (n) binary digits (bits) m(i)
u = {u(n - 1), u(n - 2), .., u(1), u(0)} : (n) binary digits (bits) u(i)
Let's consider (m(i) | u(i)) == u(i) for: i = {0, .., n - 1} ; should any of these bit-wise comparisons be false, then the expression ((mask | u) == u) evaluates as false.
From the OR table we can conclude that the expression is false if and only if m(i) = 1 and u(i) = 0. That is: m(i) | u(i) == (1) OR (0) == (1) which does not equal u(i) == 0
A more concise way of expressing the issue is that if mask has a bit at a position (i) set to (1), and u has a bit at the same position cleared to (0), then (mask | u) cannot equal u.

A many-to-one mapping in the natural domain using discrete input variables?

I would like to find a mapping f:X --> N, with multiple discrete natural variables X of varying dimension, where f produces a unique number between 0 to the multiplication of all dimensions. For example. Assume X = {a,b,c}, with dimensions |a| = 2, |b| = 3, |c| = 2. f should produce 0 to 12 (2*3*2).
a b c | f(X)
0 0 0 | 0
0 0 1 | 1
0 1 0 | 2
0 1 1 | 3
0 2 0 | 4
0 2 1 | 5
1 0 0 | 6
1 0 1 | 7
1 1 0 | 8
1 1 1 | 9
1 2 0 | 10
1 2 1 | 11
This is easy when all dimensions are equal. Assume binary for example:
f(a=1,b=0,c=1) = 1*2^2 + 0*2^1 + 1*2^0 = 5
Using this naively with varying dimensions we would get overlapping values:
f(a=0,b=1,c=1) = 0*2^2 + 1*3^1 + 1*2^2 = 4
f(a=1,b=0,c=0) = 1*2^2 + 0*3^1 + 0*2^2 = 4
A computationally fast function is preferred as I intend to use/implement it in C++. Any help is appreciated!

Ok, the most important part here is math and algorythmics. You have variable dimensions of size (from least order to most one) d0, d1, ... ,dn. A tuple (x0, x1, ... , xn) with xi < di will represent the following number: x0 + d0 * x1 + ... + d0 * d1 * ... * dn-1 * xn
In pseudo-code, I would write:
result = 0
loop for i=n to 0 step -1
result = result * d[i] + x[i]
To implement it in C++, my advice would be to create a class where the constructor would take the number of dimensions and the dimensions itself (or simply a vector<int> containing the dimensions), and a method that would accept an array or a vector of same size containing the values. Optionaly, you could control that no input value is greater than its dimension.
A possible C++ implementation could be:
class F {
vector<int> dims;
public:
F(vector<int> d) : dims(d) {}
int to_int(vector<int> x) {
if (x.size() != dims.size()) {
throw std::invalid_argument("Wrong size");
}
int result = 0;
for (int i = dims.size() - 1; i >= 0; i--) {
if (x[i] >= dims[i]) {
throw std::invalid_argument("Value >= dimension");
}
result = result * dims[i] + x[i];
}
return result;
}
};

Determinant of a square binary matrix c++ [duplicate]

Can anyone tell me which is the best algorithm to find the value of determinant of a matrix of size N x N?

Here is an extensive discussion.
There are a lot of algorithms.
A simple one is to take the LU decomposition. Then, since
det M = det LU = det L * det U
and both L and U are triangular, the determinant is a product of the diagonal elements of L and U. That is O(n^3). There exist more efficient algorithms.

Row Reduction
The simplest way (and not a bad way, really) to find the determinant of an nxn matrix is by row reduction. By keeping in mind a few simple rules about determinants, we can solve in the form:
det(A) = α * det(R), where R is the row echelon form of the original matrix A, and α is some coefficient.
Finding the determinant of a matrix in row echelon form is really easy; you just find the product of the diagonal. Solving the determinant of the original matrix A then just boils down to calculating α as you find the row echelon form R.
What You Need to Know
What is row echelon form?
See this [link](http://stattrek.com/matrix-algebra/echelon-form.aspx) for a simple definition
**Note:** Not all definitions require 1s for the leading entries, and it is unnecessary for this algorithm.
You Can Find R Using Elementary Row Operations
Swapping rows, adding multiples of another row, etc.
You Derive α from Properties of Row Operations for Determinants
If B is a matrix obtained by multiplying a row of A by some non-zero constant ß, then
det(B) = ß * det(A)
In other words, you can essentially 'factor out' a constant from a row by just pulling it out front of the determinant.
If B is a matrix obtained by swapping two rows of A, then
det(B) = -det(A)
If you swap rows, flip the sign.
If B is a matrix obtained by adding a multiple of one row to another row in A, then
det(B) = det(A)
The determinant doesn't change.
Note that you can find the determinant, in most cases, with only Rule 3 (when the diagonal of A has no zeros, I believe), and in all cases with only Rules 2 and 3. Rule 1 is helpful for humans doing math on paper, trying to avoid fractions.
Example
(I do unnecessary steps to demonstrate each rule more clearly)
| 2 3 3 1 |
A=| 0 4 3 -3 |
| 2 -1 -1 -3 |
| 0 -4 -3 2 |
R2 R3, -α -> α (Rule 2)
| 2 3 3 1 |
-| 2 -1 -1 -3 |
| 0 4 3 -3 |
| 0 -4 -3 2 |
R2 - R1 -> R2 (Rule 3)
| 2 3 3 1 |
-| 0 -4 -4 -4 |
| 0 4 3 -3 |
| 0 -4 -3 2 |
R2/(-4) -> R2, -4α -> α (Rule 1)
| 2 3 3 1 |
4| 0 1 1 1 |
| 0 4 3 -3 |
| 0 -4 -3 2 |
R3 - 4R2 -> R3, R4 + 4R2 -> R4 (Rule 3, applied twice)
| 2 3 3 1 |
4| 0 1 1 1 |
| 0 0 -1 -7 |
| 0 0 1 6 |
R4 + R3 -> R3
| 2 3 3 1 |
4| 0 1 1 1 | = 4 ( 2 * 1 * -1 * -1 ) = 8
| 0 0 -1 -7 |
| 0 0 0 -1 |
def echelon_form(A, size):
for i in range(size - 1):
for j in range(size - 1, i, -1):
if A[j][i] == 0:
continue
else:
try:
req_ratio = A[j][i] / A[j - 1][i]
# A[j] = A[j] - req_ratio*A[j-1]
except ZeroDivisionError:
# A[j], A[j-1] = A[j-1], A[j]
for x in range(size):
temp = A[j][x]
A[j][x] = A[j-1][x]
A[j-1][x] = temp
continue
for k in range(size):
A[j][k] = A[j][k] - req_ratio * A[j - 1][k]
return A

If you did an initial research, you've probably found that with N>=4, calculation of a matrix determinant becomes quite complex. Regarding algorithms, I would point you to Wikipedia article on Matrix determinants, specifically the "Algorithmic Implementation" section.
From my own experience, you can easily find a LU or QR decomposition algorithm in existing matrix libraries such as Alglib. The algorithm itself is not quite simple though.

I am not too familiar with LU factorization, but I know that in order to get either L or U, you need to make the initial matrix triangular (either upper triangular for U or lower triangular for L). However, once you get the matrix in triangular form for some nxn matrix A and assuming the only operation your code uses is Rb - k*Ra, you can just solve det(A) = Π T(i,i) from i=0 to n (i.e. det(A) = T(0,0) x T(1,1) x ... x T(n,n)) for the triangular matrix T. Check this link to see what I'm talking about. http://matrix.reshish.com/determinant.php

Need help implementing a Lucas Pseudoprimality test

I am trying to write a function that determines if a number n is prime or composite using the Lucas pseudoprime test; at the moment, I am working with the standard test, but once I get that working I will then write the strong test. I am reading the paper by Baillie and Wagstaff, and following the implementation by Thomas Nicely in the trn.c file.
I understand that the full test involves several steps: trial division by small primes, checking that n is not a square, performing a strong pseudoprimality test to base 2, then finally the Lucas pseudoprime test. I can handle all the other pieces, but I am having trouble with the Lucas pseudoprime test. Here is my implementation, in Python:
def gcd(a, b):
while b != 0:
a, b = b, a % b
return a
def jacobi(a, m):
a = a % m; t = 1
while a != 0:
while a % 2 == 0:
a = a / 2
if m % 8 == 3 or m % 8 == 5:
t = -1 * t
a, m = m, a # swap a and m
if a % 4 == 3 and m % 4 == 3:
t = -1 * t
a = a % m
if m == 1:
return t
return 0
def isLucasPrime(n):
dAbs, sign, d = 5, 1, 5
while 1:
if 1 < gcd(d, n) > n:
return False
if jacobi(d, n) == -1:
break
dAbs, sign = dAbs + 2, sign * -1
d = dAbs * sign
p, q = 1, (1 - d) / 4
print "p, q, d =", p, q, d
u, v, u2, v2, q, q2 = 0, 2, 1, p, q, 2 * q
bits = []
t = (n + 1) / 2
while t > 0:
bits.append(t % 2)
t = t // 2
h = -1
while -1 * len(bits) <= h:
print "u, u2, v, v2, q, q2, bits, bits[h] = ",\
u, u2, v, v2, q, q2, bits, bits[h]
u2 = (u2 * v2) % n
v2 = (v2 * v2 - q2) % n
if bits[h] == 1:
u = u2 * v + u * v2
u = u if u % 2 == 0 else u + n
u = (u / 2) % n
v = (v2 * v) + (u2 * u * d)
v = v if v % 2 == 0 else v + n
v = (v / 2) % n
if -1 * len(bits) < h:
q = (q * q) % n
q2 = q + q
h = h - 1
return u == 0
When I run this, isLucasPrime returns False for such primes as 83 and 89, which is incorrect. It also returns False for the composite 111, which is correct. And it returns False for the composite 323, which I know is a Lucas pseudoprime for which isLucasPrime should return True. In fact, isLucasPseudoprime returns False for every n on which I have tested it.
I have several questions:
1) I'm not expert with C/GMP, but it seems to me that Nicely runs through the bits of (n+1)/2 from right-to-left (least significant to most significant) where other authors run through the bits left-to-right. My code shown above runs through the bits left-to-right, but I have also tried running through the bits right-to-left, with the same result. Which order is correct?
2) It looks odd to me that Nicely only updates the u and v variables for a 1-bit. Is this correct? I expected to update all four of the Lucas-chain variables each time through the loop, since the indexes of the chain increase at each step.
3) What have I done wrong?

1) I'm not expert with C/GMP, but it seems to me that Nicely runs through the bits of (n+1)/2 from right-to-left (least significant to most significant) where other authors run through the bits left-to-right. My code shown above runs through the bits left-to-right, but I have also tried running through the bits right-to-left, with the same result. Which order is correct?
Indeed, Nicely goes from least significant to most significant bit. He computes U(2^k) and V(2^k) (and Q^(2^k); all modulo N of course), in the mpzU2m and mpzV2m variables, and has U((N+1) % 2^k) resp V((N+1) % 2^k) stored in mpzU and mpzV. When a 1-bit is encountered, the remainder (N+1) % 2^k changes, and mpzU and mpzV are updated accordingly.
The other way is to compute U(p), U(p+1), V(p) and (optionally) V(p+1) for a prefix p of N+1 and combine those to compute U(2*p+1) and either U(2*p) or U(2*p+2) [ditto for V] depending on whether the next bit after the prefix p is 0 or 1.
Both methods are correct, like you can compute the power x^N going from left to right, having x^p and x^(p+1) as state, or from right to left having x^(2^k) and x^(N % 2^k) as state [and, computing U(n) and U(n+1) is basically computing ζ^n where ζ = (1 + sqrt(D))/2].
I - and others, apparently - find the left-to-right order simpler. I haven't done or read an analysis, it might be that right-to-left is computationally less expensive on average and Nicely chose right-to-left because of that.
2) It looks odd to me that Nicely only updates the u and v variables for a 1-bit. Is this correct? I expected to update all four of the Lucas-chain variables each time through the loop, since the indexes of the chain increase at each step.
Yes, that is correct, because the remainder (N+1) % 2^k == (N+1) % 2^(k-1) if the 2^k bit is 0.
3) What have I done wrong?
A small typo first:
if 1 < gcd(d, n) > n:
should be
if 1 < gcd(d, n) < n:
of course.
More substantially, you use the updates for Nicely's traversal order (right-to-left), but traverse in the other direction. That of course produces wrong results.
Further, when updating v
if bits[h] == 1:
u = u2 * v + u * v2
u = u if u % 2 == 0 else u + n
u = (u / 2) % n
v = (v2 * v) + (u2 * u * d)
v = v if v % 2 == 0 else v + n
v = (v / 2) % n
you use the new value of u, but you ought to use the old value.
def isLucasPrime(n):
dAbs, sign, d = 5, 1, 5
while 1:
if 1 < gcd(d, n) < n:
return False
if jacobi(d, n) == -1:
break
dAbs, sign = dAbs + 2, sign * -1
d = dAbs * sign
p, q = 1, (1 - d) // 4
u, v, u2, v2, q, q2 = 0, 2, 1, p, q, 2 * q
bits = []
t = (n + 1) // 2
while t > 0:
bits.append(t % 2)
t = t // 2
h = 0
while h < len(bits):
u2 = (u2 * v2) % n
v2 = (v2 * v2 - q2) % n
if bits[h] == 1:
uold = u
u = u2 * v + u * v2
u = u if u % 2 == 0 else u + n
u = (u // 2) % n
v = (v2 * v) + (u2 * uold * d)
v = v if v % 2 == 0 else v + n
v = (v // 2) % n
if h < len(bits) - 1:
q = (q * q) % n
q2 = q + q
h = h + 1
return u == 0
works (no guarantees, but I think it is correct, and have done some tests, all of which it passed).

Find n-th set of a powerset

I'm trying to find the n-th set in a powerset. By n-th I mean that the powerset is generated in the following order -- first by the size, and then, lexicographically --, and so, the indices of the sets in the powerset of [a, b, c] is:
0 - []
1 - [a]
2 - [b]
3 - [c]
4 - [a, b]
5 - [a, c]
6 - [b, c]
7 - [a, b, c]
While looking for a solution, all I could find was an algorithm to return the n-th permutation of a list of elements -- for example, here.
Context:
I'm trying to retrieve the entire powerset of a vector V of elements, but I need to do this with one set at a time.
Requirements:
I can only maintain two vectors at the same time, the first one with the original items in the list, and the second one with the n-th set from the powerset of V -- that's why I'm willing to have an n-th set function here;
I need this to be done not in linear time on the space of solutions -- which means it cannot list all the sets and them pick the n-th one;
my initial idea is to use bits to represent the positions, and get a valid mapping for what I need -- as the "incomplete" solution I posted.

I don't have a closed form for the function, but I do have a bit-hacking non-looping next_combination function, which you're welcome to, if it helps. It assumes that you can fit the bit mask into some integer type, which is probably not an unreasonable assumption given that there are 264 possibilities for the 64-element set.
As the comment says, I find this definition of "lexicographical ordering" a bit odd, since I'd say lexicographical ordering would be: [], [a], [ab], [abc], [ac], [b], [bc], [c]. But I've had to do the "first by size, then lexicographical" enumeration before.
// Generate bitmaps representing all subsets of a set of k elements,
// in order first by (ascending) subset size, and then lexicographically.
// The elements correspond to the bits in increasing magnitude (so the
// first element in lexicographic order corresponds to the 2^0 bit.)
//
// This function generates and returns the next bit-pattern, in circular order
// (so that if the iteration is finished, it returns 0).
//
template<typename UnsignedInteger>
UnsignedInteger next_combination(UnsignedInteger comb, UnsignedInteger mask) {
UnsignedInteger last_one = comb & -comb;
UnsignedInteger last_zero = (comb + last_one) &~ comb & mask;
if (last_zero) return comb + last_one + (last_zero / (last_one * 2)) - 1;
else if (last_one > 1) return mask / (last_one / 2);
else return ~comb & 1;
}
Line 5 is doing the bit-hacking equivalent of the (extended) regular expression replacement, which finds the last 01 in the string, flips it to 10 and shifts all the following 1s all the way to the right.
s/01(1*)(0*)$/10\2\1/
Line 6 does this one (only if the previous one failed) to add one more 1 and shift the 1s all the way to the right:
s/(1*)0(0*)/\21\1/
I don't know if that explanation helps or hinders :)
Here's a quick and dirty driver (the command-line argument is the size of the set, default 5, maximum the number of bits in an unsigned long):
#include <iostream>
template<typename UnsignedInteger>
std::ostream& show(std::ostream& out, UnsignedInteger comb) {
out << '[';
char a = 'a';
for (UnsignedInteger i = 1; comb; i *= 2, ++a) {
if (i & comb) {
out << a;
comb -= i;
}
}
return out << ']';
}
int main(int argc, char** argv) {
unsigned int n = 5;
if (argc > 1) n = atoi(argv[1]);
unsigned long mask = (1UL << n) - 1;
unsigned long comb = 0;
do {
show(std::cout, comb) << std::endl;
comb = next_combination(comb, mask);
} while (comb);
return 0;
}
It's hard to believe that this function might be useful for a set of more than 64 elements, given the size of the enumeration, but it might be useful to enumerate some limited part, such as all subsets of three elements. In this case, the bit-hackery is only really useful if the modification fits in a single word. Fortunately, that's easy to test; you simply need to do the computation as above on the last word in the bitset, up to the test for last_zero being zero. (In this case, you don't need to bitand mask, and indeed you might want to choose a different way of specifying the set size.) If last_zero turns out to be zero (which will actually be pretty rare), then you need to do the transformation in some other way, but the principle is the same: find the first 0 which precedes a 1 (watch out for the case where the 0 is at the end of a word and the 1 at the beginning of the next one); change the 01 to 10, figure out how many 1s you need to move, and move them to the end.

Considering a list of elements L = [a, b, c], the powerset of L is given by:
P(L) = {
[],
[a], [b], [c],
[a, b], [a, c], [b, c],
[a, b, c]
}
Considering each position as a bit, you'd have the mappings:
id | positions - integer | desired set
0 | [0 0 0] - 0 | []
1 | [1 0 0] - 4 | [a]
2 | [0 1 0] - 2 | [b]
3 | [0 0 1] - 1 | [c]
4 | [1 1 0] - 6 | [a, b]
5 | [1 0 1] - 5 | [a, c]
6 | [0 1 1] - 3 | [b, c]
7 | [1 1 1] - 7 | [a, b, c]
As you see, the id is not directly mapped to the integers. A proper mapping needs to be applied, so that you have:
id | positions - integer | mapped - integer
0 | [0 0 0] - 0 | [0 0 0] - 0
1 | [1 0 0] - 4 | [0 0 1] - 1
2 | [0 1 0] - 2 | [0 1 0] - 2
3 | [0 0 1] - 1 | [0 1 1] - 3
4 | [1 1 0] - 6 | [1 0 0] - 4
5 | [1 0 1] - 5 | [1 0 1] - 5
6 | [0 1 1] - 3 | [1 1 0] - 6
7 | [1 1 1] - 7 | [1 1 1] - 7
As an attempt on solving this, I came up using a binary tree to do the mapping -- I'm posting it so that someone may see a solution from it:
#
______________|_____________
a / \
_____|_____ _______|______
b / \ / \
__|__ __|__ __|__ __|__
c / \ / \ / \ / \
[ ] [c] [b] [b, c] [a] [a, c] [a, b] [a, b, c]
index: 0 3 2 6 1 5 4 7

Suppose your set has size N.
So, there are (N choose k) sets of size k. You can find the right k (i.e. the size of the nth set) very quickly just by subtracting off (N choose k) from n until n is about to go negative. This reduces your problem to finding the nth k-subset of an N-set.
The first (N-1 choose k-1) k-subsets of your N-set will contain its least element. So, if n is less than (N-1 choose k-1), pick the first element and recurse on the rest of the set. Otherwise, you have one of the (N-1 choose k) other sets; throw away the first element, subtract (N-1 choose k-1) from n, and recurse.
Code:
#include <stdio.h>
int ch[88][88];
int choose(int n, int k) {
if (n<0||k<0||k>n) return 0;
if (!k||n==k) return 1;
if (ch[n][k]) return ch[n][k];
return ch[n][k] = choose(n-1,k-1) + choose(n-1,k);
}
int nthkset(int N, int n, int k) {
if (!n) return (1<<k)-1;
if (choose(N-1,k-1) > n) return 1 | (nthkset(N-1,n,k-1) << 1);
return nthkset(N-1,n-choose(N-1,k-1),k)<<1;
}
int nthset(int N, int n) {
for (int k = 0; k <= N; k++)
if (choose(N,k) > n) return nthkset(N,n,k);
else n -= choose(N,k);
return -1; // not enough subsets of [N].
}
int main() {
int N,n;
scanf("%i %i", &N, &n);
int a = nthset(N,n);
for (int i=0;i<N;i++) printf("%i", !!(a&1<<i));
printf("\n");
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Generate dummies in Stata vs. in R - stata

Stata r, u, s are dummies. I'm wondering if the following line is also generating dummy n, if r or u or s ==1, but just omit ==1 after r, u, s? generate byte n = r | u | s R Does it make a difference when we generate byte and variable in R or it's the same in R?

Related

What does 'if((mask | u)==u)' mean?

A many-to-one mapping in the natural domain using discrete input variables?

Determinant of a square binary matrix c++ [duplicate]

Need help implementing a Lucas Pseudoprimality test

Find n-th set of a powerset

Categories

Resources