Let's say you have a survey dataset, with 12 variables that stem from the same question, and each variable reports a response option for that question (multiple-response options possible for this question). Each variable (i.e. response option) is numeric with yes/no options. I am trying to combine all of these variables into one, so that I can do cross-tabs with other variables such as village name, and draw out the frequencies of each individual response and graphs nicely without extensive formatting. Does anyone have a solution to this: either to combine the variables or to do a multivariable cross-tab that doesn't require a lot of time spent on formatting?
Example data:
A B C D E F
1 0 1 0 1 0
0 0 1 0 1 1
1 1 1 0 0 0
There are many tricks and techniques here.
Tricks include using egen's concat() function as well as the group() function mentioned by #Dimitriy V. Masterov.
Techniques include special tabulation or listing commands, including tabm and groups on SSC and mrtab at the Stata Journal; on the last, see this article.
See also this article in the Stata Journal for a general discussion of handling multiple responses.
Does egen pattern = group(A-F), label do what you desire? If not, perhaps you can clarify what the desired transformation would look like for the 3 respondents you have shown.
Related
Consider the following example. I begin with an str6 'name' variable, and a year for two entities observed every other year.
clear
input str6 nameStr year
"A" 2002
"A" 2004
"A" 2006
"B" 2002
"B" 2004
"B" 2006
end
Then I use tsfill to balance the panel:
egen id = group(nameStr)
xtset id year
tsfill
The dataset is now:
input str6 nameStr year id
"A" 2002 1
"" 2003 1
"A" 2004 1
"" 2005 1
"A" 2006 1
"B" 2002 2
"" 2003 2
"B" 2004 2
"" 2005 2
"B" 2006 2
end
Now I could use something like xfill to fill in the missing string identifier. Or, based on the related Stata FAQ and the documentation for Time-series varlists (help tsvarlist) I expect that something like the following to fill in the values of nameStr:
sort id year \\ not required because the data are still sorted from xtset and tsfill
replace nameStr = nameStr[_n-1] if mi(nameStr) & id[_n-1] == id
and it does.
However, I also expect the following to produce the same behavior, and it does not.
replace nameStr = l.nameStr if mi(nameStr)
Instead Stata returns:
type mismatch
r(109);
While there are several ways to work around this (I've listed two), I'm interested in understanding why this happens. Most similar discussions address cases where two variables of differing types are involved, obviously this isn't the case here, since only one variable is involved.
Stata does not allow time series operators to be applied to string variables. If you think about it you will see that previous (lagging) and following (leading) string values make sense but differences don't, at least not so much. The only simple interpretation of differences would be binary, namely strings at two times are the same or different.
So, Stata is not implying that you can't work with other string values for any panel; it just doesn't support calculations on strings using time series operators.
In addition to the syntax you mention stripolate from SSC supports string interpolation: see this Statalist thread.
Eigen 3.3.7 documentation for SparceMatrix
http://eigen.tuxfamily.org/dox/group__TutorialSparse.html
seems to contain an error in Sparse matrix format section:
This storage scheme is better explained on an example. The following matrix
0 3 0 0 0
22 0 0 0 17
7 5 0 1 0
0 0 0 0 0
0 0 14 0 8
and one of its possible sparse, column major representation:
Values: 22 7 _ 3 5 14 _ _ 1 _ 17 8
InnerIndices: 1 2 _ 0 2 4 _ _ 2 _ 1 4
OuterStarts: 0 3 5 8 10 12
InnerNNZs: 2 2 1 1 2
If 14 is moved from the third column to the second (i.e. its indices changed from [4,2] to [4,1]), then the first two arrays, Values and InnerIndices, make sense. OuterStarts doesn't seem to be correct for either 14 position, while InnerNNZs makes sense for 14 being in [4,2] element of the matrix, but is inconsistent with Values array.
Is this example incorrect or am I missing something?
In general, what is the best way of figuring out Eigen, besides examining the source code? I normally look at tests and examples, but building most benchmark and tests for sparse matrices results in compilation errors (were these tests written for older version of Eigen and not updated for version 3?)...
The key is that the user is supposed to reserve at least as many entries per column as they need. In this example the user only reserved 2 entries for the second column, so if you were to try to add another entry to that column, it would probably require an expensive reallocation, or at least a complicated shift to "steal" an unused entry from another column. (I have no idea how this is implemented.)
Upon a cursory look at the documentation you linked to, I didn't see anything about moving entries like you're trying to do. I'm not sure that Eigen supports such an operation. (Correct me if I'm wrong.) I'm also not sure why you would want to do that.
Your final question is probably too broad. I'm not an expert at Eigen, but it seems like a mature, powerful, and well-documented library. If you have any specific problems compiling examples, you should post them here or on an Eigen specific forum. Many people at scicomp.SE are well-versed in Eigen and are accommodating.
I have the following data and I would like to apply the log() function:
v1
2
3
4
-1
5
Expected output:
v1
2 0.30 ~ log(2)
3 0.48 ~ log(3)
4 0.60 ~ log(4)
-1 .
5 0.70 ~ log(5)
This is just a simplified version of the problem. There are 35000 observations in my dataset and I could not find any simple rules like drop if v1 <= 0 to solve this problem.
Without screening my data first, one method in my mind is to use for loop and run the log() function over the observations. However, I couldn't find any websites telling me how to do that.
Stata will return missing if asked to take the logarithm of zero or negative values. But
generate log_x = log(x)
and
generate log_x = log(x) if x > 0
will have precisely the same result, missings in the observations with problematic values.
The bigger question here is statistical. Why do you want to take logarithms of such a variable any way? If your idea is to transform a variable, then other transformations are available. If the variable is a response or outcome variable, then a generalized linear model with logarithmic link will work even if there are some zero or negative values; the idea is just that the mean function should remain positive.
There have been many, many threads raising these issues on Cross Validated and Statalist.
I can't imagine why you think a loop is either needed or helpful here. With generate statements of the kind above, Stata automatically loops over observations.
I'm trying to generate in Stata the mean per year (e.g. 2002-2012) for each industry (by 2 digit SIC codes, so c. 50 different industries)
I found how to do it for one year with:
by sic_2digit, sort: egen test = mean(oancf_at_rsd10) if fyear == 2004
Is there a more efficient way to do this instead of repeating the command 10 times by hand and than adding the values together?
You can specify more than one variable with by:.
by sic_2digit fyear, sort: egen test = mean(oancf_at_rsd10)
Check out the help for by:, which gives the syntax and an example, and also that for collapse.
I would like to know how can I construct a regex to know if a number in base 2 (binary) is multiple of 3. I had read in this thread Check if a number is divisible by 3 but they dont do it with a regex, and the graph someone drew is wrong(because it doesn't accept even numbers). I have tried with: ((1+)(0*)(1+))(0) but it doesn't works for some values. Hope you can help me.
UPDATE:
Ok, thanks all for your help, now I know how to draw the NFA, here I left the graph and the regular expresion:
In the graph, the states are the number in base 10 mod 3.
For example: to go to state 1 you have to have 1, then you can add 1 or 0, if you add 1, you would have 11(3 in base 10), and this number mod 3 is 0 then you draw the arc to the state 0.
((0*)((11)*)((1((00) *)1) *)(101 *(0|((00) *1 *) *0)1) *(1(000)+1*01)*) *
And the other regex works, but this is shorter.
Thanks a lot :)
I know this is an old question, but an efficient answer is yet to be given and this question pops up first for "binary divisible by 3 regex" on Google.
Based on the DFA proposed by the author, a ridiculously short regex can be generated by simplifying the routes a binary string can take through the DFA.
The simplest one, using only state A, is:
0*
Including state B:
0*(11)*0*
Including state C:
0*(1(01*0)*1)*0*
And include the fact that after going back to state A, the whole process can be started again.
0*((1(01*0)*1)*0*)*
Using some basic regex rules, this simplifies to
(1(01*0)*1|0)*
Have a nice day.
If I may plug my solution for this code golf question! It's a piece of JavaScript that generates regexes (probably inefficiently, but does the job) for divisibility for each base.
This is what it generates for divisibility by 3 in base 2:
/^((((0+)?1)(10*1)*0)(0(10*1)*0|1)*(0(10*1)*(1(0+)?))|(((0+)?1)(10*1)*(1(0+)?)|(0(0+)?)))$/
Edit: comparing to Asmor's, probably very inefficient :)
Edit 2: Also, this is a duplicate of this question.
For some who is learning and searching how to do this:
see this video:
https://www.youtube.com/watch?v=SmT1DXLl3f4&t=138s
write state quations and solve them with Axden's Theorem
The way I did is visible in the image-result is the same as pointed out by user #Kert Ojasoo. I hope i did it corretly because i spent 2 days to solve it...
n+2n = 3n. Thus, 2 adjacent bits set to 1 denote a multiple of 3. If there are an odd number of adjacent 1s, that would not be 3.
So I'd propose this regex:
(0*(11)?)+