Stata: looping over observations - stata

My data set looks like this
x1
1
0
0
1
0
0
1
1
In this data set the values following 1 belongs to the same group. For example the first 2 zero belong to group 1 and the second 2 zeros belong to the second group an so on. And I would like to get a final output similar to this. Note that the delta between the two 1's is arbitrary:
x1 x2
1 1
0 1
0 1
1 2
0 2
0 2
1 3
1 4
I think I need to write a loop that goes over the observations. But I cannot figure out the logical statements that will accomplish this.

Either
gen x2 = sum(x1)
or
gen x2 = sum(x1 == 1)
is sufficient. There is a loop over observations tacit as usual there, but you don't need an explicit loop.
In detail, sum() here is a cumulative or running sum. In your case, the first solution is simple and adequate. The reason for mentioning the second solution is because it's more general: we can tag the first observation in each block or spell with 1 and then create a running sum to form blocks of 1s, 2s, and so forth.

Related

I'm not gettnig row number from grepl - doing this in R

I'm trying to determine which is the first row with a cell that contains only digits, "," "$" in a data frame:
Assessment Area Offices Offices Deposits as of 6/30/16 Deposits as of 6/30/16 Assessment Area Reviews Assessment Area Reviews Assessment Area Reviews
2 Assessment Area # % $ (000s) % Full Scope Limited Scope TOTAL
3 Ohio County 1 50.0% $24,451 52.7% 1 0 1
4 Hart County 1 50.0% $21,931 47.3% 1 0 1
5 OVERALL 2 100% $46,382 100.0% 2 0 2
This code does find the row:
grepl("[0-9]",table_1)
But the code returns:
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
I only want to know the row.
Your data could use some cleaning up, but it's not entirely necessary in order to solve your problem. You want to find the first row that contains a dollar sign and an appropriate value. My solution does the following:
Iterates over rows
In each row, asks if there's at least one cell that starts with a dollar sign followed by a specific combination of digits and commas (to be explained in greater detail below)
Stops when we reach that row
Prints the ID of the row
The solution involves a for loop, an if statement, and a regular expression.
First of all, here's my attempt to reproduce a data frame. Again the details don't matter too much. I just wanted to make the "money row" the second row which is kind of how it seems to appear in your example
df<- data.frame(
Assessment_Area = c(2,3,4,5),
Offices = c("#",1,1,2),
Dep_Percent_63016 = c("#","50.0%","50.0%","100.0%"),
Dep_Total_63016 = c("$ (000s)", "$24,451", "$21,931","$46,382"),
Assessment_Area_Rev = rep("Blah",4)
)
df
Assessment_Area Offices Dep_Percent_63016 Dep_Total_63016
1 2 # # $ (000s)
2 3 1 50.0% $24,451
3 4 1 50.0% $21,931
4 5 2 100.0% $46,382
Assessment_Area_Rev
1 Blah
2 Blah
3 Blah
4 Blah
Here's the for loop:
library(stringr)
for (i in 1:nrow(df)) {
if (any(str_detect(df[i,],"^\\$\\d{1,3}(,\\d{3})*"))) {
print(i)
break
}
}
The key is the line with the if statement. any returns TRUE if any element of a logical vector is true. In this case the vector is created by applying stringr::str_detect to a row of the df which is indexed as df[i,]. str_detect returns a logical vector - you supply a character vector and an expression to match in the elements of that vector. It returns TRUE or FALSE for each element in the vector which in this case is each cell in a row. So the crux of this is the regular expression:
"^\\$\\d{1,3}(,\\d{3})*"
This is the pattern we're searching for (the money cell) in each row. ^\\$ indicates we want the string to start with the dollar sign. The two backslashes escape the $ character because it's a metacharacter in regular expressions (end anchor). We then want 1-3 digits. This will match any dollar value below $1,000. Then we specify that the expression can contain any number (including 0) of , followed by three more digits. This will cover any dollar value.
Finally, if we encounter a row which contains one of these expressions, the for loop will print the number of the row and end the loop so it will return the lowest row number containing one desired cell. In this example the output is 2. If no appropriate rows are encountered, nothing will happen.
There may be more you want to do once you have that information, but if all you need is the lowest row number containing your money expression then this is sufficient.
A less elegant regular expression which only looks for dollar signs, commas, and digits would be:
"[0-9$,]+"
which is what you asked for although I don't think that's what you really want because that will match something like ,56$,,$$78

list input data into sas from txt file & separate into columns

I am having trouble with a .txt data set that is separated by spaces. It is in list input format and I want to separate it into three columns.
I've tried using row pointers, column pointers, and other methods within the input statement. I've been searching online and I think an array would work in this situation but I'm not sure how to do that (I'm new to SAS).
1 1 2 2 1 1 3 1 2 4 0 3 5 1 3 6 1 2 7 0 3 8 1 3 9 0 1 10 0 2 11 1 1
I want the data to be in columns:
ID Group Time.
1 . 1 . 2
2 . 1 . 1
3 . 1 . 2
The id numbers go up to 90
Use the held line input modifier ##:
filename mydata "whatever.txt";
data want;
infile mydata;
input id group time ##;
run;

Stata - How to manipulate data and create observations

I have a tricky question about how to manipulate some data. Suppose I have the following structure of data:
_n group attr value
1 1 height 3
2 1 weight 12
3 1 length 9
4 2 weight 15
5 3 height 4
I want to have all groups have height, weight, and length. If there is initially not a value, I want to have a missing value be put in. Thus the end result would look like this:
_n group attr value
1 1 height 3
2 1 weight 12
3 1 length 9
4 2 height .
5 2 weight 15
6 2 length .
7 3 height 4
8 3 weight .
9 3 length .
I don't know how to do this, but perhaps it would involve reshape?
Another thing I thought about would be to use egen to sum by group. We could figure out that group 1 has 3 members, group 2 has 1 member, and group 3 has 1 member. Then we could perform functions on groups 2 and 3 to get them up to par. But this could get complicated.
This is (1) very easy and (2) usually the wrong way to go.
(1) fillin is dedicated to this task. Your example leaves ambiguous whether attr is a numeric variable with value labels or a string variable, but this works either way:
. clear
. input group str6 attr value
group attr value
1. 1 "height" 3
2. 1 "weight" 12
3. 1 "length" 9
4. 2 "weight" 15
5. 3 "height" 4
6. end
. fillin group attr
. list, sepby(group)
+----------------------------------+
| group attr value _fillin |
|----------------------------------|
1. | 1 height 3 0 |
2. | 1 length 9 0 |
3. | 1 weight 12 0 |
|----------------------------------|
4. | 2 height . 1 |
5. | 2 length . 1 |
6. | 2 weight 15 0 |
|----------------------------------|
7. | 3 height 4 0 |
8. | 3 length . 1 |
9. | 3 weight . 1 |
+----------------------------------+
(2) However, what use is this structure? How are you going to relate height, length and weight? Usually, you would be better off with this (assuming same data as sandbox):
reshape wide value, i(group) j(attr) string
renpfix value
list, sepby(group)
Here we end with variables group, height, length and weight. If you still want the long structure, that is now achievable with reshape long.
Notes:
For more on fillin, see the help, the manual entry and this expository note.
renpfix in this case zaps prefixes of variable names. The tacit third argument is an empty string, so the prefix value is replaced with an empty string, namely removed. In recent versions of Stata (12 up), that is now easy with rename.
Presumably these data are just a toy example, but I'd be amazed if the wide structure were not more useful for your real data too. If there's a reason for the long structure, you did not tell us about it. (If you really have panel data, that's a different story, but check out tsfill.)

Divide a large image into two non overlapping images whose union is the large image

Given a large image composed of smaller images stored as a matrix. I need to find out a boundary dividing the large image into two parts(not necessarily equal but preferably nearly equal) without cutting past a smaller image.
Each small image is represented by a single integer in the larger image matrix.
Ex:
1 1 2 2 2
1 1 2 2 2
3 3 3 4 4
3 3 3 4 4
is the large image matrix composed of 4 small images.
I need to find one such boundary to separate it into two smaller images such that their sizes don't differ by a very large amount.
This is my solution:
1. Start from considering the 1st row.
2. Using binary search find the start of a boundary. In above example it will be like
1 1 | 2 2 2
1 1 2 2 2
3 3 3 4 4
3 3 3 4 4
3.Proceed down until the dividing line doesn't intersect an image. If end of large image is reached then stop.
1 1 | 2 2 2
1 1 | 2 2 2
3 3 3 4 4
3 3 3 4 4
4.Again do step 1,2,3 considering the remaining rows and make horizontal line from old line to new division line.
1 1 | 2 2 2
1 1 | 2 2 2
--
3 3 3 4 4
3 3 3 4 4
1 1 | 2 2 2
1 1 | 2 2 2
-----
3 3 3 | 4 4
3 3 3 | 4 4
End of large image...Stop.
Of-course if no vertical line can be found in step 2. We can look for a horizontal line first in a similar way like in the case of:
1 1 1 1 1
1 1 1 1 1
--
3 3 3 2 2
3 3 3 2 2
and then proceed.
How can I improve on this solution?
Are there better solutions and will my algorithm fail anytime?
I will be coding in C++. A heuristic/ greedy solution will be nice as well.
If the image is somehow big enough to make sense then you could get local differences to guide your boundaries selection.
Here is an example implemented in MATLAB for simplicity but you will get the picture:
suppose we create an image similar to the one you defined:
img = [ ones(20,20), 2*ones(20,30); ones(10,20), 2*ones(10,30); 3*ones(20,30), 4*ones(20,20)]
This command creates an image 50x50, having a 20x30 sub-image 1, a 30x30 sub-image 2, a 30x20 sub-image 3 and a 20x20 sub-image 4, as depicted graphically bellow:
Ideally you would like to get the boundaries between these "trays" representing the values 1 to 4. One way to do so is to shift the image one pixel left/right and one pixel top/bottom and subtract it with the original. This will produce another image with values only in the boundary positions.
See for example in MATLAB:
mask=((img-shift(img,1) + img-shift(img',1)')~=0);
This will create a mask by adding the difference of the right-shifted image and the original with the difference of the bottom-shifted image and the original, and, finally, by comparing the result with zero (zero values will be all pixel values except in boundaries). Function shift just shifts values of a matrix right or left. There is no need to put the code here since I just want to show the concept.
So you will end-up with the following mask image:
This mask has been cropped one pixel at the right and bottom since the previous subtractions produces a border that is not needed.
In this image, true values (white pixels) are on the last pixel of the previous image, i.e. image 1 ends at the 1st boundary and image 2 begins at the next pixel, so image 1 is bounded by x=20 and y=30, and so on for the other sub-images.

J, the unfindable verb

1 0 0 1 verb 1 2 3 4
result:1 4
The verb drops the items from the list on the right that have a 0 in the list on the left. I can remember seeing this verb in the Vocabulary but I can't find it again. Does anybody know this verb?
It's #.
Explanation: Such verbs (1 or 2 symbols, rarely 3) are called primitives. The # primitive is called Tally as a monad (effectively tallies the items, returning the count on the first dimension), and Copy as a dyad, where it copies the right arguments as many times as indicated on the left argument. Of course, in this case, your right and left elements must be the same length (or that one of them is scalar if the other is not).
Example:
1 0 0 1 # 1 2 3 4
1 4