Stata - How to manipulate data and create observations - row

I have a tricky question about how to manipulate some data. Suppose I have the following structure of data:
_n group attr value
1 1 height 3
2 1 weight 12
3 1 length 9
4 2 weight 15
5 3 height 4
I want to have all groups have height, weight, and length. If there is initially not a value, I want to have a missing value be put in. Thus the end result would look like this:
_n group attr value
1 1 height 3
2 1 weight 12
3 1 length 9
4 2 height .
5 2 weight 15
6 2 length .
7 3 height 4
8 3 weight .
9 3 length .
I don't know how to do this, but perhaps it would involve reshape?
Another thing I thought about would be to use egen to sum by group. We could figure out that group 1 has 3 members, group 2 has 1 member, and group 3 has 1 member. Then we could perform functions on groups 2 and 3 to get them up to par. But this could get complicated.

This is (1) very easy and (2) usually the wrong way to go.
(1) fillin is dedicated to this task. Your example leaves ambiguous whether attr is a numeric variable with value labels or a string variable, but this works either way:
. clear
. input group str6 attr value
group attr value
1. 1 "height" 3
2. 1 "weight" 12
3. 1 "length" 9
4. 2 "weight" 15
5. 3 "height" 4
6. end
. fillin group attr
. list, sepby(group)
+----------------------------------+
| group attr value _fillin |
|----------------------------------|
1. | 1 height 3 0 |
2. | 1 length 9 0 |
3. | 1 weight 12 0 |
|----------------------------------|
4. | 2 height . 1 |
5. | 2 length . 1 |
6. | 2 weight 15 0 |
|----------------------------------|
7. | 3 height 4 0 |
8. | 3 length . 1 |
9. | 3 weight . 1 |
+----------------------------------+
(2) However, what use is this structure? How are you going to relate height, length and weight? Usually, you would be better off with this (assuming same data as sandbox):
reshape wide value, i(group) j(attr) string
renpfix value
list, sepby(group)
Here we end with variables group, height, length and weight. If you still want the long structure, that is now achievable with reshape long.
Notes:
For more on fillin, see the help, the manual entry and this expository note.
renpfix in this case zaps prefixes of variable names. The tacit third argument is an empty string, so the prefix value is replaced with an empty string, namely removed. In recent versions of Stata (12 up), that is now easy with rename.
Presumably these data are just a toy example, but I'd be amazed if the wide structure were not more useful for your real data too. If there's a reason for the long structure, you did not tell us about it. (If you really have panel data, that's a different story, but check out tsfill.)

Related

list input data into sas from txt file & separate into columns

I am having trouble with a .txt data set that is separated by spaces. It is in list input format and I want to separate it into three columns.
I've tried using row pointers, column pointers, and other methods within the input statement. I've been searching online and I think an array would work in this situation but I'm not sure how to do that (I'm new to SAS).
1 1 2 2 1 1 3 1 2 4 0 3 5 1 3 6 1 2 7 0 3 8 1 3 9 0 1 10 0 2 11 1 1
I want the data to be in columns:
ID Group Time.
1 . 1 . 2
2 . 1 . 1
3 . 1 . 2
The id numbers go up to 90
Use the held line input modifier ##:
filename mydata "whatever.txt";
data want;
infile mydata;
input id group time ##;
run;

SAS_Specific Functions

I have a question and want to ask question by using example. My data-set is:
Group Value
1 10
1 8
1 12
2 13
2 11
2 7
I want to add two columns to this data-set. First column should consist of maximum value of second column by group. Second column should consist of minimum value of second column by group. So, the result should be look:
Group Value Max Min
1 10 12 8
1 8 12 8
1 12 12 8
2 13 13 7
2 11 13 7
2 7 13 7
12 - because there are 3 numbers (10,8,12) in group number 1 and 12 is maximum among these values.
13 - because there are 3 numbers (13,11,7) in group number 2 and 13 is maximum among these values.
8 - because there are 3 numbers (10,8,12) in group number 1 and 8 is minimum among these values.
7 - because there are 3 numbers (13,11,7) in group number 2 and 7 is minimum among these values.
I hope, i can explain it..
Many thanks in advance.
Try:
proc sql;
select *,max(value) as max,min(value) as min from have group by group;
quit;

Stata applying group1's info to other groups if group1 is the same

Suppose I have
group1 group2 info
100 1 .
100 1 .
200 1 10
200 2 20
300 2 .
Then, for group1, copy group1's "info" to other group1s if group2 is the same.
So the result will be like this.
group1 group2 info
100 1 10
100 1 10
200 1 10
200 2 20
300 2 20
I tried to do this using bysort but couldn't think of a way to do this..
The question is puzzling because the example implies that the variable group1 is irrelevant. I'll take the example rather than the wording as being definitive.
The solution by #timat is along the right lines, but does nothing to check a sensible constraint that non-missing values in a group should be identical.
One approach hinges on the fact that most egen functions ignore missing values to the extent possible. Hence there is just one distinct non-missing value if and only if the maximum and minimum in each group are identical (and not missing) and it can be copied to replace missing values within groups of observations. (If all values are missing, nothing problematic occurs.)
clear
input group1 group2 info
100 1 .
100 1 .
200 1 10
200 2 20
300 2 .
end
bysort group2: egen max = max(info)
by group2: egen min = min(info)
replace info = max if max == min & missing(info)
list, sepby(group2)
+------------------------------------+
| group1 group2 info max min |
|------------------------------------|
1. | 100 1 10 10 10 |
2. | 100 1 10 10 10 |
3. | 200 1 10 10 10 |
|------------------------------------|
4. | 200 2 20 20 20 |
5. | 300 2 20 20 20 |
+------------------------------------+
This would work
bysort group2 (info): replace info = info[1]
Found solution here:
http://www.stata.com/statalist/archive/2006-10/msg00928.html

Stata: looping over observations

My data set looks like this
x1
1
0
0
1
0
0
1
1
In this data set the values following 1 belongs to the same group. For example the first 2 zero belong to group 1 and the second 2 zeros belong to the second group an so on. And I would like to get a final output similar to this. Note that the delta between the two 1's is arbitrary:
x1 x2
1 1
0 1
0 1
1 2
0 2
0 2
1 3
1 4
I think I need to write a loop that goes over the observations. But I cannot figure out the logical statements that will accomplish this.
Either
gen x2 = sum(x1)
or
gen x2 = sum(x1 == 1)
is sufficient. There is a loop over observations tacit as usual there, but you don't need an explicit loop.
In detail, sum() here is a cumulative or running sum. In your case, the first solution is simple and adequate. The reason for mentioning the second solution is because it's more general: we can tag the first observation in each block or spell with 1 and then create a running sum to form blocks of 1s, 2s, and so forth.

Divide a large image into two non overlapping images whose union is the large image

Given a large image composed of smaller images stored as a matrix. I need to find out a boundary dividing the large image into two parts(not necessarily equal but preferably nearly equal) without cutting past a smaller image.
Each small image is represented by a single integer in the larger image matrix.
Ex:
1 1 2 2 2
1 1 2 2 2
3 3 3 4 4
3 3 3 4 4
is the large image matrix composed of 4 small images.
I need to find one such boundary to separate it into two smaller images such that their sizes don't differ by a very large amount.
This is my solution:
1. Start from considering the 1st row.
2. Using binary search find the start of a boundary. In above example it will be like
1 1 | 2 2 2
1 1 2 2 2
3 3 3 4 4
3 3 3 4 4
3.Proceed down until the dividing line doesn't intersect an image. If end of large image is reached then stop.
1 1 | 2 2 2
1 1 | 2 2 2
3 3 3 4 4
3 3 3 4 4
4.Again do step 1,2,3 considering the remaining rows and make horizontal line from old line to new division line.
1 1 | 2 2 2
1 1 | 2 2 2
--
3 3 3 4 4
3 3 3 4 4
1 1 | 2 2 2
1 1 | 2 2 2
-----
3 3 3 | 4 4
3 3 3 | 4 4
End of large image...Stop.
Of-course if no vertical line can be found in step 2. We can look for a horizontal line first in a similar way like in the case of:
1 1 1 1 1
1 1 1 1 1
--
3 3 3 2 2
3 3 3 2 2
and then proceed.
How can I improve on this solution?
Are there better solutions and will my algorithm fail anytime?
I will be coding in C++. A heuristic/ greedy solution will be nice as well.
If the image is somehow big enough to make sense then you could get local differences to guide your boundaries selection.
Here is an example implemented in MATLAB for simplicity but you will get the picture:
suppose we create an image similar to the one you defined:
img = [ ones(20,20), 2*ones(20,30); ones(10,20), 2*ones(10,30); 3*ones(20,30), 4*ones(20,20)]
This command creates an image 50x50, having a 20x30 sub-image 1, a 30x30 sub-image 2, a 30x20 sub-image 3 and a 20x20 sub-image 4, as depicted graphically bellow:
Ideally you would like to get the boundaries between these "trays" representing the values 1 to 4. One way to do so is to shift the image one pixel left/right and one pixel top/bottom and subtract it with the original. This will produce another image with values only in the boundary positions.
See for example in MATLAB:
mask=((img-shift(img,1) + img-shift(img',1)')~=0);
This will create a mask by adding the difference of the right-shifted image and the original with the difference of the bottom-shifted image and the original, and, finally, by comparing the result with zero (zero values will be all pixel values except in boundaries). Function shift just shifts values of a matrix right or left. There is no need to put the code here since I just want to show the concept.
So you will end-up with the following mask image:
This mask has been cropped one pixel at the right and bottom since the previous subtractions produces a border that is not needed.
In this image, true values (white pixels) are on the last pixel of the previous image, i.e. image 1 ends at the 1st boundary and image 2 begins at the next pixel, so image 1 is bounded by x=20 and y=30, and so on for the other sub-images.