Replace zeros with missing values in certain cases - stata

I was wondering if anyone knew an easier way of doing the following:
I have a dataset of health facility caseload by year, where each observation is one health facility. Facilities were 'brought online' in different years, so some have zeros before they have values for caseload. Also, some 'discontinue', as in they did provide services, but don't any more. I would like to replace the zeros with missing values for the years in which a facility discontinued. In the following example, the 3rd and 4th facilities discontinued, so I'd like missing for y2014 for the 3rd and y2013 & y2014 for the 4th.
y2011 y2012 y2013 y2014
0 0 76 82
0 0 29 13
0 0 25 0
5 10 0 0
0 0 17 24
I tried the following, which worked, but I'm going to have many years worth of data to work on (2000-2014), so was wondering if there was a more efficient way.
replace y2014=. if y2014==0 & (y2013>0 | y2012>0 | y2011>0)
replace y2013=. if y2013==0 & ( y2012>0 | y2011>0)
replace y2012=. if y2012==0 & ( y2011>0)
I messed around with egen rowlast to identify the facilities with a zero in the last year (meaning they discontinued), but then wasn't sure where to go with it.

Your problem would benefit from a loop over the variables.
We'll initialise started to 0, change our mind about started when we see a positive value, and change any subsequent 0s to missings if started is 1.
gen started = 0
forval y = 2000/2014 {
replace started = 1 if y`y' > 0
replace y`y' = . if started == 1 & y`y' == 0
}
Note that this scheme allows re-starts.
A more general comment is that this is not the better data structure for such panel or longitudinal data. This particular problem is not too challenging, but most problems with such data will be easier after reshape long.
See here for a survey of "rowwise" technique in Stata.

Related

combining multiple items to create one dummy variable

I have 7 items/variables in Stata that address the same survey question. These 7 items are each different weight control behaviors (diet, exercise, pills, etc.). I am trying to combine these variables to create a single weight control behavior dummy variable that is coded as yes (did engage in weight control) and no (did not engage in weight control).
The response options for each variable look something like this for a given weight control behavior
dieted
11438 0 not marked
2771 1 marked
16 6 refused
6508 7 legitimate skip
13 8 don’t know
Here is my code. I re-coded 6,7,8 for all 7 vars as missing:
tab1 h1gh30a-h1gh30g,m`
foreach X of varlist h1gh30a-h1gh30g {
replace `X'=. if `X' > 1
}
egen wgt_control= rowmax(h1gh30a-h1gh30g)
ta wgt_control
gen wgt_control_new=wgt_control
replace wgt_control_new = 1 if wgt_control>0 & wgt_control!=.
replace wgt_control_new= 0 if wgt_control <1
ta wgt_control_new
I used rowmax() to combine all 7 items but my issue is that the response option 0 or No doesn't appear when I tabulate it. I only get those who responded yes=1.
Here is a suggestion with a reproducible example for what I think is the cleanest approach. I also included some unsolicited advice about survey data best practices
* Example generated by -dataex-. For more info, type help dataex
clear
input double(h1gh30a h1gh30b h1gh30c)
1 1 1
1 0 1
6 1 8
0 0 0
7 6 8
end
* Explicit coding is better, so if possible, which it is with 7 vars,
* create a local with the vars are explicitly listed
local wgt_controls h1gh30a h1gh30b h1gh30c
* Recode is a better command to use here. And do not destroy information,
* there is a survey data quality assurance difference between respondent
* refusing to answer, not knowing or question skipped. You can replace this
* survey codes with these extended missing values that behaves like missing values
* but retain the differences in the survey codes
recode `wgt_controls' (6=.a) (7=.b) (8=.c)
* While rowmax() could be used, I think it seems like anymatch() fits
* what you are trying to do better
egen wgt_control = anymatch(`wgt_controls'), values(1)
There is no minimal reproducible example here, so we can't reproduce the problem independently.
From your code, it seems that h1gh30a-h1gh30g are recoded so that all are 0, 1 or missing, so their maximum takes one of the same values.
gen wgt_control_new = wgt_control
replace wgt_control_new = 1 if wgt_control>0 & wgt_control!=.
replace wgt_control_new= 0 if wgt_control <1
seems to boil down to cloning the variable:
gen wgt_control_new = wgt_control
In short, I can't see a reason in your code why you should never see 0 as a possible result.
EDIT
A minimal check on whether there are zeros that aren't showing up as they should might be
egen max = rowmax(h1gh30a-h1gh30g)
list high30a-high30g if max == 0
```

Is there a way to count from last condition x?

I have a complex set of data that can return 3 different conditions per row. I need to be able to count the last x rows matching one of the specific conditions.
The following formula has been working well for me, but I have discovered a glitch in one instance of this formula (the formula is replicated at least a dozen times)
=ArrayFormula(LOOKUP(9.99999999999999E+307,IF(FREQUENCY(IF(AQ:AQ)=1,ROW(AQ:AQ)),IF(AQ:AQ<>1,ROW(AQ:AQ)))=0,FREQUENCY(IF(AQ:AQ=1,ROW(AQ:AQ)),IF(AQ:AQ=0,ROW(AQ:AQ))))))
Current criteria are as such:
0: Condition x met - Reset counter
1: Condition y met - Increment counter
2: Condition z met - Ignore this row
Therefore this:
1
2
2
2
1
1
0
1
1
1
Should output: 3
This:
1
2
0
2
2
1
2
1
Should output: 2
However the glitch I have encountered isn't resetting the counter when 0 is reached, for example:
1
2
1
2
1
1
2
2
2
2
0
Should output: 0
But in fact is outputting: 4
I have tested all possible conditions with that specific data set and I cannot rectify the issue. I believe there is an error in the formula (specifically the 9.99999999999999E+307) but I wrote it so long ago that I cannot successfully debug it. I have tried 1E+306 but the result is the same.
EDIT1: Upon request I have included as stripped down version of the sheet as I can while recreating the issue.
https://docs.google.com/spreadsheets/d/1SOXiFMEQelqptBvjcabMZGNgG60TRRbe_b65rzT1bi0/edit?usp=sharing
If you scroll to the bottom of the sheet you can see Col AQ has a 0, as a result the value in the cell AF2 should be 0.
You will notice in the sheet that I am using Named Ranges.
EDIT2: player0's answer was PERFECT!! <3
I modified the new formula to adapt to my spreadsheet so it could accommodate Named Ranges and drop-down lists. This question helped me a lot with that:
Convert column index into corresponding column letter
The final formula (just FYI) turned out to be:
=ARRAYFORMULA(COUNTIF(
INDIRECT(REGEXEXTRACT(ADDRESS(ROW(), column(INDIRECT($A$1 & Z$1 & "L"))), "[A-Z]+")&
MAX(IF((INDIRECT($A$1 & Z$1 & "L")=0)*(INDIRECT($A$1 & Z$1 & "L")<>""),
ROW(INDIRECT($A$1 & Z$1 & "L"))+1,5))&":"&
REGEXEXTRACT(ADDRESS(ROW(), column(INDIRECT($A$1 & Z$1 & "L"))), "[A-Z]+")), 1))
=ARRAYFORMULA(COUNTIF(INDIRECT("A"&
MAX(IF((A2:A=0)*(A2:A<>""), ROW(A2:A)+1, ROW(A2)))&":A"), 1))
spreadsheet demo

Codeeval Challenge 230: Football, Answer Only Partially Correct

I am working on a relatively new challenge in CodeEval called 'Football.' The description is listed in the following link:
https://www.codeeval.com/open_challenges/230/
Inputs are lines of a file read by Python, and within each line there are lists separated by '|', with each list representing a country: the first being country "1", second being country "2", and so on.
1 2 3 4 | 3 1 | 4 1
19 11 | 19 21 23 | 31 39 29
Outputs are also lines in response to each line read from the file.
1:1,2,3; 2:1; 3:1,2; 4:1,3;
11:1; 19:1,2; 21:2; 23:2; 29:3; 31:3; 39:3;
so country 1 supports team 1, 2, and 3 as shown in the first line of output: 1:1,2,3.
Below is my solution, and since I have no clue why the solution only works for the two sample cases lited in the description link, I'd like to ask anyone for comments and hints on how to correct my code. Thank you very much for your time and assistance ahead of time.
import sys
def football(string):
countries = map(str.split, string.split('|'))
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
results = []
for i in range(len(teams)):
results.append([teams[i]+':'])
for j in range(len(countries)):
if teams[i] in countries[j]:
results[i].append(str(j+1))
for i in range(len(results)):
results[i] = results[i][0]+','.join(results[i][1:])
return '; '.join(results) + '; '
if __name__ == '__main__':
lines = [line.rstrip() for line in open(sys.argv[1])]
for line in lines:
print football(line)
After deliberately failing an attempt to checkout the complete test input and my output, I found the problem. The line:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
will make the output problematic in terms of sorting. For example here's a sample input:
10 20 | 43 23 | 27 | 25 | 11 1 12 43 | 33 18 3 43 41 | 31 3 45 4 36 | 25 29 | 1 19 39 | 39 12 16 28 30 37 | 32 | 11 10 7
and it produces the output:
1:5,9; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 3:6,7; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 4:7; 41:6; 43:2,5,6; 45:7; 7:12;
But the challenge expects the output teams to be sorted by numbers in ascending order, which is not achieved by the above-mentioned code as the numbers are in string format, not integer format. Therefore the solution is simply adding a key to sort the teams list by ascending order for integer:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])), key=lambda x:int(x))
With a small change in this line, the code passes through the tests. A sample output looks like:
1:5,9; 3:6,7; 4:7; 7:12; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 41:6; 43:2,5,6; 45:7;
Please let me know if you have a better and more efficient solution to the challenge. I'd love to read better codes or great suggestions on improving my programming skills.
Here's how I solved it:
import sys
with open(sys.argv[1]) as test_cases:
for test in test_cases:
if test:
team_supporters = {}
for nation, nation_teams in enumerate(test.strip().split("|"), start=1):
for team in map(int, nation_teams.split()):
team_supporters.setdefault(team, []).append(nation)
print(*("{}:{};".format(team, ",".join(map(str, sorted(nations))))
for team, nations in sorted(team_supporters.items())))
The problem is not very complicated. We're given a mapping from nation (implicitly numbered by their order in the input) to a list of teams. We need to reverse that to create an output that maps from a team to a list of nations.
It seems natural to use a dictionary that maps in the same way as the desired output. We can use enumerate to give numbers to the nations as we iterate over them. The setdefault method of the dict adds empty lists to the dictionary as they are needed (using a collections.defaultdict instead of a regular dictionary would be another way to deal with this). We don't need to care about the order of the input, nor the order things are stored in the dictionary's inner lists.
The output we build using str.format calls and the default space separator of the print function. If the final semicolon wasn't desired, I'd have used print("; ".join("{}:{}.format(...))) instead. Since the output needs to be sorted by team at the top level, and by nation in the inner lists, we make some sorted calls where necessary.
Sorting the inner lists is probably not even be necessary, since the nations were processed in order, with their numbers derived from the order they had in the input line. Fortunately, Python's Timsort algorithm is very fast on already-sorted input, so even with a bit of unnecessary sorting, our code is still fast enough.

Calculating the distance between characters

Problem: I have a large number of scanned documents that are linked to the wrong records in a database. Each image has the correct ID on it somewhere that says where it belongs in the db.
I.E. A DB row could be:
| user_id | img_id | img_loc |
| 1 | 1 | /img.jpg|
img.jpg would have the user_id (1) on the image somewhere.
Method/Solution: Loop through the database. Pull the image text in to a variable with OCR and check if user_id is found anywhere in the variable. If not, flag the record/image in a log, if so do nothing and move on.
My example is simple, in the real world I have a guarantee that user_id wouldn't accidentally show up on the wrong form (it is of a specific format that has its own significance)
Right now it is working. However, it is incredibly strict. If you've worked with OCR you understand how fickle it can be. Sometimes a 7 = 1 or a 9 = 7, etc. The result is a large number of false positives. Especially among images with low quality scans.
I've addressed some of the image quality issues with some processing on my side - increase image size, adjust the black/white threshold and had satisfying results. I'd like to add the ability for the prog to recognize, for example, that "81*7*23103" is not very far from "81*9*23103"
The only way I know how to do that is to check for strings >= to the length of what I'm looking for. Calculate the distance between each character, calc an average and give it a limit on what is a good average.
Some examples:
Ex 1
81723103 - Looking for this
81923103 - Found this
--------
00200000 - distances between characters
0 + 0 + 2 + 0 + 0 + 0 + 0 + 0 = 2
2/8 = .25 (pretty good match. 0 = perfect)
Ex 2
81723103 - Looking
81158988 - Found
--------
00635885 - distances
0 + 0 + 6 + 3 + 5 + 8 + 8 + 5 = 35
35/8 = 4.375 (Not a very good match. 9 = worst)
This way I can tell it "Flag the bottom 30% only" and dump anything with an average distance > 6.
I figure I'm reinventing the wheel and wanted to share this for feedback. I see a huge increase in run time and a performance hit doing all these string operations over what I'm currently doing.

Perl RegEx for Matching 11 column File

I'm trying to write a perl regex to match the 5th column of files that contain 11 columns. There's also a preamble and footer which are not data. Any good thoughts on how to do this? Here's what I have so far:
if($line =~ m/\A.*\s(\b\w{9}\b)\s+(\b[\d,.]+\b)\s+(\b[\d,.sh]+\b)\s+.*/i) {
And this is what the forms look like:
No. Form 13F File Number Name
____ 28-________________ None
[Repeat as necessary.]
FORM 13F INFORMATION TABLE
TITLE OF VALUE SHRS OR SH /PUT/ INVESTMENT OTHER VOTING AUTHORITY
NAME OF INSURER CLASS CUSSIP (X$1000) PRN AMT PRNCALL DISCRETION MANAGERS SOLE SHARED NONE
Abbott Laboratories com 2824100 4,570 97,705 SH sole 97,705 0 0
Allstate Corp com 20002101 12,882 448,398 SH sole 448,398 0 0
American Express Co com 25816109 11,669 293,909 SH sole 293,909 0 0
Apollo Group Inc com 37604105 8,286 195,106 SH sole 195,106 0 0
Bank of America com 60505104 174 12,100 SH sole 12,100 0 0
Baxter Internat'l Inc com 71813109 2,122 52,210 SH sole 52,210 0 0
Becton Dickinson & Co com 75887109 8,216 121,506 SH sole 121,506 0 0
Citigroup Inc com 172967101 13,514 3,594,141 SH sole 3,594,141 0 0
Coca-Cola Co. com 191216100 318 6,345 SH sole 6,345 0 0
Colgate Palmolive Co com 194162103 523 6,644 SH sole 6,644 0 0
If you ever do write a regex this long, you should at least use the x flag to ignore whitespace, and importantly allow whitespace and comments:
m/
whatever
something else # actually trying to do this
blah # for fringe case X
/xi
If you find it hard to read your own regex, others will find it Impossible.
I think a regular expression is overkill for this.
What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).
Like Ether said, another tool would be appropriate for this job.
#fields = split /\t/, $line;
if (#fields == 11) { # less than 11 fields is probably header/footer
$the_5th_column = $fields[4];
...
}
My first thought is that the sample data is horribly mangled in your example. It'd be great to see it embedded inside some <pre>...</pre> tags so columns will be preserved.
If you are dealing with columnar data, you can go after it using substr() or unpack() easier than you can using regex. You can use regex to parse out the data, but most of us who've been programming Perl a while also learned that regex is not the first tool to grab a lot of times. That's why you got the other comments. Regex is a powerful weapon, but it's also easy to shoot yourself in the foot.
http://perldoc.perl.org/functions/substr.html
http://perldoc.perl.org/functions/unpack.html
Update:
After a bit of nosing around on the SEC edgar site, I've found that the 13F files are nicely formatted. And, you should have no problem figuring out how to process them using substr and/or unpack.
FORM 13F INFORMATION TABLE
VALUE SHARES/ SH/ PUT/ INVSTMT OTHER VOTING AUTHORITY
NAME OF ISSUER TITLE OF CLASS CUSIP (x$1000) PRN AMT PRN CALL DSCRETN MANAGERS SOLE SHARED NONE
- ------------------------------ ---------------- --------- -------- -------- --- ---- ------- ------------ -------- -------- --------
3M CO COM 88579Y101 478 6051 SH SOLE 6051 0 0
ABBOTT LABS COM 002824100 402 8596 SH SOLE 8596 0 0
AFLAC INC COM 001055102 291 6815 SH SOLE 6815 0 0
ALCATEL-LUCENT SPONSORED ADR 013904305 172 67524 SH SOLE 67524 0 0
If you are seeing the 13F files unformatted, as in your example, then you are not viewing correctly because there are tabs between columns in some of the files.
I looked through 68 files to get an idea of what's out there, then wrote a quick unpack-based routine and got this:
3M CO, COM, 88579Y101, 478, 6051, SH, , SOLE, , 6051, 0, 0
ABBOTT LABS, COM, 002824100, 402, 8596, SH, , SOLE, , 8596, 0, 0
AFLAC INC, COM, 001055102, 291, 6815, SH, , SOLE, , 6815, 0, 0
ALCATEL-LUCENT, SPONSORED ADR, 013904305, 172, 67524, SH, , SOLE, , 67524, 0, 0
Based on some of the other files here's some thoughts on how to process them:
Some of the files use tabs to separate the columns. Those are trivial to parse and you do not need regex to split the columns. 0001031972-10-000004.txt appears to be that way and looks very similar to your example.
Some of the files use tabs to align the columns, not separate them. You'll need to figure out how to compress multiple tab runs into a single tab, then probably split on tabs to get your columns.
Others use a blank line to separate the rows vertically so you'll need to skip blank lines.
Others allow wrap columns to the next line (like a spreadsheet would in a column that is not wide enough. It's not too hard to figure out how to deal with that, but how to do it is being left as an exercise for you.
Some use centered column alignment, resulting in leading and trailing whitespace in your data. s/^\s+//; and s/\s+$//; will become your friends.
The most interesting one I saw appeared to have been created correctly, then word-wrapped at column 78, leading me to think some moron loaded their spreadsheet or report into their word processor then saved it. Reading that is a two step process of getting rid of the wrapping carriage-returns, then re-processing the data to parse out the columns. As an added task they also have column headings in the data for page breaks.
You should be able to get 100% of the files parsed, however you'll probably want to do it with a couple different parsing methods because of the use of tabs and blank lines and embedded column headers.
Ah, the fun of processing data from the wilderness.