I am trying to count values of many string variables (hh_1_age hh_2_age hh_3_age etc.) based on multiple conditions, and send output to a new variable schoolage.
The closest I think I've gotten is coded below...but I'm still getting error messages. I'm hoping I'm on the right track and it's just a small syntax error:
generate schoolage = .
foreach var of varlist hh_* {
count if `v'=="6 - 10 years of age" | `v'=="11 - 14 years of age"
}
This requires a great deal of guesswork compared with the small amount of information given. Please note the suggestions here for good Stata questions, which include giving a reproducible example based on real(istic) data.
You may have data on individual members of a household and wish to count how many are of school age, that meaning aged between 6 and 14 years.
gen schoolage = 0
foreach v of var hh_*_age {
replace schoolage = schoolage + inlist(`v', "6 - 10 years of age", "11 - 14 years of age")
}
would be one solution to that.
Note that this assumes that the variables in question really are string, as you report. Further, checks for equality are entirely literal: all characters must match in turn.
For most purposes holding data on each individual in a separate observation is a much better practice than your question seems to imply.
Related
I have two datasets, one that lives within my agency and another that comes from an external source. Theoretically, all my agency's data should be matchable as a subset of the external data, but the problem is that there's no consistency in how PHN + street addresses are being recorded externally.
Our data = 100 West 10 Street
Their data = 100W 10th St / 100 W. 10 St. / 100west 10TH Street (you get the idea)
We have a lot of data, but they have even more, and both our data change on a daily basis, so it's infeasible to change formats one by one.
So I have two questions, coming from a SAS novice who's learned through work and lots of Googling, so please bear with me.
1 - Is there a way to do a quick non-perfect/fuzzy matching of the two datasets on addresses if they're not totally consistent in format? I understand that I'd have to go through the results, but I wanted a quick way to eliminate most of the non-matches immediately with minimal clean-up beforehand.
2 - If 1 isn't possible, what is the best approach to clean up the external data and to make the addresses more consistent? Should I keep the PHN + Street together, or keep them as separate variables? I started looking into prxchange and while it's definitely useful, it's not perfect. For example:
Address = left(prxchange('s / ST | ST. / STREET /', -1, cat(' ', address, ' ')));
Works great until it hits addresses at St Marks, for example, and converts the St to STREET.
The other problem is that I have to account for all the possible variations in spelling, abbreviations, periods, etc., which I'm doing now the old-fashioned way in Excel, but this leaves room for error.
Also, if some of the addresses have been compressed, such as 10west instead of 10 west, what is the best way to add a space or separate out entirely? Everything has been read in in the text format, and again there's no consistency in the number of characters to do a simple substring.
Thanks!
I have an input text file containing information (Age, Name, Job, Salary...) on 3 individuals. Now those keywords could be misspelled. I have to parse the three lines and be able to compare them with a template later on.
My question is just on how I should approach this. I started by parsing each line into a vector<string> but I don't know how I can then look at each element of the vector and read the different pieces of information even if they contain spelling mistakes.
Any help would be greatly appreciated!
Name: Kevin, Jeb: Accountant, Yers of Experience: 5, Salery: 10000
Name: Susan, Job: Restaurant Owner, Years of Experience: 5, Salary: 14000
Side Note: The information in each line do not have to be in this order, each line can display them in a random order.
Since you need the result, not the process, the most easy way would be the most straightforward way. You are saying that
Typos can be of two types: Deletion (i.e. Titl instead of Title) and Substitution (i.e. experiance instead of experience).
I assume that each typo type (pun intended) can only occur once per word (otherwise the task makes little sense). So here is your line:
Name: Susan, Job: Restaurant Owner, Years of Experience: 5, Salary: 14000
After splitting it by the commas, you will get 4 parts:
Name: Susan
Job: Restaurant Owner
Years of Experience: 5
Salary: 14000
Now, each part has a "key" and a "value", it's also easy to separate them by splitting by ":". The values are of two fundamental types: integers for salary and years or experience, and strings for name and job.
Take the ones that have integers as values. Between them, it's easy to tell years of experience and salary apart, because "years of experience" even with typos is a much longer string than "salary".
Now take the ones with string values. This one is harder because you can't use the key's lengths to tell keys apart. However, the words "Name" and "Job" do not share any characters. So if a key contains at least two characters from the word "Name", then it's the "Name" key, and vice versa.
I'm trying to generate in Stata the mean per year (e.g. 2002-2012) for each industry (by 2 digit SIC codes, so c. 50 different industries)
I found how to do it for one year with:
by sic_2digit, sort: egen test = mean(oancf_at_rsd10) if fyear == 2004
Is there a more efficient way to do this instead of repeating the command 10 times by hand and than adding the values together?
You can specify more than one variable with by:.
by sic_2digit fyear, sort: egen test = mean(oancf_at_rsd10)
Check out the help for by:, which gives the syntax and an example, and also that for collapse.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I was practicing the dynamic programming problem on SPOJ. But I have no idea how to solve this one.
Can anyone please help ME in solving http://www.spoj.pl/problems/ACODE/ problem on SPOJ
Thanks!
Alice and Bob need to send secret messages to each other and are
discussing ways to encode their messages:
Alice: “Let’s just use a very simple code: We’ll assign ‘A’ the code
word 1, ‘B’ will be 2, and so on down to ‘Z’ being assigned 26.”
Bob: “That’s a stupid code, Alice. Suppose I send you the word ‘BEAN’
encoded as 25114. You could decode that in many different ways!”
Alice: “Sure you could, but what words would you get? Other than
‘BEAN’, you’d get ‘BEAAD’, ‘YAAD’, ‘YAN’, ‘YKD’ and ‘BEKD’. I think
you would be able to figure out the correct decoding. And why would
you send me the word ‘BEAN’ anyway?” Bob: “OK, maybe that’s a bad
example, but I bet you that if you got a string of length 5000 there
would be tons of different decodings and with that many you would find
at least two different ones that would make sense.” Alice: “How many
different decodings?” Bob: “Jillions!”
For some reason, Alice is still unconvinced by Bob’s argument, so she
requires a program that will determine how many decodings there can be
for a given string using her code.
Input
Input will consist of multiple input sets. Each set will consist of a
single line of at most 5000 digits representing a valid encryption
(for example, no line will begin with a 0). There will be no spaces
between the digits. An input line of ‘0’ will terminate the input and
should not be processed.
Output
For each input set, output the number of possible decodings for the
input string. All answers will be within the range of a 64 bit signed
integer.
Example
Input:
25114 1111111111 3333333333 0
Output:
6 89 1
Starting from the left, do the following:
Find how many words the sequence can be interpreted as (call that say x[k]) up to this point using a finite number of the values for previous calculated for points along the sequence.
Move to the next point.
If you still can't get it, you can take a look at the Welcome to Code Jam problem. It somewhat similar and has readily available explanations for it.
If you have a string of numbers as S, then, there are two cases possible :
1) only the first digit corresponds to an alphabet
2) the first two digits correspond to an alphabet. BUT, only if the first two digits don't form a number greater than 26.
Let S be of size n. Let f(Si) be the number of strings formed by last i digits. Note that you have to find f(Sn).
Using the above two rules, you can write a relation as :
If first two digits form a number <= 26 :
f ( Sk ) = f (Sk-1) + f (Sk-2)
If first two digits form a number > 26 :
I have some SQLCLR code for working with Regular Expresions. But now that it is getting migrated into Azure, which does not allow SQLCLR, that's out. I need to find a way to do regex in pure T-SQL.
Master Data Services are not available because the dev edition of MSSQL we have is not R2.
All ideas appreciated, thanks.
Regular expression match samples that need handling
(culled from regexlib and other places over the past few years)
email address
^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$
dollars
^(\$)?(([1-9]\d{0,2}(\,\d{3})*)|([1-9]\d*)|(0))(\.\d{2})?$
uri
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
one numeric digit
^\d$
percentage
^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$
height notation
^\d?\d'(\d|1[01])"$
numbers between 1 1000
^([1-9]|[1-9]\d|1000)$
credit card numbers
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
list of years
^([1-9]{1}[0-9]{3}[,]?)*([1-9]{1}[0-9]{3})$
days of the week
^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
time on 12 hour clock
(?<Time>^(?:0?[1-9]:[0-5]|1(?=[012])\d:[0-5])\d(?:[ap]m)?)
time on 24 hour clock
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
usa phone numbers
^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$
Unfortunately, you will not be able to move your CLR function(s) to SQL Azure. You will need to either use the normal string functions (PATINDEX, CHARINDEX, LIKE, and so on) or perform these operations outside of the database.
EDIT Adding some information for the examples added to the question.
Email address
This one is always controversial because people disagree about which version of the RFC they want to support. The original didn't support apostrophes, for example (or at least people insist that it didn't - I haven't dug it up from the archives and read it myself, admittedly), and it has to be expanded quite often for new TLDs (once for 4-letter TLDs like .info, then again for 6-letter TLDs like .museum). I've often heard quite knowledgeable people state that perfect e-mail validation is impossible, and having previously worked for an e-mail service provider, I can tell you that it was a constantly moving target. But for the simplest approaches, see the question TSQL Email Validation (without regex).
One numeric digit
Probably the easiest one of the bunch:
WHERE #s LIKE '[0-9]';
Credit card numbers
Assuming you strip out dashes and spaces, which you should do in any case. Note that this isn't an actual check of the credit card number algorithm to ensure that the number itself is actually valid, just that it conforms to the general format (AmEx = 15 digits starting with a 3, the rest are 16 digits - Visa starts with a 4, MasterCard starts with a 5, Discover starts with 6 and I think there's one that starts with a 7 (though that may just be gift cards of some kind)):
WHERE #s + ' ' LIKE '[3-7]'+ REPLICATE('[0-9]', 14) + '[0-9 ]';
If you want to be a little more precise at the cost of being long-winded, you can say:
WHERE (LEN(#s) = 15 AND #s LIKE '3' + REPLICATE('[0-9]', 14))
OR (LEN(#s) = 16 AND #s LIKE '[4-7]' + REPLICATE('[0-9]', 15));
USA phone numbers
Again, assuming you're going to strip out parentheses, dashes and spaces first. Pretty sure a US area code can't start with a 1; if there are other rules, I am not aware of them.
WHERE #s LIKE '[2-9]' + REPLICATE('[0-9]', 9);
-----
I'm not going to go further, because a lot of the other expressions you've defined can be extrapolated from the above. Hopefully this gives you a start. You should be able to Google for some of the others to see how other people have replicated the patterns with T-SQL. Some of them (like days of the week) can probably just be checked against a table - seems overkill to do an invasie pattern matching for a set of 7 possible values. Similarly with a list of 1000 numbers or years, these are things that will be much easier (and probably more efficient) to check if the numeric value is in a table rather than convert it to a string and see if it matches some pattern.
I'll state again that a lot of this will be much better if you can cleanse and validate the data before it gets into the database in the first place. You should strive to do this wherever possible, because without CLR, you just can't do powerful RegEx inside SQL Server.
Ken Henderson wrote about ways to replicate RegEx without CLR, but they require sp_OA* procedures, which are even less likely to ever see the light of day in Azure than CLR. Most of the other articles you'll find online use an approach similar to Ken's or use complex use of built-in string functions.
Which portions of RegEx specifically are you trying to replicate? Can you show an example of the input/output of one of your functions? Perhaps it will be easy to convert to get similar results using the built-in string functions like PATINDEX.