SAS converting character to numeric in data step - sas

I cannot figure out why SAS is converting some character variables to numeric in a data step. I've confirmed that each variable is a character variable with a length of 1 using PROC CONTENTS. Below is the code that is generating this problem. I've not found any answers through google searches that make sense for this issue.
data graddist.graddist11;
set graddist.graddist10;
if (ACCT2301_FA16 | BIOL1106_FA16 | BIOL1107_FA16 | BIOL1306_FA16 |
BIOL2101_FA16 | BIOL2301_FA16 | CHEM1111_FA16 | CHEM1305_FA16 | CHEM1311_FA16 |
ECON2301_FA16 | ENGL1301_FA16 | ENGL1302_FA16 | ENGR1201_FA16 | GEOG1313_FA16 |
HIST1301_FA16 | HIST1302_FA16 | MARK3311_FA16 | MATH1314_FA16 | MATH2413_FA16 |
MATH2414_FA16 | PHIL2306_FA16 | POLS2305_FA16 | POLS2306_FA16 |
PSYC1301_FA16 | PSYC2320_FA16) in ('A','B','C','D','F','W','Q','I') then FA16courses_b=1;
else FA16courses_b=0;
Thank you,
Brian

Unfortunately the in operator does not work like that in SAS. It only compares a single value on the left to a list of values on the right. Your code is basically evaluating everything to the left of the in statement first and returning a TRUE/FALSE value (in SAS this is a numeric value where 0=FALSE and anything other than 0=TRUE). It is then basically saying:
if (0) in ('A'....'I') then...
or
if (1) in ('A'....'I') then...
You would need to rewrite your equivalence test in some other way such as :
if ACCT2301_FA16 in ('A'....'I')
or BIOL1106_FA16 in ('A'....'I')
or ...

As Robert notes, you can't compare multiple variables to multiple values at once with one equals query.
The way to do this with a long string is with FINDC. Concatenate everything into a string, concatenate your finds into a string, and compare; FINDC looks for any char in (charlist).
data graddist10;
length ACCT2301_FA16 BIOL1106_FA16 BIOL1107_FA16 $1;
input
ACCT2301_FA16 $
BIOL1106_FA16 $
BIOL1107_FA16 $
;
datalines;
X Y Z
A A B
B A B
. . .
A X .
X . B
W Y A
;;;;
run;
data graddist11;
set graddist10;
if findc(catx('|',of ACCT2301_FA16 BIOL1106_FA16 BIOL1107_FA16),
('ABCDFWQI')) then FA16courses_b=1;
else FA16courses_b=0;
run;

You made two syntax mistakes but they cancelled each other out and SAS considered your statement valid. First the IN () operator can only take one value on the left. Second the | token means logical or and is not used to separate items in a list.
The use of | between the variable names caused SAS to reduce the left hand side to a single value and so your use of IN was then syntactically correct. That is also why SAS noted that it had converted your character variables to numeric values so that they could be evaluated as boolean logic.
It is not clear what logic you meant for testing of those multiple values.
You might be able to use the verify() function. Perhaps something like this, but watch out for when all of the values are missing.
FA16courses_b = 0 ^= verify(cats(of ACCT2301_FA16 BIOL1106_FA16 BIOL1107_FA16),'ABCDEQWI');
But it is probably easier to loop over the variables and test them one by one. So for example you could set the target variable to zero and then conditionally set it to one when one of the variables meets the condition.
FA16courses_b=0;
array c ACCT2301_FA16 BIOL1106_FA16 BIOL1107_FA16 ;
do i=1 to dim(c) while (FA16courses_b=0);
if not indexc(c(i),'ABCDEQWI') then FA16courses_b=1;
end;

Related

Matching a string containing special characters against a string without special characters

I'm trying to match strings using a Google Sheets query where both strings are uncertain if they contain special characters. I'll try to explain as best as I can:
I have a table of data
+-----+------+-----+-----+
| A | B | C | D |
|x | y |ø |z |
|xx | yy |á |zz |
|xxx | yyy |e |zzz |
+-----+------+-----+-----+
My query function would look something like this:
=QUERY(A1:D3, "SELECT * WHERE (C = 'ø') OR (C = 'è') OR (C = 'a')")
Currently, using this query will only return 1 row because, (C = 'ø') is an exact match with 'ø', however none of the others have a match.
For (C = 'è'), we can just replace all of the accented characters in the string with their un-accented equivalent.
In this case 'è' becomes 'e' and has a match - now the query will return a second row.
(I found a nice way to replace all accented characters in a string here.)
Finally, here is where my main problem sits: (C = 'a'). I can't figure out a way to make it match 'á', unless I check every accented variant of 'a', but that just seems silly.
It's not possible to do something like "... WHERE (CUSTOM_FUNCTION(C) = 'a')" either, sadly.
As I previously mentioned, either side of the match may or may not contain accented/special characters.
I should also mention that it wont be just a single character, it will be a whole string.
If anyone has any possible solutions to this, it would be greatly appreciated.
Instead of the QUERY formula, you could use a FILTER formula.
=FILTER(A2:D22,REGEXMATCH(C2:C22,"ø|è|á")=TRUE)
(Please adjust ranges to your needs. You can also add/remove more special characters.)
Functions used:
FILTER
REGEXMATCH

Extract term within a string that matches a variable

I have a large dataset with two string variables: people_attending and special_attendee:
*Example generated by -dataex-. To install: ssc install dataex
clear
input str148 people_attending str16 special_attendee
"; steve_jobs-apple_CEO; kevin_james-comedian; michael_crabtree-football_player; sharon_stone-actor; bill_gates-microsoft_CEO; kevin_nunes-politician" "michael_crabtree"
"; rob_lowe-actor; ted_cruz-politician; niki_minaj-music_artist; lindsey_whalen-basketball_coach" "niki_minaj"
end
The first variable varies in length and contains a list of every person who attended an event along with their title. Name and title are separated by a dash, and attendees are separated by a semi-colon and space. The second variable is an exact match of one of the names contained in the first variable.
I want to create a third variable that extracts the title for whichever person is listed in the second variable. In the above example, I would want the new variable to be "football_player" for observation 1 and "music_artist" for observation 2.
Here is a way to do this using a simple regular expression:
generate wanted = subinstr(people_attending, special_attendee, ">", .)
replace wanted = ustrregexs(0) if ustrregexm(wanted, ">(.*?);")
replace wanted = substr(wanted, 3, strpos(wanted, ";")-3)
list wanted
+-----------------+
| wanted |
|-----------------|
1. | football_player |
2. | music_artist |
+-----------------+
In the first step you substitute the name with a marker >. Then you extract the relevant substring using the regular expression. In the final step, you clean up.
EDIT:
The third step can be omitted if you slightly modify the code as follows:
generate wanted = subinstr(people_attending, special_attendee, ">", .)
replace wanted = ustrregexs(1) if ustrregexm(wanted, ">-(.*?);")

How to find strings beginning with X?

I am trying to identify strings that begin with X using the function regexm() in Stata.
My code:
for var lookin: count if regexm(X, "X")
I have tried using double quotes, square brackets, adding the options for the other characters in the string X[0-9][0-9] etc. but to no avail.
I expect the resultant number to be about 1000, but it returns 0.
The following works for me:
clear
input str22 foo
"Xhello"
"this is a X sentence"
"X a silly one"
"but serves the purpose"
end
generate tag = strmatch(foo, "X*")
list
+------------------------------+
| foo tag |
|------------------------------|
1. | Xhello 1 |
2. | this is a X sentence 0 |
3. | X a silly one 1 |
4. | but serves the purpose 0 |
+------------------------------+
count if tag
2
This is the regular expression solution based on the above example:
generate tag = regexm(foo, "^X")
for in Stata is ancient and now undocumented syntax, unless you are using a very old version of Stata, in which case you would be better flagging that.
X is the default loop element which is substituted everywhere it is found.
Hence your syntax -- looping over a single variable -- reduces to
count if regexm(lookin, "lookin")
and even without a data example we can believe that the answer is 0.
This would be legal and is closer to what you seek:
for Y in var lookin : count if regexm(Y, "X")
but the regular expression is wrong, as #Pearly Spencer points out.
Incidentally,
count if strpos(lookin, "X") == 1
is a direct alternative to your code.
In any Stata that supports regexm() you should be looping with foreach or forvalues.

Convert equation from string in postgresql

I am trying to write a query that takes in a string, where an equation in the form
x^3 + 0.0046x^2 - 0.159x +1.713
is expected. The equation is used to calculate new values in the output table from a list of existing values. Hence I will need to convert whatever the input equation string is into an equation that postgresql can process, e.g.
power(data.value,3) + 0.0046 * power(data.value,2) - 0.159 * data.value + 1.713
A few comforting constraints in this task are
The equation will always be in the form of a polynomial, e.g. sum(A_n * x^n)
The user will always use 'x' to represent the variable in the input equation
I have been pushing my queries into a string and executing it at the end, e.g.
_query TEXT;
SELECT 'select * from ' INTO _query;
SELECT _query || 'product.getlength( ' || min || ',' || max || ')' INTO _query;
RETURN QUERY EXECUTE _query;
Hence I know I only need to somehow
Replace the 'x''s to 'data.values'
Find all the places in the equation string where a number
immediately precede a 'x', and add a '*'
Find all exponential operations (x^n) in the equation string and
convert them to power(x,n)
This may very well be something very trivial for a lot of people, unfortunately postgresql is not my best skill and I have already spent way more time than I can afford to get this working. Any type of help is highly appreciated, cheers.
Your 9am-noon time frame is over, but here goes.
Every term of the polynomial has 4 elements:
Addition/subtraction modifier
Multiplier
Parameter, always x in your case
Power
The problem is that these elements are not always present. The first term has no addition element, although it could have a subtraction sign - which is then typically connected to the multiplier. Multipliers are only given when not equal to 1. The parameter is not present in the last term and neither is a power in the last two terms.
With optional capture groups in regular expression parsing you can sort out this mess and PostgreSQL has the handy regexp_matches() function for this:
SELECT * FROM
regexp_matches('x^3 + 0.0046x^2 - 0.159x +1.713',
'\s*([+-]?)\s*([0-9.]*)(x?)\^?([0-9]*)', 'g') AS r (terms);
The regular expression says this:
\s* Read 0 or more spaces.
([+-]?) Capture 0 or 1 plus or minus sign.
\s* Read 0 or more spaces.
([0-9.]*) Capture a number consisting of digit and a decimal dot, if present.
(x?) Capture the parameter x. This is necessary to differentiate between the last two terms, see query below.
\^? Read the power symbol, if present. Must be escaped because ^ is the constraint character.
([0-9]*) Capture an integer number, if present.
The g modifier repeats this process for every matching pattern in the string.
On your string this yields, in the form of string arrays:
| terms |
|-----------------|
| {'','',x,3} |
| {+,0.0046,x,2} |
| {-,0.159,x,''} |
| {+,1.713,'',''} |
| {'','','',''} |
(I have no idea why the last line with all empty strings comes out. Maybe a real expert can explain that.)
With this result, you can piece your query together:
SELECT id, sum(term)
FROM (
SELECT id,
CASE WHEN terms[1] = '-' THEN -1
WHEN terms[1] = '+' THEN 1
WHEN terms[3] = 'x' THEN 1 -- If no x then NULL
END *
CASE terms[2] WHEN '' THEN 1. ELSE terms[2]::float
END *
value ^ CASE WHEN terms[3] = '' THEN 0 -- If no x then 0 (x^0)
WHEN terms[4] = '' THEN 1 -- If no power then 1 (x^1)
ELSE terms[4]::int
END AS term
FROM data
JOIN regexp_matches('x^3 + 0.0046x^2 - 0.159x +1.713',
'\s*([+-]?)\s*([0-9.]*)(x?)\^?([0-9]*)', 'g') AS r (terms) ON true
) sub
GROUP BY id
ORDER BY id;
SQLFiddle
This assumes you have an id column to join on. If all you have is a value then you can still do it but you should then wrap the above query in a function that you feed the polynomial and the value. The power is assumed to be integral but you can easily turn that into a real number by adding a dot . to the regular expression and a ::float cast instead of ::int in the CASE statement. You can also support negative powers by adding another capture group to the regular expression and a case statement in the query, same as for the multiplier term; I leave this for your next weekend hackfest.
This query will also handle "odd" polynomials such as -4.3x^3+ 101.2 + 0.0046x^6 - 0.952x^7 +4x just so long as the pattern described above is maintained.

Is there a way to usefully index a text column containing regex patterns?

I'm using PostgreSQL, currently version 9.2 but I'm open to upgrading.
In one of my tables, I have a column of type text that stores regex patterns.
CREATE TABLE foo (
id serial,
pattern text,
PRIMARY KEY(id)
);
CREATE INDEX foo_pattern_idx ON foo(pattern);
Then I do queries on it like this:
INSERT INTO foo (pattern) VALUES ('^abc.*$');
SELECT * FROM foo WHERE 'abc literal string' ~ pattern;
I understand that this is sort of a reverse LIKE or reverse pattern match. If it was the other, more common way, if my haystack was in the database, and my needle was anchored, I could use a btree index more or less effectively depending on the exact search pattern and data.
But the data that I have is a table of patterns and other data associated with the patterns. I need to ask the database which rows have patterns that match my query text. Is there a way to make this more efficient than a sequential scan that checks every row in my table?
There is no way.
Indexes require IMMUTABLE expressions. The result of your expression depends on the input string. I don't see any other way than to evaluate the expression for every row, meaning a sequential scan.
Related answer with more details for the IMMUTABLE angle:
Does PostgreSQL support "accent insensitive" collations?
Just that there is no workaround for your case, which is impossible to index. The index needs to store constant values in its tuples, which is just not available because the resulting value for every row is computed based on the input. And you cannot transform the input without looking at the column value.
Postgres index usage is bound to operators and only indexes on expressions left of the operator can be used (due to the same logical restraints). More:
Can PostgreSQL index array columns?
Many operators define a COMMUTATOR which allows the query planner / optimizer to flip the indexed expressions to the left. Simple example: The commutator of = is =. the commutator of > is < and vice versa. The documentation:
the index-scan machinery expects to see the indexed column on the left of the operator it is given.
The regular expression match operator ~ has no commutator, again, because that's not possible. See for yourself:
SELECT oprname, oprright::regtype, oprleft::regtype, oprcom
FROM pg_operator
WHERE oprname = '~'
AND 'text'::regtype IN (oprright, oprleft);
oprname | oprright | oprleft | oprcom
---------+----------+-----------+------------
~ | text | name | 0
~ | text | text | 0
~ | text | character | 0
~ | text | citext | 0
And consult the manual here:
oprcom ... Commutator of this operator, if any
...
Unused column contain zeroes. For example, oprleft is zero for a prefix operator.
I have tried before and had to accept it's impossible on principal.