How to find strings beginning with X? - regex

I am trying to identify strings that begin with X using the function regexm() in Stata.
My code:
for var lookin: count if regexm(X, "X")
I have tried using double quotes, square brackets, adding the options for the other characters in the string X[0-9][0-9] etc. but to no avail.
I expect the resultant number to be about 1000, but it returns 0.

The following works for me:
clear
input str22 foo
"Xhello"
"this is a X sentence"
"X a silly one"
"but serves the purpose"
end
generate tag = strmatch(foo, "X*")
list
+------------------------------+
| foo tag |
|------------------------------|
1. | Xhello 1 |
2. | this is a X sentence 0 |
3. | X a silly one 1 |
4. | but serves the purpose 0 |
+------------------------------+
count if tag
2
This is the regular expression solution based on the above example:
generate tag = regexm(foo, "^X")

for in Stata is ancient and now undocumented syntax, unless you are using a very old version of Stata, in which case you would be better flagging that.
X is the default loop element which is substituted everywhere it is found.
Hence your syntax -- looping over a single variable -- reduces to
count if regexm(lookin, "lookin")
and even without a data example we can believe that the answer is 0.
This would be legal and is closer to what you seek:
for Y in var lookin : count if regexm(Y, "X")
but the regular expression is wrong, as #Pearly Spencer points out.
Incidentally,
count if strpos(lookin, "X") == 1
is a direct alternative to your code.
In any Stata that supports regexm() you should be looping with foreach or forvalues.

Related

Matching a string containing special characters against a string without special characters

I'm trying to match strings using a Google Sheets query where both strings are uncertain if they contain special characters. I'll try to explain as best as I can:
I have a table of data
+-----+------+-----+-----+
| A | B | C | D |
|x | y |ø |z |
|xx | yy |á |zz |
|xxx | yyy |e |zzz |
+-----+------+-----+-----+
My query function would look something like this:
=QUERY(A1:D3, "SELECT * WHERE (C = 'ø') OR (C = 'è') OR (C = 'a')")
Currently, using this query will only return 1 row because, (C = 'ø') is an exact match with 'ø', however none of the others have a match.
For (C = 'è'), we can just replace all of the accented characters in the string with their un-accented equivalent.
In this case 'è' becomes 'e' and has a match - now the query will return a second row.
(I found a nice way to replace all accented characters in a string here.)
Finally, here is where my main problem sits: (C = 'a'). I can't figure out a way to make it match 'á', unless I check every accented variant of 'a', but that just seems silly.
It's not possible to do something like "... WHERE (CUSTOM_FUNCTION(C) = 'a')" either, sadly.
As I previously mentioned, either side of the match may or may not contain accented/special characters.
I should also mention that it wont be just a single character, it will be a whole string.
If anyone has any possible solutions to this, it would be greatly appreciated.
Instead of the QUERY formula, you could use a FILTER formula.
=FILTER(A2:D22,REGEXMATCH(C2:C22,"ø|è|á")=TRUE)
(Please adjust ranges to your needs. You can also add/remove more special characters.)
Functions used:
FILTER
REGEXMATCH

SAS converting character to numeric in data step

I cannot figure out why SAS is converting some character variables to numeric in a data step. I've confirmed that each variable is a character variable with a length of 1 using PROC CONTENTS. Below is the code that is generating this problem. I've not found any answers through google searches that make sense for this issue.
data graddist.graddist11;
set graddist.graddist10;
if (ACCT2301_FA16 | BIOL1106_FA16 | BIOL1107_FA16 | BIOL1306_FA16 |
BIOL2101_FA16 | BIOL2301_FA16 | CHEM1111_FA16 | CHEM1305_FA16 | CHEM1311_FA16 |
ECON2301_FA16 | ENGL1301_FA16 | ENGL1302_FA16 | ENGR1201_FA16 | GEOG1313_FA16 |
HIST1301_FA16 | HIST1302_FA16 | MARK3311_FA16 | MATH1314_FA16 | MATH2413_FA16 |
MATH2414_FA16 | PHIL2306_FA16 | POLS2305_FA16 | POLS2306_FA16 |
PSYC1301_FA16 | PSYC2320_FA16) in ('A','B','C','D','F','W','Q','I') then FA16courses_b=1;
else FA16courses_b=0;
Thank you,
Brian
Unfortunately the in operator does not work like that in SAS. It only compares a single value on the left to a list of values on the right. Your code is basically evaluating everything to the left of the in statement first and returning a TRUE/FALSE value (in SAS this is a numeric value where 0=FALSE and anything other than 0=TRUE). It is then basically saying:
if (0) in ('A'....'I') then...
or
if (1) in ('A'....'I') then...
You would need to rewrite your equivalence test in some other way such as :
if ACCT2301_FA16 in ('A'....'I')
or BIOL1106_FA16 in ('A'....'I')
or ...
As Robert notes, you can't compare multiple variables to multiple values at once with one equals query.
The way to do this with a long string is with FINDC. Concatenate everything into a string, concatenate your finds into a string, and compare; FINDC looks for any char in (charlist).
data graddist10;
length ACCT2301_FA16 BIOL1106_FA16 BIOL1107_FA16 $1;
input
ACCT2301_FA16 $
BIOL1106_FA16 $
BIOL1107_FA16 $
;
datalines;
X Y Z
A A B
B A B
. . .
A X .
X . B
W Y A
;;;;
run;
data graddist11;
set graddist10;
if findc(catx('|',of ACCT2301_FA16 BIOL1106_FA16 BIOL1107_FA16),
('ABCDFWQI')) then FA16courses_b=1;
else FA16courses_b=0;
run;
You made two syntax mistakes but they cancelled each other out and SAS considered your statement valid. First the IN () operator can only take one value on the left. Second the | token means logical or and is not used to separate items in a list.
The use of | between the variable names caused SAS to reduce the left hand side to a single value and so your use of IN was then syntactically correct. That is also why SAS noted that it had converted your character variables to numeric values so that they could be evaluated as boolean logic.
It is not clear what logic you meant for testing of those multiple values.
You might be able to use the verify() function. Perhaps something like this, but watch out for when all of the values are missing.
FA16courses_b = 0 ^= verify(cats(of ACCT2301_FA16 BIOL1106_FA16 BIOL1107_FA16),'ABCDEQWI');
But it is probably easier to loop over the variables and test them one by one. So for example you could set the target variable to zero and then conditionally set it to one when one of the variables meets the condition.
FA16courses_b=0;
array c ACCT2301_FA16 BIOL1106_FA16 BIOL1107_FA16 ;
do i=1 to dim(c) while (FA16courses_b=0);
if not indexc(c(i),'ABCDEQWI') then FA16courses_b=1;
end;

Convert equation from string in postgresql

I am trying to write a query that takes in a string, where an equation in the form
x^3 + 0.0046x^2 - 0.159x +1.713
is expected. The equation is used to calculate new values in the output table from a list of existing values. Hence I will need to convert whatever the input equation string is into an equation that postgresql can process, e.g.
power(data.value,3) + 0.0046 * power(data.value,2) - 0.159 * data.value + 1.713
A few comforting constraints in this task are
The equation will always be in the form of a polynomial, e.g. sum(A_n * x^n)
The user will always use 'x' to represent the variable in the input equation
I have been pushing my queries into a string and executing it at the end, e.g.
_query TEXT;
SELECT 'select * from ' INTO _query;
SELECT _query || 'product.getlength( ' || min || ',' || max || ')' INTO _query;
RETURN QUERY EXECUTE _query;
Hence I know I only need to somehow
Replace the 'x''s to 'data.values'
Find all the places in the equation string where a number
immediately precede a 'x', and add a '*'
Find all exponential operations (x^n) in the equation string and
convert them to power(x,n)
This may very well be something very trivial for a lot of people, unfortunately postgresql is not my best skill and I have already spent way more time than I can afford to get this working. Any type of help is highly appreciated, cheers.
Your 9am-noon time frame is over, but here goes.
Every term of the polynomial has 4 elements:
Addition/subtraction modifier
Multiplier
Parameter, always x in your case
Power
The problem is that these elements are not always present. The first term has no addition element, although it could have a subtraction sign - which is then typically connected to the multiplier. Multipliers are only given when not equal to 1. The parameter is not present in the last term and neither is a power in the last two terms.
With optional capture groups in regular expression parsing you can sort out this mess and PostgreSQL has the handy regexp_matches() function for this:
SELECT * FROM
regexp_matches('x^3 + 0.0046x^2 - 0.159x +1.713',
'\s*([+-]?)\s*([0-9.]*)(x?)\^?([0-9]*)', 'g') AS r (terms);
The regular expression says this:
\s* Read 0 or more spaces.
([+-]?) Capture 0 or 1 plus or minus sign.
\s* Read 0 or more spaces.
([0-9.]*) Capture a number consisting of digit and a decimal dot, if present.
(x?) Capture the parameter x. This is necessary to differentiate between the last two terms, see query below.
\^? Read the power symbol, if present. Must be escaped because ^ is the constraint character.
([0-9]*) Capture an integer number, if present.
The g modifier repeats this process for every matching pattern in the string.
On your string this yields, in the form of string arrays:
| terms |
|-----------------|
| {'','',x,3} |
| {+,0.0046,x,2} |
| {-,0.159,x,''} |
| {+,1.713,'',''} |
| {'','','',''} |
(I have no idea why the last line with all empty strings comes out. Maybe a real expert can explain that.)
With this result, you can piece your query together:
SELECT id, sum(term)
FROM (
SELECT id,
CASE WHEN terms[1] = '-' THEN -1
WHEN terms[1] = '+' THEN 1
WHEN terms[3] = 'x' THEN 1 -- If no x then NULL
END *
CASE terms[2] WHEN '' THEN 1. ELSE terms[2]::float
END *
value ^ CASE WHEN terms[3] = '' THEN 0 -- If no x then 0 (x^0)
WHEN terms[4] = '' THEN 1 -- If no power then 1 (x^1)
ELSE terms[4]::int
END AS term
FROM data
JOIN regexp_matches('x^3 + 0.0046x^2 - 0.159x +1.713',
'\s*([+-]?)\s*([0-9.]*)(x?)\^?([0-9]*)', 'g') AS r (terms) ON true
) sub
GROUP BY id
ORDER BY id;
SQLFiddle
This assumes you have an id column to join on. If all you have is a value then you can still do it but you should then wrap the above query in a function that you feed the polynomial and the value. The power is assumed to be integral but you can easily turn that into a real number by adding a dot . to the regular expression and a ::float cast instead of ::int in the CASE statement. You can also support negative powers by adding another capture group to the regular expression and a case statement in the query, same as for the multiplier term; I leave this for your next weekend hackfest.
This query will also handle "odd" polynomials such as -4.3x^3+ 101.2 + 0.0046x^6 - 0.952x^7 +4x just so long as the pattern described above is maintained.

Bison: Shift Reduce Conflict

I believe I am having trouble understanding how shift reduce conflicts work. I understand that bison can look ahead by one, so I don't understand why I am having the issue.
In my language a List is defined as a set of numbers or lists between [ ].
For example [] [1] [1 2] [1 [2] 3] are all valid lists.
Here are the definitions that are causing problems
value: num
| stringValue
| list
;
list: LEFTBRACE RIGHTBRACE
| LEFTBRACE list RIGHTBRACE
| num list
| RIGHTBRACE
;
The conflict happens from the number, it doesn't know weather to shift by the list rule, or reduce by the value rule. I am confused because can't it check to see if a list is following the number?
Any incite on how I should proceed would be greatly appreciated.
I think I'd define things differently, in a way that avoids the problem to start with, something like:
value: num
| stringvalue
| list
;
items:
| items value
;
list: LEFTBRACE items RIGHTBRACE;
Edit: Separating lists of numbers from lists of strings can't be done cleanly unless you eliminate empty lists. The problem that arises is that you want to allow an empty list to be included in a list of numbers or a list of strings, but looking at the empty list itself doesn't let the parser decide which. For example:
[ [][][][][][][][] 1 ]
To figure out what kind of list this is, the parser would have to look ahead all the way to the 1 -- but an LALR(N) parser can only lookahead N symbols to make that decision. Yacc (and Byacc, Bison, etc.) only do LALR(1), so they can only look ahead one symbol. That leaves a few possibilities:
eliminate the possibility of empty lists entirely
Have the lexer treat an arbitrary number of consecutive empty lists as a single token
use a parser generator that isn't limited to LALR(1) grammars
Inside of a yacc grammar, however, I don't think there's much you can do -- your grammar simply doesn't fit yacc's limitations.
With a bottom up parser it is generally a good idea to avoid right recursion which is what you have in the starred line of the grammar below.
list: LEFTBRACE RIGHTBRACE
| LEFTBRACE list RIGHTBRACE
**| num list**
| RIGHTBRACE
Instead have you thought about something like this?
value:value num
| value string
| value list
| num
| string
| list
list: LEFTBRACE RIGHTBRACE
| LEFTBRACE value RIGHTBRACE
This way you have no right recursion and the nesting logic of the grammar is more simply expressed.

Escape function for regular expression or LIKE patterns

To forgo reading the entire problem, my basic question is:
Is there a function in PostgreSQL to escape regular expression characters in a string?
I've probed the documentation but was unable to find such a function.
Here is the full problem:
In a PostgreSQL database, I have a column with unique names in it. I also have a process which periodically inserts names into this field, and, to prevent duplicates, if it needs to enter a name that already exists, it appends a space and parentheses with a count to the end.
i.e. Name, Name (1), Name (2), Name (3), etc.
As it stands, I use the following code to find the next number to add in the series (written in plpgsql):
var_name_id := 1;
SELECT CAST(substring(a.name from E'\\((\\d+)\\)$') AS int)
INTO var_last_name_id
FROM my_table.names a
WHERE a.name LIKE var_name || ' (%)'
ORDER BY CAST(substring(a.name from E'\\((\\d+)\\)$') AS int) DESC
LIMIT 1;
IF var_last_name_id IS NOT NULL THEN
var_name_id = var_last_name_id + 1;
END IF;
var_new_name := var_name || ' (' || var_name_id || ')';
(var_name contains the name I'm trying to insert.)
This works for now, but the problem lies in the WHERE statement:
WHERE a.name LIKE var_name || ' (%)'
This check doesn't verify that the % in question is a number, and it doesn't account for multiple parentheses, as in something like "Name ((1))", and if either case existed a cast exception would be thrown.
The WHERE statement really needs to be something more like:
WHERE a.r1_name ~* var_name || E' \\(\\d+\\)'
But var_name could contain regular expression characters, which leads to the question above: Is there a function in PostgreSQL that escapes regular expression characters in a string, so I could do something like:
WHERE a.r1_name ~* regex_escape(var_name) || E' \\(\\d+\\)'
Any suggestions are much appreciated, including a possible reworking of my duplicate name solution.
To address the question at the top:
Assuming standard_conforming_strings = on, like it's default since Postgres 9.1.
Regular expression escape function
Let's start with a complete list of characters with special meaning in regular expression patterns:
!$()*+.:<=>?[\]^{|}-
Wrapped in a bracket expression most of them lose their special meaning - with a few exceptions:
- needs to be first or last or it signifies a range of characters.
] and \ have to be escaped with \ (in the replacement, too).
After adding capturing parentheses for the back reference below we get this regexp pattern:
([!$()*+.:<=>?[\\\]^{|}-])
Using it, this function escapes all special characters with a backslash (\) - thereby removing the special meaning:
CREATE OR REPLACE FUNCTION f_regexp_escape(text)
RETURNS text
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
SELECT regexp_replace($1, '([!$()*+.:<=>?[\\\]^{|}-])', '\\\1', 'g')
$func$;
Add PARALLEL SAFE (because it is) in Postgres 10 or later to allow parallelism for queries using it.
Demo
SELECT f_regexp_escape('test(1) > Foo*');
Returns:
test\(1\) \> Foo\*
And while:
SELECT 'test(1) > Foo*' ~ 'test(1) > Foo*';
returns FALSE, which may come as a surprise to naive users,
SELECT 'test(1) > Foo*' ~ f_regexp_escape('test(1) > Foo*');
Returns TRUE as it should now.
LIKE escape function
For completeness, the pendant for LIKE patterns, where only three characters are special:
\%_
The manual:
The default escape character is the backslash but a different one can be selected by using the ESCAPE clause.
This function assumes the default:
CREATE OR REPLACE FUNCTION f_like_escape(text)
RETURNS text
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
SELECT replace(replace(replace($1
, '\', '\\') -- must come 1st
, '%', '\%')
, '_', '\_');
$func$;
We could use the more elegant regexp_replace() here, too, but for the few characters, a cascade of replace() functions is faster.
Again, PARALLEL SAFE in Postgres 10 or later.
Demo
SELECT f_like_escape('20% \ 50% low_prices');
Returns:
20\% \\ 50\% low\_prices
how about trying something like this, substituting var_name for my hard-coded 'John Bernard':
create table my_table(name text primary key);
insert into my_table(name) values ('John Bernard'),
('John Bernard (1)'),
('John Bernard (2)'),
('John Bernard (3)');
select max(regexp_replace(substring(name, 13), ' |\(|\)', '', 'g')::integer+1)
from my_table
where substring(name, 1, 12)='John Bernard'
and substring(name, 13)~'^ \([1-9][0-9]*\)$';
max
-----
4
(1 row)
one caveat: I am assuming single-user access to the database while this process is running (and so are you in your approach). If that is not the case then the max(n)+1 approach will not be a good one.
Are you at liberty to change the schema? I think the problem would go away if you could use a composite primary key:
name text not null,
number integer not null,
primary key (name, number)
It then becomes the duty of the display layer to display Fred #0 as "Fred", Fred #1 as "Fred (1)", &c.
If you like, you can create a view for this duty. Here's the data:
=> select * from foo;
name | number
--------+--------
Fred | 0
Fred | 1
Barney | 0
Betty | 0
Betty | 1
Betty | 2
(6 rows)
The view:
create or replace view foo_view as
select *,
case
when number = 0 then
name
else
name || ' (' || number || ')'
end as name_and_number
from foo;
And the result:
=> select * from foo_view;
name | number | name_and_number
--------+--------+-----------------
Fred | 0 | Fred
Fred | 1 | Fred (1)
Barney | 0 | Barney
Betty | 0 | Betty
Betty | 1 | Betty (1)
Betty | 2 | Betty (2)
(6 rows)