Matching a string containing special characters against a string without special characters - regex

I'm trying to match strings using a Google Sheets query where both strings are uncertain if they contain special characters. I'll try to explain as best as I can:
I have a table of data
+-----+------+-----+-----+
| A | B | C | D |
|x | y |ø |z |
|xx | yy |á |zz |
|xxx | yyy |e |zzz |
+-----+------+-----+-----+
My query function would look something like this:
=QUERY(A1:D3, "SELECT * WHERE (C = 'ø') OR (C = 'è') OR (C = 'a')")
Currently, using this query will only return 1 row because, (C = 'ø') is an exact match with 'ø', however none of the others have a match.
For (C = 'è'), we can just replace all of the accented characters in the string with their un-accented equivalent.
In this case 'è' becomes 'e' and has a match - now the query will return a second row.
(I found a nice way to replace all accented characters in a string here.)
Finally, here is where my main problem sits: (C = 'a'). I can't figure out a way to make it match 'á', unless I check every accented variant of 'a', but that just seems silly.
It's not possible to do something like "... WHERE (CUSTOM_FUNCTION(C) = 'a')" either, sadly.
As I previously mentioned, either side of the match may or may not contain accented/special characters.
I should also mention that it wont be just a single character, it will be a whole string.
If anyone has any possible solutions to this, it would be greatly appreciated.

Instead of the QUERY formula, you could use a FILTER formula.
=FILTER(A2:D22,REGEXMATCH(C2:C22,"ø|è|á")=TRUE)
(Please adjust ranges to your needs. You can also add/remove more special characters.)
Functions used:
FILTER
REGEXMATCH

Related

A single Regex Match Entire String first to then break up into multiple components

I am trying to come up with a RegEx (POSIX like) in a vendor application that returns data looking like illustrated below and presents a single line of data at a time so I do not need to account for multiple rows and need to match a row indvidually.
It can return one or more values in the string result
The application doesn't just let me use a "\d+\.\d+" to capture the component out of the string and I need to map all components of a row of data to a variable unfortunately even if I am going to discard it or otherwise it returns a negative match result.
My data looks like the following with the weird underscore padding.
USER | ___________ 3.58625 | ___________ 7.02235 |
USER | ___________ 10.02625 | ___________ 15.23625 |
The syntax is supports is
Matches REGEX "(Var1 Regex), (Var2 Regex), (Var3 Regex), (Var 4 regex), (Var 5 regex)" and the entire string must match the aggregation of the RegEx components, a single character off and you get nothing.
The "|" characters are field separators for the data.
So in the above what I need is a RegEx that takes it up to the beginning of the numeric and puts that in Var1, then capture the numeric value with decimal point in var 2, then capture up to the next numeric in Var 3, and then keep the numeric in var 4, then capture the space and end field | character into var 5. Only Var 2 and 4 will be useful but I have to capture the entire string.
I have mainly tried capturing between the bars "|" using ^.*\|(.*).\|*$ from this question.
I have also tried the multiple variable ([0-9]+k?[.,]?[0-9]+)\s*-\s*.*?([0-9]+k?[.,]?[0-9]+) mentioned in this question.
I seem to be missing something to get it right when I try using them via RegExr and I feel like I am missing something pretty simple.
In RegExr I never get more than one part of the string I either get just the number, the equivalent of the entire string in a single variable, or just the number which don't work in this context to accomplish the required goal.
The only example the documentation provides is the following from like a SysLog entry of something like in this example I'm consolidating there with "Fault with Resource Name: Disk Specific Problem: Offline"
WHERE value matches regex "(.)Resource Name: (.), Specific Problem: ([^,]),(.)"
SET _Rrsc = var02
SET _Prob = var03
I've spun my wheels on this for several hours so would appreciate any guidance / help to get me over this hump.
Something like this should work:
(\D+)([\d.]+)(\D+)([\d.]+)(.*)
Or in normal words: Capture everything but numbers, capture a decimal number, capture everything but numbers, capture a decimal number, capture everything.
Using USER | ___________ 10.02625 | ___________ 15.23625 |
$1 = USER | ___________  
$2 = 10.02625
$3 =  | ___________  
$4 = 15.23625
$5 =  |

Vim - regex for changing bool variable checking

I am working on a C project an I want to change all bool-variable checking from
if(!a)
to
if(a == false)
in order to make the code easier to read(I want to do the same with while statements).
Anyway I'm using the following regex, which searches for an exclamation mark followed by a lowercase character and for the last closing parenthesis on the line.
%s/\(.*\)!\([a-z]\)\(.*\))\([^)]+\)/\1\2\3 == false)\4/g
I'm sorry for asking you to look over it but i can't understand why it would fail.
Also, is there an easier way of solving this problem and of using vim regex in general?
One solution should be this one:
%s/\(.*\)(\(\s*\)!\(\w\+\))/\1(\3 == false)/gc
Here, we do the following:
%s/\(.*\)(\(\s*\)!\(\w\+\))/\1(\3 == false)/gc
\--+-/|\--+--/|\---+--/|
| | | | | finally test for a single `)`.
| | | | (\3): then for one or more word characters (the var name).
| | | the single `!`
| | (\2): then for any amount of white space before the `!`
| the single `(`
(\1): test for any characters before a single `(`
Then, it's replaced by the first, third pattern, and then appends the text == false, opening and closing the parentheses as needed.
To do this in vim, you could use the following:
%s/\(if(\)!\([^)]\+\)/\1\2==false/c
make sure that only if(!var)-constructs are matched, you could change that to while for the next task
c asks for confirmation for every occurence
As #Kent said this is not a small undertaking. However for the simple case of just if(!a) it can be done.
:%s/\<if(\zs!\(\k\+\)\ze)/\1 == false/c
Explanation:
Start by making sure if is at a word bound by \<. This ensures it isn't part of some function name.
\zs and \ze set the start and end of the match respectively.
Capture the variable via the keyword class \k (\w works too) ending up with \(\k\+\)
For extra safety use the c flag to confirm each substation.
Thoughts:
This will need to be updated for other constructs, e.g. while
May need to make alterations for extra white-space, e.g. \<if\s*(\s*\zs!\(\k\+\)\ze\s*)
May want to use [a-z0-9_] instead of \k or \w to avoid capturing macros
There are instances where you may not have a construct: foo = !a && b;
This only handles the false cases. Doing a == true may be far trickier
Depending on your case it might be safest to just do the following:
:%s/!\([a-z0-9]\+\)/\1 == false/gc
On top of the answers already presented, I would say that the code does not smell like it needs refactoring. For a global regex replacement, the primary problem is to find
all bool-variables
and distinguish them from pointers, etc.

Extract root, month letter-year and yellow key from a Bloomberg futures ticker

A Bloomberg futures ticker usually looks like:
MCDZ3 Curcny
where the root is MCD, the month letter and year is Z3 and the 'yellow key' is Curcny.
Note that the root can be of variable length, 2-4 letters or 1 letter and 1 whitespace (e.g. S H4 Comdty).
The letter-year allows only the letter listed below in expr and can have two digit years.
Finally the yellow key can be one of several security type strings but I am interested in (Curncy|Equity|Index|Comdty) only.
In Matlab I have the following regular expression
expr = '[FGHJKMNQUVXZ]\d{1,2} ';
[rootyk, monthyear] = regexpi(bbergtickers, expr,'split','match','once');
where
rootyk{:}
ans =
'mcd' 'curncy'
and
monthyear =
'z3 '
I don't want to match the ' ' (space) in the monthyear. How can I do?
Assuming there are no leading or trailing whitespaces and only upcase letters in the root, this should work:
^([A-Z]{2,4}|[A-Z]\s)([FGHJKMNQUVXZ]\d{1,2}) (Curncy|Equity|Index|Comdty)$
You've got root in the first group, letter-year in the second, yellow key in the third.
I don't know Matlab nor whether it covers Perl Compatible Regex. If it fails, try e.g. with instead of \s. Also, drop the ^...$ if you'd like to extract from a bigger source text.
The expression you're feeding regexpi with contains a space and is used as a pattern for 'match'. This is why the matched monthyear string also has a space1.
If you want to keep it simple and let regexpi do the work for you (instead of postprocessing its output), try a different approach and capture tokens instead of matching, and ignore the intermediate space:
%// <$1><----------$2---------> <$3>
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2}) (.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
You can also simplify the expression to a more genereic '(.+)(\w{1}\d{1,2})\s+(.+)', if you wish.
Example
bbergtickers = 'MCDZ3 Curncy';
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2})\s+(.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
The result is:
tickinfo =
'MCD'
'Z3'
'Curncy'
1 This expression is also used as a delimiter for 'split'. Removing the trailing space from it won't help, as it will reappear in the rootyk output instead.
Assuming you just want to get rid of the leading and or trailing spaces at the edge, there is a very simple command for that:
monthyear = trim(monthyear)
For removing all spaces, you can do:
monthyear(isspace(monthyear))=[]
Here is a completely different approach, basically this searches the letter before your year number:
s = 'MCDZ3 Curcny'
p = regexp(s,'\d')
s(min(p)
s(min(p)-1:max(p))

Escape function for regular expression or LIKE patterns

To forgo reading the entire problem, my basic question is:
Is there a function in PostgreSQL to escape regular expression characters in a string?
I've probed the documentation but was unable to find such a function.
Here is the full problem:
In a PostgreSQL database, I have a column with unique names in it. I also have a process which periodically inserts names into this field, and, to prevent duplicates, if it needs to enter a name that already exists, it appends a space and parentheses with a count to the end.
i.e. Name, Name (1), Name (2), Name (3), etc.
As it stands, I use the following code to find the next number to add in the series (written in plpgsql):
var_name_id := 1;
SELECT CAST(substring(a.name from E'\\((\\d+)\\)$') AS int)
INTO var_last_name_id
FROM my_table.names a
WHERE a.name LIKE var_name || ' (%)'
ORDER BY CAST(substring(a.name from E'\\((\\d+)\\)$') AS int) DESC
LIMIT 1;
IF var_last_name_id IS NOT NULL THEN
var_name_id = var_last_name_id + 1;
END IF;
var_new_name := var_name || ' (' || var_name_id || ')';
(var_name contains the name I'm trying to insert.)
This works for now, but the problem lies in the WHERE statement:
WHERE a.name LIKE var_name || ' (%)'
This check doesn't verify that the % in question is a number, and it doesn't account for multiple parentheses, as in something like "Name ((1))", and if either case existed a cast exception would be thrown.
The WHERE statement really needs to be something more like:
WHERE a.r1_name ~* var_name || E' \\(\\d+\\)'
But var_name could contain regular expression characters, which leads to the question above: Is there a function in PostgreSQL that escapes regular expression characters in a string, so I could do something like:
WHERE a.r1_name ~* regex_escape(var_name) || E' \\(\\d+\\)'
Any suggestions are much appreciated, including a possible reworking of my duplicate name solution.
To address the question at the top:
Assuming standard_conforming_strings = on, like it's default since Postgres 9.1.
Regular expression escape function
Let's start with a complete list of characters with special meaning in regular expression patterns:
!$()*+.:<=>?[\]^{|}-
Wrapped in a bracket expression most of them lose their special meaning - with a few exceptions:
- needs to be first or last or it signifies a range of characters.
] and \ have to be escaped with \ (in the replacement, too).
After adding capturing parentheses for the back reference below we get this regexp pattern:
([!$()*+.:<=>?[\\\]^{|}-])
Using it, this function escapes all special characters with a backslash (\) - thereby removing the special meaning:
CREATE OR REPLACE FUNCTION f_regexp_escape(text)
RETURNS text
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
SELECT regexp_replace($1, '([!$()*+.:<=>?[\\\]^{|}-])', '\\\1', 'g')
$func$;
Add PARALLEL SAFE (because it is) in Postgres 10 or later to allow parallelism for queries using it.
Demo
SELECT f_regexp_escape('test(1) > Foo*');
Returns:
test\(1\) \> Foo\*
And while:
SELECT 'test(1) > Foo*' ~ 'test(1) > Foo*';
returns FALSE, which may come as a surprise to naive users,
SELECT 'test(1) > Foo*' ~ f_regexp_escape('test(1) > Foo*');
Returns TRUE as it should now.
LIKE escape function
For completeness, the pendant for LIKE patterns, where only three characters are special:
\%_
The manual:
The default escape character is the backslash but a different one can be selected by using the ESCAPE clause.
This function assumes the default:
CREATE OR REPLACE FUNCTION f_like_escape(text)
RETURNS text
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
SELECT replace(replace(replace($1
, '\', '\\') -- must come 1st
, '%', '\%')
, '_', '\_');
$func$;
We could use the more elegant regexp_replace() here, too, but for the few characters, a cascade of replace() functions is faster.
Again, PARALLEL SAFE in Postgres 10 or later.
Demo
SELECT f_like_escape('20% \ 50% low_prices');
Returns:
20\% \\ 50\% low\_prices
how about trying something like this, substituting var_name for my hard-coded 'John Bernard':
create table my_table(name text primary key);
insert into my_table(name) values ('John Bernard'),
('John Bernard (1)'),
('John Bernard (2)'),
('John Bernard (3)');
select max(regexp_replace(substring(name, 13), ' |\(|\)', '', 'g')::integer+1)
from my_table
where substring(name, 1, 12)='John Bernard'
and substring(name, 13)~'^ \([1-9][0-9]*\)$';
max
-----
4
(1 row)
one caveat: I am assuming single-user access to the database while this process is running (and so are you in your approach). If that is not the case then the max(n)+1 approach will not be a good one.
Are you at liberty to change the schema? I think the problem would go away if you could use a composite primary key:
name text not null,
number integer not null,
primary key (name, number)
It then becomes the duty of the display layer to display Fred #0 as "Fred", Fred #1 as "Fred (1)", &c.
If you like, you can create a view for this duty. Here's the data:
=> select * from foo;
name | number
--------+--------
Fred | 0
Fred | 1
Barney | 0
Betty | 0
Betty | 1
Betty | 2
(6 rows)
The view:
create or replace view foo_view as
select *,
case
when number = 0 then
name
else
name || ' (' || number || ')'
end as name_and_number
from foo;
And the result:
=> select * from foo_view;
name | number | name_and_number
--------+--------+-----------------
Fred | 0 | Fred
Fred | 1 | Fred (1)
Barney | 0 | Barney
Betty | 0 | Betty
Betty | 1 | Betty (1)
Betty | 2 | Betty (2)
(6 rows)

How can I use a regular expression to match something in the form 'stuff=foo' 'stuff' = 'stuff' 'more stuff'

I need a regexp to match something like this,
'text' | 'text' | ... | 'text'(~text) = 'text' | 'text' | ... | 'text'
I just want to divide it up into two sections, the part on the left of the equals sign and the part on the right. Any of the 'text' entries can have "=" between the ' characters though. I was thinking of trying to match an even number of 's followed by a =, but I'm not sure how to match an even number of something.. Also note I don't know how many entries on either side there could be. A couple examples,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'QFN'(~51NL9637X33) = '51NL9637X33' | 'ISL6262ACRZ-T' | 'INTERSIL' | 'QFN7SQ-HT1_P49' | '()'
Should extract,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'QFN'(~51NL9637X33)
and,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'INTERSIL' | 'QFN7SQ-HT1_P49' | '()'
'227637' | 'SMTU2032_1' | 'SKT W/BAT'(~227637) = '227637' | 'SMTU2032_1' | 'RENATA' | 'SKT28_5X16_1-HT5_4_P2' | '()' :SPECIAL_A ='BAT_CR2032', PART_NUM_A='202649'
Should extract,
'227637' | 'SMTU2032_1' | 'SKT W/BAT'(~227637)
and,
'227637' | 'SMTU2032_1' | 'RENATA' | 'SKT28_5X16_1-HT5_4_P2' | '()' :SPECIAL_A ='BAT_CR2032', PART_NUM_A='202649'
Also note the little tilda bit at the end of the first section is optional, so I can't just look for that.
Actually I wouldn't use a regex for that at all. Assuming your language has a split operation, I'd first split on the | character to get a list of:
'51NL9637X33'
'ISL6262ACRZ-T'
'QFN'(~51NL9637X33) = '51NL9637X33'
'ISL6262ACRZ-T'
'INTERSIL'
'QFN7SQ-HT1_P49'
'()'
Then I'd split each of them on the = character to get the key and (optional) value:
'51NL9637X33' <null>
'ISL6262ACRZ-T' <null>
'QFN'(~51NL9637X33) '51NL9637X33'
'ISL6262ACRZ-T' <null>
'INTERSIL' <null>
'QFN7SQ-HT1_P49' <null>
'()' <null>
You haven't specified why you think a regex is the right tool for the job but most modern languages also have a split capability and regexes aren't necessarily the answer to every requirement.
I agree with paxdiablo in that regular expressions might not be the most suitable tool for this task, depending on the language you are working with.
The question "How do I match an even number of characters?" is interesting nonetheless, and here is how I'd do it in your case:
(?:'[^']*'|[^=])*(?==)
This expression matches the left part of your entry by looking for a ' at its current position. If it finds one, it runs forward to the next ' and thereby only matching an even number of quotes. If it does not find a ' it matches anything that is not an equal sign and then assures that an equal sign follows the matched string. It works because the regex engine evaluates OR constructs from left to right.
You could get the left and right parts in two capturing groups by using
((?:'[^']*'|[^=])*)=(.*)
I recommend http://gskinner.com/RegExr/ for tinkering with regular expressions. =)
As paxdiablo said, you almost certainly don't want to use a regex here. The split suggestion isn't bad; I myself would probably use a parser here—there's a lot of structure to exploit. The idea here is that you formally specify the syntax of what you have—sort of like what you gave us, only rigorous. So, for instance: a field is a sequence of non-single-quote characters surrounded by single quotes; a fields is any number of fields separated by white space, a |, and more white space; a tilde is non-right-parenthesis characters surrounded by (~ and ); and an expr is a fields, optional whitespace, an optional tilde, a =, optional whitespace, and another fields. How you express this depends on the language you are using. In Haskell, for instance, using the Parsec library, you write each of those parsers as follows:
import Text.ParserCombinators.Parsec
field :: Parser String
field = between (char '\'') (char '\'') $ many (noneOf "'\n")
tilde :: Parser String
tilde = between (string "(~") (char ')') $ many (noneOf ")\n")
fields :: Parser [String]
fields = field `sepBy` (try $ spaces >> char '|' >> spaces)
expr :: Parser ([String],Maybe String,[String])
expr = do left <- fields
spaces
opt <- optionMaybe tilde
spaces >> char '=' >> spaces
right <- fields
(char '\n' >> return ()) <|> eof
return (left, opt, right)
Understanding precisely how this code works isn't really important; the basic idea is to break down what you're parsing, express it in formal rules, and build it back up out of the smaller components. And for something like this, it'll be much cleaner than a regex.
If you really want a regex, here you go (barely tested):
^\s*('[^']*'((\s*\|\s*)'[^'\n]*')*)?(\(~[^)\n]*\))?\s*=\s*('[^']*'((\s*\|\s*)'[^'\n]*')*)?\s*$
See why I recommend a parser? When I first wrote this, I got at least two things wrong which I picked up (one per test), and there's probably something else. And I didn't insert capturing groups where you wanted them because I wasn't sure where they'd go. Now yes, I could have made this more readable by inserting comments, etc. And after all, regexen have their uses! However, the point is: this is not one of them. Stick with something better.