Autohotkey RegExReplace Skip unmatched pattern - regex

How to skip an unmatched line in input on replacing by regex?
For Ex. Below is the contents of my test.txt
elkay_iyer#yahoo.com
elkay_qwer#yahoo.com
elke engineering ltd.,#yahoo.com
elke0265#yahoo.com
elke#yahoo.com
Below is my Autohotkey script with regex code
ReplaceEmailsRegEx := "i)([a-z0-9]+(\.*|\_*|\-*))+#([a-z][a-z0-9\-]+(\.|\-*\.))+[a-z]{2,6}"
RemoveDuplicateCharactersRegEx := "s)(.)(?=.*\1)"
Try{
FileRead, EmailFromTxtFile, test.txt
OtherThanEmails :=RegExReplace(EmailFromTxtFile,ReplaceEmailsRegEx)
Chars :=RegExReplace(OtherThanEmails,RemoveDuplicateCharactersRegEx)
Loop{
StringReplace, OtherThanEmails, OtherThanEmails, `r`n`r`n,`r`n, UseErrorLevel
If ErrorLevel = 0
Break
}
If (StrLen(OtherThanEmails)){
Msgbox The Characters found other than email:`n%OtherThanEmails%
}
}
catch e {
ErrorString:="what: " . e.what . "file: " . e.file . " line: " . e.line . " msg: " . e.message . " extra: " . e.extra
Msgbox An Exception was thrown`n%ErrorString%
}
Return
When it replace on test.txt it throws error:
e.what contains 'RegExReplace', e.line is 10
It executes without error when I remove 3rd email in test.txt. So how to change my regex to skip the problematic string?

The problem you have is catastrophic backtracking due to the nested quantifier in the beginning: ([a-z0-9]+(\.*|\_*|\-*))+. Here, the ., _ and - are all optional due to the * quantifier and thus your pattern gets reduced to ([a-z0-9]+)+.
I suggest "unrolling" the first subpattern to make it linear:
i)[a-z0-9]+(?:(?:\.+|_+|-+)[a-z0-9]+)*#([a-z][-a-z0-9]+\.)+[a-z]{2,6}
Or
i)[a-z0-9]+(?:([._-])\1*[a-z0-9]+)*#(?:[a-z][-a-z0-9]+\.)+[a-z]{2,6}
You may even remove \1* if you do not allow more than 1 . or _ or - in between "words".
Also, there is no need in using \-* with alternation in (\.|\-*\.), as the hyphen is matched with the previous character class, thus, this subpattern can be reduced to \..
See the regex demo

Related

Remove only non-leading and non-trailing spaces from a string in Ruby?

I'm trying to write a Ruby method that will return true only if the input is a valid phone number, which means, among other rules, it can have spaces and/or dashes between the digits, but not before or after the digits.
In a sense, I need a method that does the opposite of String#strip! (remove all spaces except leading and trailing spaces), plus the same for dashes.
I've tried using String#gsub!, but when I try to match a space or a dash between digits, then it replaces the digits as well as the space/dash.
Here's an example of the code I'm using to remove spaces. I figure once I know how to do that, it will be the same story with the dashes.
def valid_phone_number?(number)
phone_number_pattern = /^0[^0]\d{8}$/
# remove spaces
number.gsub!(/\d\s+\d/, "")
return number.match?(phone_number_pattern)
end
What happens is if I call the method with the following input:
valid_phone_number?(" 09 777 55 888 ")
I get false because line 5 transforms the number into " 0788 ", i.e. it gets rid of the digits around the spaces as well as the spaces. What I want it to do is just to get rid of the inner spaces, so as to produce " 0977755888 ".
I've tried
number.gsub!(/\d(\s+)\d/, "") and number.gsub!(/\d(\s+)\d/) { |match| "" } to no avail.
Thank you!!
If you want to return a boolean, you might for example use a pattern that accepts leading and trailing spaces, and matches 10 digits (as in your example data) where there can be optional spaces or hyphens in between.
^ *\d(?:[ -]?\d){9} *$
For example
def valid_phone_number?(number)
phone_number_pattern = /^ *\d(?:[ -]*\d){9} *$/
return number.match?(phone_number_pattern)
end
See a Ruby demo and a regex demo.
To remove spaces & hyphen inbetween digits, try:
(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)
See an online regex demo
(?: - Open non-capture group;
d+ - Match 1+ digits;
| - Or;
\G(?!^)\d+ - Assert position at end of previous match but (negate start-line) with following 1+ digits;
)\K - Close non-capture group and reset matching point;
[- ]+ - Match 1+ space/hyphen;
(?=\d) - Assert position is followed by digits.
p " 09 777 55 888 ".gsub(/(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)/, '')
Prints: " 0977755888 "
Using a very simple regex (/\d/ tests for a digit):
str = " 09 777 55 888 "
r = str.index(/\d/)..str.rindex(/\d/)
str[r] = str[r].delete(" -")
p str # => " 0977755888 "
Passing a block to gsub is an option, capture groups available as globals:
>> str = " 09 777 55 888 "
# simple, easy to understand
>> str.gsub(/(^\s+)([\d\s-]+?)(\s+$)/){ "#$1#{$2.delete('- ')}#$3" }
=> " 0977755888 "
# a different take on #steenslag's answer, to avoid using range.
>> s = str.dup; s[/^\s+([\d\s-]+?)\s+$/, 1] = s.delete("- "); s
=> " 0977755888 "
Benchmark, not that it matters that much:
n = 1_000_000
puts(Benchmark.bmbm do |x|
# just a match
x.report("match") { n.times {str.match(/^ *\d(?:[ -]*\d){9} *$/) } }
# use regex in []=
x.report("[//]=") { n.times {s = str.dup; s[/^\s+([\d\s-]+?)\s+$/, 1] = s.delete("- "); s } }
# use range in []=
x.report("[..]=") { n.times {s = str.dup; r = s.index(/\d/)..s.rindex(/\d/); s[r] = s[r].delete(" -"); s } }
# block in gsub
x.report("block") { n.times {str.gsub(/(^\s+)([\d\s-]+?)(\s+$)/){ "#$1#{$2.delete('- ')}#$3" }} }
# long regex
x.report("regex") { n.times {str.gsub(/(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)/, "")} }
end)
Rehearsal -----------------------------------------
match 0.997458 0.000004 0.997462 ( 0.998003)
[//]= 1.822698 0.003983 1.826681 ( 1.827574)
[..]= 3.095630 0.007955 3.103585 ( 3.105489)
block 3.515401 0.003982 3.519383 ( 3.521392)
regex 4.761748 0.007967 4.769715 ( 4.772972)
------------------------------- total: 14.216826sec
user system total real
match 1.031670 0.000000 1.031670 ( 1.032347)
[//]= 1.859028 0.000000 1.859028 ( 1.860013)
[..]= 3.074159 0.003978 3.078137 ( 3.079825)
block 3.751532 0.011982 3.763514 ( 3.765673)
regex 4.634857 0.003972 4.638829 ( 4.641259)

Character not at begining of line; not followed or preceded by character

I'm trying to isolate a " character when (simultaneously):
it's not in the beginning of the line
it's not followed by the character ";"
it's not preceded by the character ";"
E.g.:
Line: "Best Before - NO MATCH
Line: Best Before"; - NO MATCH
Line: ;"Best "Before - NO MATCH
Line: Best "Before - MATCH
My best solution is (?<![;])([^^])(")(?![;]) but it's not working correctly.
I also tried (?<![;])(")(?![;]), but it's only partial (missing the "not at the beginning" part)
I don't understand why I'm spelling the "AND not at the beginning" wrong.
Where am I missing it?
If you want to allow partial matches, you can extend the lookbehind with an alternation not asserting the start of the string to the left.
The semi colon [;] does not have to be between square brackets.
(?<!;|^)"(?!;)
Regex demo
if you want to match the " when there is no occurrence of '" to the left and right, and a infinite quantifier in a lookbehind assertion is allowed:
(?<!^.*;(?=").*|^)"(?!;|.*;")
Regex demo
In notepad++ you can use
^.*(?:;"|";).*$(*SKIP)(*F)|(?<!^)"
Regex demo
You can use the fact that not preceded by ; means that it's also not the first character on the line to simplify things
[^;]"(?:[^;]|$)
This gives you
Match a character that's not a ; (so there must be a character and thus the next character can't be the start of the line)
Match a "
Match a character that's not a ; or the end of the line
I know you are asking for a regex solution, but, almost always, strings can also be filtered using string methods in whatever language you are working in.
For the sake of completeness, to show that regex is not your only available tool here, here is a short javascript using the string methods:
myString.charAt()
myString.includes()
Working Example:
const checkLine = (line) => {
switch (true) {
// DOUBLE QUOTES AT THE BEGINNING
case(line.charAt(0) === '"') :
return console.log(line, '// NO MATCH');
// DOUBLE QUOTES IMMEDIATELY FOLLOWED BY SEMI-COLON
case(line.includes('";')) :
return console.log(line, '// NO MATCH');
// DOUBLE QUOTES IMMEDIATELY PRECEDED BY SEMI-COLON
case(line.includes(';"')) :
return console.log(line, '// NO MATCH');
default:
return console.log(line, '// MATCH');
}
}
checkLine('"Best Before');
checkLine('Best Before";');
checkLine(';"Best "Before');
checkLine('Best "Before');
Further Reading:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/includes

Regex Express Return All Chars before a '/' but if there are 2 '/' Return all before that

I have been trying to get a regex expression to return me the following in the following situations.
XX -> XX
XXX -> XXX
XX/XX -> XX
XX/XX/XX -> XX/XX
XXX/XXX/XX -> XXX/XXX
I had the following Regex, however they do no work.
^[^/]+ => https://regex101.com/r/xvCbNB/1
=========
([A-Z])\w+ => https://regex101.com/r/xvCbNB/2
They are close but are not there.
Any Help would be appreciated.
You want to get all text from the start till the last occurrence of a specific character or till the end of string if the character is missing.
Use
^(?:.*(?=\/)|.+)
See the regex demo and the regex graph:
Details
^ - start of string
(?:.*(?=\/)|.+) - a non-capturing group that matches either of the two alternatives, and if the first one matches first the second won't be tried:
.*(?=\/) - any 0+ chars other than line break chars, as many as possible upt to but excluding /
| - or
.+ - any 1+ chars other than line break chars, as many as possible.
It will be easier to use a replace here to match / followed by non-slash characters before end of line:
Search regex:
/[^/]*$
Replacement String:
""
Updated RegEx Demo 1
If you're looking for a regex match then use this regex:
^(.*?)(?:/[^/]*)?$
Updated RegEx Demo 2
Any special reason it has to be a regular expression? How about just splitting the string at the slashes, remove the last item and rejoin:
function removeItemAfterLastSlash(string) {
const list = string.split(/\//);
if (list.length == 1) [
return string;
}
list.pop();
return list.join("/");
}
Or look for the last slash an remove it:
function removeItemAfterLastSlash(string) {
const index = string.lastIndexOf("/");
if (index === -1) {
return string;
}
return string.splice(0, index);
}

QRegExp not finding expected string pattern

I am working in Qt 5.2, and I have a piece of code that takes in a string and enters one of several if statements based on its format. One of the formats searched for is the letters "RCV", followed by a variable amount of numbers, a decimal, and then one more number. There can be more than one of these values in the line, separated by "|", for example it could one value like "RCV0123456.1" or mulitple values like "RCV12345.1|RCV678.9". Right now I am using QRegExp class to find this, like this:
QString value = "RCV000030249.2|RCV000035360.2"; //Note: real test value from my code
if(QRegExp("^[RCV\d+\.\d\|?]+$").exactMatch(value))
std::cout << ":D" << std::endl;
else
std::cout << ":(" << std::endl;
I want it to use the if statement, but it keeps going into the else statement. Is there something I'm doing wrong with the regular expression?
Your expression should be like #vahancho mentionet in a comment:
if(QRegExp("^[RCV\\d+\\.\\d\\|?]+$").exactMatch(value))
If you use C++11, then you can use its raw strings feature:
if(QRegExp(R"(^[RCV\d+\.\d\|?]+$)").exactMatch(value))
Aside from escaping the backslashes which others has mentioned in answers and comments,
There can be more than one of these values in the line, separated by "|", for example it could one value like "RCV0123456.1" or mulitple values like "RCV12345.1|RCV678.9".
[RCV\d+\.\d\|?] may not be doing what you expect. Perhaps you want () instead of []:
/^
[RCV\d+\.\d\|?]+ # More than one of characters from the list:
# R, C, V, a digit, a +, a dot, a digit, a |, a ?
$/x
/^
(
RCV\d+\.\d # RCV, some digits, a dot, followed by a digit
\|? # Optional: a |
)+ # Quantifier of one or more
$/x
Also, maybe you could revise the regex such that the optional | requires the group to be matched *again*:
/^
(RCV\d+\.\d) # RCV, some digits, a dot, followed by a digit
(
\|(?1) # A |, then match subpattern 1 (Above)
)+ # Quantifier of one or more
$/x
Check if only valid occurences in line with the addition to require an | starting second occurence (having your implementation would not require the | even with double quotes):
QString value = "RCV000030249.2|RCV000035360.2"; //Note: real test value from my code
if(QRegExp("^RCV\\d+\\.\\d(\\|RCV\\d+\\.\\d)*$").exactMatch(value))
std::cout << ":D" << std::endl;
else
std::cout << ":(" << std::endl;

Regular Expression to check typical 'string' (type) syntax

I'm trying to manufacture a regular expression that will capture a valid, arbitrary string (as you might type it) from languages like Ruby and PHP, e.g.:
"lol" // valid
'was' // valid
"\"say\"" // valid
'\'what\'' // valid
"m"y" // invalid
'ba'd' // invalid
"th\\"is" // invalid
'su\\'cks' // invalid
I'm a little stuck trying to match the escaped quotes in the content correctly whilst failing on double-escape-then-quote.
Any help appreciated!
This matches your first 4 lines and rejects the last 4:
^(["'])(\\.|(?!\\|\1).)*\1$
A quick explanation:
^ # the start of the input
(["']) # match a single- or double quote and store it in group 1
( # open group 2
\\. # a backslash followed by any char
| # OR
(?!\\|\1). # if no backslash or the quote matched in group 1 can be seen ahead, match any char
)* # close group 2 and repeat it zero or more times
\1 # the same quote as matched in group 1
$ # the end of the input
Here's a little PHP demo:
<?php
$tests = array(
'"lol"',
"'was'",
'"\\"say\\""',
"'\\'what\\''",
'"m"y"',
"'ba'd'",
'"th\\\\"is"',
"'su\\\\'cks'"
);
foreach($tests as $test) {
if(preg_match('/^(["\'])(\\\\.|(?!\\\\|\1).)*\1$/', $test)) {
echo "valid : " . $test . "\n";
}
else {
echo "invalid : " . $test . "\n";
}
}
?>
which produces:
valid : "lol"
valid : 'was'
valid : "\"say\""
valid : '\'what\''
invalid : "m"y"
invalid : 'ba'd'
invalid : "th\\"is"
invalid : 'su\\'cks'
as can be seen on ideone: http://ideone.com/60mtE