I am trying to fix a regular expression used in tokenization so as to match everything (including '(' and ')', but not match ( and ) without being surrounded by apostrophes).
Use case examples which should be matched:
'('AN')'
'('AN
AN')'
...and every other possibility involving '(' or ')' combined or not with any string
Currently, it looks like this:
[^\)\(]+
The most successful result I have obtained so far is:
[^\)\(]+|\'.*?\'
This manages to correctly match expressions like: '('AN')' , '(' , ')' , AN , '('')' , '()'.
But it fails for: AN'(' , AN')' , '('AN , ')'AN.
NOTE: I have done some research, and found that the regex engine involved is quite old (around 1980s) and is called PCLNT (I am not 100% sure about its name). I mention this because in some other situations when I dealt with regular expressions, the regex engines available online showed the correct result, but in my application it did not even compile.
Any help would be great, also if anyone knows anything about this possible engine and its documentation please guide me.
This regex will match a sequence of any combination of characters other than parentheses or anything between apostrophes. It then optionally matches a single apostrophe followed by any sequence of unspecial characters, in order to catch unpaired apostrophes:
([^()']*|'[^']*')*('[^'()]*)?
I know nothing about the regex library you are using, but I don't think there's anything out of the ordinary in that regex.
I think what you're looking for is any string containing '(' OR any string containing ')'
.*(\'\(\').*|.*(\'\)\').*
Example here
It seems you want to match anything except a bracket without an adjacent apostrophe:
^('[()]|[()]'|[^()])+$
See live demo.
Note that you don’t have to escape brackets in a character class.
Related
I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?
I am doing a mass method replace in my C# codebase. I have lines of code that look like the following:
Assert.That(Edit.FundsTable.GetCellByIndexes(0, 2).Text.Contains("Employer Request IPM A"));
The problem is that initially when the GetCellByIndexes call was made, we had another method that basically did the same thing, leaving us doing the exact same task 2 ways. The more standard way that we are changing it to is the following:
Assert.That(Edit.FundsTable.Cells[0, 2].Text.Contains("Employer Request IPM A"));
I am trying to do a VS replace all replacement to move GetCellByIndexes calls to Cells calls. The issue is with the right paran. I can do a replace all from
GetCellByIndexes(
to
Cells[
very easily. The problem is changing the right paran of the method call to a square bracket. Does anyone know how to identify the first right paran after the "GetCellByIndexes" string utilizing Regex?
Use
GetCellByIndexes\(([^()]+)\)
Replace with Cells[$1]. See proof.
Code
Explanation
GetCellByIndexes
'GetCellByIndexes'
\(
'('
(
group and capture to $1:
[^()]+
any character except: '(', ')' (1 or more times (matching the most amount possible))
)
end of $1
\)
')'
In general search for:
GetCellByIndexes\(\s*(\d+)\s*,\s*(\d+)\s*\)
replace with
Cells[$1, $2]
Both if you are using the search-and-replace of Visual Studio or if you are programming in C#.
Note that this will only work if the indexes are numbers... If they are something more complex (variables, or functions) then it becomes more interesting (and complex).
I'm editing some data, and my end goal is to conditionally substitute , (comma) chars with .(dot). I have a crude solution working now, so this question is strictly for suggestions on better methods in practice, and determining what is possible with a regex engine outside of an enhanced programming environment.
I gave it a good college try, but 6 hours is enough mental grind for a Saturday, and I'm throwing in the towel. :)
I've been through about 40 SO posts on regex recursion, substitution, etc, the wiki.org on the definitions and history of regex and regular language, and a few other tutorial sites. The majority is centered around Python and PHP.
The working, crude regex (facilitating loops / search and replace by hand):
(^.*)(?<=\()(.*?)(,)(.*)(?=\))(.*$)
A snip of the input:
room_ass=01:macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*,3,5,7,),
room_ass=01:macro_id=02: name=Right, pgm_audio=1, usb=1, list=(2*,4,6,8,),
room_ass=01:macro_id=03: name=All, pgm_audio=1, list=(1,2*,3,4,5,6,7,8,),
And the desired output:
room_ass=01: macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*.3.5.7.),
room_ass=01: macro_id=02: name=Right, pgm_audio=1, usb=1, list=(2*.4.6.8.),
room_ass=01: macro_id=03: name=All, pgm_audio=1, list=(1.2*.3.4.5.6.7.8.),
That's all. Just replace the , with ., but only inside ( ).
This is one conceptual (not working) method I'd like to see, where the middle group<3> would loop recursively:
(^.*)(?<=\()([^,]*)([,|\d|\*]\3.*)(?=\))(.*$)
( ^ )
..where each recursive iteration would shift across the data, either 1 char or 1 comma at a time:
room_ass=01:macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*,3,5,7,),
iter 1-| ^ |
2-| ^ |
3-| ^ |
4-| ^|
or
A much simpler approach would be to just tell it to mask/select all , between the (), but I struck out on figuring that one out.
I use text editors a lot for little data editing tasks like this, so I'd like to verify that SublimeText can't do it before I dig into Python.
All suggestions and criticisms welcome. Be gentle. <--#n00b
Thanks in advance!
-B
Not much magic needed. Just check, if there's a closing ) ahead, without any ( in between.
,(?=[^)(]*\))
See this demo at regex101
However it does not check for an opening (. It's a common approach and probably a dulicate.
This is a complete guess because I don't use SublimeText, the assumption here is that SublimeText uses PCRE regular expressions.
Note that you mention "recursive", I don't believe you mean Regular Expression Recursion that doesn't fit the problem here.
Something like this might work...
You'll need to test to make sure this isn't matching other things in your document and to see if SublimeText even supports this...
This is based on using the /K operator to "keep" what comes before it - you can find other uses of it as an PCRE alternative (workaround) to variable look-behinds not being supported by PCRE.
Regular Expression
\((?:(?:[^,\)]+),)*?(?:[^,\)]+)\K,
Visualisation
Regex Description
Match the opening parenthesis character \(
Match the regular expression below (?:(?:[^,\)]+),)*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match the regular expression below (?:[^,\)]+)
Match any single character NOT present in the list below [^,\)]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The literal character “,” ,
The closing parenthesis character \)
Match the character “,” literally ,
Match the regular expression below (?:[^,\)]+)
Match any single character NOT present in the list below [^,\)]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The literal character “,” ,
The closing parenthesis character \)
Keep the text matched so far out of the overall regex match \K
Match the character “,” literally ,
The code uses the following regular expression
img[src~=(?i)\\.(png|jpe?g)]
I'm not sure if the . is escaped or the \
the \ is escaped, which appears to be an error given what it's trying to do....
actually, you've taken that out of context. that's probably in a string. if it's in a string, then it's escaping the slash, and then that slash is escaping the dot.
the ~= means "ends with" and the (?i) switches it into case-insensitive mode.
errr... now that i think about it, that actually looks like a hybrid between a CSS selector (probably used in jquery) and a regex (being familiar with both syntaxes, I thought nothing of it!). The ~= doesn't do anything in a regex (they're literal chars) the [ and ] represent a character set though.
So...I don't know what the result of this is. I suspect someone got confused and tried mixing the two.
It means match case insensitively, any string that ends in:
\.png
\.jpeg
\.jpg
But this is dependant on context. If used in a context, were \ need to be escaped out at a higher level, then it means match case insensitively:
.png
.jpeg
.jpg
In this expression , '/' is escaped ,which in turn escapes the '.'
Regular expressions are not strong point.
I can do simple stuff, but this one has just got my goat !!
So could someone give me a hand with this one.
Here's the comment in the code :
// If utf8 detection didnt work before, strip those weird characters for an underscore, as a last resort.
eregi_replace("[^a-z0-9 \-\.\(\)\/\\]","_",$str);
to (here's what I tried)
preg_replace("{[^a-z0-9 \-\.\(\)\/\\]}i","_",$str);
Any regex pros out there who give me a hand?
You need to specify regexp identifier such as # or /
preg_replace("#[^a-z0-9 \-\.\(\)\/\\]#i","_",$str);
So you should enclose your regular expression in those identifier characters.
First, I believe the { and } are fine as delimiters for the expression from the flags, but I know there are some regex flavors that don't support it, so it might be a good idea to just use something like ! or #
Second, I am not sure how the expression before worked, because AFAIK escaping with a \ character does not work with ERE expressions. You have to represent special characters like ^, -, and ] by their position within the class (^ cannot be the first character, ] must be the first character, and - must be either the first or the last character). The - character in the first expression would be interpreted as a range specifier (in this case a character in the range between \ and \). Additionally, the \ characters are treated literally, so you've got a confusing looking and largely redundant regex.
The replacement expression, however, needs to be in preg notation/flavor, so there are rule changes:
Very few things need to be escaped in a character class, even with the new rules
The \ character needs to be escaped twice - once for the string, and then one more time for the regex - otherwise, it will escape the closing bracket ]
Assuming you want to match a dash (or rather match something OTHER than a dash, it needs to be moved to the end of the class
So, here is some code (link) that I believe does what you need it to do:
$source = 'hello! ##$%^&* wazzup-dawg?.()/\\[]{}<>:"';
$blah = preg_replace('![^a-z0-9 .()/\\\\-]!i','_',$source);
print($blah);
preg_replace("{[^a-z0-9]-.()/\/}i","_",$str)
works just fine.
I tried it with all # and / and { and they all worked.