Regular expression to match any word followed by a literal string - regex

So I have the following:
^[a-zA-Z]+\b(myword+-)\b*
which I thought would match
^ start of string
[a-zA-Z] any alpha character
+ of one or more characters
\b followed by a word break
(myword+-) followed by myword which could include one or more special characters
\b followed by a word break
\* followed by anything at all
One: it does not work - it does not match anything
Two: any special characters included in {myword+-) throws an error
I could escape the special characters, but I don't know in advance what they might be, so I would have to escape all the possibilites, or perhaps I could just escape every character in {\m\y\w\o\r\d\\+\\-)
Edited to add:
Sorry, I knew I should have given more information
I have a series of strings to seach through in the form:
extra android-sdk and more that is of no interest
extra android-ndk and more that is of no interest
extra anjuta-extra and more that is of no interest
community c++-gtk-utils and more that is of no interest
and I have a list of items to search for in the strings:
android-sdk
android-ndk
extra
c++-gtk-utils
The item should only match if the second word in the string is an exact match to the item, so:
android-sdk will match the first string
android-ndk will match the second string
extra wuill NOT match the third string
c++-gtk-utils will match the fourth string
So (myword+-) is the item I am searching for "which could include one or more special characters"
Thanks for the help
Andrew

OK, with the help from above I worked it out.
This regex does exactly what I wanted, bear in mind that I am working in tcl (note the spaces to delimit the search word):
^[a-zA-Z]+\y extra \y *
where the search word is "extra".
It is necessary to escape any characters in the search string which may be interpreted by regex as qualifiers etc e.g +
So this will also work:
^[a-zA-Z]+\y dbus-c\+\+ \y *
Andrew

Strong recommendation: if you want to match literal strings, don't use regular expressions.
If we have this sample data:
set strings {
{extra android-sdk and more that is of no interest}
{extra android-ndk and more that is of no interest}
{extra anjuta-extra and more that is of no interest}
{community c++-gtk-utils and more that is of no interest}
}
set search_strings {
android-sdk
android-ndk
extra
c++-gtk-utils
}
Then, to find matches in the 2nd word of each string, we'll just use the eq string equality operator
foreach string $strings {
foreach search $search_strings {
if {[lindex [split $string] 1] eq $search} {
puts "$search matches $string"
}
}
}
outputs
android-sdk matches extra android-sdk and more that is of no interest
android-ndk matches extra android-ndk and more that is of no interest
c++-gtk-utils matches community c++-gtk-utils and more that is of no interest
If you insist on regular expression matching, you can escape any special characters to take away their usual regex meaning. Here, we'll take the brute force approach: any non-word chars will get escaped, so that the pattern may look like ^\S+\s+c\+\+\-gtk\-utils
foreach string $strings {
foreach search $search_strings {
set pattern "^\\S+\\s+[regsub -all {\W} $search {\\&}]"
if {[regexp $pattern $string]} {
puts "$search matches $string"
}
}
}
I was hoping to be able to make a portion of a regular expression to be a literal string, like
set pattern "^\\S+\\s+(***=$string)"
set pattern "^\\S+\\s+((?q)$string)"
but both failed.
Tcl regular expressions are documented at
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm
Also note your pattern ^[a-zA-Z]+\b(myword+-)\b* does not provide for any whitespace between the first and second words.

Disclaimer: Since your question lacks information what input and output is expected, I will give it a try to tell you why your Regex isn't working at all. Since it's not a full answer you might not want to mark it as accepted and possibly wait for someone to give you an example of working solution, as soon as you provide necessary information.
Notes:
quantifier characters (*, +, ? etc.) are applied to literal character or character class (a.k.a character group, namely characters/ranges inside [ ]) - when in your regex you write (myword+-) the only thing the + sign is applied to is letter 'd', nothing else.
what is myword in your regex? If you want a set of characters use [ ] combined with character ranges and/or character tokens such as \w (all word characters, such as letters and some special characters) or \d (all digit characters)
you also seem to misunderstand and misuse groups ("( )"), character classes ("[ ]") and quantifier notation ("{ }")

Related

replace single-quote with double-quote, if and only if quote is after specific string

I'm working in notepad++, and using its find-replace dialog box.
NP++ documentation states: Notepad++ regular expressions use the Boost regular expression library v1.70, which is based on PCRE (Perl Compatible Regular Expression) syntax. ref: https://npp-user-manual.org/docs/searching
What I'm trying to do should be simple, but I'm a regex novice, and after 2-3 hrs of web searches and playing with online regex testers, I give up.
I want to replace all single quotes ' with double quote " , but if and only if the ' is to the RIGHT of one or more #, ie inside a python comment.
For example,
list1 = ['apple','banana','pear'] # All 'single quotes' to LEFT of # remained unchanged.
list2 = ['tomato','carrot'] # All 'single quotes' to RIGHT of one or more # are replaced
# # with "double quotes", like this.
The np++ file is over 800 lines, manual replacement would be tedious & error prone. Advice appreciated.
This regex should do what you want:
(^[^#]*#|(?<!^)\G)[^'\n]*\K'
It looks for a ' which is preceded by either
^[^#]*# : start of line and some number of non-# characters followed by a #; or
(?<!^)\G : the start of line or the end of the previous match (\G), with a negative lookbehind for start of line (?<!^), meaning that it only matches at the end of the previous match
and then some number of non ' or newline (to prevent the match wrapping around the end of the previous line) characters [^'\n]*.
We then use \K to reset the match, so that everything before that is discarded from the match, and the regex only matches the '.
That can then be replaced with ".
Demo on regex101
Update
You can avoid matching apostrophes within words by only matching ones that are either preceded or followed by a non-word character:
(^[^#]*#|(?<!^)\G)[^'\n]*\K('(?=\W)|(?<=\W)')
Demo on regex101
Update 2
You can also deal with the case where there are # characters in strings by qualifying the first part of the regex with the requirement for there to be matched pairs of quotes beforehand:
(?:^[^'#]*(?:'[^']*'[^#']*)*[^'#]*#|(?<!^)\G)[^'\n]*\K(?:'(?=\W)|(?<=\W)')
Demo on regex101

Regex to catch 14 digits number that starts with 25 within a text string [duplicate]

I'm quite new to regular expressions. Could you help me to create a pattern, which matches whole words, containing specific part? For example, if I have a text string "Perform a regular expression match" and if I search for express, it shuld give me expression, if I search for form, it should give me Perform and so on. Got the idea?
preg_match('/\b(express\w+)\b/', $string, $matches); // matches expression
preg_match('/\b(\w*form\w*)\b/', $string, $matches); // matches perform,
// formation, unformatted
Where:
\b is a word boundary
\w+ is one or more "word" character*
\w* is zero or more "word" characters
See the manual on escape sequences for PCRE.
* Note: although not really a "word character", the underscore _ is also included int the character class \w.
This matches 'Perform':
\b(\w*form\w*)\b
This matches 'expression':
\b(\w*express\w*)\b

Regex Lookahead/behind to find character unless followed by the same

I'm really not good with Regex and have been messing about to achieve the following all morning:
I want to find unicode characters ie "\00026" in an SQL string before saving to the database and escape the "\", by replacing it with "\" unless it already has two "\" characters.
\\(?=[0])(?<![\\])
Is what I have written, which as I understand it does:
find the "\" character, positive look ahead for a "0", and look behind to check it isn't preceded by a "\"
But it's not working, so clearly I have misunderstood!
I can shorten it to \\(?=[0])
But then I get the "\" before the 0, even if it is preceded by another "\"
So how do I do:
Replace("\00026", "regex", "\\") to get "\\00026"
AND ensure that
Replace("\\00026", "regex", "\\") also gives "\\00026"
All help much appreciated!
EDIT:
This must parse an entire string and replace all occurrences, not just the first as well - just to be clear. Also I am using VB.net if it makes much difference.
Let me explain why your regex does not work.
\\ - Matches \
(?=[0]) - Checks (not matches) if the next character is 0
(?<![\\]) - Checks (but not matches) if the preceding character (that is \) is not \.
The last condition will always fail the match, as \ is \. So, not much sense, right?
If you want to match / in /000xx whole strings (e.g. separated with spaces), where x is any digit, you can use
\B(?<!/)/(?!/)(?=000\d{2})
See demo (go to Context tab)
To match the string even in context like w/00023, you can remove \B:
(?<!/)/(?!/)(?=000\d{2})
If you do not care about 0s, but just any digits:
(?<!/)/(?!/)(?=\d)
And in case you have \ (not /), just replace / with \\ in the above regular expressions.
You can use the following regex:
(?<!/)/(?=0)
And replace with //
See DEMO

Groovy : RegEx for matching Alphanumeric and underscore and dashes

I am working on Grails 1.3.6 application. I need to use Regular Expressions to find matching strings.
It needs to find whether a string has anything other than Alphanumeric characters or "-" or "_" or "*"
An example string looks like:
SDD884MMKG_JJGH1222
What i came up with so far is,
String regEx = "^[a-zA-Z0-9*-_]+\$"
The problem with above is it doesn't search for special characters at the end or beginning of the string.
I had to add a "\" before the "$", or else it will give an compilation error.
- Groovy:illegal string body character after dollar sign;
Can anyone suggest a better RegEx to use in Groovy/Grails?
Problem is unescaped hyphen in the middle of the character class. Fix it by using:
String regEx = "^[a-zA-Z0-9*_-]+\$";
Or even shorter:
String regEx = "^[\\w*-]+\$";
By placing an unescaped - in the middle of character class your regex is making it behave like a range between * (ASCII 42) and _ (ASCII 95), matching everything in this range.
In Groovy the $ char in a string is used to handle replacements (e.g. Hello ${name}). As these so called GStrings are only handled, if the string is written surrounding it with "-chars you have to do extra escaping.
Groovy also allows to write your strings without that feature by surrounding them with ' (single quote). Yet the easiest way to get a regexp is the syntax with /.
assert "SDD884MMKG_JJGH1222" ==~ /^[a-zA-Z0-9*-_]+$/
See Regular Expressions for further "shortcuts".
The other points from #anubhava remain valid!
It's easier to reverse it:
String regEx = "^[^a-zA-Z0-9\\*\\-\\_]+\$" /* This matches everything _but_ alnum and *-_ */

Regular Expression:- String can contain any characters but should not be empty

My requirement is
"A string should not be blank or empty"
Eg., A String can contain any number of characters or strings followed by any special characters but should never be empty for eg., a string can contain "a,b,c" or "xyz123abc" or "12!#$#%&*()9" or " aa bb cc "
So, this is what i tried
Regex for blank or space:-
^\s*$
^ is the beginning of string anchor
$ is the end of string anchor
\s is the whitespace character class
* is zero-or-more repetition of
I'm stuck on how to negate the regex ^\s*$ so that it accepts any string like "a,b,c" or "xyz" or "12!#$#%&*()9"
Any help is appreciated.
No need for a regex. In Groovy you have the isAllWhitespace method:
groovy:000> "".allWhitespace
===> true
groovy:000> " \t\n ".allWhitespace
===> true
groovy:000> "something".allWhitespace
===> false
So asking !yourString.allWhitespace should tell you if your string is something else than empty or blank :)
\S
\S matches any non-white space character
Each character class has it's own anti-class defined, so for \w you have \W for \s you have \S for \d you have \D etc.
http://www.regular-expressions.info/charclass.html
Your regex engine may not support \S. If this is the case you use [^ \t\v] if you support unicode (which you should) there are more space types that you should watch for.
If both your regex engine and you support unicode AND \S is not supported by your regex engine then you'll probably want to use (if you care about people entering different unicode space types):
[^ \r\f\t\v\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u2028\u2029\u202F\u205F\u3000\uFEFF]
http://www.cs.tut.fi/~jkorpela/chars/spaces.html
http://en.wikipedia.org/wiki/Whitespace_character#Unicode
to me two simple ways to express it are (both no need for anchoring):
s.trim() =~ /.+/
or
s =~ /\S+/
the first assumes you know how trim() works, the second assumes the meaning of \S.
Of course
!s.allWhitespace
is perfect, again if you know it exists
The following regular expression will ensure that a string contains at least 1 non-whitespace character.
^(?!\s*$).+
Note: I am not familiar with groovy. But I would imagine there is a native functions (trim, empty, etc) that test this more naturally than a regular expression.
is this in a grails domain class?
if so, just use the blank constraint