RegEx giving non characters with \w - regex

c = re.split(r'\w+', message)
print(c)
message contains '!nano speak', but the regex is giving me this in return:
>>> ['!', ' ', '\r\n']
I'm very new to regex, but this seems like something I should get, and I can't seem to find this problem in search. It seems like it's doing exactly the opposite, and I'm sure it's a lower-case w.

re.split is using the regex as a delimiter to split the string. You set the delimiter to be any number of alphanumeric characters. This means that it will return everything between words.
In order to get the tokens defined by the regex you can use re.findall:
>>> re.findall(r'\w+', '!nano speak')
['nano', 'speak']

\w matches word character (alphanumeric and underscore), so in the string "!nano speak", it matches everything except "!" and the space, then splitting according to "nano" and "space". So you get "!", " " and "\r\n".
To remove all non characters, you should
re.sub("[^a-zA-Z]+", "", "!nano speak")

Related

How to remove all non word characters except ä,ö and ü from a text using RegExp

I have a file and I want to remove all non-word characters from it, with the exception of ä, ö and ü, which are mutated vowels in the German language. Is there a way to do word.gsub!(/\W/, '') and put exceptions in it?
Example:
text = "übung bzw. äffchen"
text.gsub!(/\W/, '').
Now it would return "bungbzwffchen". It deletes the non word characters, but also removes the mutated vowels ü and ä, which I want to keep.
You may be able to define a list of exclusions by using some kind of negative-lookback thing, but the simplest I think would be to just use \w instead of \W and negate the whole group:
word.gsub!(/[^\wÄäÖöÜü]/, '')
You could also use word.gsub(/[^\p{Letter}]/, ''), that should get rid of any characters that are not listed as "Letter" in unicode.
You mention German vowels in your question, I think it's worth noting here that the German alphabet also includes the long-s : ẞ / ß
Update:
To answer your original question, to define a list of exclusions, you use the "negative look-behind" (?<!pat):
word.gsub(/\W(?<![ÄäÖöÅåẞß])/, '')
You could use the && operator within a character class:
text = "übung bzw. äffchen ÄÖÜ"
text.gsub(/[\W&&[^äÄöÖüÜ]]/, '')
#=> "übungbzwäffchenÄÖÜ"
The regular expression reads, "match a character in the set of characters formed by intersecting the set of all non-word characters with the set of all characters other than those in the string "äÄöÖüÜ". See Regexp (search for "&& operator").

Regular expression to match any word followed by a literal string

So I have the following:
^[a-zA-Z]+\b(myword+-)\b*
which I thought would match
^ start of string
[a-zA-Z] any alpha character
+ of one or more characters
\b followed by a word break
(myword+-) followed by myword which could include one or more special characters
\b followed by a word break
\* followed by anything at all
One: it does not work - it does not match anything
Two: any special characters included in {myword+-) throws an error
I could escape the special characters, but I don't know in advance what they might be, so I would have to escape all the possibilites, or perhaps I could just escape every character in {\m\y\w\o\r\d\\+\\-)
Edited to add:
Sorry, I knew I should have given more information
I have a series of strings to seach through in the form:
extra android-sdk and more that is of no interest
extra android-ndk and more that is of no interest
extra anjuta-extra and more that is of no interest
community c++-gtk-utils and more that is of no interest
and I have a list of items to search for in the strings:
android-sdk
android-ndk
extra
c++-gtk-utils
The item should only match if the second word in the string is an exact match to the item, so:
android-sdk will match the first string
android-ndk will match the second string
extra wuill NOT match the third string
c++-gtk-utils will match the fourth string
So (myword+-) is the item I am searching for "which could include one or more special characters"
Thanks for the help
Andrew
OK, with the help from above I worked it out.
This regex does exactly what I wanted, bear in mind that I am working in tcl (note the spaces to delimit the search word):
^[a-zA-Z]+\y extra \y *
where the search word is "extra".
It is necessary to escape any characters in the search string which may be interpreted by regex as qualifiers etc e.g +
So this will also work:
^[a-zA-Z]+\y dbus-c\+\+ \y *
Andrew
Strong recommendation: if you want to match literal strings, don't use regular expressions.
If we have this sample data:
set strings {
{extra android-sdk and more that is of no interest}
{extra android-ndk and more that is of no interest}
{extra anjuta-extra and more that is of no interest}
{community c++-gtk-utils and more that is of no interest}
}
set search_strings {
android-sdk
android-ndk
extra
c++-gtk-utils
}
Then, to find matches in the 2nd word of each string, we'll just use the eq string equality operator
foreach string $strings {
foreach search $search_strings {
if {[lindex [split $string] 1] eq $search} {
puts "$search matches $string"
}
}
}
outputs
android-sdk matches extra android-sdk and more that is of no interest
android-ndk matches extra android-ndk and more that is of no interest
c++-gtk-utils matches community c++-gtk-utils and more that is of no interest
If you insist on regular expression matching, you can escape any special characters to take away their usual regex meaning. Here, we'll take the brute force approach: any non-word chars will get escaped, so that the pattern may look like ^\S+\s+c\+\+\-gtk\-utils
foreach string $strings {
foreach search $search_strings {
set pattern "^\\S+\\s+[regsub -all {\W} $search {\\&}]"
if {[regexp $pattern $string]} {
puts "$search matches $string"
}
}
}
I was hoping to be able to make a portion of a regular expression to be a literal string, like
set pattern "^\\S+\\s+(***=$string)"
set pattern "^\\S+\\s+((?q)$string)"
but both failed.
Tcl regular expressions are documented at
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm
Also note your pattern ^[a-zA-Z]+\b(myword+-)\b* does not provide for any whitespace between the first and second words.
Disclaimer: Since your question lacks information what input and output is expected, I will give it a try to tell you why your Regex isn't working at all. Since it's not a full answer you might not want to mark it as accepted and possibly wait for someone to give you an example of working solution, as soon as you provide necessary information.
Notes:
quantifier characters (*, +, ? etc.) are applied to literal character or character class (a.k.a character group, namely characters/ranges inside [ ]) - when in your regex you write (myword+-) the only thing the + sign is applied to is letter 'd', nothing else.
what is myword in your regex? If you want a set of characters use [ ] combined with character ranges and/or character tokens such as \w (all word characters, such as letters and some special characters) or \d (all digit characters)
you also seem to misunderstand and misuse groups ("( )"), character classes ("[ ]") and quantifier notation ("{ }")

Insert \n after the word after a regex match

Let's say I have a long string of text, like a paragraph or more, and there is a specific word that appears frequently, lets call it KEY.
I want to insert a "\n" after the word that comes after each KEY.
So if I have the string KEY Hello this is KEY an example. I want it to look like KEY Hello \nthis is KEY an \nexample
If the task were slightly simpler, and I just wanted to add \n after KEY then I could easily do that with, gsub("KEY","KEY \n",string), but I don't think regex has an elegant way of selecting the word after a match, and even if it did I'm not sure I could use it in a gsub.
What would be a good way to add the \n's where I want them?
You can use a capture group and refer back to it. You have to decide how to handle certain scenarios and the specifics of your case, as Wiktor Stribiżew pointed out.
For the example case presented, look for KEY followed by a space followed by non-whitepspace characters 1 or more times (\\S+) followed by a space:
gsub("(KEY \\S+ )", "\\1\n", string, perl = TRUE)
If you want to be more general in what can follow "KEY", then you can add a character class including what you'll allow (or \s for any whitespace character or \W for any non-alphanumeric/underscore characters, as Wiktor points out). Something like this:
gsub("(KEY[., ;!?]\\S+ )", "\\1\n", string, perl = TRUE)
gsub("(KEY\\s\\S+ )", "\\1\n", string, perl = TRUE)
gsub("(KEY\\W+\\S+ )", "\\1\n", string, perl = TRUE)
Putting whatever punctuation you want in the character class part [., ;!?]
Wiktor's variation(s) may be a bit more robust:
gsub("(KEY\\s+\\S+\\s*)", "\\1\n", string) # \s = white-space character
# \S = non-white-space character
gsub("(KEY\\W+\\w+\\s*)", "\\1\n", string) # \w for alphanumeric/underscore
# \W for the opposite of \w.
These variants don't require a space after the next word (\\s* for 0 or more white-space characters) and they can match one or more whitespace characters after KEY or one or more non-alphanumerics/underscores after KEY.

Groovy : RegEx for matching Alphanumeric and underscore and dashes

I am working on Grails 1.3.6 application. I need to use Regular Expressions to find matching strings.
It needs to find whether a string has anything other than Alphanumeric characters or "-" or "_" or "*"
An example string looks like:
SDD884MMKG_JJGH1222
What i came up with so far is,
String regEx = "^[a-zA-Z0-9*-_]+\$"
The problem with above is it doesn't search for special characters at the end or beginning of the string.
I had to add a "\" before the "$", or else it will give an compilation error.
- Groovy:illegal string body character after dollar sign;
Can anyone suggest a better RegEx to use in Groovy/Grails?
Problem is unescaped hyphen in the middle of the character class. Fix it by using:
String regEx = "^[a-zA-Z0-9*_-]+\$";
Or even shorter:
String regEx = "^[\\w*-]+\$";
By placing an unescaped - in the middle of character class your regex is making it behave like a range between * (ASCII 42) and _ (ASCII 95), matching everything in this range.
In Groovy the $ char in a string is used to handle replacements (e.g. Hello ${name}). As these so called GStrings are only handled, if the string is written surrounding it with "-chars you have to do extra escaping.
Groovy also allows to write your strings without that feature by surrounding them with ' (single quote). Yet the easiest way to get a regexp is the syntax with /.
assert "SDD884MMKG_JJGH1222" ==~ /^[a-zA-Z0-9*-_]+$/
See Regular Expressions for further "shortcuts".
The other points from #anubhava remain valid!
It's easier to reverse it:
String regEx = "^[^a-zA-Z0-9\\*\\-\\_]+\$" /* This matches everything _but_ alnum and *-_ */

Remove special characters from a string except whitespace

I am looking for a regular expression to remove all special characters from a string, except whitespace. And maybe replace all multi- whitespaces with a single whitespace.
For example "[one# !two three-four]" should become "one two three-four"
I tried using str = Regex.Replace(strTemp, "^[-_,A-Za-z0-9]$", "").Trim() but it does not work. I also tried few more but they either get rid of the whitespace or do not replace all the special characters.
[ ](?=[ ])|[^-_,A-Za-z0-9 ]+
Try this.See demo.Replace by empty string.See demo.
http://regex101.com/r/lZ5mN8/69
Use the regex [^\w\s] to remove all special characters other than words and white spaces, then replace:
Regex.Replace("[one# !two three-four]", "[^\w\s]", "").Replace(" ", " ").Trim
METHOD:
instead of trying to use replace use replaceAll eg :
String InputString= "[one# !two three-four]";
String testOutput = InputString.replaceAll("[\\[\\-!,*)##%(&$_?.^\\]]", "").replaceAll("( )+", " ");
Log.d("THE OUTPUT", testOutput);
This will give an output of one two three-four.
EXPLANATION:
.replaceAll("[\\[\\-!,*)##%(&$_?.^\\]]", "") this replaces ALL the special characters present between the first and last brackets[]
.replaceAll("( )+", " ") this replaces more than 1 whitespace with just 1 whitespace
REPLACING THE - symbol:
just add the symbol to the regex like this .replaceAll("[\\[\\-!,*)##%(&$_?.^\\]]", "")
Hope this helps :)