If a word contains alphanumeric characters and the first character or characters is (or are) non-alphanumeric, then how to split off each such leading non-alphanumeric character as a separate word; Whether or not the first rule was applied, if the word contains alphanumeric characters and the last character or characters is (or are) non-alphanumeric, then how to split off each such trailing non-alphanumeric character as a separate word?
For example, if I have a
string = "John had a meeting with 3managers! %nervous:( t^ria7 #manager's.!"
The output should look like this
"John had a meeting with 3managers ! % nervous : ( t^ria7 # managers's . !"
The (new) idea is to split the words by whitespaces and then to apply an alternative regex to each word. In the end, the parts are glued together again.
The expression in question:
^(\W+)|(\W+)$
Which is either non-word characters from the beginning or the end of the string, see a demo on regex101.com.
In Python, you need to check which group was captured to insert the appropriate whitespaces:
import re
string = """John had a meeting with 3managers! %nervous:( t^ria7 #manager's."""
def replacer(match):
if match.group(1) is not None:
return '{} '.format(match.group(1))
else:
return ' {}'.format(match.group(2))
rx = re.compile(r'^(\W+)|(\W+)$')
string = " ".join([rx.sub(replacer, word) for word in string.split()])
print(string)
This yields
John had a meeting with 3managers ! % nervous :( t^ria7 # manager's .
Related
I am trying to write a regex that finds the first word in each line that contains the character a.
For a string like:
The cat ate the dog
and the mouse
The expression should find cat and
So far, I have:
/\b\w*a\w*\b/g
However this will return every match in each line, not just the first match (cat ate and).
What is the easiest way to only return the first occurrence?
Assuming you are onluy looking for words without numbers and underscores (\w would include those), I'd advise to maybe use:
(?i)^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)
And use whatever is in the 1st capture group. See an online demo. Or, if supported:
(?i)^.*?\K(?<!\S)[b-z]*a[a-z]*(?!\S)
See an online demo.
Please note that I used lookaround to assert that the word is not inbetween anything other than whitespace characters. You may also use word-boundaries if you please and swap those lookarounds for \b. Also, depending on your application you can probably scratch the inline case-insensitive switch to a 'flag'. For example, if you happen to use JavaScript /^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)/gmi should probably be your option. See for example:
var myString = "The cat ate the dog\nand the mouse";
var myRegexp = new RegExp("^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)", "gmi");
m = myRegexp.exec(myString);
while (m != null) {
console.log(m[1])
m = myRegexp.exec(myString);
}
If you want to match a word using \w you might also use a negated character class matching any character except a or a newline.
Then match a word that consists of at least an a char with word boundaries \b
^[^a\n\r]*\b([^\Wa]*a\w*)
The pattern matches:
^ Start of string
[^a\n\r]*\b Optionally match any character except a or a newline
( Capture group 1
[^\Wa]*a\w* Optionally match a word character without a, then match a and optional word characters
) Close group 1
Regex demo
Using whitespace boundaries on the left and right:
^[^a\n\r]*(?<!\S)([^\Wa]*a\w*)(?!\S)
Regex demo
The text could be matched with the regular expression
(?=(\b[a-z]*a[a-z]*\b)).*\r?\n
with the multiline and case-indifferent flags set. For each match capture group 1 contains the first word (comprised only of letters) in a line that contains an "a". There are no matches in lines that do not contain an "a".
Demo
The expression can be broken down as follows.
(?= # begin a positive lookahead
\b # match a word boundary
([a-z]*a[a-z]*) # match a word containing an "a" and save to
# capture group 1
)
.*\r?\n # match the remainder of the line including the
# line terminator
I am trying to capture n consecutive capitalized words. My current code is
n=5
a='This is a Five Gram With Five Caps and it also contains a Two Gram'
re.findall(' ([A-Z]+[a-z|A-Z]* ){n}',a)
Which returns the following:
['Caps ']
It's identifying the fifth consecutive capitalized word, but I would like it to return the entire string of capitalized words. In other words:
[' Five Gram With Five Caps ']
Note that | doesn't act as an OR inside a character class. It'll match | literally. The other issue here is that findall's behaviour is to return the match unless a group exists (although python's documentation doesn't really make this clear):
The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups
So this is why you're getting the result of the first capture group, which is the last uppercase-starting word of Caps.
The simple solution is to change your capturing group to a non-capturing group. I've also changed the space at the start to \b so as to not match an additional whitespace (which I presume you were planning on trimming anyway).
See code in use here
import re
r = re.compile(r"\b(?:[A-Z][a-zA-Z]* ){5}")
s = "This is a Five Gram With Five Caps and it also contains a Two Gram"
print(r.findall(s))
See regex in use here
\b(?:[A-Z][a-zA-Z]* ){5}
\b Assert position as a word boundary
(?:[A-Z][a-zA-Z]* ?){5} Match the following exactly 5 times
[A-Z] Match an uppercase ASCII letter once
[a-zA-Z]* Match any ASCII letter any number of times
Match a space
Result: ['Five Gram With Five Caps ']
Additionally, you may use the regex \b\[A-Z\]\[a-zA-Z\]*(?: \[A-Z\]\[a-zA-Z\]*){4}\b instead. This will allow matches at the start/end of the string as well as anywhere in the middle without grabbing extra whitespace. Another alternative may include (?:^|(?<= ))\[A-Z\]\[a-zA-Z\]*(?: \[A-Z\]\[a-zA-Z\]*){4}(?= |$)
Wrap the whole pattern in a capturing group:
(([A-Z]+[a-z|A-Z]* ){5})
Demo
Here is the problem:
Replace input string with the following: The first and last characters, separated by the count of distinct characters between the two.
Any non-alphabetic character in the input string should appear in the output string in its original relative location.
Here is the code I have thus far:
word = input("Please enter a word: ")
first_character = word[0]
last_character = word[-1]
unique_characters = (list(set(word[1:-1])))
unique_count = str(len(unique_characters))
print(first_character[0],unique_count,last_character[0])
For the second part, I have thought about using regex, however I have not been able to wrap my head around regex as it is not something I ever use.
You can use
import re
pat = r"\b([^\W\d_])([^\W\d_]*)([^\W\d_])\b"
s = "Testers"
print(re.sub(pat, (lambda m: "{0}{1}{2}".format(m.group(1), len(''.join(set(m.group(2)))), m.group(3))), s))
See the IDEONE demo.
The regex breakdown:
\b - word boundary (use ^ if you test an individual string)
([^\W\d_]) - Group 1 capturing any ASCII letter (use re.U flag if you need to match Unicode, too)
([^\W\d_]*) - Group 2 capturing zero or more letters
([^\W\d_]) - Group 3 capturing a letter at...
\b - the trailing word boundary (replace with $ if you handle individual strings)
In the replacement pattern, the len(''.join(set(m.group(2)))) is counting the number of unique letter occurrences (see this SO post).
If you need to handle 2-letter words like Ts > Ts, you may replace * with + quantifier in the second group.
I want to match entire words (or strings really) that containing only defined characters.
For example if the letters are d, o, g:
dog = match
god = match
ogd = match
dogs = no match (because the string also has an "s" which is not defined)
gods = no match
doog = match
gd = match
In this sentence:
dog god ogd, dogs o
...I would expect to match on dog, god, and o (not ogd, because of the comma or dogs due to the s)
This should work for you
\b[dog]+\b(?![,])
Explanation
r"""
\b # Assert position at a word boundary
[dog] # Match a single character present in the list “dog”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
[,] # Match the character “,”
)
"""
The following regex represents one or more occurrences of the three characters you're looking for:
[dog]+
Explanation:
The square brackets mean: "any of the enclosed characters".
The plus sign means: "one or more occurrences of the previous expression"
This would be the exact same thing:
[ogd]+
Which regex flavor/tool are you using? (e.g. JavaScript, .NET, Notepad++, etc.) If it's one that supports lookahead and lookbehind, you can do this:
(?<!\S)[dog]+(?!\S)
This way, you'll only get matches that are either at the beginning of the string or preceded by whitespace, or at the end of the string or followed by whitespace. If you can't use lookbehind (for example, if you're using JavaScript) you can spell out the leading condition:
(?:^|\s)([dog]+)(?!\S)
In this case you would retrieve the matched word from group #1. But don't take the next step and try to replace the lookahead with (?:$|\s). If you did that, the first hit ("dog") would consume the trailing space, and the regex wouldn't be able to use it to match the next word ("god").
Depending on the language, this should do what you need it to do. It will only match what you said above;
this regex:
[dog]+(?![\w,])
in a string of ..
dog god ogd, dogs o
will only match..
dog, god, and o
Example in javascript
Example in php
Anything between two [](brackets) is a character class.. it will match any character between the brackets. You can also use ranges.. [0-9], [a-z], etc, but it will only match 1 character. The + and * are quantifiers.. the + searches for 1 or more characters, while the * searches for zero or more characters. You can specify an explicit character range with curly brackets({}), putting a digit or multiple digits in-between: {2} will match only 2 characters, while {1,3} will match 1 or 3.
Anything between () parenthesis can be used for callbacks, say you want to return or use the values returned as replacements in the string. The ?! is a negative lookahead, it won't match the character class after it, in order to ensure that strings with the characters are not matched when the characters are present.
Given the code:
import clr
clr.AddReference('System')
from System.Text.RegularExpressions import *
def ReplaceBlank(st):
return Regex.Replace(
st,r'[^a-z\s](\s+)[^a-z\s]',
lambda s:s.Value.Replace(' ', ''),RegexOptions.IgnoreCase)
I expect the input ABC EDF to return ABCDEF but it doesn't work, what did I do wrong?
[^a-z\s] with ignore-case flag set matches anything other than letters and whitespace characters. ^ at the beginning of a character class (the thing between []) negates the character class.
To replace blanks, you can simply replace \s+ with empty strings or, if you need to match only letters replace
(?<=[a-z])\s+(?=[a-z])
with an empty sting. The second regex will match string of whitespaces between two letters; to account for beginning/end of strings, use
(?<=(^|[a-z]))\s+(?=($|[a-z]))
or
\b\s+\b
The second one will match spaces between two word boundaries, which include symbol chars like period, comma, hyphen etc.