Say I have the following string:
Something before _The brown "fox" jumped over_ Something after
I want to capture what's between _ and _, but only if there is an even number of quotes " between them. So the above case will be a match.
From the following, only the bold ones should be matched:
Some text _fir"st part_ and other text _seco"nd t"est_ and more _thir"d" "t"est_
Note that the second and third ones have 2 and 4 quotes, respectively.
I've tried to do it but I've not been very successful: _ (?= [^_]* " [^_]* " [^_]* _)* .*? _ The spaces are added for readability.
I'm using PHP if it's relevant.
You can use this regex:
_(([^"_]*"){2})+[^"_]*_
Online Demo: http://regex101.com/r/bN9pF1
Related
Example: "this is the #example te#s5t #abc17$ ++ 123"
Will match: "this is the te ++ 123"
You could use this regex to match the things you don't want, and then replace any matches with an empty string:
\s*#\S*
The regex matches an optional number of spaces, followed by an #, then some number of non-space characters. For your sample data, this will give this is the te ++ 123 as the output.
Demo on regex101
Note the reason to remove spaces at the beginning (if the # starts a word) is so that when a whole word is removed (e.g. #example in your sample data) you don't get left with two spaces next to each other in the output string.
I have a text that I need to split in subsentences but if the text contains special cases such as domain.com or st. moris it gets splitted at those points too.
Here is what I got:
val pattern = "(?<=[.](?<![s][t][.]))"
val text = "here is an axample with cases like st. moris and google.com here. second sentence."
val list = text.split(pattern)
list.foreach(println)
I want this code to return
List(
"here is an axample with cases like st. moris and google.com here.",
"second sentence."
)
but instead it returns:
List(
"here is an axample with cases like st.",
" moris and google.",
"com here.",
"second sentence."
)
How can I make it work?
If you want to split with 1+ whitespaces preceded with a dot that is not itself is preceded with st as a whole word, you may use
val pattern = """(?i)(?<=(?<!\bst)\.)\s+"""
Or, if the number of whitespace chars after the dot can be 0, you may implement the logic to avoid matching a . if it is followed with com, org, etc. as whole words:
val pattern = """(?i)(?<=\.(?<!\bst\.)(?!(?:com|org)\b))\s*+(?!$)"""
See the regex #1 demo and regex #2 demo. Details:
(?i) - makes the pattern case insensitive
(?<=(?<!\bst)\.) - a location immediately preceded with a dot that is not immediately preceded with a whole word st
\s+ - 1 or more whitespaces
Or
(?i) - makes the pattern case insensitive
(?<=\.(?<!\bst\.)(?!(?:com|org)\b)) - a location immediately preceded with a dot that is not immediately preceded with a whole word st and not immediately followed with com or org as whole words (add more alternatives if needed after |)
\s*+ - 0 or more whitespaces matched possessively
(?!$) - not at the end of string.
See Scala demo #1 (Scala demo #2):
val pattern = """(?i)(?<=(?<!\bst)\.)\s+"""
// val pattern = """(?i)(?<=\.(?<!\bst\.)(?!(?:com|org)\b))\s*+(?!$)""" // Pattern #2
val text = "here is an axample with cases like st. moris and google.com here. second sentence."
val list = text.split(pattern)
list.foreach(println)
Output:
here is an axample with cases like st. moris and google.com here.
second sentence.
Your code is returning such value because as you have mentioned in pattern you need to split when your mentioned symbol comes.
And one of the symbols among you mentioned is "." .
So after st when "." Comes it splits.
So you have two options either remove "." after st and Google or give something another symbol from pattern before "second" word and remove "." from pattern.
So this one works for me, and can be expanded with different exclusions in the text
((.+(st\.|mr\.|mrs\.))*.+?\.( |$))
Maybe there will be some sub-matches in the group, but you should look only for full matches. Here is the regex101.com example
As you see on the right, only two matches.
To add more exclusions, you should add to the (st\.|mr\.|mrs\.) part string pattern which you would like to count as exclusions.
The domain names are exluded with this part: \.( |$). It says, that the end of the sentence should be a dot and a space(or)end of the line.
Reply if it works in your environment.
Hard to word this correctly, but TL;DR.
I want to match, in a given text sentence (let's say "THE TREE IS GREEN") if any space is doubled (or more).
Example:
"In this text,
THE TREE IS GREEN should not match,
THE TREE IS GREEN should
and so should THE TREE IS GREEN
but double-spaced TEXT SHOULD NOT BE FLAGGED outside the pattern."
My initial approach would be
/THE( {2,})TREE( {2,})IS( {2,})GREEN/
but this only matches if all spaces are double in the sequence, therefore I'd like to make any of the groups trigger a full match. Am I going the wrong way, or is there a way to make this work?
You can use Negative lookahead if there is an option.
First match the sentence that you want to fail, in your case, it is "THE TREE IS GREEN" then give the most generic case that wants to catch your desired result.
(?!THE TREE IS GREEN)(THE[ ]+TREE[ ]+IS[ ]+GREEN)
https://regex101.com/r/EYDU6g/2
You can just search for the spaces that you're looking for:
/ {2,}/ will work to match two or more of the space character. (https://regexr.com/4h4d4)
You can capture the results by surrounding it with parenthesis - /( {2,})/
You may want to broaden it a bit.
/\s{2,}/ will match any doubling of whitespace.
(\s - means any whitespace - space, tab, newline, etc.)
No need to match the whole string, just the piece that's of interest.
If I am not mistaken you want the whole match if there is a part present where there are 2 or more spaces between 2 uppercased parts.
If that is the case, you might use:
^.*[A-Z]+ {2,}[A-Z]+.*$
^ Start of string
.*[A-Z]+ match any char except a newline 0+ time, then match 1+ times [A-Z]
[ ]{2,} Match 2 or more times a space (used square brackets for clarity)
A-Z+ Match 1+ times an uppercase char
.*$ Match any char except a newline 0+ times until the end of the string
Regex demo
You could do this:
import re
pattern = r"THE +TREE +IS +GREEN"
test_str = ("In this text,\n"
"THE TREE IS GREEN should not match,\n"
"THE TREE IS GREEN should\n"
"and so should THE TREE IS GREEN\n"
"but double-spaced TEXT SHOULD NOT BE FLAGGED outside the pattern.")
matches = re.finditer(pattern, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
if match.group() != 'THE TREE IS GREEN':
print ("{match}".format(match = match.group()))
To better clean my forum message corpus, I would like to remove the leading spaces before punctuation and add one after if needed, using two regular expressions. The latter was no problem ((?<=[.,!?()])(?! )) but I've some problem with the first at least.
I used this expression: \s([?.!,;:"](?:\s|$))
But it's by far not flexible enough:
It matches even if there's already a space(or more) before the punctuation character
It doesn't match if there's not a space after the punctuation character
It doesn't match any unlisted punctuation character (but I guess I can use [:punct:] for that, at the end of the day)
Finally, both matches the decimal points (while they should not)
How can I eventually rewrite the expression to meet my needs?
Example Strings and expected output
This is the end .Hello world! # This is the end. Hello world! (remove the leading, add the trailing)
This is the end, Hello world! # This is the end, Hello world! (ok!)
This is the end . Hello world! # This is the end. Hello world! (remove the leading, ok the trailing)
This is a .15mm tube # This is a .15 mm tube (ok since it's a decimal point)
Use \p{P} to match all the punctuations. Use \h* instead of \s* because \s would match newline characters also.
(?<!\d)\h*(\p{P}+)\h*(?!\d)
Replace the matched strings by \1<space>
DEMO
> x <- c('This is the end .Stuff', 'This is the end, Stuff', 'This is the end . Stuff', 'This is a .15mm tube')
> gsub("(?<!\\d)\\h*(\\p{P}+)\\h*(?!\\d)", "\\1 ", x, perl=T)
[1] "This is the end. Stuff" "This is the end, Stuff" "This is the end. Stuff"
[4] "This is a .15mm tube"
Here's an expression that detects the substrings that need to be replaced:
\s*\.\s*(?!\d)
You need to replace these by: . (a dot and a space)
Here's a demo link of how this works: http://regex101.com/r/zB2bY3/1
Explanation of the regex:
\s* - matches whitespace, any number of chars (0 - unbounded)
\. - matches a dot
\s* - same as above
(?!\d) - negative lookahead. It means that the string, in order to be matched, must not be followed by a digit (this handles your last test case).
Can you please provide me with a regular expression that would
Allow only alphanumeric
Have definitely only one hyphen in the entire string
Hyphen or spaces not allowed at the front and back of the string
no consecutive space or hyphens allowed.
hypen and one space can be present near each other
Valid - "123-Abc test1","test- m e","abc slkh-hsds"
Invalid - " abc ", " -hsdj sdsd hjds- "
Thanks for helping me out on the same. Your help is much appreciated
/^([a-zA-Z0-9] ?)+-( ?[a-zA-Z0-9])+$/
See demo here.
EDIT:
If there can't be a space on both sides of the hyphen, then there needs to be a little more:
/^([a-zA-Z0-9] ?)+-(((?<! -) )?[a-zA-Z0-9])+$/
^^^^^^^^ ^
Alternatively, if negative lookbehind assertions aren't supported (e.g. in JavaScript), then an equivalent regex:
/^([a-zA-Z0-9]( (?!- ))?)+-( ?[a-zA-Z0-9])+$/
^ ^^^^^^^ ^
Only alphanumeric (hyphen and space included, otherwise it'd make no sense):
^[\da-zA-Z -]+$
This is the main part that will match the string and makes sure that every character is in the given set. I.e. digits and ASCII letters as well as space and hyphen (the use of which will be restricted in the following parts).
Only one hyphen and none at the start or end of the string:
(?=^[^-]+-[^-]+$)
This is a lookahead assertion making sure that the string starts and ends with at least one non-hyphen character. A single hyphen is required in the middle.
No space at the start or end or the string:
(?=^[^ ].*[^ ]$)
Again a lookahead, similar to the one above. They could be combined into one, but it looks much messier and is harder to explain.
No consecutive spaces (consecutive hyphens are ruled out already by 2. above):
(?!.* )
Putting it all together:
(?!.* )(?=^[^ ].*[^ ]$)(?=^[^-]+-[^-]+$)^[\da-zA-Z -]+$
Quick PowerShell test:
PS> $re='(?!.* )(?=^[^ ].*[^ ]$)(?=^[^-]+-[^-]+$)^[\da-zA-Z -]+$'
PS> "123-Abc test1","test- m e","abc slkh-hsds"," abc ", " -hsdj sdsd hjds- " -match $re
123-Abc test1
test- m e
abc slkh-hsds
Use this regex:
^(.+-.+)[\da-zA-Z]+[\da-zA-Z ]*[\da-zA-Z]+$