I was wondering how to match a line without either of two words?
For example, I would like to match a line without neither Chapter nor Part. So neither of these two lines is a match:
("Chapter 2 The Economic Problem 31" "#74")
("Part 2 How Markets Work 51" "#94")
while this is a match
("Scatter Diagrams 21" "#64")
My python-style regex will be like (?<!(Chapter|Part)).*?\n. I know it is not right and will appreciate your help.
Try this:
^(?!.*(Chapter|Part)).*
#MRAB's solution will work, but here's another option:
(?m)^(?:(?!\b(?:Chapter|Part)\b).)*$
The . matches one character at a time, after the lookahead checks that it's not the first character of Chapter or Part. The word boundaries (\b) make sure it doesn't incorrectly match part of a longer word, like Partition.
The ^ and $ are start- and end anchors; they ensure that you match a whole line. $ is better than \n because it also matches the end of the last line, which won't necessarily have a linefeed at the end. The (?m) at the beginning modifies the meaning of the anchors; without that, they only match at the beginning and end of the whole input, not of individual lines.
Related
How do I create a regular expression to match a word at the beginning of a string?
We are looking to match stop at the beginning of a string and anything can follow it.
For example, the expression should match:
stop
stop random
stopping
If you wish to match only lines beginning with stop, use
^stop
If you wish to match lines beginning with the word stop followed by a space:
^stop\s
Or, if you wish to match lines beginning with the word stop, but followed by either a space or any other non-word character you can use (your regex flavor permitting)
^stop\W
On the other hand, what follows matches a word at the beginning of a string on most regex flavors (in these flavors \w matches the opposite of \W)
^\w
If your flavor does not have the \w shortcut, you can use
^[a-zA-Z0-9]+
Be wary that this second idiom will only match letters and numbers, no symbol whatsoever.
Check your regex flavor manual to know what shortcuts are allowed and what exactly do they match (and how do they deal with Unicode).
Try this:
/^stop.*$/
Explanation:
/ charachters delimit the regular expression (i.e. they are not part of the Regex per se)
^ means match at the beginning of the line
. followed by * means match any character (.), any number of times (*)
$ means to the end of the line
If you would like to enforce that stop be followed by a whitespace, you could modify the RegEx like so:
/^stop\s+.*$/
\s means any whitespace character
+ following the \s means there has to be at least one whitespace character following after the stop word
Note: Also keep in mind that the RegEx above requires that the stop word be followed by a space! So it wouldn't match a line that only contains: stop
If you want to match anything after a word, stop, and not only at the start of the line, you may use: \bstop.*\b - word followed by a line.
Or if you want to match the word in the string, use \bstop[a-zA-Z]* - only the words starting with stop.
Or the start of lines with stop - ^stop[a-zA-Z]* for the word only - first word only.
The whole line ^stop.* - first line of the string only.
And if you want to match every string starting with stop, including newlines, use: /^stop.*/s - multiline string starting with stop.
Like #SharadHolani said. This won't match every word beginning with "stop"
. Only if it's at the beginning of a line like "stop going".
#Waxo gave the right answer:
This one is slightly better, if you want to match any word beginning with "stop" and containing nothing but letters from A to Z.
\bstop[a-zA-Z]*\b
This would match all
stop (1)
stop random (2)
stopping (3)
want to stop (4)
please stop (5)
But
/^stop[a-zA-Z]*/
would only match (1) until (3), but not (4) & (5)
If you want to match anything that starts with "stop" including "stop going", "stop" and "stopping" use:
^stop
If you want to match the word stop followed by anything as in "stop going", "stop this", but not "stopped" and not "stopping" use:
^stop\W
/stop([a-zA-Z])+/
Will match any stop word (stop, stopped, stopping, etc)
However, if you just want to match "stop" at the start of a string
/^stop/
will do :D
If you want the word to start with "stop", you can use the following pattern.
"^stop.*"
This will match words starting with stop followed by anything.
/^stop*$/i
i - in case it is case sensitive.
I'd advise against a simple regular expression approach to this problem. There are too many words that are substrings of other unrelated words, and you'll probably drive yourself crazy trying to overadapt the simpler solutions already provided.
You'll want at least a naive stemming algorithm (try the Porter stemmer; there's available, free code in most languages) to process text first. Keep this processed text and the preprocessed text in two separate space-split arrays. Make sure each non-alphabetical character also gets its own index in this array. Whatever list of words you're filtering, stem them also.
The next step would be to find the array indices which match to your list of stemmed 'stop' words. Remove those from the unprocessed array, and then rejoin on spaces.
This is only slightly more complicated, but will be much more reliable an approach. If you've got any doubts on the value of a more NLP-oriented approach, you might want to do some research into clbuttic mistakes.
can you try this:
https://regex101.com/r/P3qfKG/1
reg = /stop(\w+| [^ ]+|$)/gm
it will select both stop and start with stop and next word;
How do I create a regular expression to match a word at the beginning of a string?
We are looking to match stop at the beginning of a string and anything can follow it.
For example, the expression should match:
stop
stop random
stopping
If you wish to match only lines beginning with stop, use
^stop
If you wish to match lines beginning with the word stop followed by a space:
^stop\s
Or, if you wish to match lines beginning with the word stop, but followed by either a space or any other non-word character you can use (your regex flavor permitting)
^stop\W
On the other hand, what follows matches a word at the beginning of a string on most regex flavors (in these flavors \w matches the opposite of \W)
^\w
If your flavor does not have the \w shortcut, you can use
^[a-zA-Z0-9]+
Be wary that this second idiom will only match letters and numbers, no symbol whatsoever.
Check your regex flavor manual to know what shortcuts are allowed and what exactly do they match (and how do they deal with Unicode).
Try this:
/^stop.*$/
Explanation:
/ charachters delimit the regular expression (i.e. they are not part of the Regex per se)
^ means match at the beginning of the line
. followed by * means match any character (.), any number of times (*)
$ means to the end of the line
If you would like to enforce that stop be followed by a whitespace, you could modify the RegEx like so:
/^stop\s+.*$/
\s means any whitespace character
+ following the \s means there has to be at least one whitespace character following after the stop word
Note: Also keep in mind that the RegEx above requires that the stop word be followed by a space! So it wouldn't match a line that only contains: stop
If you want to match anything after a word, stop, and not only at the start of the line, you may use: \bstop.*\b - word followed by a line.
Or if you want to match the word in the string, use \bstop[a-zA-Z]* - only the words starting with stop.
Or the start of lines with stop - ^stop[a-zA-Z]* for the word only - first word only.
The whole line ^stop.* - first line of the string only.
And if you want to match every string starting with stop, including newlines, use: /^stop.*/s - multiline string starting with stop.
Like #SharadHolani said. This won't match every word beginning with "stop"
. Only if it's at the beginning of a line like "stop going".
#Waxo gave the right answer:
This one is slightly better, if you want to match any word beginning with "stop" and containing nothing but letters from A to Z.
\bstop[a-zA-Z]*\b
This would match all
stop (1)
stop random (2)
stopping (3)
want to stop (4)
please stop (5)
But
/^stop[a-zA-Z]*/
would only match (1) until (3), but not (4) & (5)
If you want to match anything that starts with "stop" including "stop going", "stop" and "stopping" use:
^stop
If you want to match the word stop followed by anything as in "stop going", "stop this", but not "stopped" and not "stopping" use:
^stop\W
/stop([a-zA-Z])+/
Will match any stop word (stop, stopped, stopping, etc)
However, if you just want to match "stop" at the start of a string
/^stop/
will do :D
If you want the word to start with "stop", you can use the following pattern.
"^stop.*"
This will match words starting with stop followed by anything.
/^stop*$/i
i - in case it is case sensitive.
I'd advise against a simple regular expression approach to this problem. There are too many words that are substrings of other unrelated words, and you'll probably drive yourself crazy trying to overadapt the simpler solutions already provided.
You'll want at least a naive stemming algorithm (try the Porter stemmer; there's available, free code in most languages) to process text first. Keep this processed text and the preprocessed text in two separate space-split arrays. Make sure each non-alphabetical character also gets its own index in this array. Whatever list of words you're filtering, stem them also.
The next step would be to find the array indices which match to your list of stemmed 'stop' words. Remove those from the unprocessed array, and then rejoin on spaces.
This is only slightly more complicated, but will be much more reliable an approach. If you've got any doubts on the value of a more NLP-oriented approach, you might want to do some research into clbuttic mistakes.
can you try this:
https://regex101.com/r/P3qfKG/1
reg = /stop(\w+| [^ ]+|$)/gm
it will select both stop and start with stop and next word;
I have this regular expression:
\..*?\.
But it only selects between two periods, not every punctuation mark, and it also selects across multiple lines.
Would modifying this expression to only take in one line at a time work somehow, if there's also a way to group punctuation into where we have a period?
Just to make things simpler, at this time I only need the expression to recognize periods, exclamation points, and question marks. I don't need it to register commas.
Thanks to Nathan and Agumander below, I know to substitute [.!?] in place of \. now, but I'm still having trouble with the other half of my question.
Just to make sure I'm being more clear, using [.!?].*?[.!?]\s will highlight text between punctuation marks, but across multiple lines. So I can't use it to bookmark only the lines that have multiple punctuation marks.
Placing characters inside a pair of square brackets will match to any of the enclosed characters. In your case you'd want [.?!]
If you want to match any sentence that has two of these, then you'll be looking for a pair of [.!?] separated by zero or more of any character.
The regex that matches strings with more than one of the set [.?!] would then be [.!?].*[.!?]
To make . match newlines, you'd add the s modifier to your regex.
...so the full regex would be /[.!?].*[.!?]/s
Ok I figured it out. Thanks to Agumander and Nathan above I substituted [.!?] in for the two \. in my original regex:
\..*?\. became [.!?].*[.!?]
Putting \s at the end of the regex made it pink select the entire document in notepad++.
The last issue I had was remembering to turn off "matches newline."
Agumander, I think you're asking for a regex that basically finds multiple punctuation marks on a single line. So here's one way to do it.
Here's the text I'm going to match. The regex will match the first line in it's entirety, but will not match the second.
Here's a line with multiple punctuation. The entire line will match the regex!
This line does not have multiple punctuation.
Regex
^.*(?:[\.?!].*){2,}$
Explanation
^ -- Start matching at the beginning of a line
.* -- match any character 0 or more times
(?: -- start a new non-capturing group
[.?!] -- find a character matching a period, question mark, or exclamation point.
.* -- match any character 0 or more times
)
{2,} -- repeat the previous group 2 or more times. This is how we ensure there's at least two punctuation marks before considering it a match.
$ -- end of line anchor, basically stop matching at the end of a line
I have texts similar to the following (whitespaces intended), which i run a RegEx on line-by-line:
Smith-Petersen X1l
Jonas Henry
Foord. 82a 221.
12345 Somewhere
I now want to use the RegEx to capture anything before 3 or more whitespaces occur (which might or might not occur) in the first match group. The allowed chars:
[a-zA-Z0-9,. '\-AÖÜäöüß]
What I want is : Smith-Petersen, Jonas Henry, Foord. 82a and 12345 Somewhere.
After trying desperately, I hope to find help with this here...I just can't get it to work since my expression grabs the blanks and what follows and puts it into the first group as well. Is there a ways to reverse the way the RegEx? Can anyone help me with this?
Assuming by "may or may not occur" you mean the line may end before 3 spaces are encountered:
^\s*([-a-zA-Z0-9,\.'AÖÜäöüß ]+?)(?=\s{3}|\s{0,2}$)
This regex is using a positive look ahead to assert that either there's 3 spaces following or there's up to 2 spaces then end-of-input.
The anchor to start of input avoids matching the junk at the end of the longer lines.
Your target is in group 1.
See a live demo on rubular
Here is my approach.
^ *([a-zA-Z0-9,.'AÖÜäöüß-]+(?: {1,2}[a-zA-Z0-9,.'AÖÜäöüß-]+)*)
What you want is in match group 1. This regex uses only greedy operators and works on all four cases found in your sample text.
Basically it matches all words at the beginning of a line that are separated from one another by no more than two spaces. Once more than 2 spaces are found, the match is completed.
How do I create a regular expression to match a word at the beginning of a string?
We are looking to match stop at the beginning of a string and anything can follow it.
For example, the expression should match:
stop
stop random
stopping
If you wish to match only lines beginning with stop, use
^stop
If you wish to match lines beginning with the word stop followed by a space:
^stop\s
Or, if you wish to match lines beginning with the word stop, but followed by either a space or any other non-word character you can use (your regex flavor permitting)
^stop\W
On the other hand, what follows matches a word at the beginning of a string on most regex flavors (in these flavors \w matches the opposite of \W)
^\w
If your flavor does not have the \w shortcut, you can use
^[a-zA-Z0-9]+
Be wary that this second idiom will only match letters and numbers, no symbol whatsoever.
Check your regex flavor manual to know what shortcuts are allowed and what exactly do they match (and how do they deal with Unicode).
Try this:
/^stop.*$/
Explanation:
/ charachters delimit the regular expression (i.e. they are not part of the Regex per se)
^ means match at the beginning of the line
. followed by * means match any character (.), any number of times (*)
$ means to the end of the line
If you would like to enforce that stop be followed by a whitespace, you could modify the RegEx like so:
/^stop\s+.*$/
\s means any whitespace character
+ following the \s means there has to be at least one whitespace character following after the stop word
Note: Also keep in mind that the RegEx above requires that the stop word be followed by a space! So it wouldn't match a line that only contains: stop
If you want to match anything after a word, stop, and not only at the start of the line, you may use: \bstop.*\b - word followed by a line.
Or if you want to match the word in the string, use \bstop[a-zA-Z]* - only the words starting with stop.
Or the start of lines with stop - ^stop[a-zA-Z]* for the word only - first word only.
The whole line ^stop.* - first line of the string only.
And if you want to match every string starting with stop, including newlines, use: /^stop.*/s - multiline string starting with stop.
Like #SharadHolani said. This won't match every word beginning with "stop"
. Only if it's at the beginning of a line like "stop going".
#Waxo gave the right answer:
This one is slightly better, if you want to match any word beginning with "stop" and containing nothing but letters from A to Z.
\bstop[a-zA-Z]*\b
This would match all
stop (1)
stop random (2)
stopping (3)
want to stop (4)
please stop (5)
But
/^stop[a-zA-Z]*/
would only match (1) until (3), but not (4) & (5)
If you want to match anything that starts with "stop" including "stop going", "stop" and "stopping" use:
^stop
If you want to match the word stop followed by anything as in "stop going", "stop this", but not "stopped" and not "stopping" use:
^stop\W
/stop([a-zA-Z])+/
Will match any stop word (stop, stopped, stopping, etc)
However, if you just want to match "stop" at the start of a string
/^stop/
will do :D
If you want the word to start with "stop", you can use the following pattern.
"^stop.*"
This will match words starting with stop followed by anything.
/^stop*$/i
i - in case it is case sensitive.
I'd advise against a simple regular expression approach to this problem. There are too many words that are substrings of other unrelated words, and you'll probably drive yourself crazy trying to overadapt the simpler solutions already provided.
You'll want at least a naive stemming algorithm (try the Porter stemmer; there's available, free code in most languages) to process text first. Keep this processed text and the preprocessed text in two separate space-split arrays. Make sure each non-alphabetical character also gets its own index in this array. Whatever list of words you're filtering, stem them also.
The next step would be to find the array indices which match to your list of stemmed 'stop' words. Remove those from the unprocessed array, and then rejoin on spaces.
This is only slightly more complicated, but will be much more reliable an approach. If you've got any doubts on the value of a more NLP-oriented approach, you might want to do some research into clbuttic mistakes.
can you try this:
https://regex101.com/r/P3qfKG/1
reg = /stop(\w+| [^ ]+|$)/gm
it will select both stop and start with stop and next word;