Matching the character before a linebreak, excluding whitespaces?

Matching the character before a linebreak, excluding whitespaces? - regex

So I currently have a regex (https://regex101.com/r/zBE4Ju/1) that highlights the words before and after a linebreak. This is nice, but the issue is sometimes there are whitespaces after the word that appears BEFORE the line break. So they end up
You can see on my regex101 how the issue happens, and I have outlined the problem. I need to recognize the word before and after the line break, regardless of if there is a space after the word.
(\w*(?:[\n](?![\n])\w*)+)
You can see it in action here https://regex101.com/r/zBE4Ju/3
Expected: Line 1
Actual: Line 3

You can use $1 from:
/([^ ]+) *(\r|\n)/gm
https://regex101.com/r/o87VP7/5

If you want to highlight the last "word" in the sentence followed by possible spaces and a newline, you could repeat 0+ times a group matching 1+ non whitespace chars followed by 1+ spaces.
Then capture in a group matching non whitespace chars (\S+) and match possible spaces followed by a newline.
^ *(?:\S+ +)*(\S+) *\r?\n
Explanation
^ Start of string
* Match 0+ times a space
(?: Non capturing group
\S+ + Match 1+ non whitespace chars and 1+ spaces
-)* Close non capturing group and repeat 0+ times (to also match a single word at the beginning)
(\S+) Capture group 1, match 1+ times a non whitespace char
*\r?\n Match 0+ times a space followed by a newline
Regex demo

Related

How to match names separated by "and" excluding "and" itself using regex?

I am trying to solve http://play.inginf.units.it/#/level/10
I have some strings as follows:
title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},
title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},
title={Fuel cells and their applications},
author={Kordesch, Karl and Simader, G{"u}nter and Wiley, John},
volume={117},
I need to match the names in bold. I tried the following regex:
(?<=author={).+(?=})
But it matches the entire string inside {}. I understand why is it so but how can I break the pattern with and?

It took me a little while to get the samples to show up in your link. What about:
(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},).)+
See an online demo
(?:^\s*author={|\G(?!^) and ) - Either match start of a line followed by 0+ whitespace chars and literally match 'author={` or assert position at end of previous match but negate start-line;
\K - Reset starting point of reported match;
(?:(?! and |},).)+ - Match any if it's not followed by ' and ' or match a '}' followed by a comma.
Above will also match 'others' as per last sample in linked test. If you wish to exclude 'others' then maybe add the option to the negated list as per:
(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},|\bothers\b).)+
See an online demo
In the comment section we established above would not work for given linked website. Apparently its JS based which would support zero-width lookbehind. Therefor try:
(?<=\bauthor={(?:(?!\},).*?))\b[A-Z]\S*\b(?:,? [A-Z]\S*\b)*
See the demo
(?<= - Open lookbehind;
\bauthor={ - Match word-boundary and literally 'author={';
(?:(?!\},).*?)) - Open non-capture group to match a negative lookahead for '},' and 0+ (lazy) characters. Close lookbehind;
\b[A-Z]\S*\b - Match anything between two word-boundaries starting with a capital letter A-Z followed by 0+ non-whitespace chars;
(?:,? [A-Z]\S*\b)* - A 2nd non-capture group to keep matching comma/space seperated parts of a name.

If using a lookbehind assertion is supported and matching word characters, you might use:
(?<=\bauthor={[^{}]*(?:{[^{}]*}[^{}]*)*)[A-Z][^\s,]*,(?:\s+[A-Z][^\s,]*)+\b
Explanation
(?<= Postive lookahead, assert that to the left of the current position is
\bauthor={ Match author={ preceded by a word boundary
[^{}]*(?:{[^{}]*}[^{}]*)* Match optional chars other than { } or match {...}
) Close the lookbehind
[A-Z] Match an uppercase char A-Z
[^\s,]*, Optionally match non whitespace chars except , and then match ,
(?: Non capture group to repeat as a whole part
\s+[A-Z][^\s,]* Match 1+ whitespace chars, uppercase char A-Z, optional non whitespace chars except ,
)+ Close the non capture group and repeat it 1 or more times
\b a word boundary
See a regex101 demo.

Regex to capture everything after optional token

I have fields which contain data in the following possible formats (each line is a different possibility):
AAA - Something Here
AAA - Something Here - D
Something Here
Note that the first group of letters (AAA) can be of varying lengths.
What I am trying to capture is the "Something Here" or "Something Here - D" (if it exists) using PCRE, but I can't get the Regex to work properly for all three cases. I have tried:
- (.*) which works fine for cases 1 and 2 but obviously not 3;
(?<= - )(.*) which also works fine for cases 1 and 2;
(?! - )(.+)| - (.+) works for cases 2 and 3 but not 1.
I feel like I'm on the verge of it but I can't seem to crack it.
Thanks in advance for your help.
Edit: I realized that I was unclear in my requirements. If there is a trailing " - D" (the letter in the data is arbitrary but should only be a single character), that needs to be captured as well.

About the patterns that you tried:
- (.*)This pattern will match the first occurrence of - followed by matching the rest of the line. It will match too much for the second example as the .* will also match the second occurrence of -
(?<= - )(.*)This pattern will match the same as the first example without the - as it asserts that is should occur directly to the left
(?! - )(.+)| - (.+) This pattern uses a negative lookahead which asserts what is directly to the right is not (?! - ). As none of the example start with - , the whole line will be matched directly after the negative lookahead due to .+ and the second part after the alternation | will not be evaluated
If the first group of letters can be of varying length, you could make the match either specific matching 1 or more uppercase characters [A-Z]+ or 1+ word characters \w+.
To get a more broad match, you could match 1 or more non whitespace characters using \S+
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*
Explanation
^ Start of string
(?:\S+\h-\h)? Optionally match the first group of non whitespace chars followed by - between horizontal whitespace chars
\K Clear the match buffer (Forget what is currently matched)
\S+ Match 1+ non whitespace characters
(?: Non capture group
\h(?!-\h) Match a horizontal whitespace char and assert what is directly to the right is not - followed by another horizontal whitespace char
\S+ Match 1+ non whitespace chars
)* Close non capture group and repeat 1+ times to match more "words" separated by spaces
Regex demo
Edit
To match an optional hyphen and trailing single character, you could add an optional non capturing group (?:-\h\S\h*)?$ and assert the end of the string if the pattern should match the whole string:
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*\h*(?:-\h\S\h*)?$
Regex demo

You may use
^(?:.*? - )?\K.*?(?= - | *$)
^(?:.*?\h-\h)?\K.*?(?=\h-\h|\h*$)
See the regex demo
Details
^ - start of string
-(?:.*? - )? - an optional non-capturing group matching any 0+ chars other than line break chars as few as possible up to the first space-space
\K - match reset operator
.*? - any 0+ chars other than line break chars as few as possible
(?= - | *$) - space-space or 0+ spaces till the end of string should follow immediately on the right.
Note that \h matches any horizontal whitespace chars.

^(?:[A-Z]+ - \K)?.*\S
demo
Since "Something Here" can be anything, there's no reason to specially describe the eventual last letter in the pattern. You don't need something more complicated.
With this pattern I assume that you are not interested by the trailing spaces, that's why I ended it with \S. If you want to keep them, remove the \S and change the previous quantifier to +.

Regex separate digts and chars into groups

I try to seperate any string into 2 groups, digits and chars and eliminate all whitespace between this 2 groups. And after the first digit chars are allowed.
The (\D*)(\S+) works so far well for me except for the whitespace after the 1 group of chars.
Here is my regex demo.

You could exclude matching the whitespace chars as well using a negated character class [^\d\s]+ matching 1+ times any char except a whitespace char or a digit.
You can match optional whitespace chars using \s*
([^\d\s]+)\s*(\S+)
Explanation
( Capture group 1
[^\d\s]+ Match 1+ chars except a digit or whitespace char
) Close group
\s* Match 0+ non whitespace chars
(\S+) Capture group 2, match 1+ times a non whitespace char
Regex demo

How to get the symbol and everything between the symbol

I have been trying to capture anything with a symbol('!') and the word(s) and between them is a space.
Example:
!!! !!! intense beatdown
Right now I could only get the !!! intense word but what would I want is to get the whole word:
!!! intense beatdown
Here is the regex that I'm using:
text = '!!! !!! intense beatdown'
matches = re.findall(r'(\!+ \w+)', text)

Use this regex :
Regex :
!!!\s([!\s]+.+)
Demo Code : Here
Demo Regex : Here

You could match 1 or more exclamation marks followed by matching 1+ word chars.
Then repeat a non capturing group 0+ times matching 1+ word chars separated by a space.
!+ \w+(?: \w+)*
In parts
!+ Match 1+ times !
\w+ Match a space and 1+ word chars
(?: Non capturing group
\w+ Match a space and 1+ word chars
)* Close group and repeat 0+ times using *
Regex demo

Replace Word with #Word after first #

Anyone would kindly help with a regex for Notepad++ to replace Word with #Word (only after the first occurrence of #)?
#Celebrity #Glad #Known #Lord Byron #British #Poet
should become
#Celebrity #Glad #Known #Lord #Byron #British #Poet
^

To replace Word with #Word only after the first occurrence of #, you could use an alternation:
Find what
(?>^[^#]*#\w+\h*|#\w+\h*|\G)\K(\w+\h*)
Replace with
#\1
Regex demo
Explanation
(?> Atomic group
^[^#]*#\w+\h* Match from the start of the string not a # 0+ times using a negated character class followed by matching a #. Then match 1+ times a word character followed by 0+ times a horizontal whitespace character.
| Or
#\w+\h* Match #, a word character 1+ times followed by a horizontal whitespace character 0+ times
| Or
\G Assert position at the end of the previous match
) Close atomic group
\K Forget what what previously matched
(\w+\h*) Capture in a group 1+ word characters followed by 0+ times a horizontal whitespace character

You can use the the following regex to match and replace:
\s([^#]\w+)
It starts by matching a White Space then it creates a Group, that does not start with '#', but contains one or more Word characters.
You then replace with:
' #$1'
That will add '#' to the Words thats doesn't start with it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Matching the character before a linebreak, excluding whitespaces? - regex

You can use $1 from: /([^ ]+) *(\r|\n)/gm https://regex101.com/r/o87VP7/5

Related

How to match names separated by "and" excluding "and" itself using regex?

Regex to capture everything after optional token

Regex separate digts and chars into groups

How to get the symbol and everything between the symbol

Replace Word with #Word after first #

Categories

Resources