Regex ignores negative lookahead - regex

I've got the following string:
#index 1#n John Doe#a some University#pc 7#cn 4#hi 1#pi 0.5889
And want to extract the part between #n and the following # with regex. The result should then be:
"John Doe"
This works with the following regex:
(?<=#cn\s).(?:(?!#).)*
However, if the string looks as follows:
#index 1#n #a some University#pc 7#cn 4#hi 1#pi 0.5889
The regex returns:
"#a some University"
But I need it to return an empty string. Can someone help me with this problem?

You may do that by extracting one or more chars other than # after #n and a whitespace:
(?<=#n\s)[^#]+
See the regex demo. The (?<=#n\s) positive lookbehind matches a location immediately preceded with #n and a whitespace, and [^#]+ matches one or more chars other than #.
If there can be any one or more whitespaces, you can use a capturing group. In PySpark, it will look like
df.withColumn("result", regexp_extract(col("source"), r"#n\s+([^#]+)", 1))
See this regex demo. With #n\s+([^#]+), you match #n, one or more whitespaces, and then capture one or more non-#s into Group 1.

Related

Regex Non greedy match between two characters

I have this following string:
message_id = "7bb19406-f97a-47b3-b42c-40868d2cef5b-1661496224#example.com"
I would like to extract the last part between - .* # which is 1661496224
With forward and backward lookup, it starts from first matches of - but I want to match from last match of -:
#!/bin/ruby
message_id = "7bb19406-f97a-47b3-b42c-40868d2cef5b-1661496224#example.com"
message_id.match(/(?<=\-).*(?=\#)
output:
#<MatchData "f97a-47b3-b42c-40868d2cef5b-1661496224">
How to capture the least match (1661496224) between two characters?
Assuming that the message_id does not contain spaces, you might use:
(?<=-)[^-#\s]+(?=#)
Regex demo
If there can not be any more # chars or hyphens after the # till the end of the string, you can add that to the assertion.
(?<=-)[^-#\s]+(?=#[^\s#-]*$)
Regex demo
Another option with a capture group:
-([^-#\s]+)#[^-#\s]*$
Regex demo
You can match all the non-dash characters that are followed by a #:
[^-]+(?=#)
Demo: https://ideone.com/bnDxd2
Inspired by this post on SO; what about:
s = "7bb19406-f97a-47b3-b42c-40868d2cef5b-1661496224#example.com"
puts s[/-([^-#]*)#/,1] #assuming a single '#', otherwise:
puts s.scan(/-([^-#]*)#/).last.first
Both print:
1661496224

replaceAll regex to remove last - from the output

I was able to achieve some of the output but not the right one. I am using replace all regex and below is the sample code.
final String label = "abcs-xyzed-abc-nyd-request-xyxpt--1-cnaq9";
System.out.println(label.replaceAll(
"([^-]+)-([^-]+)-(.+)-([^-]+)-([^-]+)", "$3"));
i want this output:
abc-nyd-request-xyxpt
but getting:
abc-nyd-request-xyxpt-
here is the code https://ideone.com/UKnepg
You may use this .replaceFirst solution:
String label = "abcs-xyzed-abc-nyd-request-xyxpt--1-cnaq9";
label.replaceFirst("(?:[^-]*-){2}(.+?)(?:--1)?-[^-]+$", "$1");
//=> "abc-nyd-request-xyxpt"
RegEx Demo
RegEx Details:
(?:[^-]+-){2}: Match 2 repetitions of non-hyphenated string followed by a hyphen
(.+?): Match 1+ of any characters and capture in group #1
(?:--1)?: Match optional --1
-: Match a -
[^-]+: Match a non-hyphenated string
$: End
The following works for your example case
([^-]+)-([^-]+)-(.+[^-])-+([^-]+)-([^-]+)
https://regex101.com/r/VNtryN/1
We don't want to capture any trailing - while allowing the trailing dashes to have more than a single one which makes it match the double --.
With your shown samples and attempts, please try following regex. This is going to create 1 capturing group which can be used in replacement. Do replacement like: $1in your function.
^(?:.*?-){2}([^-]*(?:-[^-]*){3})--.*
Here is the Online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^(?:.*?-){2} ##Matching from starting of value in a non-capturing group where using lazy match to match very near occurrence of - and matching 2 occurrences of it.
([^-]*(?:-[^-]*){3}) ##Creating 1st and only capturing group and matching everything before - followed by - followed by everything just before - and this combination 3 times to get required output.
--.* ##Matching -- to all values till last.

How can i add conditional statements in Regex

I have 2 strings
1) abc-def
2) abc-
and i have written regex group (?<Myid>[a-zA-Z0-9-]+) all works fine for the first string
However in 2nd string i don't need "-", only abc should be selected. How can i add condition here.
I would phrase your regex as:
(?<Myid>[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*)
This pattern says to match:
[a-zA-Z0-9]+ match one or more alphanumeric characters
(?:-[a-zA-Z0-9]+)* followed by dash and more alphanumeric characters,
zero or more times
Demo
Just appending the negation rule at the end will suffice here I guess.
i.e. (?<Myid>[a-zA-Z0-9-]+[^-])
Demo: https://regex101.com/r/PetK6Q/1

Regex matching a text after a specific string until another specific string

If I have the following example:
X-FileName: pallen (Non-Privileged).pst
Here is our forecast
Message-ID: <15464986.1075855378456.JavaMail.evans#thyme>
How can I select the text
Here is our forecast
after "X-FileName .... \n" until "Message-ID" execluded?
I read about lookahead and behind and tried this but didn't work:
(?<=X-FileName:(\n)+$).+(?=Message-ID:)
This should do it:
(?:X-FileName:[^\n]+)\n+([^\n]+)\n+(?:Message-ID:) (group #1 is the match)
Demo
Explanation:
(?:X-FileName:[^\n]+) matches X-Filename: followed by any number of characters that aren't newlines, without capturing it (?:).
\n+ matches any number of consecutive newlines.
([^\n]+) matches and captures any number of consecutive characters that aren't newlines.
\n+, again, matches any number of consecutive newlines.
(?:Message-ID:) matches Message-ID: without capturing it (?:).
Edit: as #WiktorStribiżew mentioned though, splitting your text into lines may be an easier/cleaner way to retrieve what you want.
There are two approaches here, and they depend on the broader context. If your expected substring is the second paragraph, just split with \n\n (or \r\n\r\n) and get the second item from the resulting list.
If it is a text inside some larger text, use a regex.
See a Python demo:
import re
s='''X-FileName: pallen (Non-Privileged).pst
Here is our forecast
Message-ID: <15464986.1075855378456.JavaMail.evans#thyme>'''
# Non-regex way for the string in the exact same format
print(s.split('\n\n')[1])
# Regex way to get some substring in a known context
m = re.search(r'X-FileName:.*[\r\n]+(.+)', s)
if m:
print(m.group(1))
The regex means:
X-FileName: - a literal substring
.* - any 0+ chars other than line break chars
[\r\n]+ - 1 or more CR or LF chars
(.+) - Group 1: one or more chars other than line break chars, as many as possible.
See the regex demo.

Regex: exclude trailing .0 but include all strings

I have a number of floats/strings that look as follows:
12339.0
133339
159.0
dfkkei
something
32439
Some of them have trailing .0. How can I show all the numbers without the trailing .0 as a regular repression, including the items that are not a number? I tried something like that, hoping it would exclude all .0 from the capture group, but it doesn't work: (.*)(:?.0)?
https://regex101.com/r/sC6jO2/1
You may use a simpler regex:
\.0+$
And replace with an empty string, see regex demo.
The regex matches a . (\.) followed with 1 or more zeros (0+) up to the end of string ($).
If you plan to match two groups as in your initial attempt, use
^(.*?)(?:\.0+)?$
See this regex demo
Here,
^ - start of string
(.*?) - Group 1 capturing any 0+ chars other than a newline, as few as possible (=lazily), up to a
(?:\.0+)? - optional sequence of . + one or more zeros
$ - at the end of the string.