This seems like it should be easy but I can't figure out which permutation of regex matching will result in extracting the whole string after the first number if the string. I can extract the string before the first number like so:
gsub( "\\d.*$", "", "DitchMe5KeepMe" )
Any idea how to write the regex pattern such that the string after the first number is kept?
Instead of lazy dot matching, I'd rely on a \D non-digit character class and use sub to make just one replacement:
sub( "^\\D*\\d", "", "DitchMe5KeepMe" )
Here,
^ - matches the start of a string
\D* - matches zero or more non-digits
\d - matched a digit
NOTE: to remove up to the first number, add a + after the last \d to match one or more digits.
See the IDEONE demo.
What I can see is that you want to remove everything until the first number, so you can use this regex and replace it with an empty string:
^.*?\d
I used .*? to make the pattern ungreedy, so if you had DitchMe5Keep8Me it will match DitchMe5, if you use a greedy pattern like .*\d it would match DitchMe5Keep8
Regex 101 Demo
R Fiddle Demo
You can also use str_extract from stringr:
library(stringr)
str_extract("DitchMe5KeepMe", "(?<=\\d).*$")
[1] "KeepMe"
which will extract everything after the first digit.
str_extract("DitchMe5KeepMe6keepme", "(?<=\\d).*$")
[1] "KeepMe6keepme"
Related
I thought I understood how the non-greedy modifier works, but am confused by the following result:
Regular Expression: (,\S+?)_sys$
Test String: abc,def,ghi,jkl_sys
Desired result: ,jkl_sys <- last field including comma
Actual result: ,def,ghi,jkl_sys
Use case is that I have a comma separated string whose last field will end in "_sys" (e.g. ,sometext_sys). I want to match only the last field and only if it ends with _sys.
I am using the non-greedy (?) modifier to return the shortest possible match (only the last field including the comma), but it returns all but the first field (i.e. the longest match).
What am I missing?
I used https://regex101.com/ to test, in case you want to see a live example.
You can use
,[^,]+_sys$
The pattern matches:
, Match the last comma
[^,]+ Match 1 + occurrences of any char except ,
_sys Match literally
$ End of string
See a regex demo.
If you don't want to match newlines and whitespaces:
,[^\s,]+_sys$
It sounds like you're looking for the a string that ends with "_sys" and it has to be at the end of the source string, and it has to be preceded by a comma.
,\s*(\w+_sys)$
I added the \s* to allow for optional whitespace after the comma.
No non-greedy modifiers necessary.
The parens are around \w+_sys so you can capture just that string, without the comma and optional whitespace.
I want to replace anything other than character, spaces and number only in end with empty string or in other words: we replace any number or spaces comes in-starting or in-middle of the string replace with empty string.
Example
**Input** **Output**
Ndd12 Ndd12
12Ndd12 Ndd12
Ndd 12 Ndd 12
Nav G45up Nav Gup
Attempted Code
regexp_replace(df1[col_name]), "(^[A-Za-z]+[0-9 ])", ""))
You may use:
\d+(?!\d*$)|[^\w\n]+(?!([A-Z]|$))
RegEx Demo
Explanation:
\d+(?!\d*$): Match 1+ digits that are not followed by 0+ digits and end of line
|: OR
[^\w\n]+(?!([A-Z]|$)): Match 1+ non-word characters that are not followed by an uppercase letter or and end of line
if you use python, you can use regular expressions.
You can use the re module.
import re
new_string = re.sub(r"[^a-zA-Z0-9]","",s)
Where ^ means exclusion.
Regular expressions exist in other languages. So it would be helpful to find a regular expression.
I came up with this regex to capture all characters that you want to remove from the string.
^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+
Do
regexp_replace(df1[col_name]), "^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+", ""))
Regex Demo
Explanation:
^\d+ - captures all digits in a sequence from the start.
(?<=\w)\d+(?![\d\s]) - Positive look behind for a word character with a negative look ahead for a number followed by space and capturing a sequence of digits in the middle. (Captures digits in G45up)
(?<=\s)\s+ - positive look behind for a space followed by one or more spaces, capturing all additional spaces.
Note : This regex could be inefficient when matching large strings as it uses expensive look-arounds.
^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+|(?<=\w)\W|\W(?=\w)|(?<!\w)\W|\W(?!\w)
I am trying to find a regex pattern to fix a localize issue.
The usual delimiters are "." "," or "_" which i have stored into an array of delimiters.
I'm trying to find a pattern with match any of these delimiters which also ends with one or more 0.
For example 3,000 or 3,0 3.0 3.00
You could try positive lookahead
If indeed your data always has one or more 0 after any delimiter, using a positive lookahead ( (?=0+) in this case) might be what you are looking for...
More precisely, for the numbers you gave:
s/([_.,](?=0+))/g
should do the trick!
You could try it out and experiment with regex here!
We could likely start with an expression similar to:
\d+[.,](\d+)?[0]
and add additional boundaries to it, if we like so.
For instance, if we wish to capture the delimiters, we would be adding a capturing group:
\d+([.,])(\d+)?[0]
Demo
Or if we wish to remove delimiters, we would expand it to:
(\d+)([.,])(\d+)?([0])
and replace it with:
$1$3$4
Demo
Test
const regex = /(\d+)([.,])(\d+)?([0])/gm;
const str = `3,000
3,0
3.0
3.00`;
const subst = `$1$3$4`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log(result);
You could use add the delimiters in a character class [_.,] and use word boundaries \b to prevent the number being part of a larger word.
If thet are the only value, you might also use anchors to assert the start ^ and the ends $ of the string.
\b\d+[_.,]\d*0\b
That will match:
\b Word boundary
\d+ Match 1+ digits
[_.,] Match any of the listed in the character class
\d*0 Match 0+ digits followed by a zero
\b Word boundary
Regex demo
Let's say I've have the string ">=3.0.0". Regex: [^\D](\S+\d+) helps to strip it down to 3.0.0 (https://regex101.com/r/XUxyEM/1). But when I've the string ">=6" Regex [^\D](\S+\d+) will return an empty result (https://regex101.com/r/XUxyEM/2) whilst I want to have 6 as result. How to change my regex so that it works for both cases. Sorry, I'm kind of a regex-newbie.
If you want to capture digit followed with any amount of digit and dots after any amount of non-digits, use
\D*(\d[\d.]*)
See the regex demo
The following regex
\d[\d.]*$
will match a digit followed with 0+ digits and/or dots at the end of the string. See another demo.
I have this regex
(?<=TG00).*?(?=#)
which extracts all strings between TG00 and #. Demo: https://regex101.com/r/04oqua/1
Now, from above results I want to extract only the string which contains TG40 155963. How can I do it?
Try this pattern:
TG00[^#]*TG40 155963[^#]*#
This pattern just says to find the string TG40 155963 in between TG00 and an ending #. For the sample data in your demo there were 3 matches.
Demo
For some reason appending .*? to your lookbehind results in engine error, but works fine with lookahead. Regex below does not match your text exactly, but it does extract it via capture group.
(?<=TG00).*?(TG40 155963)(?=.*?#)
You can use this regex with a lookahead and negated character class:
(?<=TG00)(?=[^#]*TG40 155963)[^#]+(?=#)
RegEx Demo
RegEx Explanation:
(?<=TG00): Assert that we have TG00 at previous position
(?=[^#]*TG40 155963): Lookahead to assert we have string TG40 155963 after 0 or more non-# characters, ahead
[^#]+: Match 1+ non-# characters