Regexp: replace all digits on URL after third slash - regex

How can I replace all digits in URL after third slash, on characters #### with regexp?
In this case, the number of # must correspond to the number of replaced digits.
Numbers can be in more than one slash section.
Also, the location of the digits is not fixed, but exactly after the third slash
Examples:
/path/to/something/1234/end
/path/to/something/12/1234/end
To:
/path/to/something/####/end
/path/to/something/##/####/end
I tried to use an expression, but it does not give the desired result:
"(?<=/)\\d+(?=/|$), #####"
This regexp is needed to implement the grok pattern in Logstash (gsub function).
P.s. Why after third slash? Because because the numbers can be at the beginning, but they do not need to be changed (/path/to_1/something/1234/end)

You can use
(?:\G(?!^)|^((?:/[^/]*){3}/))(\D*)\d
as regex and $1$2# as replacement.
See the regex demo.
Details:
(?:\G(?!^)|^((?:/[^/]*){3}/)) - end of the previous match (\G(?!^)) or (|) start of string + three occurrences of / and then zero or more non-slash shars and then a slash char captured into Group 1 (^((?:/[^/]*){3}/))
(\D*) - Group 2: any zero or more non-digits
\d - a digit
The replacement is a concatenation of Group 1 + Group 2 values and a # char.

Related

Regex with wildcard search?

I created a Regex to check a string for the following situation:
first 4 chars are numbers
following by a point
following by 3 numbers
following by a point
following by 4 to 8 numbers or letters
ie: 1234.123.125B
My Regex: ^[0-9]{4}[.][0-9]{3}[.][0-9a-zA-Z]{4,8}$
But now I need a wildcard search: The Regex should also match if there is a '*' after the first 8 characters. For example:
1234.123.12* MATCH
1234.123* MATCH
1234.123.45B9* MATCH
1234.12* NO MATCH
1234.12345* NO MATCH
How can I add the wildcard search to my Regex?
Thank you
You may use this regex with alternation:
^\d{4}\.\d{3}(?:\*|\.[\da-zA-Z]{0,7}\*|\.[\da-zA-Z]{4,8})$
RegEx Demo
RegEx Details:
^: Start
\d{4}\.\d{3}: Match 4 digits + 1 dot + 3 digits
(?:\*|\.[\da-zA-Z]{0,7}\*|\.[\da-zA-Z]{4,8}): matches a single * OR a * after after a dot and 0 to 7 digits/letters OR match 4 to 8 digits/letters
$: End
My assumptions are that:
You don't allow wildcards to be mid-string
Nor do you want to allow wildcards after the full pattern (e.g.: 1234.123.12345678*).
So, alternatively you may possibily use something like:
^\d{4}\.\d{3}(?!.*\*.)(?![^*]{0,4}$)[.*][*\da-zA-Z]{0,8}$
See the online demo.
^ - Start string ancor.
\d{4}\.\d{3} - Four digits, a dot and another three digits.
(?!.*\*.) - Negative lookahead for zero or more characters followed by asterisk and another character other than newline.
(?![^*]{0,4}$) - Negative lookahead for zero to four characters other than asterisk before end string ancor.
[.*] - A literal dot or asterisk.
[*\da-zA-Z]{0,8} - Zero to eight characters from the character class.
$ - End string ancor.

Removing trailing zeros using REPLACE regex

Remove trailing zeros to a number with 4 decimals
Sample expected output:
1.7500 -> 1.75
1.1010 -> 1.101
1.0000 -> 1
I am new with REGEX so I just tried this one first but not working:
REPLACE ALL OCCURRENCES OF REGEX '^\.[0]\d{0,3}' IN lv_rate WITH space.
Need help for the right regex to use. Thanks!
EDIT: SHIFT lv_rate RIGHT DELETING TRAILING '0' is not an option.
Try replacing on the following regex pattern:
\.?0+$
Use empty string as the replacement. This will match an optional decimal point, followed by trailing zeroes until the end of the string. See the demo below to see this pattern working.
Demo
This answer assumes that all inputs would always have a decimal component. If not, then we would need to add additional logic.
If you want to remove trailing zeros to a number with 4 decimals, one option is to use a capturing group and use group 1 in the replacement.
^(\d+(?=\.\d{4}$)(?:\.\d*[1-9])?)\.?0+$
In parts
^ Start of string
( Capture group 1
\d+ Match 1+ digits
(?=\.\d{4}$) Assert what is on the right is a . and 4 digits
(?:\.\d*[1-9])? Optionally match digits until the last digit 1-9
) Close group 1
\.?0+ Match an optional . and 1 or more times a zero
$ End of string
Regex demo

RegEx - if then else

I am trying to work out a regex expression but struggle with conditionals. I have a list of 100s of URLs that look like this:
/name/something/details/55334
/name/page/1/2
/name/somethingdifferent/34523
/name/page/1
/name/something/553/1
Bottom line is that I want to remove everything when a number appears apart from a scenario where the last thing before the number is a word 'page'.
1. /name/something/details/
2. /name/page/1/2
3. /name/somethingdifferent/
4. /name/page/1
5. /name/something
I will be removing it with Google Analytics Content Grouping or potentially with DataStudio. I already removed /name/ so I have:
1. /something/details/55334
2. /page/1/2
3. /somethingdifferent/34523
4. /page/1
5. /something/553/1
but want to add another rule and remove the numbers so I get:
1. /something/details/
2. /page/1/2
3. /somethingdifferent/
4. /page/1
5. /something
have already tried:
\(?(?=(page\/[0-9]+))(\2)|(\/\d+)
following the syntax of:
(?(?=condition))(IF)|(ELSE)
but it highlights all numbers after text.
Thanks for your help.
sampak
Try ^(\/page.*|[^0-9]*), works with your example.
A Version incl. name: ^(page[\/\d]*|[^\d\s])*
One option might be to match not a whitespace or digit while not matching /page.
Then match a forward slash and 1+ digits followed by any char 0+ times to omit that from the result.
^((?:(?!\/page)[^\d\s])*\/)\d.*
In parts
^ Start of string
( Capture group 1
(?: Non capturing group
(?!\/page) Negative lookahead, assert what is directly to the right is not
[^\d\s] Match any char except a digit or whitespace char
)* Close non capturing group and repeat 0+ times
\/ Match /
) Close group 1
\d.* Match a digit followed by any char except a newline 0+ times
In the replacement use the first capturing group
Regex demo
If you also want to remove /name you could use:
^\/name((?:(?!\/page)[^\d\s])*\/)\d.*
Regex demo

Trying to match zero outside the word bounderies

I have patterns like
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
I can match word TCELL and TBNK with this RegEX
^(\D+)-(\d+)-(\d+)([A-Z1-9]+)?.*
But if I have patterns like
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
the above regex returns
T2 and C192 instead of T20NK and C1920 respectively
Is there a general regex that matches Nzeros out side of these word boundaries?
Let's consider all 4 examples of your input:
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
The first group, between start of line and the first "_" (e.g. FQC19515 in row 1)
consists of:
a non-empty sequence of letters,
a non-empty sequence of digits.
So the regex matching it, including the start of line anchor and a capturing group is:
^([A-Z]+\d+)
You used \D instead of [A-Z] but I think that [A-Z] is
more specific, as it matches only letters an not e.g. "_".
The next source char is _, so the regex can also include _.
A now the more diificult part: The second group to be captured has
actually 2 variants:
a sequence of letters and a sequence of digits (after that there is
a "_"),
a sequence of letters, a sequence of digits and another sequence of
letters (after that there are digits that you want to omit).
So the most intuitive way is to define 2 alternatives, each with
a respective positive lookahead:
alternative 1: [A-Z]+\d+(?=_),
alternative 2: [A-Z]+\d+[A-Z]+(?=\d).
But there is a bit shorter way. Notice that both alternatives start
from [A-Z]+\d+.
So we can put this fragment at the first place and only the rest
include as a non-capturing group ((?:...)), with 2 alternatives.
All the above should be surrounded with a capturing group:
([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
So the whole regex can be:
^([A-Z]+\d+)_([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
with m option ("^" matches also the start of each line).
For a working example see https://regex101.com/r/GDdt10/1
Your regex: ^(\D+)-(\d+) is wrong as after a sequence of non-digits
(\D+) you specified a minus which doesn't occur in your source.
Also the second minus does not correspond to your input.
Edit
To match all your strings, I modified slightly the previous regex.
The changes are limited to the matching group No 2 (after _):
Alternative No 1: [A-Z]{2,}+(?=\d) - two or more letters, after them
there is a digit, to be omitted. It will match TCELL and TBNK.
Alternative No 2: [A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)) - the previous
content of this group. It will match two remaining cases.
So the whole regex is:
^([A-Z]+\d+)_([A-Z]{2,}+(?=\d)|[A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
For a working example see https://regex101.com/r/GDdt10/2
As far as I understand, you could use:
^[A-Z]+\d+_\K[A-Z0-9]{5}
Explanation:
^ # beginning of line
[A-Z]+ # 1 or more capitals
\d+_ # 1 or more digit and 1 underscore
\K # forget all we have seen until this position
[A-Z0-9]{5} # 5 capitals or digits
Demo

Regex: exclude trailing .0 but include all strings

I have a number of floats/strings that look as follows:
12339.0
133339
159.0
dfkkei
something
32439
Some of them have trailing .0. How can I show all the numbers without the trailing .0 as a regular repression, including the items that are not a number? I tried something like that, hoping it would exclude all .0 from the capture group, but it doesn't work: (.*)(:?.0)?
https://regex101.com/r/sC6jO2/1
You may use a simpler regex:
\.0+$
And replace with an empty string, see regex demo.
The regex matches a . (\.) followed with 1 or more zeros (0+) up to the end of string ($).
If you plan to match two groups as in your initial attempt, use
^(.*?)(?:\.0+)?$
See this regex demo
Here,
^ - start of string
(.*?) - Group 1 capturing any 0+ chars other than a newline, as few as possible (=lazily), up to a
(?:\.0+)? - optional sequence of . + one or more zeros
$ - at the end of the string.