Regex to Extract Last Part of URL that Contains User ID Strings - regex

I'm having a hard time figuring this one out and could use some help.
I'm using Google Analytics filters to reduce the number of unique pages being reported in our app by stripping out ID strings from the URLs that are coming in.
What I need is a regex that will look for URLs that have these IDs in the URL. Here's what sets them apart from the rest of the URL:
ID strings are always the last part of the URL
ID strings always contain both letters and numbers
ID strings are always either 16- or 32-characters in length
ID strings can show up twice in a URL
ID strings can end with either a "/" or without
Here are some example URLs that show how they appear in our reporting:
/app/6be031b9672be9b5/
/app/admin/client/settings/6be031b9672be9b5
/app/subscribers/ea33fb38c9efc4dc0367819f23434f99/
/app/subscribers/customfieldsettings/0359c487066727ae/
/app/reports/6fa92d36be0e6c16/dc5aa096fba9cbb97eea1dae616d4b3c/
The second part of my question is that this regex should also group everything before these ID strings into a capturing group so that I can call that group later on in the filter, effectively stripping out these ID strings to look like the following:
/app/6be031b9672be9b5/ --> /app/
/app/subscribers/ea33fb38c9efc4dc0367819f23434f99/ --> /app/subscribers/
etc.
I've tried a couple different approaches but none seem to work perfectly, so I could really use the help, thank you!

Here's a solution:
^(.*?)(?:\/[a-zA-Z0-9]{16}|\/[a-zA-Z0-9]{32}){0,2}\/?$
Demo
This will remove the last part or 2 parts of URLs which are 16 or 32 characters long and contain only letters and digits.
You can make sure these parts contain both letters and numbers like this, if the tool supports lookaheads:
^(.*?)(?:\/(?=.{0,15}?\d)(?=.{0,15}?[a-zA-Z])[a-zA-Z0-9]{16}|\/(?=.{0,31}?\d)(?=.{0,31}?[a-zA-Z])[a-zA-Z0-9]{32}){0,2}\/?$
Demo
This adds assertions to the pattern.
Breakdown:
^(.*?) # Start of URL
(?:
\/ # a slash
(?=.{0,15}?\d) # check there's a digit at most 16 chars ahead
(?=.{0,15}?[a-zA-Z]) # check there's a letter at most 16 chars ahead
[a-zA-Z0-9]{16} # check the next 16 chars are digits or letters
| # .. or:
\/ # a slash
(?=.{0,31}?\d) # check there's a digit at most 32 chars ahead
(?=.{0,31}?[a-zA-Z]) # check there's a letter at most 32 chars ahead
[a-zA-Z0-9]{32} # check the next 32 chars are digits or letters
){0,2} # .. at most 2 times
\/?$ # optional slash at end

This will do it:
([a-z0-9]+)(?:\/?$)
Demo
Explanation:
([a-z0-9]+) matches and captures the alphanumeric part
(?:\/?$) looks for (but doesn't match or capture) the optional final / and then the end of the string ($)

modified - totally missed that can be 1 or 2 id's at the end thing.
Oh well, revised fwiw.
# (?i)^(.*?)/((?:(?=[^/]{0,31}[a-f])(?=[^/]{0,31}[0-9])(?:[a-f0-9]{16}|[a-f0-9]{32})(?:(?:/[a-z])?/?$|/)){1,2})$
(?i) # Case insensitive modifier
^ # BOS, begin the ride ..
( .*? ) # (1), Kreep up on the first ID
/ # Trim this / junk
( # (2 start), 1-2 ID's separated by a /
(?:
(?= [^/]{0,31} [a-f] ) # Use largest range (32), Must be a lettr AND number
(?= [^/]{0,31} [0-9] )
(?: # One of 16 or 32 length
[a-f0-9]{16}
| [a-f0-9]{32}
)
(?:
(?: / [a-z] )? # optional / letter
/? $ # /? EOS for end of 1 or 2
| # or,
/ # / between 2 only
)
){1,2}
) # (2 end)
$ # EOS, rides over !!
Sample output:
** Grp 0 - ( pos 195 , len 63 )
/app/reports/6fa92d36be0e6c16/dc5aa096fba9cbb97eea1dae616d4b3c/
** Grp 1 - ( pos 195 , len 12 )
/app/reports
** Grp 2 - ( pos 208 , len 50 )
6fa92d36be0e6c16/dc5aa096fba9cbb97eea1dae616d4b3c/

Related

Parsing digits and decimals out of string with re

I have a string that looks like this:
'Home Cookie viewed item "yada_yada.mov" (22.4338.241384081)'
I need to parse the last set of numbers, the ones between the last period and the closing paren (in this case, 241384081) out of the string, keeping in mind that there may be one or more sets of parenthesis in the filename "yada_yada.mov."
So far I have this:
mo = re.match('.*([0-9])\)$', data1)
...where data1 is the string. But that is only returning the very last digit.
Any help, please?
Thanks!
You may use
(\d[\d.]*)\)$
See the regex demo.
Details
(\d[\d.]*) - Capturing group 1: a digit and then any amount of . and digits, 0 or more times
\) - a )
$ - end of string.
See the Python demo:
import re
s='Home Cookie viewed item "yada_yada.mov" (22.4338.241384081)'
m = re.search(r'(\d[\d.]*)\)$', s)
if m:
print(m.group(1)) # => 22.4338.241384081
# print(m.group(1).replace(".", "")) # => 224338241384081
Alternative patterns:
(\d+(?:\.\d+)*)\)$ # To match digits and then 0 or more repetitions of . + digits
(\d+(?:\.\d+)*)\)\s*$ # To allow any 0+ trailing whitespaces

How does this regex for FQDNs (excluding.arpa) work?

I am trying to understand how regex works. I understand it little by little. However, I don't understand this one completely. It's basically a regex for fully qualified domain names but a requirement is that the ending can't be .arpa.
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
https://regex101.com/r/hU6tP0/3
This doesn't match google.uk. If I change it to:
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{1,63}[^.arpa]$)
It works again.
But this works as well
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
Here is my thought process for
?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
I see it as this
(?=
Is a positive look ahead (Can someone explain to me what this actually means?) As I understand it now, it just means that the string needs to match the regex.
^.{4,253}$)
Match all characters but it needs to be between 4 and 253 characters long.
(^([a-zA-Z0-9]{1,63}\.)
Start a capture group and make another capture group within. This capture group says that every non special character can be written 1 to 63 times or till the . is written.
+
The previous capture group can be repeated indefinitely, but it should always end with a .. This way the next capture group is started.
[a-zA-Z]{2,63}
Then as many times as you want you can write a to z with upper, but it needs to be between 2 and 63.
[^.arpa]$)
The last characters can't be .arpa.
Can someone tell me where I am going wrong?
This doesn't do what you think it does:
[^.arpa]
All that says is 'ends with something that isn't one of the letter apr.' - it's a negated character class.
You might be thinking of a negative lookahead assertion:
(?!\.arpa)$
But if you're trying to compound multiple criteria in a regex, I'd suggest you're probably using the wrong tool for the job. It ends up complicated and hard to debug, thanks to greedy/non-greedy matching, etc.
Your 'positive/negative' lookaheads are to match a piece of a pattern that aren't surrounded by other pieces of pattern. But that can have some unexpected outcomes if you're matching variable widths, because the regex engine will backtrack until it finds something that matches.
A simpler example:
([\w.]+)(?!arpa)$
Applied to:
www.test.arpa
Will it match? What's in the group?
... it will match, because [\w\.]+ will consume all of it, and then the lookahead won't "see" anything.
If you use:
([\w]+)\.(?!arpa)
Instead though - you'll capture.... www, but you won't match test (with e.g. g flag, because the www doesn't have .arpa after it, but the test does.
https://regex101.com/r/hU6tP0/5
It really does get complicated using negative assertions in a pattern as a result. I'd suggest simply not doing so, and applying two separate tests. It's hard for you to figure out, and it's hard for a future maintenance programmer too!
This is an analysis of your regex:
(?=^.{4,253}$) # force min length: 4 chars, max length: 253 chars
( # Capturing Group 1 (CG1) - not needed
^ # Match start of the string
( # CG2 (can be a non capturing group '(?:...)')
[a-zA-Z0-9]{1,63} # any sequence of letters and numbers with length between 1 and 63
\. # a literal dot
)+ # CLOSE CG2
[a-zA-Z]{1,63} # any letter sequence with length between 1 to 63
[^.arpa] # a negated char class: any char that is not a "literal" '.','a','r','p' (last 'a' is redundant)
$ # end of the string
) # CLOSE CG1
To avoid the tail of the string to be .arpa you need to use a negative lookahead (?!...), so modify just like this:
(?=^.{4,253}$)(?!.*\.arpa$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
An online demo
Update:
I've upgraded the regex to rationalise it (i've incorporated also the Sobrique suggestion adding an important details):
/^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i
Compact version online demo
Legenda
/ # js regex delimiter
^ # start of the string
(?=.{4,253}$) # force min length: 4 chars, max length: 253 chars
(?: # Non capturing group 1 (NCG1)
[a-z0-9]{1,63} # any letter or digit in a sequence with length from 1 to 63 chars
[.] # a literal dot '.' (more readable than \.)
)+ # CLOSE NCG1 - repeat its content one or more time
(?!arpa$) # force that after the last literal dot '.' the string does not end with 'arpa' (i've added '$' to Sobrique suggestion instead it prevents also '.arpanet' too)
[a-z]{2,63} # a sequence of letters with length from 2 to 63
$ # end of the string
/i # Close the regex delimiter and add case insensitive flag [a-z] match also [A-Z] and viceversa
var re = /^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i;
var tests = ['google.uk','domain.arpa','domain.arpa2','another.domain.arpa.net','domain.arpanet'];
var m;
while(t = tests.pop()) {
document.getElementById("r").innerHTML += '"' + t + '"<br/>';
document.getElementById("r").innerHTML += 'Valid domain? ' + ( (t.match(re)) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>';
}
<div id="r"/>

Regex Social Security number validation with dummy characters

I am modifying existing code that displays a SS#. I am trying to figure out the existing validation although I know next to nothing about regular expressions. What I need to do is refactor the existing validation to ALSO accept dummy characters (probably upper-case "X") for the first 5 places, displaying only the last 4 effectively. All this w/o messing up the existing validation. What I pass into the control will depend on roles within the application, either the full number, 000000000 or XXXXX0000. Any suggestions would be greatly appreciated.
<dx:ASPxTextBox ID="SSN" runat="server" CssClass="ContractTextEntry"
MaxLength="9" Width="145px" AutoPostBack="True"
ValidationSettings-RegularExpression-ValidationExpression="^(?!000)(?!666)(?!9)\d{3}([- ]?)(?!00)\d{2}\1(?!0000)\d{4}$">
<MaskSettings Mask="000-00-0000" PromptChar=" " />
<ValidationSettings SetFocusOnError="True">
<RegularExpression ErrorText="Please enter a valid SSN" />
</ValidationSettings>
</dx:ASPxTextBox>
If you just want to accept X as well as a digit in your first 5 numerals then its a fairly straightforward modification:
^(?!000)(?!666)(?!9)[X0-9]{3}([- ]?)(?!00)[X0-9]{2}\1(?!0000)\d{4}$
all I've done is replace a couple of instances of \d (meaning any digit) with [X0-9] (meaning X or a character in the range 0-9)
FYI - the {3} following the first means repeated 3 times (and the {2} on the 2nd instance means repeated 2 times)
Since you require a few things, either all the first 5 are X's or they're all digits.
I think Dot-Net supports conditionals, but not sure if group number match.
I know it supports group name conditional.
# ^(?!000)(?!666)(?!9)(?:(XXX)|\d{3})([- ]?)(?!00)(?(1)XX|\d{2})\2(?!0000)\d{4}$
^
(?! 000 )
(?! 666 )
(?! 9 )
(?:
( XXX ) # (1), XXX
| \d{3} # Or digits
)
( [- ]? ) # (2), Separator
(?! 00 )
(?(1) # Conditional, did group 1 match ?
XX # yes, get XX
| \d{2} # no, get digits
)
\2 # Backref to separator
(?! 0000 )
\d{4}
$

Matching percentages

I've been trying to enhance some code which determines whether a string is a valid percentage.
I decided that it was time to finally have a hundred problems, and learned regex.
I've been using this web regex tester to build my pattern.
I'm trying to do this rather loosely, such that valid percentages may be integer or decimal, positive or negative, include commas or not, and have any amount of whitespace at the beginning and end, as well as around the optional negative sign and the required percentage sign.
So far, I have \s*-?\s*\d+(,\d+)*(?:\.\d*)?\s*%\s*, which matches almost all of my test cases correctly:
0
0
0
% 0
- 0 %
20948.924780%
315%
2,456,875 %
2,104.86%
89fqyf0gp948y1-%ghghpq98fy92,.?><
, , , ,,,, 0,0,000,00,00,,,0
, , , ,,,, 0,0,000,00,00,,,0%
000000000,00000000000 %
000000000,00000000000,00000000000 %
000000000,00000000000,00000000000,00000000000.00000000000 %
These are not in any particular order, some pass and some fail, but only one is incorrect. In , , , ,,,, 0,0,000,00,00,,,0%, the last 0%\n is a match, but the whole line should be invalid. Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
It may be something small, but as someone who only learned regex yesterday, it's far beyond my reach.
Thanks!
Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
Those anchors should be working. However, it does depend on the regex engine and the options whether they match line begins/ends or file begins/ends. On RegExr, you'd have to check the multiline option: http://regexr.com?380p9 - in programming, use the m flag.
It could be done like this.
Edit: So after realizing its a line thing, this is the regex now.
Note(s) -
Uses multiline mode line Bergi's.
Also, you CANNOT just use \s wihitespace class in this.
It doesn't matter what mode used, \s will WILL match CRLF if it can, which means
-
000,000000.22
%
will match because it satisfies all the conditions.
[^\S\r\n] means match whitespace except CRLF characters. It could be replaced with
[^\S\n] in the real world. The initial input on that tester used \r\n linebreaks.
Good Luck!!
# ^[^\S\r\n]*-?[^\S\r\n]*(?:(?:\.\d+)|(?:\d+(?:,\d+)*(?:\.\d*)?))[^\S\r\n]*%[^\S\r\n]*$
^ # BOL
[^\S\r\n]*
-? # optional -
[^\S\r\n]*
(?: # group
(?: \. \d+ ) # .number
| # or
(?: # group
\d+ # number
(?: , \d+ )* # optional many ,number
(?: \. \d* )? # optional . optional number
) # end group
) # end group
[^\S\r\n]*
% # %
[^\S\r\n]*
$ # EOL

Match any except list of values - oracle regex

I need an Oracle regex that will match a file-name in the format ABCD_EFG_YYYYMMDD_HH(24)MISS.csv, except if the time-part is one of three specific values: 110000, 140000, or 180000.
So, for example, it will match the file-name ABC_DEF_20120925_110001.csv, but not the file-name ABCD_EFG_20120925_110000.csv is not.
The following non-Oracle regex works:
^ABCD_EFG_[0-9]*_(?!110000|140000|180000)[0-9]*\.csv$
but I don't know how to write it as an Oracle regex.
Oracle doesn't support lookahead assertions, so you'll have to spell out all the valid matches:
^ABCD_EFG_[0-9]*_([02-9]|1[0235679]|1[148]0{0,3}[1-9])[0-9]*\.csv$
should work (assuming that the time part is always 6 digits long).
Explanation:
ABCD_EFG_ # Match ABCD_EFG_
[0-9]*_ # Match first number (date part) and _
( # Match a number that starts with
[02-9] # 0 or 2-9
| # or
1[0235679] # 1, followed by 2,3,5,6,7, or 9
| # or
1[148] # 11, 14, or 18
0{0,3} # followed by up to three zeroes
[1-9] # but then one digit 1-9
) # End of alternation
[0-9]* # Fill the rest with any digits
\.csv # Match .csv (mind the backslash!)