Using regex to get string after final occurrence of / in a URL - regex

I have a large list of URLS such as:
https://www.walmart.com/ip/Cabbage-Patch-Kids-Naptime-Babies-Doll-Blonde-Hair-Blue-Eye-Girl/45792420
https://www.walmart.com/ip/My-Life-As-18-inch-Schoolgirl-Doll-Blonde/336940687
https://www.walmart.com/ip/My-Life-As-18-inch-Everyday-Girl-Doll-African-American/52730785
I need to find all instances after the final / such as 45792420 within the file.
I'm using Sublime Text 3 to do the search with regex.
I created the following regex
\/(?:.(?!\/))+$
however it is returning the / with the string rather than just the string that occurs after the /
For example /45792420
How can I just get whatever comes after the final / ?

Just use \K to prevent anything before the \K from being included in your capture:
\/\K(?:.(?!\/))+$

I would use zero-width assertion(lookbehind) for this :
(?<=\/)\d+$
If last part is not digit, you can search for word characters :
(?<=\/)\w+$
Based on your regex, you can simply apply lookbehind(zero width assertion) :
(?<=\/)(?:.(?!\/))+$

Related

Data Studio Regex (Google RE2) to Extract Subdirectory from Path

I'm working with a Google Data Studio field that has a page URL Path contained within it. Examples:
/
/sample-url
/sample-url-2/
/#sample-url-5/
/sample-url-3/sample-url-4
/sample-url-3/sample-url-6
In each one, I want to be capturing the bold portion in a custom formula/field--from the first slash, up to but excluding the second slash if there is one, and also including the first slash if that's the whole path. (In essence, the first subdirectory.) I would be open to recording the second backslash when there is one if that would make the solution simpler, but I'm guessing it's more complicated that way. I tried the following:
REGEXP_EXTRACT(Field, "^/[^/]+/$")
But it didn't work; everything returned null. What is wrong with that string?
The ^/[^/]+/$ pattern matches a string that starts with a / char, then contains one or more chars other than / and then ends with a / char. So, you can only match strings like /abc/, /123abc/, /abc-1 2 3.?!/, etc.
You can use
REGEXP_EXTRACT(Field, "^(/[^/]*)")
See the regex demo.
NOTE: REGEXP_EXTRACT requires a capturing group in the pattern, the content captured is the return value.
Here, ^ matches the start of string and (/[^/]*) is a capturing group with ID 1 that matches a / char and then any zero or more chars other than / (with [^/]*).

How to apply correct regex?

I have a special task which requires lots of regex and javascript parsing.
My head is almost exploding, so maybe I'm tired and forgot some small thing else I'm not newbie to regex so perhaps someone will point me to good direction here and show me where I did mistake.
So I have this regex code:
((?<=\ffmpg=).+(?=////u0026cs=nt))
to get the value of substring between 2 strings. The first string is called:
ffmpg= from this string it should start and it will end just before the other string start called //u0026cs=nt
The problem is that it is working fine until the html page contains only one parameter with the same name; because the source html has inside like 10's of ffmg and the same end string called cs=nt.
I can not even make regex to count the characters because every time you visit the html page the number of characters are different, sometimes +3 else +10. So the only way is to get this sting from the start of param1 to the end of param2.
This is the string I need to get: 1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012
This is the source html example:
\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\
I have copied 3 times the same just for this purpose because it is very big html source and I doubt I can upload it here.
Thanks for your help.
In your questions, you use (?<=\ffmpg=) where \f will match a form feed character which is not present in the data example. If you meant to use \\f it will match \f which is also not present in the example data.
You could get the match using a capturing group instead of using lookarounds as lookbehinds are not widely supported by all browsers.
If you just want to get a single match, you can omit the /g global flag.
If you use .+ you will match too much as the .+ will match until the end of the string and then backtracks until the first time it can match \\u0026cs=nt
What you could do instead is be specific in what you would allow to match which for the current string is a character class with the following characters [AC0-9%]+
You could broaden the character class with a range to match chars A-Z instead of AC for example and add more chars or ranges as required.
ffmpg=([AC0-9%]+)\\\\u0026cs=nt
Regex demo
For example
const regex = /ffmpg=([AC0-9%]+)\\\\u0026cs=nt/;
const str = `\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\`;
console.log(str.match(regex)[1]);
Try this:
(?<=ffmpg=)([A-F0-9%]+)
Explanation
Since your string only consists of url-encoded characters, you can use [A-F0-9%]+character class to capture it. It will stop when next string starts because there will be a backslash.
See online demo here.

Find a number which is followed by Specfic string from JSON using Regex

I have Json which contains String like
"Kiransinh": 1443095486000,
i wanted to find number which is followed by "Kiransinh": String.
so output will be 1443095486000
so far i was able to find "Kiransinh": string using
("Kiransinh"[ :]+)
To match the numbers after Kiransinh you can use:
/"Kiransinh": \d+,/
You almost got it. Just add \d+ or [0-9]+ after your pattern. And capture it in a group(group 1 precisely). Also, the character class makes : optional which i know you don't want.
Your pattern should be like this
"Kiransinh":\s*(\d+)
Then capture group 1
See DEMO

Regex split and concatenate path base and pattern with filename deleting part of path between them

I have an URL like this:
a) <a href=\"http://example.com/path-pattern-to-match/subPath/onemoreSubpath/arbitrary-number-of-subpaths/someArticle1\">
or:
b) <a href=\"http://example.com/path-pattern-to-match/someArticle2\">
I need to split path pattern with its base URL, start of <a> tag and concatenate it with Iits someArticle. Everything in between needs to be deleted.
Case 'b' remains untouched. Case 'a' needs to become:
<a href=\"http://example.com/path-pattern-to-match/someArticle1\">
Please answer with a RegEx, that is what I need. Other solutions could be interesting if well explained, using Perl or a bash script, but please avoid to suggest some programming module or function to parse it only to say that RegEx is not the best solution and without any real one solution.
PS: I need to parse a non multiline file.
someArticle is variable.
If you have look-behind support, use
(?<=<a href=\\"http:\/\/example\.com\/path-pattern-to-match\/)(?:[^\/]+\/)*([^\/>"]*)(?=\\">)
See demo
EXPLANATION
(?<=<a href=\\"http:\/\/example\.com\/path-pattern-to-match\/) - a fixed width lookbehind making sure we have <a href=\"http://example.com/path-pattern-to-match/ literal text in front of...
(?:[^\/]+\/)* - 0 or more sequences of 1 or more characters other than / ([^\/]+) followed with a literal / (i.e. subpaths)
([^\/>"]*) - A capturing group that matches our keyword "someArticle" (0 or more characters other than ", >, or /.
(?=\\">) - A positive lookahead checking if there is a \"> right after the preceding subpattern.
Using the $1 replacement string, you can remove the subpaths and keep the "someArticle" part.

Regex: Negative lookahead after list match

Consider the following input string (part of css file):
url('...');
url(example.png);
The objective is to take the url part using regex and do something with it. So the first part is easy:
url\(['"]?(.+?)['"]?\)
Basically, it takes contents from inside url(...) with optional quotes symbols. Using this regexp I get the following matches:
...
example.png
So far so good. Now I want to exclude the urls which include 'data:image' in their text. I think negative lookahead is the proper tool for that but using it like this:
url\(['"]?(?!data:image)(.+?)['"]?\)
gives me the following result for the first url:
'...
Not only it doesn't exclude this match, but the matched string itself now includes quote character at the beginning. If I use + instead of first ? like this:
url\(['"]+(?!data:image)(.+?)['"]?\)
it works as expected, url is not matched. But this doesn't allow the optional quote in url (since + is 1 or more). How should I change the regex to exclude given url?
You can use negative lookahead like this:
url\((['"]?)((?:(?!data:image).)+?)\1?\)
RegEx Demo