match url params with regex - regex

Im trying to extract url params with regex.
here is an example string: param1=val1&param2=val2&adv=val3&param3=val4&param4=val5
This is the regex im using right now:
(\&)([^=]+)\=([^&]+)
I can't figure out how to match the first param. I what to have param1 be in group 2 and val2 in group 3 like the rest of the param matches.
https://regex101.com/r/Qzxyyo/1
How can I do this?
edit:
So this seems to work (meaning param1 is in group 2 and val1 in is group3). But I dont understand why it works or if it is reliable:
(\&|^)([^=]+)\=([^&]+)

(\&)([^=]+)\=([^&]+)
Break it down!
\& look for a & character
[^=]+ match characters up until it hits a =
\= match the = character
[^&]+ match characters up until &
() These define the groups!

Related

Regular expression to extract string from urls

I need to extract a string from an URL. Here are some examples:
Input: https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html – Output: bas-026-009
Input: https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html – Output: aw18-245-b86
Input: https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html – Output: ss20-028-e70
I want to be able to extract the string that goes from the first character after the "/eur_en/" until the third dash. Can someone help me? Thanks
You're looking for regexp: \/eur_en\/([^-]+-[^-]+-[^-]+)
Play & test it at regex101: https://regex101.com/r/RvGROG/1
You need something like this:
const urls = [
"https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html",
"https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html",
"https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html",
]
const rg = new RegExp(`\/eur_en\/([^-]+-[^-]+-[^-]+)`)
const strs = urls.map(url => url.match(rg)[1])
console.log(strs)
// Output:
// [
// "bas-026-009",
// "aw18-245-b86",
// "ss20-028-e70"
// ]
Of course, it's a simple example. In real cases don't forget to check that .match returned array with length greater than 1.
So, the first element is full captured string and the second (as third and next) it's a sub-strings, which is captured by parentheses.
We can improve and complicate our regex like so:
\/((?:[^-\/]+-){2}[^-\/]+)
It'll allow us to not to use a specific anchor /eur_en/ and control the number of dash divided parts.
The expression you're looking for is the following:
/(?<=eur_en\/)[^-]*-[^-]*-[^-]*/
Here is how it works:
(?<=eur_en\/): will look behind for eur_env/ but will not use it in the output
[^-]*: it will match any character that is not a dash. So it will get everything up to the first dash (not including the dash)
[^-]*: it will match any character that is not a dash. So it will get everything up to the second dash (not including the dash)
[^-]*: it will match any character that is not a dash. So it will get everything up to the third dash (not including the dash).
/(?<=\/eur_en\/)\w+-\w+-\w+/g
Tolkens
Description
(?<=\/eur_en\/)
Look behind - If /eur_en/ is found, match whatever proceeds it.
\w+-\w+-\w+
One or more Word character = [A-Za-z0-9] and a literal hyphen three consecutive times.
Review: https://regex101.com/r/Ge0zA3/1

Python Regex - How to extract the third portion?

My input is of this format: (xxx)yyyy(zz)(eee)fff where {x,y,z,e,f} are all numbers. But fff is optional though.
Input: x = (123)4567(89)(660)
Expected output: Only the eeepart i.e. the number inside 3rd "()" i.e. 660 in my example.
I am able to achieve this so far:
re.search("\((\d*)\)", x).group()
Output: (123)
Expected: (660)
I am surely missing something fundamental. Please advise.
Edit 1: Just added fff to the input data format.
You could find all those matches that have round braces (), and print the third match with findall
import re
n = "(123)4567(89)(660)999"
r = re.findall("\(\d*\)", n)
print(r[2])
Output:
(660)
The (eee) part is identical to the (xxx) part in your regex. If you don't provide an anchor, or some sequencing requirement, then an unanchored search will match the first thing it finds, which is (xxx) in your case.
If you know the (eee) always appears at the end of the string, you could append an "at-end" anchor ($) to force the match at the end. Or perhaps you could append a following character, like a space or comma or something.
Otherwise, you might do well to match the other parts of the pattern and not capture them:
pattern = r'[0-9()]{13}\((\d{3})\)'
If you want to get the third group of numbers in brackets, you need to skip the first two groups which you can do with a repeating non-capturing group which looks for a set of digits enclosed in () followed by some number of non ( characters:
x = '(123)4567(89)(660)'
print(re.search("(?:\(\d+\)[^(]*){2}(\(\d+\))", x).group(1))
Output:
(660)
Demo on rextester

How to replace partial groups in python regex?

I have a regex
(obligor_id): (\d+);(obligor_id): (\d+):
A sample match like below:
Match 1
Full match 57-95 `obligor_id: 505732;obligor_id: 505732:`
Group 1. 57-67 `obligor_id`
Group 2. 69-75 `505732`
Group 3. 76-86 `obligor_id`
Group 4. 88-94 `505732`
I am trying to partially replace the full match to the following:
obligor_id: 505732;obligor_id: 505732: -> obligor_id: 505732;
Two ways to achieve so,
replace group 3 and 4 with empty string
replace group 1 and 2 with empty string, and then replace group 4 to (\d+);
How can I achieve these 2 in python? I know there is a re.sub function, but I only know how to replace the whole, not partially replace group.
Thanks in advance.
You can change capturing groups and reference them in the substitution string:
s = 'obligor_id: 505732;obligor_id: 505732:'
re.sub(r'(obligor_id: \d+;)(obligor_id: \d+:)', r'\1', s)
# => 'obligor_id: 505732;
Thanks for answers and advices:
I achieved them as below for future users:
re.sub(regex, r'\1: \2;', str)
re.sub(regex, r'\3: \4;', str)

SCALA regex: Find matching URL rgex within a sentence

import java.util.regex._
object RegMatcher extends App {
val str="facebook.com"
val urlpattern="(http://|https://|file://|ftp://)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?"
var regex_list: Set[(String, String)] = Set()
val url=Pattern.compile(urlpattern)
var m=url.matcher(str)
if (m.find()) {
regex_list += (("date", m.group(0)))
println("match: " + m.group(0))
}
val str2="url is ftp://filezilla.com"
m=url.matcher(str2)
if (m.find()) {
regex_list += (("date", m.group(0)))
println("str 2 match: " + m.group(0))
}
}
This returns
match: facebook.com
str 2 match: url is ftp:
How do I manage the regex pattern so that both the strings are matched well.
What do the symbols actually mean in regex. I am very new to regex. Please help.
I read your regex as:
0 or 1 (? modifier) of the schemes (http://, https://, etc.)
followed by 0 or 1 instance of www.,
followed by 1 or more (+ modifier ) alphanumeric characters ,
followed by any character ( . is a regex special character, remember, standing for any one character),
followed by 0 or more (* modifier) alphanumerics,
followed by any character (. again)
followed by 3 lowercase letters ({3} being an exact count modifier)
followed by 0 or 1 of any character (.?)
followed by one or more lowecase letters.
If you plug your regex into regex101.com, you'll not only see a similar breakdown ( without any errors I might have made, though I think i nailed it), and you'll also have a chance to test various strings against it. Then, once you have your regexes working the way you want, you can bring them back to your script. It's a solid workflow for both learning regexes and developing an expression for a particular purpose.
If you drop your regex and your inputs into regex 101, you'll see why you're getting the output you see. But here's a hint: when you ask your regular expression to match "url is ftp://filezilla.com", nothing excludes "url is" from being part of the match. That's why you're not matching the scheme you want. Regex101 really is a great way to investigate this further.
The regex can be updated to
((ftp|https|http?):\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,})
This is all I needed.

Scala - Explanation for regex statement

Assuming I have a dataframe called df and regex as follows:
var df2 = df
regex = new Regex("_(.)")
for (col <- df.columns) {
df2 = df2.withColumnRenamed(col, regex.replaceAllIn(col, { M => M.group(1).toUpperCase }))
}
I know that this code is renaming columns of df2 such that if I had a column name called "user_id", it would become userId.
I understand what withcolumnRenamed and replaceAllIn functions do. What I do not understand is this part: { M => M.group(1).toUpperCase }
What is M? What is group(1)?
I can guess what is happening because I know that the expected output is userId but I do not think I fully understand how this is happening.
Could someone help me understand this? Would really appreciate it.
Thanks!
M just stands for match, and group (1) refer to group (1) that is captured by regex. Consider this example:
World Cup
if you want to match the example above with regex, you will write something like this \w+\s\w+, however, you can make use of the groups, and write it this way:
(\w+)\s(\w+)
The parenthesis in Regex are used to indicated groups. In the example above, the first (\w+) is group 1 which will match World. The second (\w+) will match group 2 in regex which is Cup. If you want to match the whole thing, you can use group 0 which will match the whole thing.
See the groups in action here on the right side:
https://regex101.com/r/v0Ybsv/1
The signature of the replaceAllIn method is
replaceAllIn(target: CharSequence, replacer: (Match) ⇒ String): String
So that M is a Match and it has a group method, which returns
The matched string in group i, or null if nothing was matched
A group in regex is what's matched by the (sub)regex in parenthesis (., i.e. one symbol in your case). You can have several capturing groups and you can name them or refer to them by index. You can read more about capturing groups here and in the Scala API docs for Regex.
So { M => M.group(1).toUpperCase } means that you replace every match with the symbol in it that goes after _ changed to upper case.