extract string using regex? - regex

Sample Data:
+---------------------------------------------------------------------------------+
|refererurl |
+---------------------------------------------------------------------------------+
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|http://mbappgiwwg33nfz2gk43dn4xgo4tpmnsxe6joozuwk5y8.com/ |
|http://mbappgewtgobzgu4dcmrtgy888888.com/ |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|http://mbappgiwwg33nfz2gk43dn4xgo4tpmnsxe6joozuwk5y8.com/ |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|null |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|http://mbappgiwwg33nfz2gk43dn4xgo4tpmnsxe6joozuwk5y8.com/ |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|http://mbappgiwwg33nfz2gk43dn4xgo4tpmnsxe6joozuwk5y8.com/ |
|https://www.tesco.com/direct/party-gifts-flowers/helium-canisters/cat31450037.cat|
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
|https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html |
+---------------------------------------------------------------------------------+
I want regex expression as follows:
a. I want a regex expression which can start extracting backward before '.com' till website name including .com
for eg.
https://www.tesco.com/groceries/dfp/dfp-beaa1a3b14.html --> tesco.com
http://mbappgiwwg33nfz2gk43dn4xgo4tpmnsxe6joozuwk5y8.com --> mbappgiwwg33nfz2gk43dn4xgo4tpmnsxe6joozuwk5y8.com

The following regex seems to work here:
[^.\/]+.com
Demo
Note that this doesn't consider possible URLs like tesco.co.uk.com, in which case we would need to do more work.

Try this one:
(?:http(?:s)?:\/\/(?:www.)?)(.*?)\/
It should work even with url like:
www.example.co.uk/qsdqsd.html
DEMO

Related

Extract multiple values from a string for each id

I want to extract matches from a string column for each id. How can I achieve that?
+--------+---------------------------------------+
| id | text |
+--------+---------------------------------------+
| fsaf12 | Other Questions,Missing Document |
| sfas11 | Others,Missing Address,Missing Name |
+--------+---------------------------------------+
Desired output:
+--------+------------------+
| id | extracted |
+--------+------------------+
| fsaf12 | Other Questions |
| fsaf12 | Missing Document |
| sfas11 | Others |
| sfas11 | Missing Address |
| sfas11 | Missing Name |
+--------+------------------+
Here is the query for sample data: FIDDLE
You can use regexp_split_to_table for your requirement like below:
WITH t1 AS (
SELECT 'fsaf12' AS id, 'Other Questions,Missing Document' AS text UNION ALL
SELECT 'sfas11', 'Others,Missing Address,Missing Name'
)
SELECT id, regexp_split_to_table(text,',')
FROM t1
OUTPUT
| id | extracted |
|-----------|-----------------------|
| fsaf12 | Other Questions |
| fsaf12 | Missing Document |
| sfas11 | Others |
| sfas11 | Missing Address |
| sfas11 | Missing Name |
DEMO
Postgres is not my forte at all but based on this older post on SO you could try to use unnest(). I included a TRIM() to remove possible railing spaces after a split:
SELECT id, TRIM(unnest(string_to_array(text, ','))) as "extracted" FROM t1;
Or, if you want to use regexp_split_to_table():
SELECT id, regexp_split_to_table(text, '\s*,\s*') as "extracted" FROM t1;
Here we matches 0+ whitespace characters, a literal comma and again 0+ whitespace characters.

matching string where intitial part variable and fixed end part

following is the list of instance name from the output of nova command.
nova list
+--------------------------------------+-----------------------------------------+--------+------------+-------------+------------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-----------------------------------------+--------+------------+-------------+------------------------------------------+
| 6cdc00a7-cfe3-4bfe-bbb1-7980ac1c04c0 | haproxy-instance-vms22updateconfar | ACTIVE | - | Running | Orch-Mgmt=10.32.1.40 |
| d0528617-39cd-4098-b34c-0977f5a18414 | gunicon-instance-vms22updateconfar | ACTIVE | - | Running | vms2.1-net=192.168.0.248 |
| e89dd43d-8021-47c6-9f55-39d8bce3c11b | nsoshim-instance-vms22updateconfar | ACTIVE | - | Running | App-Mgmt=10.20.0.126 |
| b7ea9059-834c-4196-8706-54cfaab3d177 | haproxy-instance-vms22update | ACTIVE | - | Running | App-Mgmt=10.20.0.89 |
| 2d4d22e5-b844-413f-8d36-f8b3eb3dea32 | gunicon-instance-vms22update | ACTIVE | - | Running | App-Mgmt=10.20.0.46 |
| 41c4fdc0-3058-4e39-8207-2c02a611ee22 | nsoshim-instance-vms22update | ACTIVE | - | Running | App-Mgmt=10.20.0.217 |
|
SUBDOMAIN=vms22update
nova list | grep "\-instance-$SUBDOMAIN"
gunicon-instance-vms22updateconfar
haproxy-instance-vms22updateconfar
nsoshim-instance-vms22updateconfar
gunicon-instance-vms22update
haproxy-instance-vms22update
nsoshim-instance-vms22update
I want to see instance ends with only vms22update
I tried nova list | grep "-instance-^$SUBDOMAIN$"
it is not listing anything.
#Chris_vr: Thanks for the hint posting my comment as an answer:
You could try this:
nova list | awk -F"|" '{print $3}' | sed 's/ *$//' | grep -E "vms22update\$"
Get output by executing nova list
Split by |
Remove whitespaces
grep for lines ending with vms22update

Regex priority of match (forward and rear looking regex)

I have a monster regex at the moment, and am currently looking at how this best functions.
My regex is listed below and I am curious if there is a way to prioritize the regex in one function rather than just look for a specific match whereever it may exist.
Example:
If in my string i have a match for ([\d]+/[\d]+) or ([\d]+ / [\d]+) it would pick that first.
If this match above does not exist then but these existed ([\d]+-[\d]+) or ([\d]+ - [\d]+) it would pick that match
After that if ([\d]+) then it would pick that match as the end marker. If none of those existed it would then just move on to any of the other matches.
So my question is:
With Regex is there any way to prioritize which match to take first?
example: Some of my address strings are in the format of 1 - 12 example street,
often the regex will pull 12 example street rather than taking 1 - 12 example street.
Thanks!
The full regex is listed below:
New Regex("( ([\d]+) | ([\d]+-[\d]+) | ([\d]+ - [\d]+) | CAR
SMOULDERING | GAS BOTTLE EXPLOSION | INPUT | OFF | OPPOSITE | CNR |
SPARKING | INCIC1 | INCIC3 | STRUC1 | STRUC3 | G&SC1 | G&SC3 | ALARC1 |
ALARC3 | NOSTC1| NOSTC3 | RESCC1 | RESCC3 | HIARC1 | HIARC3 | CAR
ACCIDENT - POSS PERSON TRAPPED | EXPLOSIONS HEARD | WASHAWAY AS A
RESULT OF ACCIDENT | ENTRANCE | ENT |FIRE| LHS | RHS | POWER LINES
ARCING AND SPARKING | SMOKE ISSUING FROM FAN | CAR FIRE | FIRE ALARM
OPERATING | GAS LEAK | GAS PIPE | NOW OUT | ACCIDENT | SMOKING | ROOF |
GAS | REQUIRED | FIRE | LOCKED IN CAR | SMOKE RISING | SINGLE CAR
ACCIDENT | ACCIDENT | FIRE)(.*?)(?=\SVSE| M | SVC | SVSW | SVNE | SVNW
)", RegexOptions.RightToLeft)
Change the order of the 3 first:
(\d+-\d+) | (\d+ - \d+) | (\d+ )
instead of:
([\d]+) | ([\d]+-[\d]+) | ([\d]+ - [\d]+)

Regex to capture dialog in Virginia Woolf's novel The Waves?

A bunch of us English grad students are studying dialog in Virginia Woolf's novel The Waves, and I've been trying to mark up the novel in TEI. To do this, it would be useful to write a regex that captures the dialog. Thankfully, The Waves is extremely regular, and almost all the dialog is in the form:
'Now they have all gone,' said Louis. 'I am alone. They have gone into the house for breakfast,'
But could continue for several paragraphs. I'm trying to write a regex to match all the paragraphs of a given speaker.
This is discussed briefly in Chris Foster's blog post, where he suggests something like /'([\^,]+,)' said Louis, '(*)'/, although this would only match single paragraphs, I think. This is how I'm thinking through it:
For every paragraph containing the text "said Louis" (or any other character's name) in the first line of the paragraph, match every line until reaching another character's speech, i.e. "said Rhodha."
I could probably do this with a ton of awkward python, but I'd love to know whether this is possible with regex.
It seems, from your link, that the text follows the following rules.
Each "line" is indeed a line in the strict sense, i.e. separated by \n.
Paragraphs are demarcated by two or more consecutive new lines, _i.e. \n\n+.
Only the non-directional single quote ' is used to demarcate speech.
Here's a quick attempt (scroll all the way down to view the match groups)—flawed, I'm sure—but there's enough here that should lead you in the right direction. Note how if you concatenate the three capture groups, idiomatically known as $1, $2, and $3, you get each character's speech, including punctuation between the "said" separator. However, notice how certain quirks of language throw this regular expression off—for example, the fact that we do not close quotes at the end of paragraphs, yet open new quotes if the speech continues into the next paragraph, throws off the whole balanced-quotes strategy—and so do apostrophes.
\n\n.*?'([^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:([.]) |, )'([^^]+?)'(?=[^']*(?:'[^']')*[^']*\n\n.*'(?:[^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:[.] |, ))
| | | <----><--> <>|<-------------------><------------>| <----> |<--------------------------------------------------------------------------------->
| | | | | | || | | | ||
| | | | | | || | | | |assert that this end-quote is followed by a string of non-quote characters, then
| | | | | | || | | | |zero or more strings of quoted non-quote characters, then another string of non-
| | | | | | || | | | |quote characters, a new paragraph, and the next "said Bernard"; otherwise fail.
| | | | | | || | | | |
| | | | | | || | | | match an (end-)quote
| | | | | | || | | |
| | | | | | || | | match any character as needed (but no more than needed)
| | | | | | || | |
| | | | | | || | match a (start-)quote
| | | | | | || |
| | | | | | || match either a period followed by two spaces, or a comma followed by one space
| | | | | | ||
| | | | | | |match the "said Bernard"
| | | | | | |
| | | | | | match an (end-)quote
| | | | | |
| | | | | match a comma, optionally
| | | | |
| | | | match a question mark, optionally
| | | |
| | | match any character as needed (but no more than needed)
| | |
| | match a (start-)quote
| |
| match as many non-newline characters as needed (but no more than needed)
|
new paragraph
Rubular matches (an excerpt):
Match 3
1. But when we sit together, close
2.
3. we melt into each
other with phrases. We are edged with mist. We make an
unsubstantial territory.
Match 4
1. I see the beetle
2. .
3. It is black, I see; it is green,
I see; I am tied down with single words. But you wander off; you
slip away; you rise up higher, with words and words in phrases.

Regex named grouping

Can you have dynamic naming in regex groups? Something like
reg = re.compile(r"(?PText|Or|Something).*(?PTextIWant)")
r = reg.find("TextintermingledwithTextIWant")
r.groupdict()["Text"] == "TextIWant"
So that depending on what the beggining was, group["Text"] == TextIWant
Updated to make the quesetion more clear.
Some regex engines support this, some don't. This site says that Perl, Python, PCRE (and thus PHP), and .NET support it, all with slightly different syntax:
+--------+----------------------------+----------------------+------------------+
| Engine | Syntax | Backreference | Variable |
+--------+----------------------------+----------------------+------------------+
| Perl | (?<name>...), (?'name'...) | \k<name>, \k'name' | %+{name} |
| | (?P<name>...) | \g{name}, (?&name)* | |
| | | (?P>name)* | |
+--------+----------------------------+----------------------+------------------+
| Python | (?P<name>...) | (?P=name), \g<name> | m.group('name') |
+--------+----------------------------+----------------------+------------------+
| .NET | (?<name>...), (?'name'...) | \k<name>, \k'name' | m.Groups['name'] |
+--------+----------------------------+----------------------+------------------+
| PCRE | (?<name>...), (?'name'...) | \k<name>, \k'name' | Depends on host |
| | (?P<name>...) | \g{name}, \g<name>* | language. |
| | | \g'name'*, (?&name)* | |
| | | (?P>name)* | |
+--------+----------------------------+----------------------+------------------+
This is not a complete list, but it's what I could find. If you know more flavors, add them! The backreference forms with a * are those which are "recursive" as opposed to just a back-reference; I believe this means they match the pattern again, not what was matched by the pattern. Also, I arrived at this by reading the docs, but there could well be errors—this includes some languages I've never used and some features I've never used. Let me know if something's wrong.
Your question is worded kind of funny, but I think what you are looking for is a non-capturing group. Make it like this:
(?:Must_Match_This_First)What_You_Want(?:Must_Match_This_Last)
The ?: is what designates a that a group matches, but does not capture.
You could first build the string in a dynamic way and then pass it to the Regex engine.