How to compare two string using RegEx - regex

I have a collection in the MongoDB which has list of URLs like below.
In my business logic for some requirements, I want to check whether the called URL is matching with any of the records in the DB records.
like req.originalUrl i get suppose
/logistics/initiator/5ee7a0be36acdc46ae0576d6/users
But in the above URL obviously, I'm getting the actual Id -- 5ee7a0be36acdc46ae0576d6
What i tried:
I tried manually concatinating the req.baseUrl and req.route.path but that still gives me the below string
/logistics/initiator/:initiator/users
which is again incomparable.

Replace the ID with \{\w+\}, and use that as a regular expression to match against the url column in the table. So the regexp should be /logistics/initiator/\{\w+\}/users.

Related

Regular expressions (RegEx) to filter string from URLs in Google Analytics

I want to filter a string from the URLs in Google Analytics. This can be done using the Views > Filter > Exclude using RegEx, but I have been unable to get it to work.
An outline of how these filters are set up, can be found here, however, I can not work out how to isolate the string using RegEx. I believe it will need to be one filter per URL type.
The URLs follow this format:
/software/11F372288FA/pagename
/software/13F412C5FA/pagename/summary
/software/XIL1P0BFXCKM81/pagename2
I need to exclude this part of the URL:
/11F372288FA/
So that the URL data (e.g. Session time) is recorded against:
/software/pagename
/software/pagename/summary
/software/pagename2
I have worked out that I can isolate the string using thing following RegEx
^\/validate\/(..........)\/accounts\/summary$
It is not very elegant and would require a filter for every URL type.
Thanks for the help!
I'm not certain if this will work in your exact case but instead of using regex for this it might be easier to just create a new string from the start to the end of "software" and append everything from pagename to the end. In Java this might look something like:
String newString = oldString.substring(0, 9) + oldString.substring(oldString.indexOf("pagename"));
Take note though that this will only work if the "software" at the start is always the same length and you are actually only excluding things between "software" and "pagename".

BQSQLException: Cannot parse regular expression: invalid perl operator: see full error in post as title cannot contain certain characters)

Please find dummy data and my attempted solution at the end of this post.
I started learning REGEX in the last several days and am creating a REGEX to exclude any private IP addresses from my dataset. My dataset has a column url, which shows from which IP address a company performed an action. This column contains all kinds of IP addresses in the url format.
I have created a query that should output only non-local IP addresses (which are a part of URL. The query I have is as follows:
WITH table_1 AS(
SELECT 'http://localhost:9999' AS url UNION ALL
SELECT 'https://localhost:0000' AS url UNION ALL
SELECT 'http://stackoverflow.com/challenge' AS url UNION ALL
SELECT 'https://arseniyaskingquestion.ru/SO' AS url
)
SELECT url
FROM table_1
WHERE url NOT IN (SELECT DISTINCT url
FROM table_1
WHERE REGEXP_CONTAINS(url, r'((http(s)?):\/\/)(((25[0-5]|(2[0-4]|1[0-9]|[1-9]|)[0-9])(\.(?!$)|$)){4}$|(.*\.local)|(.*local\.)|(.*localhost)|(.*\.internal)|(.*csb.)|(.*codesandbox)|(.*lvh\.me)|(.*.ngrok.)|(.*nip\.io)|(.*.test)).*'))
ORDER BY url DESC
When I run this query, I get the following error message: BQSQLException: Cannot parse regular expression: invalid perl operator: (?!
I searched StackOverflow and noticed that 1 solution here, but I could not implement it successfully using REGEXP_REPLACE - I kept getting other errors as I tried to implement this and after reading Google Big Query documentation.
As you can see from my code snippet, I am trying to output only non-local IP addresses (which are a part of a full url link). Therefore, the expected output is:
url
----------------------------------
http://stackoverflow.com/challenge
https://arseniyaskingquestion.ru/SO
Is the LIKE clause not suitable?
WHERE LOWER(url) NOT LIKE '%localhost%'
The primary issue with your regex is the use of the negative lookahead (?!$).
Google BigQuery uses re2 and it omits support for lookarounds.
At regex101 you should develop your regexes using the Golang option since that is re2-based. See https://regex101.com/r/HAV5J1/1/ and it will explain why your regex is failing.
Additionally your subquery seems wildly inefficient:
WHERE url NOT IN (SELECT DISTINCT url
FROM table_1
WHERE REGEXP_CONTAINS(url, r'((http(s)?):\/\/)(((25[0-5]|(2[0-4]|1[0-9]|[1-9]|)[0-9])(\.(?!$)|$)){4}$|(.*\.local)|(.*local\.)|(.*localhost)|(.*\.internal)|(.*csb.)|(.*codesandbox)|(.*lvh\.me)|(.*.ngrok.)|(.*nip\.io)|(.*.test)).*'))
Could it not be condensed to:
WHERE NOT REGEXP_CONTAINS(url, r'MY_REGEX')
or:
WHERE REGEXP_CONTAINS(url, r'MY_REGEX') = false
I have no experience with BigQuery.
Your regex contains a lookahead and RE2 does not support lookaheads.
If you are satisfied with your pattern all in all, and you just need to fix that lookahead issue, you can unwrap the IP matching part and use
https?:\/\/((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){4}$|.*\.local|.*local\.|.*localhost|.*\.internal|.*csb.|.*codesandbox|.*lvh\.me|.*.ngrok.|.*nip\.io|.*.test).*
See the RE2 regex demo.
See this IP matching regex page for details.

How to extract sub-directories from the URL using 'REGEXP_EXTRACT' in Data Studio

I'm trying to extract the product name from the URL between the 2 slashes using REGEXP_EXTRACT. For example, I want to extraxt ace-5 from the URLs below:
www.abc.com/products/phones/ace-5/
www.abc.com/products/phones/ace-5/?cid=dm66363&bid
www.abc.com/products/phones/ace-5/?fbclid=iwar30dpnmmpwppnla7
www.abc.com/products/phones/ace-5/?et_cid=em_367029&et_rid=130
I have a RegEx to extract the Domain Name but it is not something I'm actually looking for. Below is the RegEx:
REGEXP_EXTRACT(page,'^[^.]+.([^.]+)')
It gives the following result: abc
Assuming that the product name would always be the fixed fourth path element, we can try:
REGEXP_EXTRACT(page, '(?:[^\/]+\/){3}([^\/]+).*')
or, if the above would not work:
REGEXP_EXTRACT(page, '[^\/]+\/[^\/]+\/[^\/]+\/([^\/]+).*')
Here is a demo for the above:
Demo
Since I do not have the Same Page with my GDS, but I tried to recreate with my set of data source i.e pages from the google analytics.
Use may use the below which will get you all the records after two slash as per your requirement.
REGEXP_EXTRACT(Page,'[^/]+/[^/]+/([^/]+)')
You need to create a calculated column with this formula, once you have created this calculated column you might need to add an additional filter to remove those with the null value.
example Page: "/products/phones/ace-5/"
The Calculated Column value will be "ace-5"
Just make sure this regex will only give you the extracted word after phones/, if you do not have any record after that it will give you null in return.
The REGEXP_EXTRACT Calculated Field below does the trick, extracting all characters after the 3rd / till the next instance of /:
REGEXP_EXTRACT(Page, "^(?:[^/]+/){3}([^/]+)")
Google Data Studio Report and a GIF to elaborate

Mariadb: Regexp_substr not working with non-matching group regular expression

I am using a query to pull a user ID from a column that contains text. This is for a forum system I am using and want to get the User id portion out of a text field that contains the full message. The query I am using is
SELECT REGEXP_SUBSTR(message, '(?:member: )(\d+)'
) AS user_id
from posts
where message like '%quote%';
Now ignoring the fact thats ugly SQL and not final I just need to get to the point where it reads the user ID. The following is an example of the text that you would see in the message column
`QUOTE="Moony, post: 967760, member: 22665"]I'm underwhelmed...[/QUOTE]
Hopefully we aren’t done yet and this is nothing but a depth signing!`
Is there something different about the regular expression when used in mariadb REGEXP_SUBST? this should be PCRE and works within regex testers and should read correctly. It should be looking for the group "member: " and then take the numbers after that and have a single match on all those posts.
This is an ugly hack/workaround that works by using a lookahead for the following "] however will not work if there are multiple quotes in a post
(?<=member: )\.+(?="])

How to extract and validate the id from a youtube video link?

I found this RegEx for extracting youtube ID's:
#^http(?:s?)://(?:www\.)?youtu(?:be\.com/watch\?(?:.*?&(?:amp;)?)?v=|\.be/)([\w‌​\-]+)(?:&(?:amp;)?[\w\?=]*)?#i
Now I'm trying to modify the RegEx to extract the youtube id for a youtube URL in this format:
http://www.youtube.com/watch?v=ESUYMoJVpYo&feature=share&a=rRL4kwOAewcP9KzId6Ks4A
How do I make sure I get the Id extracted from all possible url formats...
URLs aren't normally parsed by regular expression. If you want to modify them in any way, then you probably shouldn't use them.
URLs use what's called a Query String to pass parameters to a page. The beginning of the query string is marked by a question mark and followed by an ampersand delimited list of name/value pairs.
For example, using your own url: http://www.youtube.com/watch?v=ESUYMoJVpYo&feature=share&a=rRL4kwOAewcP9KzId6Ks4A
Page request: www.youtube.com/watch
Whole query string: ?v=ESUYMoJVpYo&feature=share&a=rRL4kwOAewcP9KzId6Ks4A
Name/Value pairs:
v -> ESUYMoJVpYo
feature -> share
a -> rRL4kwOAewcP9KzId6Ks4A
If you want to parse/modify the URL, do so by breaking down the query string. That'll be much more reliable than trying to write a RegEx for it.