Match URL string on different situations with a single regex - regex

I am trying to match a url in four different situations:
With no attributes
Link without other attr
With other attributes
Link with other attr
With no standard href
<span data-link="example.com/reviews/audi/a6/">Link with no href</a>
Just the URL
example.com/reviews/audi/a6
In all of them I always want to do the same, swap reviews at the end without an extra /
I am using this regex to account for the ones that have another attr by identifing the space after the "
("example\.com)\/(reviews|used-cars)\/(.*[^\/$])(\/?)(" )
But then if it ends in "> it messes up and matches end of class
("example\.com)\/(reviews|used-cars)\/(.*[^\/$])(\/?)(">)
https://regex101.com/r/9xbdme/1

You can use
Find:       ("?example\.com)/(reviews|used-cars)/([^"\s]*[^/"\s])/?("[\s>])?
Replace: $1/$3/$2/$4
See the regex demo.
Details:
("?example\.com) - Group 1: an optional ", example.com string
/ - a slash
(reviews|used-cars) - Group 2: reviews or used-cars string
/ - a slash
([^"\s]*[^/"\s]) - Group 3: zero or more chars other than whitespace and " (as many as possible) and then a char other than a whitespace, " and /
/? - an optional slash
("[\s>])? - Group 4 (optional): a " and then a > or whitespace.

Related

Denodo Dive Split function URL

I'm trying the below code to select the last part of the URL:
select 'http://www.XX.com/download/apple-Selection-products/beauty-soap-ICs' , field_1[0].string
from
(
select SPLIT('([^\/]+$)', 'http://www.XX.com/download/apple-Selection-products/beauty-soap-ICs')field_1
)
However, my result isn't coming as expected.
http://www.XX.com/download/apple-Selection-products/beauty-soap-ICs
result should be :
beauty-soap-ICs
but I'm getting Wrong Result.
Any help will be appreciated. The URL can and can't end in a /.
You can use the REGEXP function here:
SELECT REGEXP('http://www.XX.com/download/apple-Selection-products/beauty-soap-ICs', '.*/([^/]+)/?$', '$1') AS result
See the regex demo
Details:
.* - any zero or more chars other than line break chars as many as possible
/ - a / char
([^/]+) - Group 1 ($1 refers to this group value)" one or more chars other than /
/? - an optional / char
$ - end of string.

Regex, substitute part of a string always at the end

I am trying to substitute a string so a part of this url always goes to the end
google.com/to_the_end/faa/
google.com/to_the_end/faa/fee/
google.com/to_the_end/faa/fee/fii
Using this
(google\.com)\/(to_the_end)\/([a-zA-Z0-9._-]+)
$1/$3/$2
It works for the first example, but I need something a bit more versatile so no matter how many folders it always moves to_the_end as the last folder in the url string
Desired output
google.com/faa/to_the_end
google.com/faa/fee/to_the_end/
google.com/faa/fee/fii/to_the_end/
You can use
(google\.com)\/(to_the_end)\/(.*[^\/])\/?$
See the regex demo.
Details:
(google\.com) - Group 1: google.com
\/ - a / char
(to_the_end) - Group 2: to_the_end
\/ - a / char
(.*[^\/]) - Group 3: any zero or more chars other than line break chars as many as possible and then a char other than a / char
\/? - an optional / char
$ - end of string.

Regex for validating account names for NEAR protocol

I want to have accurate form field validation for NEAR protocol account addresses.
I see at https://docs.near.org/docs/concepts/account#account-id-rules that the minimum length is 2, maximum length is 64, and the string must either be a 64-character hex representation of a public key (in the case of an implicit account) or must consist of "Account ID parts" separated by . and ending in .near, where an "Account ID part" consists of lowercase alphanumeric symbols separated by either _ or -.
Here are some examples.
The final 4 cases here should be marked as invalid (and there might be more cases that I don't know about):
example.near
sub.ex.near
something.near
98793cd91a3f870fb126f66285808c7e094afcfc4eda8a970f6648cdf0dbd6de
wrong.near.suffix (INVALID)
shouldnotendwithperiod.near. (INVALID)
space should fail.near (INVALID)
touchingDotsShouldfail..near (INVALID)
I'm wondering if there is a well-tested regex that I should be using in my validation.
Thanks.
P.S. Originally my question pointed to what I was starting with at https://regex101.com/r/jZHtDA/1 but starting from scratch like that feels unwise given that there must already be official validation rules somewhere that I could copy.
I have looked at code that I would have expected to use some kind of validation, such as these links, but I haven't found it yet:
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/utils/account.js#L8
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/components/accounts/AccountFormAccountId.js#L95
https://github.com/near/near-cli/blob/cdc571b1625a26bcc39b3d8db68a2f82b91f06ea/commands/create-account.js#L75
The pre-release (v0.6.0-0) version of the JS SDK comes with a built-in accountId validation function:
const ACCOUNT_ID_REGEX =
/^(([a-z\d]+[-_])*[a-z\d]+\.)*([a-z\d]+[-_])*[a-z\d]+$/;
/**
* Validates the Account ID according to the NEAR protocol
* [Account ID rules](https://nomicon.io/DataStructures/Account#account-id-rules).
*
* #param accountId - The Account ID string you want to validate.
*/
export function validateAccountId(accountId: string): boolean {
return (
accountId.length >= 2 &&
accountId.length <= 64 &&
ACCOUNT_ID_REGEX.test(accountId)
);
}
https://github.com/near/near-sdk-js/blob/dc6f07bd30064da96efb7f90a6ecd8c4d9cc9b06/lib/utils.js#L113
Feel free to implement this in your program too.
Something like this should do: /^(\w|(?<!\.)\.)+(?<!\.)\.(testnet|near)$/gm
Breakdown
^ # start of line
(
\w # match alphanumeric characters
| # OR
(?<!\.)\. # dots can't be preceded by dots
)+
(?<!\.) # "." should not precede:
\. # "."
(testnet|near) # match "testnet" or "near"
$ # end of line
Try the Regex out: https://regex101.com/r/vctRlo/1
If you want to match word characters only, separated by a dot:
^\w+(?:\.\w+)*\.(?:testnet|near)$
Explanation
^ Start of string
\w+ Match 1+ word characters
(?:\.\w+)* Optionally repeat . and 1+ word characters
\. Match .
(?:testnet|near) Match either testnet or near
$ End of string
Regex demo
A bit broader variant matching whitespace character excluding the dot:
^[^\s.]+(?:\.[^\s.]+)*\.(?:testnet|near)$
Regex demo

Sanitize url path with regex

I'm trying to sanitize a url path from the following elements
ids (1, 14223423, 24fb3bdc-8006-47f0-a608-108f66d20af4)
filenames (things.xml, doc.v2.final.csv)
domains (covered under filenames)
emails (foo#bar.com)
Sample:
/v1/upload/dxxp-sSy449dk_rm_1debit_A_03MAY21.final.csv/email/foo#bar.com?who=knows
Desired outcome:
/upload/email
I have something that works... but I'm not proud (written in Ruby)
# Remove params from the path (everything after the ?)
route = req.path&.split('?')&.first
# Remove filenames with singlular extentions, domains, and emails
route = route&.gsub(/\b[\w-]*#?[\w-]+\.[\w-]+\b/, '')
# Remove ids from the path (any string that contains a number)
route = "/#{route&.scan(/\b[a-z_]+\b/i)&.join('/')}".chomp('/')
I can't help but think this can be done simply with something like \/([a-z_]+)\/?, but the \/? is too loose, and \/ is too restrictive.
Perhaps you can remove the parts starting with a / and that contain at least a dot or a digit.
Replace the match with an empty string.
/[^/\d.]*[.\d][^/]*
Rubular regex demo
/ Match a forward slash
[^/\d.]* Match 0+ times any char except / or . or a digit
[.\d] Match either a . or a digit
[^/]* Match 0+ times any char except /
Output
/upload/email
In Ruby, you can use a bit of code to simplify your checks in a similar way you did:
text = text.split('?').first.split('/').select{ |x| not x.match?(/\A[^#]*#\S+\z|\d/) }.join("/")
See the Ruby demo. Note how much this approach simplifies the email and digit checking.
Details
text.split('?').first - split the string with ? and grab the first part
.split('/') - splits with / into subparts
.select{ |x| not x.match?(/\A[^#]*#\S+\z|\d/) } - only keep the items that do not match \A[^#]*#\S+\z|\d regex: \A[^#]*#\S+\z - start of string, any zero or more chars other than #, a # char, then any zero or more non-whitespace chars and end of string, or a digit
.join("/") - join the resulting items with /.
So, I think it's better to go with the allow list here, rather than a block list. Seems like it's more predictable to say "we only keep words with letters and underscores".
# Keep path w/o params
route = req.path.to_s.split('?').first
# Keep words that only contain letters or _
route = route.split('/').keep_if { |chunk| chunk[/^[a-z_]+$/i] }
# Put the path back together
route = "/#{route.join('/')}".chomp('/')

How to delete duplicate numbers in notepad ++?

I've been trying to do use the ^(.*?)$\s+?^(?=.*^\1$) but it doesnt work.
I have this scenario:
9993990487 - 9993990487
9993990553 - 9993990553
9993990554 - 9993990559
9993990570 - 9993990570
9993990593 - 9993990596
9993990594 - 9993990594
And I would want to delete those that are "duplicate" and spect the following:
9993990487
9993990553
9993990554 - 9993990559
9993990570
9993990593 - 9993990596
9993990594
I would really appreciate some help since its 20k+ numbers I have to filter. Or maybe another program, but it's the only one I have available in this PC.
Thanks,
Josue
You may use
^(\d+)\h+-\h+\1$
Replace with $1.
See the regex demo.
Details
^ - start of a line
(\d+) - Group 1: one or more digits
\h+-\h+ - a - char enclosed with 1+ horizontal whitespaces
\1 - an inline backreference to Group 1 value
$ - end of a line.
The replacement is a $1 placeholder that replaces the match with the Group 1 value.
Demo and settings: