Sanitize url path with regex

Sanitize url path with regex - regex

I'm trying to sanitize a url path from the following elements
ids (1, 14223423, 24fb3bdc-8006-47f0-a608-108f66d20af4)
filenames (things.xml, doc.v2.final.csv)
domains (covered under filenames)
emails (foo#bar.com)
Sample:
/v1/upload/dxxp-sSy449dk_rm_1debit_A_03MAY21.final.csv/email/foo#bar.com?who=knows
Desired outcome:
/upload/email
I have something that works... but I'm not proud (written in Ruby)
# Remove params from the path (everything after the ?)
route = req.path&.split('?')&.first
# Remove filenames with singlular extentions, domains, and emails
route = route&.gsub(/\b[\w-]*#?[\w-]+\.[\w-]+\b/, '')
# Remove ids from the path (any string that contains a number)
route = "/#{route&.scan(/\b[a-z_]+\b/i)&.join('/')}".chomp('/')
I can't help but think this can be done simply with something like \/([a-z_]+)\/?, but the \/? is too loose, and \/ is too restrictive.

Perhaps you can remove the parts starting with a / and that contain at least a dot or a digit.
Replace the match with an empty string.
/[^/\d.]*[.\d][^/]*
Rubular regex demo
/ Match a forward slash
[^/\d.]* Match 0+ times any char except / or . or a digit
[.\d] Match either a . or a digit
[^/]* Match 0+ times any char except /
Output
/upload/email

In Ruby, you can use a bit of code to simplify your checks in a similar way you did:
text = text.split('?').first.split('/').select{ |x| not x.match?(/\A[^#]*#\S+\z|\d/) }.join("/")
See the Ruby demo. Note how much this approach simplifies the email and digit checking.
Details
text.split('?').first - split the string with ? and grab the first part
.split('/') - splits with / into subparts
.select{ |x| not x.match?(/\A[^#]*#\S+\z|\d/) } - only keep the items that do not match \A[^#]*#\S+\z|\d regex: \A[^#]*#\S+\z - start of string, any zero or more chars other than #, a # char, then any zero or more non-whitespace chars and end of string, or a digit
.join("/") - join the resulting items with /.

So, I think it's better to go with the allow list here, rather than a block list. Seems like it's more predictable to say "we only keep words with letters and underscores".
# Keep path w/o params
route = req.path.to_s.split('?').first
# Keep words that only contain letters or _
route = route.split('/').keep_if { |chunk| chunk[/^[a-z_]+$/i] }
# Put the path back together
route = "/#{route.join('/')}".chomp('/')

Related

Regex for validating account names for NEAR protocol

I want to have accurate form field validation for NEAR protocol account addresses.
I see at https://docs.near.org/docs/concepts/account#account-id-rules that the minimum length is 2, maximum length is 64, and the string must either be a 64-character hex representation of a public key (in the case of an implicit account) or must consist of "Account ID parts" separated by . and ending in .near, where an "Account ID part" consists of lowercase alphanumeric symbols separated by either _ or -.
Here are some examples.
The final 4 cases here should be marked as invalid (and there might be more cases that I don't know about):
example.near
sub.ex.near
something.near
98793cd91a3f870fb126f66285808c7e094afcfc4eda8a970f6648cdf0dbd6de
wrong.near.suffix (INVALID)
shouldnotendwithperiod.near. (INVALID)
space should fail.near (INVALID)
touchingDotsShouldfail..near (INVALID)
I'm wondering if there is a well-tested regex that I should be using in my validation.
Thanks.
P.S. Originally my question pointed to what I was starting with at https://regex101.com/r/jZHtDA/1 but starting from scratch like that feels unwise given that there must already be official validation rules somewhere that I could copy.
I have looked at code that I would have expected to use some kind of validation, such as these links, but I haven't found it yet:
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/utils/account.js#L8
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/components/accounts/AccountFormAccountId.js#L95
https://github.com/near/near-cli/blob/cdc571b1625a26bcc39b3d8db68a2f82b91f06ea/commands/create-account.js#L75

The pre-release (v0.6.0-0) version of the JS SDK comes with a built-in accountId validation function:
const ACCOUNT_ID_REGEX =
/^(([a-z\d]+[-_])*[a-z\d]+\.)*([a-z\d]+[-_])*[a-z\d]+$/;
/**
* Validates the Account ID according to the NEAR protocol
* [Account ID rules](https://nomicon.io/DataStructures/Account#account-id-rules).
*
* #param accountId - The Account ID string you want to validate.
*/
export function validateAccountId(accountId: string): boolean {
return (
accountId.length >= 2 &&
accountId.length <= 64 &&
ACCOUNT_ID_REGEX.test(accountId)
);
}
https://github.com/near/near-sdk-js/blob/dc6f07bd30064da96efb7f90a6ecd8c4d9cc9b06/lib/utils.js#L113
Feel free to implement this in your program too.

Something like this should do: /^(\w|(?<!\.)\.)+(?<!\.)\.(testnet|near)$/gm
Breakdown
^ # start of line
(
\w # match alphanumeric characters
| # OR
(?<!\.)\. # dots can't be preceded by dots
)+
(?<!\.) # "." should not precede:
\. # "."
(testnet|near) # match "testnet" or "near"
$ # end of line
Try the Regex out: https://regex101.com/r/vctRlo/1

If you want to match word characters only, separated by a dot:
^\w+(?:\.\w+)*\.(?:testnet|near)$
Explanation
^ Start of string
\w+ Match 1+ word characters
(?:\.\w+)* Optionally repeat . and 1+ word characters
\. Match .
(?:testnet|near) Match either testnet or near
$ End of string
Regex demo
A bit broader variant matching whitespace character excluding the dot:
^[^\s.]+(?:\.[^\s.]+)*\.(?:testnet|near)$
Regex demo

Extracting email addresses from messy text in OpenRefine

I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john#doe.com> format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n#doe.com"]
value.match(
/.*([a-zA-Z0-9_\-\+]+#[\._a-zA-Z0-9-]+).*/
)
Any help is much appreciated.

The n is captured because you are using .* before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before #.
If you can get partial matches git rid of the .* and use
/[^<\s]+#[^\s>]+/
See the regex demo
Details
[^<\s]+ - 1 or more chars other than < and whitespace
# - a # char
[^\s>]+ - 1 or more chars other than whitespace and >.
Python/Jython implementation:
import re
res = ''
m = re.search(r'[^<\s]+#[^\s>]+', value)
if m:
res = m.group(0)
return res
There are other ways to match these strings. In case you need a full string match .*<([^<]+#[^>]+)>.* where .* will not gobble the name since it will stop before an obligatory <.

If some cells contain just the email, it's probably better to use the #wiktor-stribiżew's partial match. In the development version of Open Refine, there is now a value.find() function that can do this, but it will only be officially implemented in the next version (2.9). In the meantime, you can reproduce it using Python/Jython instead of GREL:
import re
return re.findall(r"[^<\s]+#[^\s>]+", value)[0]
Result :

Regex to grab formulas

I am trying to parse a file that contains parameter attributes. The attributes are setup like this:
w=(nf*40e-9)*ng
but also like this:
par_nf=(1) * (ng)
The issue is, all of these parameter definitions are on a single line in the source file, and they are separated by spaces. So you might have a situation like this:
pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0
The current algorithm just splits the line on spaces and then for each token, the name is extracted from the LHS of the = and the value from the RHS. My thought is if I can create a Regex match based on spaces within parameter declarations, I can then remove just those spaces before feeding the line to the splitter/parser. I am having a tough time coming up with the appropriate Regex, however. Is it possible to create a regex that matches only spaces within parameter declarations, but ignores the spaces between parameter declarations?

Try this RegEx:
(?<=^|\s) # Start of each formula (start of line OR [space])
(?:.*?) # Attribute Name
= # =
(?: # Formula
(?!\s\w+=) # DO NOT Match [space] Word Characters = (Attr. Name)
[^=] # Any Character except =
)* # Formula Characters repeated any number of times
When checking formula characters, it uses a negative lookahead to check for a Space, followed by Word Characters (Attribute Name) and an =. If this is found, it will stop the match. The fact that the negative lookahead checks for a space means that it will stop without a trailing space at the end of the formula.
Live Demo on Regex101

Thanks to #Andy for the tip:
In this case I'll probably just match on the parameter name and equals, but replace the preceding whitespace with some other "parse-able" character to split on, like so:
(\s*)\w+[a-zA-Z_]=
Now my first capturing group can be used to insert something like a colon, semicolon, or line-break.

You need to add Perl tag. :-( Maybe this will help:
I ended up using this in C#. The idea was to break it into name value pairs, using a negative lookahead specified as the key to stop a match and start a new one. If this helps
var data = #"pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0";
var pattern = #"
(?<Key>[a-zA-Z_\s\d]+) # Key is any alpha, digit and _
= # = is a hard anchor
(?<Value>[.*+\-\\\/()\w\s]+) # Value is any combinations of text with space(s)
(\s|$) # Soft anchor of either a \s or EOB
((?!\s[a-zA-Z_\d\s]+\=)|$) # Negative lookahead to stop matching if a space then key then equal found or EOB
";
Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
.OfType<Match>()
.Select(mt => new
{
LHS = mt.Groups["Key"].Value,
RHS = mt.Groups["Value"].Value
});
Results:

Regex to create url friendly string

I want to create a url friendly string (one that will only contain letters, numbers and hyphens) from a user input to :
remove all characters which are not a-z, 0-9, space or hyphens
replace all spaces with hyphens
replace multiple hyphens with a single hyphen
Expected outputs :
my project -> my-project
test project -> test-project
this is # long str!ng with spaces and symbo!s -> this-is-long-strng-with-spaces-and-symbos
Currently i'm doing this in 3 steps :
$identifier = preg_replace('/[^a-zA-Z0-9\-\s]+/','',strtolower($project_name)); // remove all characters which are not a-z, 0-9, space or hyphens
$identifier = preg_replace('/(\s)+/','-',strtolower($identifier)); // replace all spaces with hyphens
$identifier = preg_replace('/(\-)+/','-',strtolower($identifier)); // replace all hyphens with single hyphen
Is there a way to do this with one single regex ?

Yeah, #Jerry is correct in saying that you can't do this in one replacement as you are trying to replace a particular string with two different items (a space or dash, depending on context). I think Jerry's answer is the best way to go about this, but something else you can do is use preg_replace_callback. This allows you to evaluate an expression and act on it according to what the match was.
$string = 'my project
test project
this is # long str!ng with spaces and symbo!s';
$string = preg_replace_callback('/([^A-Z0-9]+|\s+|-+)/i', function($m){$a = '';if(preg_match('/(\s+|-+)/i', $m[1])){$a = '-';}return $a;}, $string);
print $string;
Here is what this means:
/([^A-Z0-9]+|\s+|-+)/i This looks for any one of your three quantifiers (anything that is not a number or letter, more than one space, more than one hyphen) and if it matches any of them, it passes it along to the function for evaluation.
function($m){ ... } This is the function that will evaluate the matches. $m will hold the matches that it found.
$a = ''; Set a default of an empty string for the replacement
if(preg_match('/(\s+|-+)/i', $m[1])){$a = '-';} If our match (the value stored in $m[1]) contains multiple spaces or hyphens, then set $a to a dash instead of an empty string.
return $a; Since this is a function, we will return the value and that value will be plopped into the string wherever it found a match.
Here is a working demo

I don't think there's one way of doing that, but you could reduce the number of replaces and in an extreme case, use a one liner like that:
$text=preg_replace("/[\s-]+/",'-',preg_replace("/[^a-zA-Z0-9\s-]+/",'',$text));
It first removes all non-alphanumeric/space/dash with nothing, then replaces all spaces and multiple dashes with a single one.

Since you want to replace each thing with something different, you will have to do this in multiple iterations.
Sorry D:

Regular Expressions: querystring parameters matching

I'm trying to learn something about regular expressions.
Here is what I'm going to match:
/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
My expression should "grabs" abc123 and def456.
And now just an example about what I'm not going to match ("question mark" is missing):
/parent/child/firstparam=abc123&secondparam=def456
Well, I built the following expression:
^(?:/parent/child){1}(?:^(?:/\?|\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?
But that doesn't work.
Could you help me to understand what I'm doing wrong?
Thanks in advance.
UPDATE 1
Ok, I made other tests.
I'm trying to fix the previous version with something like this:
/parent/child(?:(?:\?|/\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?$
Let me explain my idea:
Must start with /parent/child:
/parent/child
Following group is optional
(?: ... )?
The previous optional group must starts with ? or /?
(?:\?|/\?)+
Optional parameters (I grab values if specified parameters are part of querystring)
(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?
End of line
$
Any advice?
UPDATE 2
My solution must be based just on regular expressions.
Just for example, I previously wrote the following one:
/parent/child(?:[?&/]*(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*))*$
And that works pretty nice.
But it matches the following input too:
/parent/child/firstparam=abc123&secondparam=def456
How could I modify the expression in order to not match the previous string?

You didn't specify a language so I'll just usre Perl. So basically instead of matching everything, I just matched exactly what I thought you needed. Correct me if I am wrong please.
while ($subject =~ m/(?<==)\w+?(?=&|\W|$)/g) {
# matched text = $&
}
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
= # Match the character “=” literally
)
\\w # Match a single character that is a “word character” (letters, digits, and underscores)
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
& # Match the character “&” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
\\W # Match a single character that is a “non-word character”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
Output:

This regex will work as long as you know what your parameter names are going to be and you're sure that they won't change.
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?
Whilst regex is not the best solution for this (the above code examples will be far more efficient, as string functions are way faster than regexes) this will work if you need a regex solution with up to 3 parameters. Out of interest, why must the solution use only regex?
In any case, this regex will match the following strings:
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
It will now only match those containing query string parameters, and put them into capture groups for you.
What language are you using to process your matches?
If you are using preg_match with PHP, you can get the whole match as well as capture groups in an array with
preg_match($regex, $string, $matches);
Then you can access the whole match with $matches[0] and the rest with $matches[1], $matches[2], etc.
If you want to add additional parameters you'll also need to add them in the regex too, and add additional parts to get your data. For example, if you had
/parent/child/?secondparam=def456&firstparam=abc123&fourthparam=jkl01112&thirdparam=ghi789
The regex will become
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?
This will become a bit more tedious to maintain as you add more parameters, though.
You can optionally include ^ $ at the start and end if the multi-line flag is enabled. If you also need to match the whole lines without query strings, wrap this whole regex in a non-capture group (including ^ $) and add
|(?:^\/parent\/child\/?\??$)
to the end.

You're not escaping the /s in your regex for starters and using {1} for a single repetition of something is unnecessary; you only use those when you want more than one repetition or a range of repetitions.
And part of what you're trying to do is simply not a good use of a regex. I'll show you an easier way to deal with that: you want to use something like split and put the information into a hash that you can check the contents of later. Because you didn't specify a language, I'm just going to use Perl for my example, but every language I know with regexes also has easy access to hashes and something like split, so this should be easy enough to port:
# I picked an example to show how this works.
my $route = '/parent/child/?first=123&second=345&third=678';
my %params; # I'm going to put those URL parameters in this hash.
# Perl has a way to let me avoid escaping the /s, but I wanted an example that
# works in other languages too.
if ($route =~ m/\/parent\/child\/\?(.*)/) { # Use the regex for this part
print "Matched route.\n";
# But NOT for this part.
my $query = $1; # $1 is a Perl thing. It contains what (.*) matched above.
my #items = split '&', $query; # Each item is something like param=123
foreach my $item (#items) {
my ($param, $value) = split '=', $item;
$params{$param} = $value; # Put the parameters in a hash for easy access.
print "$param set to $value \n";
}
}
# Now you can check the parameter values and do whatever you need to with them.
# And you can add new parameters whenever you want, etc.
if ($params{'first'} eq '123') {
# Do whatever
}

My solution:
/(?:\w+/)*(?:(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)?|\w+|)
Explain:
/(?:\w+/)* match /parent/child/ or /parent/
(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)? match child?firstparam=abc123 or ?firstparam=abc123 or ?
\w+ match text like child
..|) match nothing(empty)
If you need only query string, pattern would reduce such as:
/(?:\w+/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)
If you want to get every parameter from query string, this is a Ruby sample:
re = /\/(?:\w+\/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)/
s = '/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789'
if m = s.match(re)
query_str = m[1] # now, you can 100% trust this string
query_str.scan(/(\w+)=(\w+)/) do |param,value| #grab parameter
printf("%s, %s\n", param, value)
end
end
output
secondparam, def456
firstparam, abc123
thirdparam, ghi789

This script will help you.
First, i check, is there any symbol like ?.
Then, i kill first part of line (left from ?).
Next, i split line by &, where each value splitted by =.
my $r = q"/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789";
for my $string(split /\n/, $r){
if (index($string,'?')!=-1){
substr($string, 0, index($string,'?')+1,"");
#say "string = ".$string;
if (index($string,'=')!=-1){
my #params = map{$_ = [split /=/, $_];}split/\&/, $string;
$"="\n";
say "$_->[0] === $_->[1]" for (#params);
say "######next########";
}
else{
#print "there is no params!"
}
}
else{
#say "there is no params!";
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js