How to split a string by \ in perl? - regex

use Data::Dumper qw(Dumper);
#arr=split('\/',"\Program Files\Microsoft VisualStudio\VC98\Bin\Rebase.exe");
print(Dumper \#arr);
output:
$VAR1 = [
'Program FilesMicrosoft VisualStudioVC98BinRebase.exe'
];
Required output:
$VAR1 = [
'Program Files',
'Microsoft VisualStudio',
'VC98',
'Bin',
'Rebase.exe'
];

You are splitting on the forward slash / (escaped by \), while you clearly need the \.
Since \ itself escapes things, you need to escape it, too
use warnings;
use strict;
use feature 'say';
my $str = '\Program Files\Microsoft VisualStudio\VC98\Bin\Rebase.exe';
my #ary = split /\\/, $str;
shift #ary if $ary[0] eq '';
say for #ary;
what prints the path components, one per line.
Since this string begins with a \ the first element of #ary is going to be an empty string, as that precedes the first \. We remove it from the array by shift, with a check.
Note that for this string one must use '', or the operator form q(...), since the double quotes try to interpolate the presumed escapes \P, \M (etc) in the string, failing with a warning. It is a good idea to use '' for literal strings and "" (or qq()) when you need to interpolate variables.
Another way to do this is with regex
my #ary = $str =~ /[^\\]+/g;
The negated character class, [^...], with the (escaped) \ matches any character that is not \. The quantifier + means that at least one such match is needed while it matches as many times as possible. Thus this matches a sequence of characters up to the first \.
With the modifier /g the matching keeps going through the string to find all such patterns.
Assigning to an array puts the match operator in list context, in which the list of matches is returned, and assigned to #ary. In a scalar context only the true (1) or false (empty string) is returned.
No capturing () are needed here, since we want everything that is matched.
Generally the () in list context are needed so that only the captured matches are returned.
With this we don't have to worry about an empty string at the beginning since there is no match before the first \, as there are no characters before it while we requested at least one non-\.
But working with paths is common and there are ready tools for that, which take care of details. The core module File::Spec is multi-platform and its splitdir breaks the path into components
use File::Spec;
my #path_components = File::Spec->splitdir($str);
The first element is again an empty string if the path starts with \ (or / on Unix/Mac).
Thanks to Sinan Ünür for a comment.

Related

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Perl split string based on forward slash

I am new to Perl, so this is basic question. I have a string as shown below. I am interested in taking date out of it, so thinking of splitting it using slash
my $path = "/bla/bla/bla/20160306";
my $date = (split(/\//,$path))[3];#ideally 3 is date position in array after split
print $date;
However, I don't see the expected output, but instead I see 5 getting printed.
Since the path starts with the pattern / itself, split returns a list with an empty string first (to the left of the first /); one element more. Thus the posted code miscounts by one and returns the one before last element (subdirectory) in the path, not the date.
If date is always the last thing in the string you can pick the last element
my $date = (split '/', $path)[-1];
where i've used '' for delimiters so to not have to escape /. (This, however, may confuse since the separator pattern is a regex and // convey that, while '' may appear to merely quote a string.)
This can also be done with regex
my #parts = $path =~ m{([^/]+)}g;
With this there can be no inital empty string. Or, the last part can be picked out of the full list as above, with ($path =~ m{...}g)[-1], but if you indeed only need the last bit then extract it directly
my ($last_part) = $path =~ m{.*/(.*)};
Here the "greedy" .* matches everything in the string up to the last instance of the next subpattern (/ here), thus getting us to the last part of the path, which is then captured. The regex match operator returns its matches only when it is in the list context so parens on the left are needed.
What brings us to the fact that you are parsing a path, and there are libraries dedicated to that.
For splitting a path into its components one tool is splitdir from File::Spec
use File::Spec;
my #parts = File::Spec->splitdir($path);
If the path starts with a / we'll again get an empty string for the first element (by design, see docs). That can then be removed, if there
shift #parts if $parts[0] eq '';
Again, the last element alone can be had like in the other examples.
Simply bind it to the end:
(\d+)$
# look for digits at the end of the string
See a demo on regex101.com. The capturing group is only for clarification though not really needed in this case.
In Perl this would be (I am a PHP/Python guy, so bear with me when it is ugly)
my $path = "/bla/bla/bla/20160306";
$path =~ /(\d+)$/;
print $1;
See a demo on ideone.com.
Try this
Use look ahead for to do it. It capture the / by splitting. Then substitute the data using / for remove the slash.
my $path = "/a/b/c/20160306";
my $date = (split(/(?=\/)/,$path))[3];
$date=~s/^\///;
print $date;
Or else use pattern matching with grouping for to do it.
my $path = "/a/b/c/20160306";
my #data = $path =~m/\/(\w+)/g;
print $data[3];

Perl regex wierd behavior : works with smaller strings fails with longer repetitions of the smaller ones

here is a REGEX in perl that I use to identify strings that match this pattern : include any number of occurrences of any character but single quote ' or backslash , allow only escaped occurrences of ' or , respectively : \' and \ and finally it has to end with a (non-escaped) single quote '
foo.pl
#!/usr/bin/perl
my $line;
my $matchString;
Main();
sub Main() {
foreach $line( <STDIN> ) {
$line =~ m/(^(([^\\\']*?(\\\')*?(\\\\)*?)*?\'))/g;
$matchString = $1;
print "matchString:$matchString\n"
}
}
It seems to work fine for strings like :
./foo.pl
asasas'
sdsdsdsdsdsd'
\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
matchString:asasas'
matchString:sdsdsdsdsdsd'
matchString:\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
matchString:\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
Then I create a file with the following recurring pattern :
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA\\BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\'CCCCCCCCCCCCCCCCCCCCCC\\sdsdsd\\\\\' ZZZZ\'GGGGGG
By creating a string by repeating this pattern one or more times and adding a single quote ' at the end should match the reg exp. I created a file called zz3 with 16 repetitions of the above pattern. I created then a file called ZZ6 with 18 repetitions of zz3 and another one called ZZ7 with the contents of ZZ6 + one additional instance of zz3, hence 19 repetitions of zz3.
By adding a single quote at the end of zz3 it results in a match. By adding a single quote at the end of ZZ6 it also results in a match as expected.
Now here is the tough part, by adding a single quote at the end of ZZ7 does not result in a match!
here is a link to the 3 files :
https://drive.google.com/file/d/0BzIKyGguqkWvOWdKaElGRjhGdjg/view?usp=sharing
The perl version I am using is v5.16.3 on FreeBSD bit i tried with various versions on either FreeBSD or linux with identical results. It seems to me that either perl has a problem with the size from 34274 bytes (ZZ6) to 36178 bytes (ZZ7), or I am missing something badly.
Your regular expression leads to catastrophic backtracking because you have nested quantifiers.
If you change it to
(^(([^\\\']*+(\\')*+(\\\\)*+)*?'))
(using possessive quantifiers to avoid backtracking), it should work.
I just would like to note that the whole problem appeared in an effort to re-engineer an old in-house program to parse escaped PostgreSQL bytea values.
Following this discussion it is clear that perl cannot match any repetition of non dot (.) patterns for more than 32766(=32K-2) times.
The solution is to masquerade the \\ and \' sequences with some chars that are certain to not appear in the input, such as Device Ctrl1 (\x11) and Device Ctrl2 (\x12), (presented as ^Q, ^R in vi respectively) :
$dataField =~ s/\\\\/\x11/g;
$dataField =~ s/\\\'/\x12/g;
then try to match non greedily any input till the first single quote.
$dataField =~ m/(^.*?\')/s;
$matchString = $1;
and finally substitute the above Ctrl chars back to their initial values
$matchString =~ s/\x11/\\\\/g;
$matchString =~ s/\x12/\\\'/g;
This is very fast. Another solution would be to parse till the first single quote and count the number of \'s. If it is even then we have found our last non escaped single quote in the text so we have found our desired match, otherwise the single quote is an escape one and thus considered part of the text, so we keep this value and iterate to the next single quote and repeat the same logic, by concatenating the value to the previous value. This tends to be very slow for big files with many intermediate escaped single quotes.
Perl regex's seem to be much faster than Perl code.

after matching pattern how to add dash after string in perl.regex

i have this type of data:
please help me out i am new to regular expressions,and please explain each step while answering.thanks..
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
i want to extract only this data from above lines:
7210315_AX1A_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
then if AX1A contains two consecutive alphabets after underscore ,it should be written as AX_ , and if contains single digit and single alphabet then they become as -1_ and -A_ so after applying this pattern it will become: AX_-1_-A_ and all other data should be remain same.
similarly in next line "W1A" so firstly it contains single alphabet "W" which should be converted to -W_ now next character is a single digit so it should also be converted as same pattern -1_ similarly last one is also treated same.so it become -W_-1_-A_
we are only interested in applying regex to the part after digits followed by underscore.
_AX1A_
_W1A_
_U1A_
_AV21NA_
output should be:
7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD
use strict;
use warnings;
my $match
= qr/
( \d+ # group of digits
_ # followed by an underscore
) # end group
( \p{Alpha}+ ) # group of alphas
( \d+ ) # group of digits
( \p{Alpha}* ) # group of alphas
( \w+ ) # group of word characters
/x
;
while ( my $record = <$input> ) { # record of input
# match and capture
if ( my ( $pre, $pre_alpha, $num, $post_alpha, $post ) = $record =~ m/$match/ ) {
say $pre
# if the alpha has length 1, add a dash before it
. ( length $pre_alpha == 1 ? '-' : '' )
# then the alpha
. $pre_alpha
# then the underscore
. '_'
# test if the length of the number is 1 and the length of the
# trailing alpha string is 1
. ( length( $num ) == 1 && length( $post_alpha ) == 1
# if true, apply a dash before each
? "-$num\_-$post_alpha"
# otherwise treat as AV21NA in example.
: "$num\_$post_alpha"
)
. $post
;
}
}
I don't know all the ins and outs of what you need stripped, but I'll extrapolate and let you clarify if this doesn't do quite what you need.
For the first step, extracting the 1X50_RE_ and 1X50_LI, you could search for those strings and replace them with nothing.
Next, to split your second letter/number code into your small chunks, you can use a pair of matches, using a look-ahead on each. However, since you only want to mess with that second code chunk, I'd split the overall line up first, work on the second chunk, and then join the pieces back together again.
while (<$input>) {
# Replace the 1X50_RE/LI_ bits with nothing (i.e., delete them)
s/1X50_(RE|LI)_//;
my #pieces = split /_/; # split the line into pieces at each underscore
# Just working with the second chunk. /g, means do it for all matches found
$pieces[1] =~ s/([A-Z])(?=[0-9])/$1_-/g; # Convert AX1 -> AX_-1
$pieces[1] =~ s/([0-9])(?=[A-Z])/$1_-/g; # Convert 1A -> 1-_A
# Join the pieces back together again
$_ = join '_', #pieces;
print;
}
The $_ is the variable many Perl operations work on if you don't specify. The <$input> reads the next line of the file handle named $input into $_. The s///, split, and print functions work on $_ when not given. The =~ operator is the way you tell Perl to use $pieces[1] (or whichever variable you are working on) instead of $_ for regular expression operations. (For split or print, you'd pass the variables as the argument instead, so split /_/ is the same as split /_/, $_ and print is the same as print $_.)
Oh, and to explain the regular expressions a bit:
s/1X50_(RE|LI)_//;
This is matching anything containing 1X50_RE or 1X50_LI (the (|) is a list of alternatives) and replacing them with nothing (the empty // at the end).
Looking at one of the other lines:
s/([A-Z])(?=[0-9])/$1_-/g;
The plain parentheses (...) around [A-Z] cause $1 to be set to whatever letter is matched inside (in this case a letter, A-Z). The (?=...) parenthesis cause a zero-width positive look-ahead assertion. That means the regular expression only matches if the very next thing in the string matches the expression (a digit, 0-9), but that part of the match is not included as part of the string that is replaced.
The /$1_-/ causes the matched part of the string, the [A-Z], to be replaced with the value captured by the parentheses, (...), but before the look-head, [0-9], with the addition of the _- you require.
#!/usr/bin/perl -w
use strict;
while (<>) {
next if /^\s*$/;
chomp;
## Remove those parts of the line we do not want
## You do not specify what, if anything, is constant about
## the parts you do not want. One of the following cases should
## serve.
## i) Remove the string _1X50_ and the next characters between
## two underscores:
s/_1X50_.+?_/_/;
## ii) keep the first 2 and last 3 sections of each line.
## Uncomment this line and comment the previous one to use this:
#s/^(.+?_.+?)_.+_(.+_.+_.+)$/$1_$2/;
## The line now contains only those regions we are
## interested in. Split on '_' to collect an array of the
## different parts (#a):
my #a=split(/_/);
## $a[1] is the second string, eg AX1A,W1A etc.
## We search for one or more letters, followed by one or more digits
## followed by one or more letters. The 'i' operand makes the match
## case Insensitive and the 'g' operand makes the search global, allowing
## us to capture the matches in the #matches array.
my #matches=($a[1]=~/^([a-z]*)(\d*)([a-z]*)/ig);
## So, for each of the matched strings, if the length of the match
## is less than 2, add a '-' to the beginning of the string:
foreach my $match (#matches) {
if (length($match)<2) {
$match="-" . $match;
}
}
## Now replace the original $a[1] with each string in
## #matches, connected by '_':
$a[1]=join("_", #matches);
## Finally, build the string $kk by joining each element
## of the line (#a) by a '_', and print:
my $kk=join("_", #a);
print "$kk\n";
}
Are you sure like this:
while (<DATA>) {
s/1X50_(LI|RE)_//;
s/(\d+)_([A-Z])(\d)([A-Z])/$1_-$2_-$3_-$4/;
s/(\d+)_([A-Z]{2})(\d)([A-Z])/$1_$2_-$3_-$4/;
s/(\d+)_([A-Z]{1,2})(\d+)([A-Z]+)/$1_$2_$3_$4/;
print;
}
__DATA__
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
output:
7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD
zostay's suggestion of splitting the line may make things easier if you are a regex beginner. However, avoiding the split is optimal from a performance perspective. Here is how to do it without splitting:
open IN_FILE, "filename" or die "Whoops! Can't open file.";
while (<IN_FILE>)
{
s/^\d{7}_\K([A-Z]{1,2})(\d{1,2})([A-Z]{1,2})/-${1}-${2}-${3}/
or print "line didn't match: $line\n";
s/1X50_(LI|RE)_//;
}
Breaking down the first pattern:
s/// is the search-and-replace operator.
^ match the beginning of the line
\d{7}_ match seven digits, followed by an underscore
\K look-behind operator. This means that whatever came before won't be part of the string that is replaced. () each set of parentheses specifies a chunk of the match that will be captured. These will be put into the match variables $1, $2, etc. in order. [A-Z]{1,2} this means match between one and two capital letters. You can probably figure out what the other two sections in parentheses mean. -${1}-${2}-${3} Replace what matched with the first three match variables, preceded by dashes. The only reason for the curly braces is to make clear what the variable name is.

Regular Expressions: querystring parameters matching

I'm trying to learn something about regular expressions.
Here is what I'm going to match:
/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
My expression should "grabs" abc123 and def456.
And now just an example about what I'm not going to match ("question mark" is missing):
/parent/child/firstparam=abc123&secondparam=def456
Well, I built the following expression:
^(?:/parent/child){1}(?:^(?:/\?|\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?
But that doesn't work.
Could you help me to understand what I'm doing wrong?
Thanks in advance.
UPDATE 1
Ok, I made other tests.
I'm trying to fix the previous version with something like this:
/parent/child(?:(?:\?|/\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?$
Let me explain my idea:
Must start with /parent/child:
/parent/child
Following group is optional
(?: ... )?
The previous optional group must starts with ? or /?
(?:\?|/\?)+
Optional parameters (I grab values if specified parameters are part of querystring)
(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?
End of line
$
Any advice?
UPDATE 2
My solution must be based just on regular expressions.
Just for example, I previously wrote the following one:
/parent/child(?:[?&/]*(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*))*$
And that works pretty nice.
But it matches the following input too:
/parent/child/firstparam=abc123&secondparam=def456
How could I modify the expression in order to not match the previous string?
You didn't specify a language so I'll just usre Perl. So basically instead of matching everything, I just matched exactly what I thought you needed. Correct me if I am wrong please.
while ($subject =~ m/(?<==)\w+?(?=&|\W|$)/g) {
# matched text = $&
}
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
= # Match the character “=” literally
)
\\w # Match a single character that is a “word character” (letters, digits, and underscores)
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
& # Match the character “&” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
\\W # Match a single character that is a “non-word character”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
Output:
This regex will work as long as you know what your parameter names are going to be and you're sure that they won't change.
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?
Whilst regex is not the best solution for this (the above code examples will be far more efficient, as string functions are way faster than regexes) this will work if you need a regex solution with up to 3 parameters. Out of interest, why must the solution use only regex?
In any case, this regex will match the following strings:
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
It will now only match those containing query string parameters, and put them into capture groups for you.
What language are you using to process your matches?
If you are using preg_match with PHP, you can get the whole match as well as capture groups in an array with
preg_match($regex, $string, $matches);
Then you can access the whole match with $matches[0] and the rest with $matches[1], $matches[2], etc.
If you want to add additional parameters you'll also need to add them in the regex too, and add additional parts to get your data. For example, if you had
/parent/child/?secondparam=def456&firstparam=abc123&fourthparam=jkl01112&thirdparam=ghi789
The regex will become
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?
This will become a bit more tedious to maintain as you add more parameters, though.
You can optionally include ^ $ at the start and end if the multi-line flag is enabled. If you also need to match the whole lines without query strings, wrap this whole regex in a non-capture group (including ^ $) and add
|(?:^\/parent\/child\/?\??$)
to the end.
You're not escaping the /s in your regex for starters and using {1} for a single repetition of something is unnecessary; you only use those when you want more than one repetition or a range of repetitions.
And part of what you're trying to do is simply not a good use of a regex. I'll show you an easier way to deal with that: you want to use something like split and put the information into a hash that you can check the contents of later. Because you didn't specify a language, I'm just going to use Perl for my example, but every language I know with regexes also has easy access to hashes and something like split, so this should be easy enough to port:
# I picked an example to show how this works.
my $route = '/parent/child/?first=123&second=345&third=678';
my %params; # I'm going to put those URL parameters in this hash.
# Perl has a way to let me avoid escaping the /s, but I wanted an example that
# works in other languages too.
if ($route =~ m/\/parent\/child\/\?(.*)/) { # Use the regex for this part
print "Matched route.\n";
# But NOT for this part.
my $query = $1; # $1 is a Perl thing. It contains what (.*) matched above.
my #items = split '&', $query; # Each item is something like param=123
foreach my $item (#items) {
my ($param, $value) = split '=', $item;
$params{$param} = $value; # Put the parameters in a hash for easy access.
print "$param set to $value \n";
}
}
# Now you can check the parameter values and do whatever you need to with them.
# And you can add new parameters whenever you want, etc.
if ($params{'first'} eq '123') {
# Do whatever
}
My solution:
/(?:\w+/)*(?:(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)?|\w+|)
Explain:
/(?:\w+/)* match /parent/child/ or /parent/
(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)? match child?firstparam=abc123 or ?firstparam=abc123 or ?
\w+ match text like child
..|) match nothing(empty)
If you need only query string, pattern would reduce such as:
/(?:\w+/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)
If you want to get every parameter from query string, this is a Ruby sample:
re = /\/(?:\w+\/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)/
s = '/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789'
if m = s.match(re)
query_str = m[1] # now, you can 100% trust this string
query_str.scan(/(\w+)=(\w+)/) do |param,value| #grab parameter
printf("%s, %s\n", param, value)
end
end
output
secondparam, def456
firstparam, abc123
thirdparam, ghi789
This script will help you.
First, i check, is there any symbol like ?.
Then, i kill first part of line (left from ?).
Next, i split line by &, where each value splitted by =.
my $r = q"/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789";
for my $string(split /\n/, $r){
if (index($string,'?')!=-1){
substr($string, 0, index($string,'?')+1,"");
#say "string = ".$string;
if (index($string,'=')!=-1){
my #params = map{$_ = [split /=/, $_];}split/\&/, $string;
$"="\n";
say "$_->[0] === $_->[1]" for (#params);
say "######next########";
}
else{
#print "there is no params!"
}
}
else{
#say "there is no params!";
}
}