Concatenate regex s+ w+ ... perl - regex

I have entries like that :
XYZABC------------HGTEZCW
ZERTAE------------RCBCVQE
I would like to get just HGTEZCW and RCBCVQE .
I would like to use a generic regex.
$temp=~ s/^\s+//g; (1)
$temp=~ s/^\w+[-]+//g; (2)
If i use (1) + (2) , it works.
It works i get : HGTEZCW, then RCBCVQE ...
I would like to know if it is possible to do that in one line like :
$temp=~ s/^\s+\w+[-]+//g; (3)
When I use (3), i get this result : XYZABC------------HGTEZCW
I dont understand why it is not possible to concat 1 + 2 in one line.
Sorry my entries was :
XYZABC------------HGTEZCW
ZERTAE------------RCBCVQE
Also, the regex 1 remove space but when i use regex2, it remove XYZABC------------ .
But the combination (3), don't work.
i have this XYZABC------------HGTEZCW
#Tim So there always is whitespace at the start of each string?
yes

Your regex (1) removes whitespace from the start of the string. So it does nothing on your example strings.
Reges (2) removes all alphanumerics from the start of the string plus any following dashes, returning whatever follows the last dash.
If you combine both, the regex fails because there is no whitespace \s+ could match - therefore the entire regex fails.
To fix this, simply make the whitespace optional. Also you don't need to enclose the - in brackets:
$temp=~ s/^\s*\w+-+//g;

This should do the trick.
$Str = '
XYZABC------------HGTEZCW
ZERTAE------------RCBCVQE
';
#Matches = ($Str =~ m#^.+-(\w+)$#mg);
print join "\n",#Matches ;

If you only need the last seven characters of each entry, you could do the following:
$temp =~ /.{7}$/;

Related

perl Regex replace for specific string length

I am using Perl to do some prototyping.
I need an expression to replace e by [ee] if the string is exactly 2 chars and finishes by "e".
le -> l [ee]
me -> m [ee]
elle -> elle : no change
I cannot test the length of the string, I need one expression to do the whole job.
I tried:
`s/(?=^.{0,2}\z).*e\z%/[ee]/g` but this is replacing the whole string
`s/^[c|d|j|l|m|n|s|t]e$/[ee]/g` same result (I listed the possible letters that could precede my "e")
`^(?<=[c|d|j|l|m|n|s|t])e$/[ee]/g` but I have no match, not sure I can use ^ on a positive look behind
EDIT
Guys you're amazing, hours of search on the web and here I get answers minutes after I posted.
I tried all your solutions and they are working perfectly directly in my script, i.e. this one:
my $test2="le";
$test2=~ s/^(\S)e$/\1\[ee\]/g;
print "test2:".$test2."\n";
-> test2:l[ee]
But I am loading these regex from a text file (using Perl for proto, the idea is to reuse it with any language implementing regex):
In the text file I store for example (I used % to split the line between match and replace):
^(\S)e$% \1\[ee\]
and then I parse and apply all regex like that:
my $test="le";
while (my $row = <$fh>) {
chomp $row;
if( $row =~ /%/){
my #reg = split /%/, $row;
#if no replacement, put empty string
if($#reg == 0){
push(#reg,"");
}
print "reg found, reg:".$reg[0].", replace:".$reg[1]."\n";
push #regs, [ #reg ];
}
}
print "orgine:".$test."\n";
for my $i (0 .. $#regs){
my $p=$regs[$i][0];
my $r=$regs[$i][1];
$test=~ s/$p/$r/g;
}
print "final:".$test."\n";
This technique is working well with my other regex, but not yet when I have a $1 or \1 in the replace... here is what I am obtaining:
final:\1\ee\
PS: you answered to initial question, should I open another post ?
Something like s/(?i)^([a-z])e$/$1[ee]/
Why aren't you using a capture group to do the replacement?
`s/^([c|d|j|l|m|n|s|t])e$/\1 [ee]/g`
If those are the characters you need and if it is indeed one word to a line with no whitespace before it or after it, then this will work.
Here's another option depending on what you are looking for. It will match a two character string consisting of one a-z character followed by one 'e' on its own line with possible whitespace before or after. It will replace this will the single a-z character followed by ' [ee]'
`s/^\s*([a-z])e\s*$/\1 [ee]/`
^(\S)e$
Try this.Replace by $1 [ee].See demo.
https://regex101.com/r/hR7tH4/28
I'd do something like this
$word =~ s/^(\w{1})(e)$/$1$2e/;
You can use following regex which match 2 character and then you can replace it with $1\[$2$2\]:
^([a-zA-Z])([a-zA-Z])$
Demo :
$my_string =~ s/^([a-zA-Z])([a-zA-Z])$/$1[$2$2]/;
See demo https://regex101.com/r/iD9oN4/1

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

Regular Expressions: querystring parameters matching

I'm trying to learn something about regular expressions.
Here is what I'm going to match:
/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
My expression should "grabs" abc123 and def456.
And now just an example about what I'm not going to match ("question mark" is missing):
/parent/child/firstparam=abc123&secondparam=def456
Well, I built the following expression:
^(?:/parent/child){1}(?:^(?:/\?|\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?
But that doesn't work.
Could you help me to understand what I'm doing wrong?
Thanks in advance.
UPDATE 1
Ok, I made other tests.
I'm trying to fix the previous version with something like this:
/parent/child(?:(?:\?|/\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?$
Let me explain my idea:
Must start with /parent/child:
/parent/child
Following group is optional
(?: ... )?
The previous optional group must starts with ? or /?
(?:\?|/\?)+
Optional parameters (I grab values if specified parameters are part of querystring)
(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?
End of line
$
Any advice?
UPDATE 2
My solution must be based just on regular expressions.
Just for example, I previously wrote the following one:
/parent/child(?:[?&/]*(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*))*$
And that works pretty nice.
But it matches the following input too:
/parent/child/firstparam=abc123&secondparam=def456
How could I modify the expression in order to not match the previous string?
You didn't specify a language so I'll just usre Perl. So basically instead of matching everything, I just matched exactly what I thought you needed. Correct me if I am wrong please.
while ($subject =~ m/(?<==)\w+?(?=&|\W|$)/g) {
# matched text = $&
}
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
= # Match the character “=” literally
)
\\w # Match a single character that is a “word character” (letters, digits, and underscores)
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
& # Match the character “&” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
\\W # Match a single character that is a “non-word character”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
Output:
This regex will work as long as you know what your parameter names are going to be and you're sure that they won't change.
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?
Whilst regex is not the best solution for this (the above code examples will be far more efficient, as string functions are way faster than regexes) this will work if you need a regex solution with up to 3 parameters. Out of interest, why must the solution use only regex?
In any case, this regex will match the following strings:
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
It will now only match those containing query string parameters, and put them into capture groups for you.
What language are you using to process your matches?
If you are using preg_match with PHP, you can get the whole match as well as capture groups in an array with
preg_match($regex, $string, $matches);
Then you can access the whole match with $matches[0] and the rest with $matches[1], $matches[2], etc.
If you want to add additional parameters you'll also need to add them in the regex too, and add additional parts to get your data. For example, if you had
/parent/child/?secondparam=def456&firstparam=abc123&fourthparam=jkl01112&thirdparam=ghi789
The regex will become
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?
This will become a bit more tedious to maintain as you add more parameters, though.
You can optionally include ^ $ at the start and end if the multi-line flag is enabled. If you also need to match the whole lines without query strings, wrap this whole regex in a non-capture group (including ^ $) and add
|(?:^\/parent\/child\/?\??$)
to the end.
You're not escaping the /s in your regex for starters and using {1} for a single repetition of something is unnecessary; you only use those when you want more than one repetition or a range of repetitions.
And part of what you're trying to do is simply not a good use of a regex. I'll show you an easier way to deal with that: you want to use something like split and put the information into a hash that you can check the contents of later. Because you didn't specify a language, I'm just going to use Perl for my example, but every language I know with regexes also has easy access to hashes and something like split, so this should be easy enough to port:
# I picked an example to show how this works.
my $route = '/parent/child/?first=123&second=345&third=678';
my %params; # I'm going to put those URL parameters in this hash.
# Perl has a way to let me avoid escaping the /s, but I wanted an example that
# works in other languages too.
if ($route =~ m/\/parent\/child\/\?(.*)/) { # Use the regex for this part
print "Matched route.\n";
# But NOT for this part.
my $query = $1; # $1 is a Perl thing. It contains what (.*) matched above.
my #items = split '&', $query; # Each item is something like param=123
foreach my $item (#items) {
my ($param, $value) = split '=', $item;
$params{$param} = $value; # Put the parameters in a hash for easy access.
print "$param set to $value \n";
}
}
# Now you can check the parameter values and do whatever you need to with them.
# And you can add new parameters whenever you want, etc.
if ($params{'first'} eq '123') {
# Do whatever
}
My solution:
/(?:\w+/)*(?:(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)?|\w+|)
Explain:
/(?:\w+/)* match /parent/child/ or /parent/
(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)? match child?firstparam=abc123 or ?firstparam=abc123 or ?
\w+ match text like child
..|) match nothing(empty)
If you need only query string, pattern would reduce such as:
/(?:\w+/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)
If you want to get every parameter from query string, this is a Ruby sample:
re = /\/(?:\w+\/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)/
s = '/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789'
if m = s.match(re)
query_str = m[1] # now, you can 100% trust this string
query_str.scan(/(\w+)=(\w+)/) do |param,value| #grab parameter
printf("%s, %s\n", param, value)
end
end
output
secondparam, def456
firstparam, abc123
thirdparam, ghi789
This script will help you.
First, i check, is there any symbol like ?.
Then, i kill first part of line (left from ?).
Next, i split line by &, where each value splitted by =.
my $r = q"/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789";
for my $string(split /\n/, $r){
if (index($string,'?')!=-1){
substr($string, 0, index($string,'?')+1,"");
#say "string = ".$string;
if (index($string,'=')!=-1){
my #params = map{$_ = [split /=/, $_];}split/\&/, $string;
$"="\n";
say "$_->[0] === $_->[1]" for (#params);
say "######next########";
}
else{
#print "there is no params!"
}
}
else{
#say "there is no params!";
}
}

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.

Regex with lookahead

I can't seem to make this regex work.
The input is as follows. Its really on one row but I have inserted line breaks after each \r\n so that it's easier to see, so no check for space characters are needed.
01-03\r\n
01-04\r\n
TEXTONE\r\n
STOCKHOLM\r\n
350,00\r\n ---- 350,00 should be the last value in the first match
12-29\r\n
01-03\r\n
TEXTTWO\r\n
COPENHAGEN\r\n
10,80\r\n
This could go on with another 01-31 and 02-01, marking another new match (these are dates).
I would like to have a total of 2 matches for this input.
My problem is that I cant figure out how to look ahead and match the starting of a new match (two following dates) but not to include those dates within the first match. They should belong to the second match.
It's hard to explain, but I hope someone will get me.
This is what I got so far but its not even close:
(.*?)((?<=\\d{2}-\\d{2}))
The matches I want are:
1: 01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n
2: 12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n
After that I can easily separate the columns with \r\n.
Can this more explicit pattern work to you?
(\d{2}-\d{2})\r\n(\d{2}-\d{2})\r\n(.*)\r\n(.*)\r\n(\d+(?:,?\d+))
Here's another option for you to try:
(.+?)(?=\d{2}-\d{2}\\r\\n\d{2}-\d{2}|$)
Rubular
/
\G
(
(?:
[0-9]{2}-[0-9]{2}\r\n
){2}
(?:
(?! [0-9]{2}-[0-9]{2}\r\n ) [^\n]*\n
)*
)
/xg
Why do so much work?
$string = q(01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n);
for (split /(?=(?:\d{2}-\d{2}\\r\\n){2})/, $string) {
print join( "\t", split /\\r\\n/), "\n"
}
Output:
01-03 01-04 TEXTONE STOCKHOLM 350,00
12-29 01-03 TEXTTWO COPENHAGEN 10,80`