How do I remove the special characters within multiple lines in Regex? - regex

I'm trying to solve a problem that wants to display the given text from a file omitting the special characters and modifying the multi-line input to a single-formatted output in only the language of Perl/Regex (no other languages like XML, etc.). Here's the given text in my flight.txt file:
<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
</flight>
</start>
The required output is:
Holland, T. "Aeronautics Engineer" 200 06/09/1969 Flight from DC to VA.
As you can see, I need the output in a single line; and first name should be first initial while output also major should be in "" while output and date format should be changed from - to /.
This is what I have in my code so far:
#!/bin/perl
use strict;
use warnings;
my $filename = "flights.txt"
open(my $input, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$input>){
my $text = <>;
$text =~ s/<[^>]*>//g;
print $text;
}
close $input
Please suggest me on what to do next and how to format the output of the given file. I'm new to Regex & Perl so need help.

Here's your homework problem, as you noted in a comment to ikegami's answer:
Create the Perl script “code.pl” that print lines that contain an opening and closing XML heading tag from “flights.txt”. Valid tags are pilot, major, company, price, date, and details regardless of the case. Tags may also have any arbitrary content inside of them. You may assume that a '<' or a '>' character will not appear inside of the attribute's value portion
Let's forget that your input is XML, for all the reasons that ikegami has already explained. The entire thing is a contrived example to get you to practice some particular regex feature. I'll go through a process of solving this problem, but also reveal later what I think the instructor expects.
First, you only need to think about one line at a time, so you don't care about nodes where the opening and closing is on separate lines, such as <start> and </start>, or <flight> and </flight>. You want to find lines such as:
<node>...</node>
The pattern is that there is some string you match near the start of the line, and that match has to show up later in the line. I think your intended task is to practice backreferences. Writing good exercises is tough, and people fall back on things, such as XML, that are familiar. My Learning Perl Exercises is more thoughtful about this.
Your basic program needs to look something like this first attempt. Read lines of input, skip the ones that don't match your pattern, and output the rest. Whenever you see ... in this answer, that's just something I need to fill in and is not Perl syntax (ignoring the yada operator, which cannot appear in a regex):
use strict;
use warnings;
while( <> ) {
next unless m/ ... /;
print;
}
I'll mostly ignore that program structure and focus on the match operator, m//. update the pattern as I step through this.
The trick, then, is what goes in the pattern. You know you have to match something that looks like an XML open tag (again, ignoring that this is XML because it's not a good example for input). That starts with < and ends with > with some stuff in the middle. This pattern uses the /x flag to make whitespace insignificant. I can spread out the pattern so I can grok it easier:
m/ < ... > /x;
So what can go inside the angle brackets? In the inputL which I'm pretending isn't XML, the stuff inside the angles follows these rules, which you could read about in the XML standard if this were XML:
case-sensitive
starts with a letter or underscore
can contain letters, digits, hyphens, underscores, and periods
cannot start with xml in any case
Let's ignore that last one for a moment because I don't think it's part of the simple exercise you need to do. And the rules are actually slightly more complicated.
Case sensitive is easy. We aren't going to use the /i flag on the match operator, so we get that for free.
Starts with a letter or underscore. That's pretty easy. Since I'm pretending this is not XML, I'm not going to support all the Unicode scripts that current XML will allow. I'll restrict that to ASCII, and use a character class to represent all the letters that I'll allow right after the >:
m/ < [a-zA-Z_] ... > /x;
After that, I can have letters and underscores, but now also have hyphens, digits, and periods. As an aside, many such things have a set of characters for the start of an "identifier" (ID_Start) and a wider set for the rest (ID_Continue). Perl has similar rules for its variable name.
I use a second character class for the continuation. There's a slight gotcha here because you want a literal hyphen, but that also forms a range in the character class. That is, it forms a range unless it's at the end. The . in a character class is literal .:
m/ < [a-zA-Z_] [a-zA-Z_0-9.-]+ > /x;
With this pattern, you get much more than you wanted. The output is every line that has a start tag. Note that it does not match <flight number="12345"> because this pattern doesn't handle attributes, which is fine because I'm pretending this isn't XML:
<start>
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
The end tag has the same name as the start tag. In our input, there's one start tag and one end tag per line, and since I look at one line at a time, I can ignore many things that an XML parser has to care about. Now I spread my pattern over several lines because /x allows me to do that, and \x also allows me to add comments so I remember what each part of the pattern does. The / in the end tag is also the match operator delimiter, so I escape that as \/:
m/
< [a-zA-Z_] [a-zA-Z_0-9.-]+ > # start tag
... # the interesting text
< \/ ... > # end tag
/x;
I need to fill in the ... parts. The "interesting text" part is easy. I'll match anything. The .* greedily matches zero or more non-newline characters:
m/
< [a-zA-Z_] [a-zA-Z_0-9.-]+ > # start tag
.* # the interesting text, greedily
< \/ ... > # end tag
/x;
But, I don't really want * to be greedy. I don't want it to match the end tag, so I can add the non-greedy modifier ? to the .*:
m/
< [a-zA-Z_] [a-zA-Z_0-9.-]+ > # start tag
.*? # the interesting text, non-greedily
< \/ ... > # end tag
/x;
Now I need to fill in the name portion of the end tag. It has to be the same as the start name. By surrounding the start name in (...), I capture that part of the string that matched. That goes into the capture buffer $1. I can then re-use that exact match within the pattern with a "back reference" (the point of your problem, I'm guessing). A backreference starts with a \ and uses the number of the capture buffer you want to use. So, \1 uses the exact text matched in $1; not the same pattern but the actual text matched:
m/
< # start tag
([a-zA-Z_] [a-zA-Z_0-9.-]+) # $1
>
.*? # the interesting text, non-greedily
< \/ \1 > # end tag
/x;
Now the output excludes <start> because it doesn't have an end tag:
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
If you modified your data to change </date> to </data>, that line wouldn't match because the start and end tags are different.
But, what you really want is the text in the middle, so you need to capture that too. You can add another capture buffer. As the second set of parens, this is the the buffer $2, and doesn't disturb $1 or \1:
m/
< # start tag
([a-zA-Z_] [a-zA-Z_0-9.-]+) # $1
>
( .*? ) # $2, the interesting text, non-greedily
< \/ \1 > # end tag
/x;
But now you want to print the interesting test, not the entire line, so I'll print the $2 capture buffer instead of the entire line. Remember, these buffers are only valid after a successful match, but I've skipped the lines where it doesn't match, so I'm fine:
use strict;
use warnings;
while( <DATA> ) {
next unless m/
< # start tag
([a-zA-Z_] [a-zA-Z_0-9.-]+) # $1
>
(.*?) # $2, the interesting text, non-greedily
< \/ \1 > # end tag
/x;
print $2;
}
print "\n"; # end all the output!
This gets me close. I'm missing some whitespace between elements (And note there is a leading space before Holland):
Holland, TomAeronautics EngineerBoeing20006-09-1969Flight from DC to VA.
I can add a space at the end of each print:
print $2, ' ';
Now you have your output:
Holland, Tom Aeronautics Engineer Boeing 200 06-09-1969 Flight from DC to VA.
What the answer probably is
I'm guessing that the answer you'll see is much simpler. If you ignore all the rules about names and only handle exactly the input from the problem, you can probably get away with this:
m/ <(.*?)> (.*?) < \/ \1 > /x
As an exercise simply to practice back references, that's fine. But, you'll eventually create problems handling real XML like that. Note that $1 could capture all of flight number="1234" because this doesn't exclude whitespace or the other disallowed characters.
Let's go a bit deeper
The pattern I showed was pretty complicated, especially if you are just learning things. I can precompile the pattern and save it in a scalar, then use that scalar inside the match operator:
use strict;
use warnings;
my $pattern = qr/
< # start tag
([a-zA-Z_] [a-zA-Z_0-9.-]+) # $1
>
( .*? ) # the interesting text, non-greedily
< \/ \1 > # end tag
/x;
while( <DATA> ) {
next unless m/$pattern/;
print $2, ' ';
}
This way, the mechanics of the while loop are distinct from the particulars. The complexity of the pattern doesn't affect my ability to understand the loop.
Now, having done that, I'll get more complicated. So far I used numbered captures and backreferences, but I might mess that up if I add more captures. If there's another capture before the start tag, the start tag capture is no longer $1, which means \1 now refers to the wrong thing. Instead of numbers, I can give them my own labels with the (?<LABEL>...) feature that Perl stole from Python. The back reference to that label is \k<LABEL>:
my $pattern = qr/
< # start tag
(?<tag> # labeled capture
[a-zA-Z_] [a-zA-Z_0-9.-]+
)
>
( .*? ) # the interesting text, non-greedily
< \/ \k<tag> > # end tag
/x;
I can even label the "interesting text" portion:
my $pattern = qr/
< # start tag
(?<tag>
[a-zA-Z_] [a-zA-Z_0-9.-]+
)
>
(?<text> .*? ) # the interesting text, non-greedily
< \/ \k<tag> > # end tag
/x;
The rest of the program still works because these labels are aliases to the numbered capture variables. However, I don't want to rely on that (hence, the label). The hash %+ has the values in the labeled captures, and the label is the key. The interesting text is in $+<text>:
while( <DATA> ) {
next unless m/$pattern/;
print $+{'text'}, ' ';
}
The rule I ignored
Now, there was the rule that I ignored. A tag name cannot start with xml in any case. That's tied to an XML feature I'll ignore here. I'll change my input to include an xmlmeal node:
<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<xmlmeal> chicken</xmlmeal>
</flight>
</start>
I match that xmlmeal node because I haven't done anything to follow the rule. I can add a negative lookahead assertion, (?!...) to exclude that. As an assertion (\b and \A are other assertions), the lookahead does not consume text; it merely matches a condition. I use (?!xml) to mean "wherever I am right now, xml cannot be next":
my $pattern = qr/
< # start tag
(?<tag>
(?!xml)
[a-zA-Z_] [a-zA-Z_0-9.-]+
)
>
(?<text> .*? ) # the interesting text, non-greedily
< \/ \k<tag> > # end tag
/x;
That's fine and it won't show " chicken" in the output. But, what if the input tag name was XMLmeal? I've only excluded the lowercase version. I need to exclude much more:
<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<XMLmeal>chicken</XMLmeal>
<xmldrink>diet coke</xmldrink>
<Xmlsnack>almonds</Xmlsnack>
</flight>
</start>
I can get fancier. I'm not use the /i flag for case insensitivity because the start and end tag need to match exactly. I can, however, turn on case insensitivity for part of a pattern with (?i), and everything past that will ignore case:
my $pattern = qr/
< # start tag
(?<tag>
(?i) # ignore case starting here
(?!xml)
[a-zA-Z_] [a-zA-Z_0-9.-]+
)
>
(?<text> .*? ) # the interesting text, non-greedily
< \/ \k<tag> > # end tag
/x;
But, inside grouping parentheses, the (?i) is in effect only until the end of that group. I can limit which part of my pattern ignores case. The (?: ... ) groups without capturing (so doesn't disturb what $1 or $2 capture):
(?: (?i) (?!xml) )
Now my pattern excludes those three tags I added:
my $pattern = qr/
< # start tag
(?<tag>
(?: (?i) (?!xml) ) # not XmL in any case
[a-zA-Z_] [a-zA-Z_0-9.-]+
)
>
(?<text> .*? ) # the interesting text, non-greedily
< \/ \k<tag> > # end tag
/x;
Some Mojo
So far, none of what I've presented handles attributes in the tags, which you want to ignore anyway. You should be able to add those to the regex yourself. But, I'll shift gears into other ways to handle XML like things.
Here's a Mojolicious program that understands XML and can extract things. Since it's a real Document Object Model (DOM) parser, it doesn't care about lines.
#!perl
use Mojo::DOM;
my $not_xml = <<~'HERE';
<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
</flight>
</start>
HERE
Mojo::DOM->new( $not_xml )->xml(1)
->find( 'flight *' )
->map( 'text' )
->each( sub { print "$_ " } );
print "\n";
The find uses a CSS Selector to decide what it wants to process. The selector flight * is all the child nodes inside flight (so, any child tag no matter its name). The map calls the text method on each portion of the tree that find produces, and each outputs each result. It's very simple because someone has already done all of the hard work.
But, Mojo::DOM is not appropriate for every situation. It wants to know the entire tree at once, and for very large documents that's a burden on memory. There are "streaming" parsers that can handle that.
Twiggy
The problem you present in the original question is different than the homework that you posted in the comments. You want to transform text based on which tag it comes from. This is a different sort of problem all together because
XML::Twig is useful for processing different node types differently. It has the added advantage is that it doesn't need the entire XML tree in memory at one time.
Here's an example that uses two different handlers for the pilot and major portion. When Twig runs into those nodes, it calls the appropriate subroutine that you referenced in twig_handlers. I won't explain the particular Perl features here:
use XML::Twig;
my $twig = XML::Twig->new(
twig_handlers => {
pilot => \&pilot,
major => \&major,
},
);
sub pilot {
my( $twig, $e ) = #_;
my $text = $e->text;
$text =~ s/,\s.\K.*/./;
print $text, ' ';
$twig->purge;
}
sub major {
my( $twig, $e ) = #_;
print '"' . $e->text . '"' . ' ';
$twig->purge;
}
my $xml = <<~'HERE';
<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
</flight>
</start>
HERE
$twig->parse($xml);
This outputs:
Holland, T. "Aeronautics Engineer"
Now you'd complete that with subroutines for all of the other things that you want to process.

Forenote
Based on comments made after this answer was posted, this is an assignment where the teacher is encouraging the OP to make numerous bad assumptions about XML. They are teaching them to do exactly what one should never do. If the teacher were to define a format, that would be fine; it wouldn't be XML but merely something inspired by XML. But they didn't do that. They explicitly stated it was XML. I can't help the OP any further because
I won't teach how to do this incorrectly,
doing it correctly without using an existing module would require a time expenditure that's too large,
doing it correctly without using an existing module would fall outside the scope of the site, and
I don't even know what the teacher wants (having been provided the exact wording of the assignment).
What follows is an answer the Question asked (as opposed to a solution to the OP's homework).
Answer
You are trying to parse XML. There are existing XML parsers you can use instead of spending considerable effort writing your own. I personally use XML::LibXML.
use XML::LibXML qw( );
my $doc = XML::LibXML->new->parse_file("flight.txt");
for my $flight_node ($doc->findnodes("/start/flight")) {
my $pilot = $flight_node->findvalue("pilot");
my $major = $flight_node->findvalue("major");
my $price = $flight_node->findvalue("price");
my $date = $flight_node->findvalue("date");
my $details = $flight_node->findvalue("details");
say "$pilot \"$major\" $price $date $details";
}

Just to give you some hints:
Your code is "ok" but
my $text = <>;
in your while loop is wrong. You already have the line in $row, so just use $row instead.
and your row also contains a linefeed at the end, so before printing it out you might remove this.
chomp($row);
So wrapping it up:
chomp($row);
$row =~ s/<[^>]*>//g;
print $row . " ";
might be the code in your while-loop you are looking for. And for extra grades, start thinking how to remove unnecessary white space at the beginning/end.

Related

How to split text into "steps" using regex in perl?

I am trying to split texts into "steps"
Lets say my text is
my $steps = "1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!"
I'd like the output to be:
"1.Do this."
"2.Then do that."
"3.And then maybe that."
"4.Complete!"
I'm not really that good with regex so help would be great!
I've tried many combination like:
split /(\s\d.)/
But it splits the numbering away from text
I would indeed use split. But you need to exclude the digit from the match by using a lookahead.
my #steps = split /\s+(?=\d+\.)/, $steps;
All step-descriptions start with a number followed by a period and then have non-numbers, until the next number. So capture all such patterns
my #s = $steps =~ / [0-9]+\. [^0-9]+ /xg;
say for #s;
This works only if there are surely no numbers in the steps' description, like any approach relying on matching a number (even if followed by a period, for decimal numbers)†
If there may be numbers in there, we'd need to know more about the structure of the text.
Another delimiting pattern to consider is punctuation that ends a sentence (. and ! in these examples), if there are no such characters in steps' description and there are no multiple sentences
my #s = $steps =~ / [0-9]+\. .*? [.!] /xg;
Augment the list of patterns that end an item's description as needed, say with a ?, and/or ." sequence as punctuation often goes inside quotes.‡
If an item can have multiple sentences, or use end-of-sentence punctuation mid-sentence (as a part of a quotation perhaps) then tighten the condition for an item's end by combining footnotes -- end-of-sentence punctuation and followed by number+period
my #s = $steps =~ /[0-9]+\. .*? (?: \."|\!"|[.\!]) (?=\s+[0-9]+\. | \z)/xg;
If this isn't good enough either then we'd really need a more precise description of that text.
† An approach using a "numbers-period" pattern to delimit item's description, like
/ [0-9]+\. .*? (?=\s+[0-9]+\. | \z) /xg;
(or in a lookahead in split) fails with text like
1. Only $2.50   or   1. Version 2.4.1   ...
‡ To include text like 1. Do "this." and 2. Or "that!" we'd want
/ [0-9]+\. .*? (?: \." | !" | [.!?]) /xg;
Following sample code demonstrates power of regex to fill up %steps hash in one line of code.
Once the data obtained you can dice and slice it anyway your heart desires.
Inspect the sample for compliance with your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,%steps,$re);
$str = '1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!';
$re = qr/(\d+)\.(\D+)\./;
%steps = $str =~ /$re/g;
say Dumper(\%steps);
say "$_. $steps{$_}" for sort keys %steps;
Output
$VAR1 = {
'1' => 'Do this',
'2' => 'Then do that',
'3' => 'And then maybe that'
};
1. Do this
2. Then do that
3. And then maybe that

Perl / Regex String Manipulation for multiple matches

I have the following string:
<Multicast ID="0/m1" Feed="EUREX-EMDI" IPPort="224.0.50.128:59098" State="CHECK" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="-" />
I need to strip everything in this string apart from:
Feed="EUREX-EMDI"
State="CLOSED"
IsTainted="0"
I have managed to get "Feed="EUREX-EMDI"" with the following code:
s/^[^Feed]*(?=Feed)//;
So it now looks like:
Feed="EUREX-EMDI" IPPort="224.0.50.0:59098" State="CLOSED" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="2191840" />
However I now don't know how to look for the next part "State="CLOSED"" in the string whilst ignoring my already found "Feed="EUREX-EMDI"" match
The perl idiom for this type of thing is a multiple assignment from regex capture groups. Assuming you can always count on the items of interest being in the same order and format (quoting):
($feed, $state, $istainted) = /.*(Feed="[^"]*").*(State="[^"]*").*(IsTainted="[^"]*")/;
Or if you only want to capture the (unquoted) values themselves, change the parentheses (capture groups):
($feed, $state, $istainted) = /.*Feed="([^"]*)".*State="([^"]*)".*(IsTainted="([^"]*)"/;
Please, don't try and parse XML with a regex. It's brittle. XML is contextual, and regular expression aren't. So at best, it's a dirty hack, and one that may one day break without warning for the most inane reasons.
See: RegEx match open tags except XHTML self-contained tags for more.
However, XML is structured, and it's actually quite easy to work with - provided you use something well suited to the job: A parser.
I like XML::Twig. XML::LibXML is also excellent, but has a bit of a steeper learning curve. (You also get XPath which is like regular expressions, but much more well suited for XML)
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#create a list of what we want to keep. This map just turns it
#into a hash.
my %keep = map { $_ => 1 } qw ( IsTainted State Feed );
#parse the XML. If it's a file, you may want "parsefile" instead.
my $twig = XML::Twig->parse( \*DATA );
#iterate the attributes.
foreach my $att ( keys %{ $twig->root->atts } ) {
#delete the attribute unless it's in our 'keep' list.
$twig->root->del_att($att) unless $keep{$att};
}
#print it. You may find set_pretty_print useful for formatting XML.
$twig->print;
__DATA__
<Multicast ID="0/m1" Feed="EUREX-EMDI" IPPort="224.0.50.128:59098" State="CHECK" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="-" />
Outputs:
<Multicast Feed="EUREX-EMDI" IsTainted="0" State="CHECK"/>
That preserves the attributes, and gives you valid XML. But if you just want the values:
foreach my $att ( qw ( Feed State IsTainted ) ) {
print $att, "=", $twig->root->att($att),"\n";
}
This will strip all but those strings.
$str =~ s/(?s)(?:(?!(?:Feed|State|IsTainted)\s*=\s*".*?").)*(?:((?:Feed|State|IsTainted)\s*=\s*".*?")|$)/$1/g;
If you want to include a space separator, make the replacement ' $1'.
Explained
(?s) # Dot - all
(?: # To be removed
(?!
(?: Feed | State | IsTainted )
\s* = \s* " .*? "
)
.
)*
(?: # To be saved
( # (1 start)
(?: Feed | State | IsTainted )
\s* = \s* " .*? "
) # (1 end)
| $
)

Regular Expressions: querystring parameters matching

I'm trying to learn something about regular expressions.
Here is what I'm going to match:
/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
My expression should "grabs" abc123 and def456.
And now just an example about what I'm not going to match ("question mark" is missing):
/parent/child/firstparam=abc123&secondparam=def456
Well, I built the following expression:
^(?:/parent/child){1}(?:^(?:/\?|\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?
But that doesn't work.
Could you help me to understand what I'm doing wrong?
Thanks in advance.
UPDATE 1
Ok, I made other tests.
I'm trying to fix the previous version with something like this:
/parent/child(?:(?:\?|/\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?$
Let me explain my idea:
Must start with /parent/child:
/parent/child
Following group is optional
(?: ... )?
The previous optional group must starts with ? or /?
(?:\?|/\?)+
Optional parameters (I grab values if specified parameters are part of querystring)
(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?
End of line
$
Any advice?
UPDATE 2
My solution must be based just on regular expressions.
Just for example, I previously wrote the following one:
/parent/child(?:[?&/]*(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*))*$
And that works pretty nice.
But it matches the following input too:
/parent/child/firstparam=abc123&secondparam=def456
How could I modify the expression in order to not match the previous string?
You didn't specify a language so I'll just usre Perl. So basically instead of matching everything, I just matched exactly what I thought you needed. Correct me if I am wrong please.
while ($subject =~ m/(?<==)\w+?(?=&|\W|$)/g) {
# matched text = $&
}
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
= # Match the character “=” literally
)
\\w # Match a single character that is a “word character” (letters, digits, and underscores)
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
& # Match the character “&” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
\\W # Match a single character that is a “non-word character”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
Output:
This regex will work as long as you know what your parameter names are going to be and you're sure that they won't change.
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?
Whilst regex is not the best solution for this (the above code examples will be far more efficient, as string functions are way faster than regexes) this will work if you need a regex solution with up to 3 parameters. Out of interest, why must the solution use only regex?
In any case, this regex will match the following strings:
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
It will now only match those containing query string parameters, and put them into capture groups for you.
What language are you using to process your matches?
If you are using preg_match with PHP, you can get the whole match as well as capture groups in an array with
preg_match($regex, $string, $matches);
Then you can access the whole match with $matches[0] and the rest with $matches[1], $matches[2], etc.
If you want to add additional parameters you'll also need to add them in the regex too, and add additional parts to get your data. For example, if you had
/parent/child/?secondparam=def456&firstparam=abc123&fourthparam=jkl01112&thirdparam=ghi789
The regex will become
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?
This will become a bit more tedious to maintain as you add more parameters, though.
You can optionally include ^ $ at the start and end if the multi-line flag is enabled. If you also need to match the whole lines without query strings, wrap this whole regex in a non-capture group (including ^ $) and add
|(?:^\/parent\/child\/?\??$)
to the end.
You're not escaping the /s in your regex for starters and using {1} for a single repetition of something is unnecessary; you only use those when you want more than one repetition or a range of repetitions.
And part of what you're trying to do is simply not a good use of a regex. I'll show you an easier way to deal with that: you want to use something like split and put the information into a hash that you can check the contents of later. Because you didn't specify a language, I'm just going to use Perl for my example, but every language I know with regexes also has easy access to hashes and something like split, so this should be easy enough to port:
# I picked an example to show how this works.
my $route = '/parent/child/?first=123&second=345&third=678';
my %params; # I'm going to put those URL parameters in this hash.
# Perl has a way to let me avoid escaping the /s, but I wanted an example that
# works in other languages too.
if ($route =~ m/\/parent\/child\/\?(.*)/) { # Use the regex for this part
print "Matched route.\n";
# But NOT for this part.
my $query = $1; # $1 is a Perl thing. It contains what (.*) matched above.
my #items = split '&', $query; # Each item is something like param=123
foreach my $item (#items) {
my ($param, $value) = split '=', $item;
$params{$param} = $value; # Put the parameters in a hash for easy access.
print "$param set to $value \n";
}
}
# Now you can check the parameter values and do whatever you need to with them.
# And you can add new parameters whenever you want, etc.
if ($params{'first'} eq '123') {
# Do whatever
}
My solution:
/(?:\w+/)*(?:(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)?|\w+|)
Explain:
/(?:\w+/)* match /parent/child/ or /parent/
(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)? match child?firstparam=abc123 or ?firstparam=abc123 or ?
\w+ match text like child
..|) match nothing(empty)
If you need only query string, pattern would reduce such as:
/(?:\w+/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)
If you want to get every parameter from query string, this is a Ruby sample:
re = /\/(?:\w+\/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)/
s = '/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789'
if m = s.match(re)
query_str = m[1] # now, you can 100% trust this string
query_str.scan(/(\w+)=(\w+)/) do |param,value| #grab parameter
printf("%s, %s\n", param, value)
end
end
output
secondparam, def456
firstparam, abc123
thirdparam, ghi789
This script will help you.
First, i check, is there any symbol like ?.
Then, i kill first part of line (left from ?).
Next, i split line by &, where each value splitted by =.
my $r = q"/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789";
for my $string(split /\n/, $r){
if (index($string,'?')!=-1){
substr($string, 0, index($string,'?')+1,"");
#say "string = ".$string;
if (index($string,'=')!=-1){
my #params = map{$_ = [split /=/, $_];}split/\&/, $string;
$"="\n";
say "$_->[0] === $_->[1]" for (#params);
say "######next########";
}
else{
#print "there is no params!"
}
}
else{
#say "there is no params!";
}
}

perl regex for extracting multiline blocks

I have text like this:
00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
So, I don't have a block end, just a new block start.
I want to recursively get all blocks:
1 = 00:00 stuff
2 = 00:01 more stuff
multi line
and going
etc
The bellow code only gives me this:
$VAR1 = '00:00';
$VAR2 = '';
$VAR3 = '00:01';
$VAR4 = '';
$VAR5 = '00:02';
$VAR6 = '';
What am I doing wrong?
my $text = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
';
my #array = $text =~ m/^([0-9]{2}:[0-9]{2})(.*?)/gms;
print Dumper(#array);
Version 5.10.0 introduced named capture groups that are useful for matching nontrivial patterns.
(?'NAME'pattern)
(?<NAME>pattern)
A named capture group. Identical in every respect to normal capturing parentheses () but for the additional fact that the group can be referred to by name in various regular expression constructs (such as \g{NAME}) and can be accessed by name after a successful match via %+ or %-. See perlvar for more details on the %+ and %- hashes.
If multiple distinct capture groups have the same name then the $+{NAME} will refer to the leftmost defined group in the match.
The forms (?'NAME'pattern) and (?<NAME>pattern) are equivalent.
Named capture groups allow us to name subpatterns within the regex as in the following.
use 5.10.0; # named capture buffers
my $block_pattern = qr/
(?<time>(?&_time)) (?&_sp) (?<desc>(?&_desc))
(?(DEFINE)
# timestamp at logical beginning-of-line
(?<_time> (?m:^) [0-9][0-9]:[0-9][0-9])
# runs of spaces or tabs
(?<_sp> [ \t]+)
# description is everything through the end of the record
(?<_desc>
# s switch makes . match newline too
(?s: .+?)
# terminate before optional whitespace (which we remove) followed
# by either end-of-string or the start of another block
(?= (?&_sp)? (?: $ | (?&_time)))
)
)
/x;
Use it as in
my $text = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
';
while ($text =~ /$block_pattern/g) {
print "time=[$+{time}]\n",
"desc=[[[\n",
$+{desc},
"]]]\n\n";
}
Output:
$ ./blocks-demo
time=[00:00]
desc=[[[
stuff
]]]
time=[00:01]
desc=[[[
more stuff
multi line
and going
]]]
time=[00:02]
desc=[[[
still
have
]]]
This should do the trick. Beginning of next \d\d:\d\d is treated as block end.
use strict;
my $Str = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
00:03 still
have' ;
my #Blocks = ($Str =~ m#(\d\d:\d\d.+?(?:(?=\d\d:\d\d)|$))#gs);
print join "--\n", #Blocks;
Your problem is that .*? is non-greedy in the same way that .* is greedy. When it is not forced, it matches as little as possible, which in this case is the empty string.
So, you'll need something after the non-greedy match to anchor up your capture. I came up with this regex:
my #array = $text =~ m/\n?([0-9]{2}:[0-9]{2}.*?)(?=\n[0-9]{2}:|$)/gs;
As you see, I removed the /m option to accurately be able to match end of string in the look-ahead assertion.
You might also consider this solution:
my #array = split /(?=[0-9]{2}:[0-9]{2})/, $text;

Need help with Perl reg ex?

Here is my text file forms.
S1,F2 title including several white spaces (abbr) single,Here<->There,reply
S1,F2 title including several white spaces (abbr) single,Here<->There
S1,F2 title including several white spaces (abbr) single,Here<->There,[reply]
How to change my reg ex to work on all the three forms above?
/^S(\d),F(\d)\s+(.*?)\((.*?)\)\s+(.*?),(.*?)[,](.*?)$/
I tried replace (.*?)$/ with [.*?]$/. It doesn't work. I guess I shouldn't use [](square brackets) to match the possible word of [reply](including the []).
Actually, my general question should be how to match the possible characters better in Reg exp using Perl? I looked up the online PerlDoc webpages. But it is hard for me to find out the useful information based on my Perl knowledge level. That's why I also asked some stupid questions.
Appreciated for your comments and suggestions.
What about using negated character classes:
/^S(\d),F(\d)\s+([^()]*?)\s+\(([^()]+)\)\s+([^,]*),([^,]*)(?:,(.*?))?$/
When incorporated into this script:
#!/bin/perl
use strict;
use warnings;
while (<>)
{
chomp;
my($s,$f,$title,$abbr,$single,$here,$reply) =
$_ =~ m/^S(\d),F(\d)\s+([^()]*?)\s+\(([^()]+)\)\s+([^,]*),([^,]*)(?:,(.*?))?$/;
$reply ||= "<no reply>";
print "S$s F$f <$title> ($abbr) $single : $here : $reply\n";
}
And run on the original data file, it produces:
S1 F2 <title including several white spaces> (abbr) single : Here<->There : reply
S1 F2 <title including several white spaces> (abbr) single : Here<->There : <no reply>
S1 F2 <title including several white spaces> (abbr) single : Here<->There : [reply]
You should probably also use the 'xms' suffix to the expression to allow you to document it more easily:
#!/bin/perl
use strict;
use warnings;
while (<>)
{
chomp;
my($s,$f,$title,$abbr,$single,$here,$reply) =
$_ =~ m/^
S(\d) , # S1
F(\d) \s+ # F2
([^()]*?) \s+ # Title
\(([^()]+)\) \s+ # (abbreviation)
([^,]*) , # Single
([^,]*) # Here or There
(?: , (.*?) )? # Optional reply
$
/xms;
$reply ||= "<no reply>";
print "S$s F$f <$title> ($abbr) $single : $here : $reply\n";
}
I confess I'm still apt to write one-line monsters - I'm trying to mend my ways.
You know that brackets in regular expression are reserved for declaring sets of characters that you want to match? So, for a real bracket, you need to escape it, or to enclose it in brackets ([[] or []]), isn't that obfuscated?!.
Try (\[.*?\]|.*?) to indicate that optional brackets.
Try
/^S(\d),F(\d)\s+(.*?)\((.*?)\)\s+(.*?),(.*?)(,(\[reply\]|reply))?$/
This will match the optional (?) part ,(\[reply\]|reply) which is either ,[reply] or ,reply, i.e.,
(nothing)
,reply
[,reply]
BTW, your [,] means "one character of the following: ,". Exactly the same as a literal , within the regex. If you wanted to make your [,](.*?)$ work, you should use (,(.+))?$ to match either nothing or a comma followed by any (non-empty) string.
EDIT
If the following are also valid:
S1,F2 title including several white spaces (abbr) single,Here<->There,[reply
S1,F2 title including several white spaces (abbr) single,Here<->There,reply]
Then you could use (,\[?reply\]?)? at the end.
You can make the last part optional by using the (?:..)? as:
^S(\d),F(\d)\s+(.*?)\((.*?)\)\s+(.*?),(.*?)(?:,(.*))?$
Codepad link