I have the following string:
<Multicast ID="0/m1" Feed="EUREX-EMDI" IPPort="224.0.50.128:59098" State="CHECK" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="-" />
I need to strip everything in this string apart from:
Feed="EUREX-EMDI"
State="CLOSED"
IsTainted="0"
I have managed to get "Feed="EUREX-EMDI"" with the following code:
s/^[^Feed]*(?=Feed)//;
So it now looks like:
Feed="EUREX-EMDI" IPPort="224.0.50.0:59098" State="CLOSED" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="2191840" />
However I now don't know how to look for the next part "State="CLOSED"" in the string whilst ignoring my already found "Feed="EUREX-EMDI"" match
The perl idiom for this type of thing is a multiple assignment from regex capture groups. Assuming you can always count on the items of interest being in the same order and format (quoting):
($feed, $state, $istainted) = /.*(Feed="[^"]*").*(State="[^"]*").*(IsTainted="[^"]*")/;
Or if you only want to capture the (unquoted) values themselves, change the parentheses (capture groups):
($feed, $state, $istainted) = /.*Feed="([^"]*)".*State="([^"]*)".*(IsTainted="([^"]*)"/;
Please, don't try and parse XML with a regex. It's brittle. XML is contextual, and regular expression aren't. So at best, it's a dirty hack, and one that may one day break without warning for the most inane reasons.
See: RegEx match open tags except XHTML self-contained tags for more.
However, XML is structured, and it's actually quite easy to work with - provided you use something well suited to the job: A parser.
I like XML::Twig. XML::LibXML is also excellent, but has a bit of a steeper learning curve. (You also get XPath which is like regular expressions, but much more well suited for XML)
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#create a list of what we want to keep. This map just turns it
#into a hash.
my %keep = map { $_ => 1 } qw ( IsTainted State Feed );
#parse the XML. If it's a file, you may want "parsefile" instead.
my $twig = XML::Twig->parse( \*DATA );
#iterate the attributes.
foreach my $att ( keys %{ $twig->root->atts } ) {
#delete the attribute unless it's in our 'keep' list.
$twig->root->del_att($att) unless $keep{$att};
}
#print it. You may find set_pretty_print useful for formatting XML.
$twig->print;
__DATA__
<Multicast ID="0/m1" Feed="EUREX-EMDI" IPPort="224.0.50.128:59098" State="CHECK" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="-" />
Outputs:
<Multicast Feed="EUREX-EMDI" IsTainted="0" State="CHECK"/>
That preserves the attributes, and gives you valid XML. But if you just want the values:
foreach my $att ( qw ( Feed State IsTainted ) ) {
print $att, "=", $twig->root->att($att),"\n";
}
This will strip all but those strings.
$str =~ s/(?s)(?:(?!(?:Feed|State|IsTainted)\s*=\s*".*?").)*(?:((?:Feed|State|IsTainted)\s*=\s*".*?")|$)/$1/g;
If you want to include a space separator, make the replacement ' $1'.
Explained
(?s) # Dot - all
(?: # To be removed
(?!
(?: Feed | State | IsTainted )
\s* = \s* " .*? "
)
.
)*
(?: # To be saved
( # (1 start)
(?: Feed | State | IsTainted )
\s* = \s* " .*? "
) # (1 end)
| $
)
Related
I am trying to split texts into "steps"
Lets say my text is
my $steps = "1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!"
I'd like the output to be:
"1.Do this."
"2.Then do that."
"3.And then maybe that."
"4.Complete!"
I'm not really that good with regex so help would be great!
I've tried many combination like:
split /(\s\d.)/
But it splits the numbering away from text
I would indeed use split. But you need to exclude the digit from the match by using a lookahead.
my #steps = split /\s+(?=\d+\.)/, $steps;
All step-descriptions start with a number followed by a period and then have non-numbers, until the next number. So capture all such patterns
my #s = $steps =~ / [0-9]+\. [^0-9]+ /xg;
say for #s;
This works only if there are surely no numbers in the steps' description, like any approach relying on matching a number (even if followed by a period, for decimal numbers)†
If there may be numbers in there, we'd need to know more about the structure of the text.
Another delimiting pattern to consider is punctuation that ends a sentence (. and ! in these examples), if there are no such characters in steps' description and there are no multiple sentences
my #s = $steps =~ / [0-9]+\. .*? [.!] /xg;
Augment the list of patterns that end an item's description as needed, say with a ?, and/or ." sequence as punctuation often goes inside quotes.‡
If an item can have multiple sentences, or use end-of-sentence punctuation mid-sentence (as a part of a quotation perhaps) then tighten the condition for an item's end by combining footnotes -- end-of-sentence punctuation and followed by number+period
my #s = $steps =~ /[0-9]+\. .*? (?: \."|\!"|[.\!]) (?=\s+[0-9]+\. | \z)/xg;
If this isn't good enough either then we'd really need a more precise description of that text.
† An approach using a "numbers-period" pattern to delimit item's description, like
/ [0-9]+\. .*? (?=\s+[0-9]+\. | \z) /xg;
(or in a lookahead in split) fails with text like
1. Only $2.50 or 1. Version 2.4.1 ...
‡ To include text like 1. Do "this." and 2. Or "that!" we'd want
/ [0-9]+\. .*? (?: \." | !" | [.!?]) /xg;
Following sample code demonstrates power of regex to fill up %steps hash in one line of code.
Once the data obtained you can dice and slice it anyway your heart desires.
Inspect the sample for compliance with your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,%steps,$re);
$str = '1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!';
$re = qr/(\d+)\.(\D+)\./;
%steps = $str =~ /$re/g;
say Dumper(\%steps);
say "$_. $steps{$_}" for sort keys %steps;
Output
$VAR1 = {
'1' => 'Do this',
'2' => 'Then do that',
'3' => 'And then maybe that'
};
1. Do this
2. Then do that
3. And then maybe that
I'm trying to solve a problem that wants to display the given text from a file omitting the special characters and modifying the multi-line input to a single-formatted output in only the language of Perl/Regex (no other languages like XML, etc.). Here's the given text in my flight.txt file:
<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
</flight>
</start>
The required output is:
Holland, T. "Aeronautics Engineer" 200 06/09/1969 Flight from DC to VA.
As you can see, I need the output in a single line; and first name should be first initial while output also major should be in "" while output and date format should be changed from - to /.
This is what I have in my code so far:
#!/bin/perl
use strict;
use warnings;
my $filename = "flights.txt"
open(my $input, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$input>){
my $text = <>;
$text =~ s/<[^>]*>//g;
print $text;
}
close $input
Please suggest me on what to do next and how to format the output of the given file. I'm new to Regex & Perl so need help.
Here's your homework problem, as you noted in a comment to ikegami's answer:
Create the Perl script “code.pl” that print lines that contain an opening and closing XML heading tag from “flights.txt”. Valid tags are pilot, major, company, price, date, and details regardless of the case. Tags may also have any arbitrary content inside of them. You may assume that a '<' or a '>' character will not appear inside of the attribute's value portion
Let's forget that your input is XML, for all the reasons that ikegami has already explained. The entire thing is a contrived example to get you to practice some particular regex feature. I'll go through a process of solving this problem, but also reveal later what I think the instructor expects.
First, you only need to think about one line at a time, so you don't care about nodes where the opening and closing is on separate lines, such as <start> and </start>, or <flight> and </flight>. You want to find lines such as:
<node>...</node>
The pattern is that there is some string you match near the start of the line, and that match has to show up later in the line. I think your intended task is to practice backreferences. Writing good exercises is tough, and people fall back on things, such as XML, that are familiar. My Learning Perl Exercises is more thoughtful about this.
Your basic program needs to look something like this first attempt. Read lines of input, skip the ones that don't match your pattern, and output the rest. Whenever you see ... in this answer, that's just something I need to fill in and is not Perl syntax (ignoring the yada operator, which cannot appear in a regex):
use strict;
use warnings;
while( <> ) {
next unless m/ ... /;
print;
}
I'll mostly ignore that program structure and focus on the match operator, m//. update the pattern as I step through this.
The trick, then, is what goes in the pattern. You know you have to match something that looks like an XML open tag (again, ignoring that this is XML because it's not a good example for input). That starts with < and ends with > with some stuff in the middle. This pattern uses the /x flag to make whitespace insignificant. I can spread out the pattern so I can grok it easier:
m/ < ... > /x;
So what can go inside the angle brackets? In the inputL which I'm pretending isn't XML, the stuff inside the angles follows these rules, which you could read about in the XML standard if this were XML:
case-sensitive
starts with a letter or underscore
can contain letters, digits, hyphens, underscores, and periods
cannot start with xml in any case
Let's ignore that last one for a moment because I don't think it's part of the simple exercise you need to do. And the rules are actually slightly more complicated.
Case sensitive is easy. We aren't going to use the /i flag on the match operator, so we get that for free.
Starts with a letter or underscore. That's pretty easy. Since I'm pretending this is not XML, I'm not going to support all the Unicode scripts that current XML will allow. I'll restrict that to ASCII, and use a character class to represent all the letters that I'll allow right after the >:
m/ < [a-zA-Z_] ... > /x;
After that, I can have letters and underscores, but now also have hyphens, digits, and periods. As an aside, many such things have a set of characters for the start of an "identifier" (ID_Start) and a wider set for the rest (ID_Continue). Perl has similar rules for its variable name.
I use a second character class for the continuation. There's a slight gotcha here because you want a literal hyphen, but that also forms a range in the character class. That is, it forms a range unless it's at the end. The . in a character class is literal .:
m/ < [a-zA-Z_] [a-zA-Z_0-9.-]+ > /x;
With this pattern, you get much more than you wanted. The output is every line that has a start tag. Note that it does not match <flight number="12345"> because this pattern doesn't handle attributes, which is fine because I'm pretending this isn't XML:
<start>
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
The end tag has the same name as the start tag. In our input, there's one start tag and one end tag per line, and since I look at one line at a time, I can ignore many things that an XML parser has to care about. Now I spread my pattern over several lines because /x allows me to do that, and \x also allows me to add comments so I remember what each part of the pattern does. The / in the end tag is also the match operator delimiter, so I escape that as \/:
m/
< [a-zA-Z_] [a-zA-Z_0-9.-]+ > # start tag
... # the interesting text
< \/ ... > # end tag
/x;
I need to fill in the ... parts. The "interesting text" part is easy. I'll match anything. The .* greedily matches zero or more non-newline characters:
m/
< [a-zA-Z_] [a-zA-Z_0-9.-]+ > # start tag
.* # the interesting text, greedily
< \/ ... > # end tag
/x;
But, I don't really want * to be greedy. I don't want it to match the end tag, so I can add the non-greedy modifier ? to the .*:
m/
< [a-zA-Z_] [a-zA-Z_0-9.-]+ > # start tag
.*? # the interesting text, non-greedily
< \/ ... > # end tag
/x;
Now I need to fill in the name portion of the end tag. It has to be the same as the start name. By surrounding the start name in (...), I capture that part of the string that matched. That goes into the capture buffer $1. I can then re-use that exact match within the pattern with a "back reference" (the point of your problem, I'm guessing). A backreference starts with a \ and uses the number of the capture buffer you want to use. So, \1 uses the exact text matched in $1; not the same pattern but the actual text matched:
m/
< # start tag
([a-zA-Z_] [a-zA-Z_0-9.-]+) # $1
>
.*? # the interesting text, non-greedily
< \/ \1 > # end tag
/x;
Now the output excludes <start> because it doesn't have an end tag:
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
If you modified your data to change </date> to </data>, that line wouldn't match because the start and end tags are different.
But, what you really want is the text in the middle, so you need to capture that too. You can add another capture buffer. As the second set of parens, this is the the buffer $2, and doesn't disturb $1 or \1:
m/
< # start tag
([a-zA-Z_] [a-zA-Z_0-9.-]+) # $1
>
( .*? ) # $2, the interesting text, non-greedily
< \/ \1 > # end tag
/x;
But now you want to print the interesting test, not the entire line, so I'll print the $2 capture buffer instead of the entire line. Remember, these buffers are only valid after a successful match, but I've skipped the lines where it doesn't match, so I'm fine:
use strict;
use warnings;
while( <DATA> ) {
next unless m/
< # start tag
([a-zA-Z_] [a-zA-Z_0-9.-]+) # $1
>
(.*?) # $2, the interesting text, non-greedily
< \/ \1 > # end tag
/x;
print $2;
}
print "\n"; # end all the output!
This gets me close. I'm missing some whitespace between elements (And note there is a leading space before Holland):
Holland, TomAeronautics EngineerBoeing20006-09-1969Flight from DC to VA.
I can add a space at the end of each print:
print $2, ' ';
Now you have your output:
Holland, Tom Aeronautics Engineer Boeing 200 06-09-1969 Flight from DC to VA.
What the answer probably is
I'm guessing that the answer you'll see is much simpler. If you ignore all the rules about names and only handle exactly the input from the problem, you can probably get away with this:
m/ <(.*?)> (.*?) < \/ \1 > /x
As an exercise simply to practice back references, that's fine. But, you'll eventually create problems handling real XML like that. Note that $1 could capture all of flight number="1234" because this doesn't exclude whitespace or the other disallowed characters.
Let's go a bit deeper
The pattern I showed was pretty complicated, especially if you are just learning things. I can precompile the pattern and save it in a scalar, then use that scalar inside the match operator:
use strict;
use warnings;
my $pattern = qr/
< # start tag
([a-zA-Z_] [a-zA-Z_0-9.-]+) # $1
>
( .*? ) # the interesting text, non-greedily
< \/ \1 > # end tag
/x;
while( <DATA> ) {
next unless m/$pattern/;
print $2, ' ';
}
This way, the mechanics of the while loop are distinct from the particulars. The complexity of the pattern doesn't affect my ability to understand the loop.
Now, having done that, I'll get more complicated. So far I used numbered captures and backreferences, but I might mess that up if I add more captures. If there's another capture before the start tag, the start tag capture is no longer $1, which means \1 now refers to the wrong thing. Instead of numbers, I can give them my own labels with the (?<LABEL>...) feature that Perl stole from Python. The back reference to that label is \k<LABEL>:
my $pattern = qr/
< # start tag
(?<tag> # labeled capture
[a-zA-Z_] [a-zA-Z_0-9.-]+
)
>
( .*? ) # the interesting text, non-greedily
< \/ \k<tag> > # end tag
/x;
I can even label the "interesting text" portion:
my $pattern = qr/
< # start tag
(?<tag>
[a-zA-Z_] [a-zA-Z_0-9.-]+
)
>
(?<text> .*? ) # the interesting text, non-greedily
< \/ \k<tag> > # end tag
/x;
The rest of the program still works because these labels are aliases to the numbered capture variables. However, I don't want to rely on that (hence, the label). The hash %+ has the values in the labeled captures, and the label is the key. The interesting text is in $+<text>:
while( <DATA> ) {
next unless m/$pattern/;
print $+{'text'}, ' ';
}
The rule I ignored
Now, there was the rule that I ignored. A tag name cannot start with xml in any case. That's tied to an XML feature I'll ignore here. I'll change my input to include an xmlmeal node:
<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<xmlmeal> chicken</xmlmeal>
</flight>
</start>
I match that xmlmeal node because I haven't done anything to follow the rule. I can add a negative lookahead assertion, (?!...) to exclude that. As an assertion (\b and \A are other assertions), the lookahead does not consume text; it merely matches a condition. I use (?!xml) to mean "wherever I am right now, xml cannot be next":
my $pattern = qr/
< # start tag
(?<tag>
(?!xml)
[a-zA-Z_] [a-zA-Z_0-9.-]+
)
>
(?<text> .*? ) # the interesting text, non-greedily
< \/ \k<tag> > # end tag
/x;
That's fine and it won't show " chicken" in the output. But, what if the input tag name was XMLmeal? I've only excluded the lowercase version. I need to exclude much more:
<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<XMLmeal>chicken</XMLmeal>
<xmldrink>diet coke</xmldrink>
<Xmlsnack>almonds</Xmlsnack>
</flight>
</start>
I can get fancier. I'm not use the /i flag for case insensitivity because the start and end tag need to match exactly. I can, however, turn on case insensitivity for part of a pattern with (?i), and everything past that will ignore case:
my $pattern = qr/
< # start tag
(?<tag>
(?i) # ignore case starting here
(?!xml)
[a-zA-Z_] [a-zA-Z_0-9.-]+
)
>
(?<text> .*? ) # the interesting text, non-greedily
< \/ \k<tag> > # end tag
/x;
But, inside grouping parentheses, the (?i) is in effect only until the end of that group. I can limit which part of my pattern ignores case. The (?: ... ) groups without capturing (so doesn't disturb what $1 or $2 capture):
(?: (?i) (?!xml) )
Now my pattern excludes those three tags I added:
my $pattern = qr/
< # start tag
(?<tag>
(?: (?i) (?!xml) ) # not XmL in any case
[a-zA-Z_] [a-zA-Z_0-9.-]+
)
>
(?<text> .*? ) # the interesting text, non-greedily
< \/ \k<tag> > # end tag
/x;
Some Mojo
So far, none of what I've presented handles attributes in the tags, which you want to ignore anyway. You should be able to add those to the regex yourself. But, I'll shift gears into other ways to handle XML like things.
Here's a Mojolicious program that understands XML and can extract things. Since it's a real Document Object Model (DOM) parser, it doesn't care about lines.
#!perl
use Mojo::DOM;
my $not_xml = <<~'HERE';
<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
</flight>
</start>
HERE
Mojo::DOM->new( $not_xml )->xml(1)
->find( 'flight *' )
->map( 'text' )
->each( sub { print "$_ " } );
print "\n";
The find uses a CSS Selector to decide what it wants to process. The selector flight * is all the child nodes inside flight (so, any child tag no matter its name). The map calls the text method on each portion of the tree that find produces, and each outputs each result. It's very simple because someone has already done all of the hard work.
But, Mojo::DOM is not appropriate for every situation. It wants to know the entire tree at once, and for very large documents that's a burden on memory. There are "streaming" parsers that can handle that.
Twiggy
The problem you present in the original question is different than the homework that you posted in the comments. You want to transform text based on which tag it comes from. This is a different sort of problem all together because
XML::Twig is useful for processing different node types differently. It has the added advantage is that it doesn't need the entire XML tree in memory at one time.
Here's an example that uses two different handlers for the pilot and major portion. When Twig runs into those nodes, it calls the appropriate subroutine that you referenced in twig_handlers. I won't explain the particular Perl features here:
use XML::Twig;
my $twig = XML::Twig->new(
twig_handlers => {
pilot => \&pilot,
major => \&major,
},
);
sub pilot {
my( $twig, $e ) = #_;
my $text = $e->text;
$text =~ s/,\s.\K.*/./;
print $text, ' ';
$twig->purge;
}
sub major {
my( $twig, $e ) = #_;
print '"' . $e->text . '"' . ' ';
$twig->purge;
}
my $xml = <<~'HERE';
<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
</flight>
</start>
HERE
$twig->parse($xml);
This outputs:
Holland, T. "Aeronautics Engineer"
Now you'd complete that with subroutines for all of the other things that you want to process.
Forenote
Based on comments made after this answer was posted, this is an assignment where the teacher is encouraging the OP to make numerous bad assumptions about XML. They are teaching them to do exactly what one should never do. If the teacher were to define a format, that would be fine; it wouldn't be XML but merely something inspired by XML. But they didn't do that. They explicitly stated it was XML. I can't help the OP any further because
I won't teach how to do this incorrectly,
doing it correctly without using an existing module would require a time expenditure that's too large,
doing it correctly without using an existing module would fall outside the scope of the site, and
I don't even know what the teacher wants (having been provided the exact wording of the assignment).
What follows is an answer the Question asked (as opposed to a solution to the OP's homework).
Answer
You are trying to parse XML. There are existing XML parsers you can use instead of spending considerable effort writing your own. I personally use XML::LibXML.
use XML::LibXML qw( );
my $doc = XML::LibXML->new->parse_file("flight.txt");
for my $flight_node ($doc->findnodes("/start/flight")) {
my $pilot = $flight_node->findvalue("pilot");
my $major = $flight_node->findvalue("major");
my $price = $flight_node->findvalue("price");
my $date = $flight_node->findvalue("date");
my $details = $flight_node->findvalue("details");
say "$pilot \"$major\" $price $date $details";
}
Just to give you some hints:
Your code is "ok" but
my $text = <>;
in your while loop is wrong. You already have the line in $row, so just use $row instead.
and your row also contains a linefeed at the end, so before printing it out you might remove this.
chomp($row);
So wrapping it up:
chomp($row);
$row =~ s/<[^>]*>//g;
print $row . " ";
might be the code in your while-loop you are looking for. And for extra grades, start thinking how to remove unnecessary white space at the beginning/end.
I need to get the date which could be in 3 possible format.
11/20/2012
11.20.2012
11-20-2012
How could I achieve this in Perl. I'm trying RegEx to get what I want. Here's my code.
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012"); #array values may vary in every run
foreach my $date (#dates){
$date =~ /[-.\/\d+]/g;
print "Date: $date \n";
}
I want the output to be. (code above doesn't print anything)
Date: 11/20/2012
Date: 2012.11.20
Date: 20-11-2012
Where am I wrong? Please Help. Thanks
Note: I want to achieve this without using any CPAN module as much as possible. I know there are a lot of CPAN modules that could provide what I want.
Your code almost produces what you want. I assume your input is a bit more complicated, or you have posted code that you are not actually running.
Either way, the problem is this
$date =~ /[-.\/\d+]/g;
First off, your plus multiplier is inside the character class: It should be after it. Second, it is just a pattern match, you need to use it in list context, and store its return value:
my ($match) = $date =~ /[-.\/\d]+/g;
print "Date: $match\n";
Then it will return the first of the strings found that contains one or more of dash, period, slash or a number. Be aware that it will match other things as well, as it is a rather unstrict regex.
Why does it work? Because a pattern match in list context returns a list of the matches when the global /g modifier is used.
I highly recommend the use of DateTime::Format::Strptime module, which has a rich set of funcionality. Think not only in parsing strings, but also in checking the date is valid.
Why not search for the formats one at a time?
=~ m!(\d{2}/\d{2}/\d{2}|\d{4}\.\d{2}\.\d{2}|\d{2}-\d{2}-\d{4})!
should do the trick. Other than that, there's a module dealing with dates called DateTime.
Try matching the formats in turn. The regex below matches any of your permitted separators (/, ., or -) and then requires the same separator via backreference (\2 or \3). Otherwise, you have three possible separators times two possible positions for the year to make six alternatives in your pattern.
#! /usr/bin/env perl
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr<
\b # begin on word boundary
(
(?: [0-9][0-9] ([-/.]) [0-9][0-9] \2 [0-9][0-9][0-9][0-9])
| (?: [0-9][0-9][0-9][0-9] ([-/.]) [0-9][0-9] \3 [0-9][0-9])
)
\b # end on word boundary
>x;
foreach my $date (#dates) {
if (my($match) = $date =~ /$date_pattern/) {
print "Date: $match\n";
}
}
Output:
Date: 11/20/2012
Date: 2012.11.20
Date: 20-11-2012
On my first try at the code above, I had \2 in the YYYY-MM-DD alternative where I should have had \3, which failed to match. To spare us counting parentheses, version 5.10.0 added named capture buffers.
Named Capture Buffers
It is now possible to name capturing parenthesis in a pattern and refer to the captured contents by name. The naming syntax is (?<NAME>....). It's possible to backreference to a named buffer with the \k<NAME> syntax. In code, the new magical hashes %+ and %- can be used to access the contents of the capture buffers.
Using this handy feature, the code above becomes
#! /usr/bin/env perl
use 5.10.0; # named capture buffers
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr!
\b # begin on word boundary
(?<date>
(?: [0-9][0-9] (?<sep>[-/.]) [0-9][0-9] \k{sep} [0-9][0-9][0-9][0-9])
| (?: [0-9][0-9][0-9][0-9] (?<sep>[-/.]) [0-9][0-9] \k{sep} [0-9][0-9])
)
\b # end on word boundary
!x;
foreach my $date (#dates) {
if ($date =~ /$date_pattern/) {
print "Date: $+{date}\n";
}
}
and produces the same output.
The code above still contains a lot of repetition. Using the (DEFINE) special case combined with named captures, we can make the pattern much nicer.
#! /usr/bin/env perl
use 5.10.0;
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr!
\b (?<date> (?&YMD) | (?&DMY)) \b
(?(DEFINE)
(?<SEP> [-/.])
(?<YYYY> [0-9][0-9][0-9][0-9])
(?<MM> [0-9][0-9])
(?<DD> [0-9][0-9])
(?<YMD> (?&YYYY) (?<sep>(?&SEP)) (?&MM) \k<sep> (?&DD))
(?<DMY> (?&DD) (?<sep>(?&SEP)) (?&MM) \k<sep> (?&YYYY))
)
!x;
foreach my $date (#dates) {
if ($date =~ /$date_pattern/) {
print "Date: $+{date}\n";
}
}
Yes, the subpattern named DMY also matches dates int MDY form. For now it suffices, and you ain’t gonna need it.
I'm trying to learn something about regular expressions.
Here is what I'm going to match:
/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
My expression should "grabs" abc123 and def456.
And now just an example about what I'm not going to match ("question mark" is missing):
/parent/child/firstparam=abc123&secondparam=def456
Well, I built the following expression:
^(?:/parent/child){1}(?:^(?:/\?|\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?
But that doesn't work.
Could you help me to understand what I'm doing wrong?
Thanks in advance.
UPDATE 1
Ok, I made other tests.
I'm trying to fix the previous version with something like this:
/parent/child(?:(?:\?|/\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?$
Let me explain my idea:
Must start with /parent/child:
/parent/child
Following group is optional
(?: ... )?
The previous optional group must starts with ? or /?
(?:\?|/\?)+
Optional parameters (I grab values if specified parameters are part of querystring)
(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?
End of line
$
Any advice?
UPDATE 2
My solution must be based just on regular expressions.
Just for example, I previously wrote the following one:
/parent/child(?:[?&/]*(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*))*$
And that works pretty nice.
But it matches the following input too:
/parent/child/firstparam=abc123&secondparam=def456
How could I modify the expression in order to not match the previous string?
You didn't specify a language so I'll just usre Perl. So basically instead of matching everything, I just matched exactly what I thought you needed. Correct me if I am wrong please.
while ($subject =~ m/(?<==)\w+?(?=&|\W|$)/g) {
# matched text = $&
}
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
= # Match the character “=” literally
)
\\w # Match a single character that is a “word character” (letters, digits, and underscores)
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
& # Match the character “&” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
\\W # Match a single character that is a “non-word character”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
Output:
This regex will work as long as you know what your parameter names are going to be and you're sure that they won't change.
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?
Whilst regex is not the best solution for this (the above code examples will be far more efficient, as string functions are way faster than regexes) this will work if you need a regex solution with up to 3 parameters. Out of interest, why must the solution use only regex?
In any case, this regex will match the following strings:
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
It will now only match those containing query string parameters, and put them into capture groups for you.
What language are you using to process your matches?
If you are using preg_match with PHP, you can get the whole match as well as capture groups in an array with
preg_match($regex, $string, $matches);
Then you can access the whole match with $matches[0] and the rest with $matches[1], $matches[2], etc.
If you want to add additional parameters you'll also need to add them in the regex too, and add additional parts to get your data. For example, if you had
/parent/child/?secondparam=def456&firstparam=abc123&fourthparam=jkl01112&thirdparam=ghi789
The regex will become
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?
This will become a bit more tedious to maintain as you add more parameters, though.
You can optionally include ^ $ at the start and end if the multi-line flag is enabled. If you also need to match the whole lines without query strings, wrap this whole regex in a non-capture group (including ^ $) and add
|(?:^\/parent\/child\/?\??$)
to the end.
You're not escaping the /s in your regex for starters and using {1} for a single repetition of something is unnecessary; you only use those when you want more than one repetition or a range of repetitions.
And part of what you're trying to do is simply not a good use of a regex. I'll show you an easier way to deal with that: you want to use something like split and put the information into a hash that you can check the contents of later. Because you didn't specify a language, I'm just going to use Perl for my example, but every language I know with regexes also has easy access to hashes and something like split, so this should be easy enough to port:
# I picked an example to show how this works.
my $route = '/parent/child/?first=123&second=345&third=678';
my %params; # I'm going to put those URL parameters in this hash.
# Perl has a way to let me avoid escaping the /s, but I wanted an example that
# works in other languages too.
if ($route =~ m/\/parent\/child\/\?(.*)/) { # Use the regex for this part
print "Matched route.\n";
# But NOT for this part.
my $query = $1; # $1 is a Perl thing. It contains what (.*) matched above.
my #items = split '&', $query; # Each item is something like param=123
foreach my $item (#items) {
my ($param, $value) = split '=', $item;
$params{$param} = $value; # Put the parameters in a hash for easy access.
print "$param set to $value \n";
}
}
# Now you can check the parameter values and do whatever you need to with them.
# And you can add new parameters whenever you want, etc.
if ($params{'first'} eq '123') {
# Do whatever
}
My solution:
/(?:\w+/)*(?:(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)?|\w+|)
Explain:
/(?:\w+/)* match /parent/child/ or /parent/
(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)? match child?firstparam=abc123 or ?firstparam=abc123 or ?
\w+ match text like child
..|) match nothing(empty)
If you need only query string, pattern would reduce such as:
/(?:\w+/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)
If you want to get every parameter from query string, this is a Ruby sample:
re = /\/(?:\w+\/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)/
s = '/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789'
if m = s.match(re)
query_str = m[1] # now, you can 100% trust this string
query_str.scan(/(\w+)=(\w+)/) do |param,value| #grab parameter
printf("%s, %s\n", param, value)
end
end
output
secondparam, def456
firstparam, abc123
thirdparam, ghi789
This script will help you.
First, i check, is there any symbol like ?.
Then, i kill first part of line (left from ?).
Next, i split line by &, where each value splitted by =.
my $r = q"/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789";
for my $string(split /\n/, $r){
if (index($string,'?')!=-1){
substr($string, 0, index($string,'?')+1,"");
#say "string = ".$string;
if (index($string,'=')!=-1){
my #params = map{$_ = [split /=/, $_];}split/\&/, $string;
$"="\n";
say "$_->[0] === $_->[1]" for (#params);
say "######next########";
}
else{
#print "there is no params!"
}
}
else{
#say "there is no params!";
}
}
If I have some HTML like this:
<b>1<i>2</i>3</b>
And the following regex:
\<[^\>\/]+\>(.*?)\<\/[^\>]+\>
Then it will match:
<b>1<i>2</i>
I want it to only match HTML where the start and end tags are the same. Is there a way to do this?
Thanks,
Joe
Is there a way to do this?
Yes, certainly. Ignore those flippant non-answers that tell you it can’t be done. It most certainly can. You just may not wish to do so, as I explain below.
Numbered Captures
Pretending for the nonce that HTML <i> and <b> tags are always denude of attributes, and moreover, neither overlap nor nest, we have this simple solution:
#!/usr/bin/env perl
#
# solution A: numbered captures
#
use v5.10;
while (<>) {
say "$1: $2" while m{
< ( [ib] ) >
(
(?:
(?! < /? \1 > ) .
) *
)
</ \1 >
}gsix;
}
Which when run, produces this:
$ echo 'i got <i>foo</i> and <b>bar</b> bits go here' | perl solution-A
i: foo
b: bar
Named Captures
It would be better to use named captures, which leads to this equivalent solution:
#!/usr/bin/env perl
#
# Solution B: named captures
#
use v5.10;
while (<>) {
say "$+{name}: $+{contents}" while m{
< (?<name> [ib] ) >
(?<contents>
(?:
(?! < /? \k<name> > ) .
) *
)
</ \k<name> >
}gsix;
}
Recursive Captures
Of course, it is not reasonable to assume that such tags neither overlap nor nest. Since this is recursive data, it therefore requires a recursive pattern to solve. Remembering that the trival pattern to parse nested parens recursively is simply:
( \( (?: [^()]++ | (?-1) )*+ \) )
I’ll build that sort of recursive matching into the previous solution, and I’ll further toss in a bit interative processing to unwrap the inner bits, too.
#!/usr/bin/perl
use v5.10;
# Solution C: recursive captures, plus bonus iteration
while (my $line = <>) {
my #input = ( $line );
while (#input) {
my $cur = shift #input;
while ($cur =~ m{
< (?<name> [ib] ) >
(?<contents>
(?:
[^<]++
| (?0)
| (?! </ \k<name> > )
.
) *+
)
</ \k<name> >
}gsix)
{
say "$+{name}: $+{contents}";
push #input, $+{contents};
}
}
}
Which when demo’d produces this:
$ echo 'i got <i>foo <i>nested</i> and <b>bar</b> bits</i> go here' | perl Solution-C
i: foo <i>nested</i> and <b>bar</b> bits
i: nested
b: bar
That’s still fairly simple, so if it works on your data, go for it.
Grammatical Patterns
However, it doesn’t actually know about proper HTML syntax, which admits tag attributes to things like <i> and <b>.
As explained in this answer, one can certainly use regexes to parse markup languages, provided one is careful about it.
For example, this knows the attributes germane to the <i> (or <b>) tag. Here we defined regex subroutines used to build up a grammatical regex. These are definitions only, just like defining regular subs but now for regexes:
(?(DEFINE) # begin regex subroutine defs for grammatical regex
(?<i_tag_end> < / i > )
(?<i_tag_start> < i (?&attributes) > )
(?<attributes> (?: \s* (?&one_attribute) ) *)
(?<one_attribute>
\b
(?&legal_attribute)
\s* = \s*
(?:
(?"ed_value)
| (?&unquoted_value)
)
)
(?<legal_attribute>
(?&standard_attribute)
| (?&event_attribute)
)
(?<standard_attribute>
class
| dir
| ltr
| id
| lang
| style
| title
| xml:lang
)
# NB: The white space in string literals
# below DOES NOT COUNT! It's just
# there for legibility.
(?<event_attribute>
on click
| on dbl click
| on mouse down
| on mouse move
| on mouse out
| on mouse over
| on mouse up
| on key down
| on key press
| on key up
)
(?<nv_pair> (?&name) (?&equals) (?&value) )
(?<name> \b (?= \pL ) [\w\-] + (?<= \pL ) \b )
(?<equals> (?&might_white) = (?&might_white) )
(?<value> (?"ed_value) | (?&unquoted_value) )
(?<unwhite_chunk> (?: (?! > ) \S ) + )
(?<unquoted_value> [\w\-] * )
(?<might_white> \s * )
(?<quoted_value>
(?<quote> ["'] )
(?: (?! \k<quote> ) . ) *
\k<quote>
)
(?<start_tag> < (?&might_white) )
(?<end_tag>
(?&might_white)
(?: (?&html_end_tag)
| (?&xhtml_end_tag)
)
)
(?<html_end_tag> > )
(?<xhtml_end_tag> / > )
)
Once you have the pieces of your grammar assembled, you could incorporate those definitions into the recursive solution already given to do a much better job.
However, there are still things that haven’t been considered, and which in the more general case must be. Those are demonstrated in the longer solution already provided.
SUMMARY
I can think of only three possible reasons why you might not care to use regexes for parsing general HTML:
You are using an impoverished regex language, not a modern one, and so you have to recourse to essential modern conveniences like recursive matching or grammatical patterns.
You might such concepts as recursive and grammatical patterns too complicated for you to easily understand.
You prefer for someone else to do all the heavy lifting for you, including the heavy testing, and so you would rather use a separate HTML parsing module instead of rolling your own.
Any one or more of those might well apply. In which case, don’t do it this way.
For simple canned examples, this route is easy. The more robust you want this to work on things you’ve never seen before, the harder this route becomes.
Certainly you can’t do any of it if you are using the inferior, impoverished pattern matching bolted onto the side of languages like Python or even worse, Javascript. Those are barely any better than the Unix grep program, and in some ways, are even worse. No, you need a modern pattern matching engine such as found in Perl or PHP to even start down this road.
But honestly, it’s probably easier just to get somebody else to do it for you, by which I mean that you should probably use an already-written parsing module.
Still, understanding why not to bother with these regex-based approaches (at least, not more than once) requires that you first correctly implement proper HTML parsing using regexes. You need to understand what it is all about. Therefore, little exercises like this are useful for improving your overall understanding of the problem-space, and of modern pattern matching in general.
This forum isn’t really in the right format for explaining all these things about modern pattern-matching. There are books, though, that do so equitably well.
You probably don't want to use regular expressions with HTML.
But if you still want to do this you need to take a look at backreferences.
Basically it's a way to capture a group (such as "b" or "i") to use it later in the same regular expression.
Related issues:
RegEx match open tags except XHTML self-contained tags