Here is my text file forms.
S1,F2 title including several white spaces (abbr) single,Here<->There,reply
S1,F2 title including several white spaces (abbr) single,Here<->There
S1,F2 title including several white spaces (abbr) single,Here<->There,[reply]
How to change my reg ex to work on all the three forms above?
/^S(\d),F(\d)\s+(.*?)\((.*?)\)\s+(.*?),(.*?)[,](.*?)$/
I tried replace (.*?)$/ with [.*?]$/. It doesn't work. I guess I shouldn't use [](square brackets) to match the possible word of [reply](including the []).
Actually, my general question should be how to match the possible characters better in Reg exp using Perl? I looked up the online PerlDoc webpages. But it is hard for me to find out the useful information based on my Perl knowledge level. That's why I also asked some stupid questions.
Appreciated for your comments and suggestions.
What about using negated character classes:
/^S(\d),F(\d)\s+([^()]*?)\s+\(([^()]+)\)\s+([^,]*),([^,]*)(?:,(.*?))?$/
When incorporated into this script:
#!/bin/perl
use strict;
use warnings;
while (<>)
{
chomp;
my($s,$f,$title,$abbr,$single,$here,$reply) =
$_ =~ m/^S(\d),F(\d)\s+([^()]*?)\s+\(([^()]+)\)\s+([^,]*),([^,]*)(?:,(.*?))?$/;
$reply ||= "<no reply>";
print "S$s F$f <$title> ($abbr) $single : $here : $reply\n";
}
And run on the original data file, it produces:
S1 F2 <title including several white spaces> (abbr) single : Here<->There : reply
S1 F2 <title including several white spaces> (abbr) single : Here<->There : <no reply>
S1 F2 <title including several white spaces> (abbr) single : Here<->There : [reply]
You should probably also use the 'xms' suffix to the expression to allow you to document it more easily:
#!/bin/perl
use strict;
use warnings;
while (<>)
{
chomp;
my($s,$f,$title,$abbr,$single,$here,$reply) =
$_ =~ m/^
S(\d) , # S1
F(\d) \s+ # F2
([^()]*?) \s+ # Title
\(([^()]+)\) \s+ # (abbreviation)
([^,]*) , # Single
([^,]*) # Here or There
(?: , (.*?) )? # Optional reply
$
/xms;
$reply ||= "<no reply>";
print "S$s F$f <$title> ($abbr) $single : $here : $reply\n";
}
I confess I'm still apt to write one-line monsters - I'm trying to mend my ways.
You know that brackets in regular expression are reserved for declaring sets of characters that you want to match? So, for a real bracket, you need to escape it, or to enclose it in brackets ([[] or []]), isn't that obfuscated?!.
Try (\[.*?\]|.*?) to indicate that optional brackets.
Try
/^S(\d),F(\d)\s+(.*?)\((.*?)\)\s+(.*?),(.*?)(,(\[reply\]|reply))?$/
This will match the optional (?) part ,(\[reply\]|reply) which is either ,[reply] or ,reply, i.e.,
(nothing)
,reply
[,reply]
BTW, your [,] means "one character of the following: ,". Exactly the same as a literal , within the regex. If you wanted to make your [,](.*?)$ work, you should use (,(.+))?$ to match either nothing or a comma followed by any (non-empty) string.
EDIT
If the following are also valid:
S1,F2 title including several white spaces (abbr) single,Here<->There,[reply
S1,F2 title including several white spaces (abbr) single,Here<->There,reply]
Then you could use (,\[?reply\]?)? at the end.
You can make the last part optional by using the (?:..)? as:
^S(\d),F(\d)\s+(.*?)\((.*?)\)\s+(.*?),(.*?)(?:,(.*))?$
Codepad link
Related
I am trying to split texts into "steps"
Lets say my text is
my $steps = "1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!"
I'd like the output to be:
"1.Do this."
"2.Then do that."
"3.And then maybe that."
"4.Complete!"
I'm not really that good with regex so help would be great!
I've tried many combination like:
split /(\s\d.)/
But it splits the numbering away from text
I would indeed use split. But you need to exclude the digit from the match by using a lookahead.
my #steps = split /\s+(?=\d+\.)/, $steps;
All step-descriptions start with a number followed by a period and then have non-numbers, until the next number. So capture all such patterns
my #s = $steps =~ / [0-9]+\. [^0-9]+ /xg;
say for #s;
This works only if there are surely no numbers in the steps' description, like any approach relying on matching a number (even if followed by a period, for decimal numbers)†
If there may be numbers in there, we'd need to know more about the structure of the text.
Another delimiting pattern to consider is punctuation that ends a sentence (. and ! in these examples), if there are no such characters in steps' description and there are no multiple sentences
my #s = $steps =~ / [0-9]+\. .*? [.!] /xg;
Augment the list of patterns that end an item's description as needed, say with a ?, and/or ." sequence as punctuation often goes inside quotes.‡
If an item can have multiple sentences, or use end-of-sentence punctuation mid-sentence (as a part of a quotation perhaps) then tighten the condition for an item's end by combining footnotes -- end-of-sentence punctuation and followed by number+period
my #s = $steps =~ /[0-9]+\. .*? (?: \."|\!"|[.\!]) (?=\s+[0-9]+\. | \z)/xg;
If this isn't good enough either then we'd really need a more precise description of that text.
† An approach using a "numbers-period" pattern to delimit item's description, like
/ [0-9]+\. .*? (?=\s+[0-9]+\. | \z) /xg;
(or in a lookahead in split) fails with text like
1. Only $2.50 or 1. Version 2.4.1 ...
‡ To include text like 1. Do "this." and 2. Or "that!" we'd want
/ [0-9]+\. .*? (?: \." | !" | [.!?]) /xg;
Following sample code demonstrates power of regex to fill up %steps hash in one line of code.
Once the data obtained you can dice and slice it anyway your heart desires.
Inspect the sample for compliance with your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,%steps,$re);
$str = '1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!';
$re = qr/(\d+)\.(\D+)\./;
%steps = $str =~ /$re/g;
say Dumper(\%steps);
say "$_. $steps{$_}" for sort keys %steps;
Output
$VAR1 = {
'1' => 'Do this',
'2' => 'Then do that',
'3' => 'And then maybe that'
};
1. Do this
2. Then do that
3. And then maybe that
I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);
I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig
I have entries like that :
XYZABC------------HGTEZCW
ZERTAE------------RCBCVQE
I would like to get just HGTEZCW and RCBCVQE .
I would like to use a generic regex.
$temp=~ s/^\s+//g; (1)
$temp=~ s/^\w+[-]+//g; (2)
If i use (1) + (2) , it works.
It works i get : HGTEZCW, then RCBCVQE ...
I would like to know if it is possible to do that in one line like :
$temp=~ s/^\s+\w+[-]+//g; (3)
When I use (3), i get this result : XYZABC------------HGTEZCW
I dont understand why it is not possible to concat 1 + 2 in one line.
Sorry my entries was :
XYZABC------------HGTEZCW
ZERTAE------------RCBCVQE
Also, the regex 1 remove space but when i use regex2, it remove XYZABC------------ .
But the combination (3), don't work.
i have this XYZABC------------HGTEZCW
#Tim So there always is whitespace at the start of each string?
yes
Your regex (1) removes whitespace from the start of the string. So it does nothing on your example strings.
Reges (2) removes all alphanumerics from the start of the string plus any following dashes, returning whatever follows the last dash.
If you combine both, the regex fails because there is no whitespace \s+ could match - therefore the entire regex fails.
To fix this, simply make the whitespace optional. Also you don't need to enclose the - in brackets:
$temp=~ s/^\s*\w+-+//g;
This should do the trick.
$Str = '
XYZABC------------HGTEZCW
ZERTAE------------RCBCVQE
';
#Matches = ($Str =~ m#^.+-(\w+)$#mg);
print join "\n",#Matches ;
If you only need the last seven characters of each entry, you could do the following:
$temp =~ /.{7}$/;
i was wondering how to turn a paragraph, into bullet sentences.
before:
sentence1. sentence2. sentence3. sentence4. sentence5. sentence6. sentence7.
after:
sentence1.
sentence2.
sentence3
sentence4.
sentence5.
Since all the other answers so far show how to do it various programming languages and you have tagged the question with Vim, here's how to do it in Vim:
:%s/\.\(\s\+\|$\)/.\r\r/g
I've used two carriage returns to match the output format you showed in the question. There are a number of alternative regular expression forms you could use:
" Using a look-behind
:%s/\.\#<=\( \|$\)/\r\r/g
" Using 'very magic' to reduce the number of backslashes
:%s/\v\.( |$)/.\r\r/g
" Slightly different formation: this will also break if there
" are no spaces after the full-stop (period).
:%s/\.\s*$\?/.\r\r/g
and probably many others.
A non-regexp way of doing it would be:
:let s = getline('.')
:let lineparts = split(s, '\.\#<=\s*')
:call append('.', lineparts)
:delete
See:
:help pattern.txt
:help change.txt
:help \#<=
:help :substitute
:help getline()
:help append()
:help split()
:help :d
You can use a regex
/\.( |$)/g
That will match the end of the sentence, then you can add newlines.
Or you can use some split function with . (dot space) and . (dot), then join with newlines.
Just replace all end of sentences /(?<=.) / with a period followed by two newline characters /.\n\n/. The syntax would of course depend on the language you are using.
Using Perl:
perl -e "$_ = <>; s/\.\s*/.\n/g; print"
Longer, somewhat more readable version:
my $input = 'foo. bar. baz.';
$input =~ s/
\. # A literal '.'
\s* # Followed by 0 or more space characters
/.\n/gx; # g for all occurences, x to allow comments and whitespace in regex
print $input;
Using Python:
import re
input = 'foo. bar. baz.'
print re.sub(r'\.\s*', '.\n', input)
An example using Ruby:
ruby-1.9.2 > a = "sentence1. sentence2. sentence3. and array.split(). the end."
=> "sentence1. sentence2. sentence3. and array.split(). the end."
ruby-1.9.2 > puts a.gsub(/\.(\s+|$)/, ".\n\n")
sentence1.
sentence2.
sentence3.
and array.split().
the end.
It goes like, for every . followed by (1 whitespace character or more, or followed by end of line), replace it with just . and two newline characters.
using awk
$ awk '{$1=$1}1' OFS="\n" file
sentence1.
sentence2.
sentence3.
sentence4.
sentence5.
sentence6.
sentence7
In PHP:
<?php
$input = "sentence. sentence. sentence.";
$output = preg_replace("/(.*?)\\.[\\s]+/", "$1\n", $input);
?>
Also, regular expressions are a blast, but not necessary for this problem. You can also try:
<?php
$input = "sentence. sentence. sentence.";
$arr = explode('.', $input);
foreach ($arr as $k => $v) $arr[$k] = trim($v);
$output = implode("\n", $arr);
?>
I figured out how to do this in RegExr
Search String is
(\-=?\s+)
--
Replace String is
\n\n
This is the generated information for the current regex
RegExp: /(\-=?\s+)/g
pattern: (\-=?\s+)
flags: g
capturing groups: 1
group 1: (\-=?\s+)
This will find every - in the sentence below and replace it with two newlines
Sentence 1- Sentence 2- Sentence 3- Sentence 4- Sentence 5-
The end result is
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
I have a really simple naive solution using capturing regexs.
:%s/[.!?]/\1y\r\r/g
The main draw back is this won't handle ellipses or multiple punctuation.