Perl Pattern Matching Question - regex

I am trying to match patterns in perl and need some help.
I need to delete from a string anything that matches [xxxx] i.e. opening bracket-things inside it-first closing bracket that occurs.
So I am trying to substitute with space the opening bracket, things inside, first closing bracket with the following code :
if($_ =~ /[/)
{
print "In here!\n";
$_ =~ s/[(.*?)]/ /ig;
}
Similarly I need to match i.e. angular bracket-things inside it-first closing angular bracket.
I am doing that using the following code :
if($_ =~ /</)
{
print "In here!\n";
$_ =~ s/<(.*?)>/ /ig;
}
This some how does not seem to work. My sample data is as below :
'Joanne' <!--Her name does NOT contain "Kathleen"; see the section "Name"--> "'Jo'" 'Rowling', OBE [http://news bbc co uk/1/hi/uk/793844 stm Caine heads birthday honours list] BBC News 17 June 2000 Retrieved 25 October 2000 , [http://content scholastic com/browse/contributor jsp?id=3578 JK Rowling Biography] Scholastic com Retrieved 20 October 2007 better known as 'J K Rowling' ,<ref name=telegraph>[http://www telegraph co uk/news/uknews/1531779/BBCs-secret-guide-to-avoid-tripping-over-your-tongue html Daily Telegraph, BBC's secret guide to avoid tripping over your tongue, 19 October 2006] is a British <!--do not change to "English" or "Scottish" until issue is resolved --> author best known as the creator of the [[Harry Potter]] fantasy series, the idea for which was conceived whilst on a train trip from Manchester to London in 1990 The Potter books have gained worldwide attention, won multiple awards, sold more than 400 million copies and been the basis for a popular series of films, in which Rowling had creative control serving as a producer in two of the seven installments [http://www businesswire com/news/home/20100920005538/en/Warner-Bros -Pictures-Worldwide-Satellite-Trailer-Debut%C2%A0Harry Business Wire - Warner Bros Pictures mentions J K Rowling as producer ]
Any help would be appreciated. Thanks!

You need to use this:
1 while s/\[[^\[\]]*\];
Demo:
% echo "i have [some [square] brackets] in [here] and [here] today."| perl -pe '1 while s/\[[^\[\]]*\]/NADA/g'
i have NADA in NADA and NADA today.
Versus the failing:
% echo "i have [some [square] brackets] in [here] and [here] today." | perl -pe 's/\[.*?\]/NADA/g'
i have NADA brackets] in NADA and NADA today.
The recursive regular expression I leave as an exercise for the reader. :)
EDIT: Eric Strom kindly provided a recursive solution you don’t have to use 1 while:
% echo "i have [some [square] brackets] in [here] and [here] today." | perl -pe 's/\[(?:[^\[\]]*|(?R))*\]/NADA/g'
i have NADA in NADA and NADA today.

$_ =~ /someregex/ will not modify $_
Just a note, $_ =~ /someregex/ and /someregex/ do the same thing.
Also, you don't need to check for the existence of [ or < or the grouping parenthesis:
s/\[.*?\]/ /g;
s/<.*?>/ /g;
will do the job you want.
Edit: changed code to match the fact you're modifying $_

Square brackets have special meaning in the regex syntax, so escape them: /\[.*?\]/. (You also don't need the parentheses here, and doing case-insensitive matching is pointless.)
It's been a long time since I had to wrestle with Perl, but I'm pretty sure that testing $_ with a regex will also modify $_ (even if you aren't using s///). You don't need the test anyway; just run the replacement, and if the pattern doesn't match anywhere, then it won't do anything.

Related

RegEx for matching words preceding commas, with exceptions

The section of text I'm targeting always begins with “Also there is” and ends with a period. The single names in between the commas is what I'm trying to target (i.e. "randomperson" in the example below. These names will always be different. It gets tricky because there’s other things present that are not single word “names”. Maybe I can match everything in between the commas ONLY IF it’s a single word/name, but I cant seem to figure that one out. The list of names could be much longer or even shorter, so the expression must be dynamic and not just match a set amount of names.
Targeted Text:
Also there is a reinforced stone wall, a wooden wall, a stone wall,
randomperson, a lumbering earth elemental, randomperson, randomperson,
randomperson.
(broken over multiple lines for readability)
How do I solve this problem?
Code
sed -r ':a
s/, ([a-zA-Z]*)([,\.])/\n##\1\n\2/
ta
' | sed -n 's/##//gp'
Output
randomperson
randomperson
randomperson
randomperson
Explanation:
Start a loop
sed -r ':a
Find all occurrences of ', oneword,' or ', oneword.' and replace with ##oneword, or ##oneword. The ## is a magic marker to identify the extracted names later
s/, ([a-zA-Z]*)([,\.])/\n##\1\n\2/
End loop
ta
Filter lines based on ## to extract only oneword
' | sed -n 's/##//gp'
In a program
my $text = "Also there is a reinforced stone wall, a wooden wall, a stone wall, "
. "randomperson, a lumbering earth elemental, randomperson, "
. "randomperson, randomperson."
my #single_words =
grep { split == 1 }
split /\s*,|\.|\!|;\s*/,
($text =~ /Also there is (.*)/)[0];
The regex on $text gets text after that initial phrase, then split
returns the list of strings between commas (or other punctuation), and grep filters out strings that have more than one word†.
On the command line
echo "Also there is a reinforced stone wall, a wooden wall,..., randomperson,..."
| perl -wnE'say for
grep { split == 1 }
split /\s*,|\.|\!|;\s*/, (/Also there is (.*)/)[0]'
The same as above.
Please show us what you have tried for additional explanations and commentary.
†  A lone split uses defaults, split ' ', $_, where ' ' is a special pattern that splits on \s+ and discards leading and trailing space. But in the expression split == 1 the split is in a scalar context (imposed by the operator == which needs a single value on both sides) and so it returns the number of elements in the list, then compared against 1.

Regex expression matching block of lines

I have this kind of file:
Analysis of its root cause:
Blablablablabla
blabablabkjhjk
kjbsqbdqbds
Details of the fix
blablabla
Analysis of its root cause:
fddsfsdfsdfdsfs
blnskdbbqbbb
xxxxggggggg
Details of the fix
blablabla
Analysis of its root cause is repeated x times in the file. I would like to get the block of text delimited by "Analysis of its root cause" and "Details of the fix".
Thanks a lot for your help.
I'm pretty sure there is some better way to do this, but that's what I could manage:
/(?(?<=Analysis of its root cause:\n)((.*\n)*)(?=Details of the fix\n))/gU
I'm using positive lookahead and lookbehind, and the following modifiers:
g - global - Don't return after first match
u - Ungreedy - Make quantifiers lazy
Try it online: https://regex101.com/r/xpz7pg/2
Not a regex answer, but using perl
Put your lines into a single file.
perl -e '$/="Analysis of its root cause:"; #Sets the record delimiter
while(<>){ #Iterates over the file, record by record
chomp; #Removes the delimiter
if ($_ =~ /\n(.*?)\nDetails of the fix\n(.*)\n/s){ #Matches strings between Details of the fix. . is allowed to match newline
print "ONE:$1TWO:$2"} # $1 is the analysis, $2 is the details
}'
file.txt
Output
ONE:Blablablablabla
blabablabkjhjk
kjbsqbdqbds
TWO:blablabla
ONE:fddsfsdfsdfdsfs
blnskdbbqbbb
xxxxggggggg
TWO:blablabla

Can adding a particular number to a bunch of "time" strings, be done in Regex

I have a "srt" file(like standard movie-subtitle format) like shown in below link:http://pastebin.com/3k8a53SC
Excerpt:
1
00:00:53,000 --> 00:00:57,000
<any text that may span multiple lines>
2
00:01:28,000 --> 00:01:35,000
<any text that may span multiple lines>
But right now the subtitles timing is all wrong, as it lags behind by 9 seconds.
Is it possible to add 9 seconds(+9) to every time entry with regex ?
Even if the milliseconds is set to 000 then it's fine, but the addition of 9 seconds should adhere to "60 seconds = 1 minute & 60 minutes = 1 hour" rules.
Also the subtitle text after timing entry must not get altered by regex.
By the way the time format for each time string is "Hours:Minutes:Seconds.Milliseconds".
Quick answer is "no", that's not an application for regex. A regular expression lets you MATCH text, but not change it. Changing things is outside the scope of the regex itself, and falls to the language you're using -- perl, awk, bash, etc.
For the task of adjusting the time within an SRT file, you could do this easily enough in bash, using the date command to adjust times.
#!/usr/bin/env bash
offset="${1:-0}"
datematch="^(([0-9]{2}:){2}[0-9]{2}),[0-9]{3} --> (([0-9]{2}:){2}[0-9]{2}),[0-9]{3}"
os=$(uname -s)
while read line; do
if [[ "$line" =~ $datematch ]]; then
# Gather the start and end times from the regex
start=${BASH_REMATCH[1]}
end=${BASH_REMATCH[3]}
# Replace the time in this line with a printf pattern
linefmt="${line//[0-2][0-9]:[0-5][0-9]:[0-5][0-9]/%s}\n"
# Calculate new times
case "$os" in
Darwin|*BSD)
newstart=$(date -v${offset}S -j -f "%H:%M:%S" "$start" '+%H:%M:%S')
newend=$(date -v${offset}S -j -f "%H:%M:%S" "$end" '+%H:%M:%S')
;;
Linux)
newstart=$(date -d "$start today ${offset} seconds" '+%H:%M:%S')
newend=$(date -d "$end today ${offset} seconds" '+%H:%M:%S')
;;
esac
# And print the result
printf "$linefmt" "$newstart" "$newend"
else
# No adjustments required, print the line verbatim.
echo "$line"
fi
done
Note the case statement. This script should auto-adjust for Linux, OSX, FreeBSD, etc.
You'd use this script like this:
$ ./srtadj -9 < input.srt > output.srt
Assuming you named it that, of course. Or more likely, you'd adapt its logic for use in your own script.
No, sorry, you can’t. Regex are a context free language (see Chomsky e.g. https://en.wikipedia.org/wiki/Chomsky_hierarchy) and you cannot calculate.
But with a context sensitive language like perl it will work.
It could be a one liner like this ;-)))
perl -n -e 'if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {print plus9($1).$2.plus9($3).$4."\n";}else{print $_} sub plus9{ ($h,$m,$s)=split(/:/,shift); $t=(($h*60+$m)*60+$s+9); $h=int($t/3600);$r=$t-($h*3600);$m=int($r/60);$s=$r-($m*60);return sprintf "%02d:%02d:%02d", $h, $m, $s;}‘ movie.srt
with move.srt like
1
00:00:53,000 --> 00:00:57,000
hello
2
00:01:28,000 --> 00:01:35,000
I like perl
3
00:02:09,000 --> 00:02:14,000
and regex
you will get
1
00:01:02,000 --> 00:01:06,000
hello
2
00:01:37,000 --> 00:01:44,000
I like perl
3
00:02:18,000 --> 00:02:23,000
and regex
You can change the +9 in the "sub plus9{...}", if you want another delta.
How does it work?
We are looking for lines that matches
dd:dd:dd something dd:dd:dd something
and then we call a sub, which add 9 seconds to the matched group one ($1) and group three ($3). All other lines are printed unchanged.
added
If you want to put the perl oneliner in a file, say plus9.pl, you can add newlines ;-)
if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {
print plus9($1).$2.plus9($3).$4."\n";
} else {
print $_
}
sub plus9{
($h,$m,$s)=split(/:/,shift);
$t=(($h*60+$m)*60+$s+9);
$h=int($t/3600);
$r=$t-($h*3600);
$m=int($r/60);
$s=$r-($m*60);
return sprintf "%02d:%02d:%02d", $h, $m, $s;
}
Regular expressions strictly do matching and cannot add/substract. You can match each datetime string using python, for example, add 9 seconds to that, and then rewrite the string in the appropriate spot. The regular expression I would use to match it would be the following:
(?<hour>\d+):(?<minute>\d+):(?<second>\d+),(?<msecond>\d+)
It has labeled capture groups so it's really easy to get each section (you won't need msecond but it's there for visualization, I guess)
Regex101

Regex code for address separated by commas

How can I extract the state text which is before third comma only using the regex code?
54 West 21st Street Suite 603, New York,New York,United States, 10010
I've managed to extract the rest how I wanted but this one is a problem.
Also, how can I extract the "United States" please?
It looks like you want to use capturing groups:
.*,.*,(.*),(.*),.*
The first capturing group will be "New York" and the second will be "United States" (try it on Rubular).
Or you can split by commas (which will probably be even simpler) as #Jerry points out, assuming the language/tool you're using supports that.
You can use this regex:
(?:[^,]*,){2}([^,]*)
And use captured group # 1 for your desired String.
TL;DR
A lot depends on your regular expression engine, and whether you really need a regular expression or field-splitting. You can do field-splitting in Ruby and Awk (among others), but sed and grep only do regular expressions. See some examples below to get you started.
Ruby
str = '54 West 21st Street Suite 603, New York,New York,United States, 10010'
str.match /(?:.*?,){2}([^,]+)/
$1
#=> "New York"
GNU sed
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
sed -rn 's/([^,]+,){2}([^,]+).*/\2/p'
GNU awk
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
awk -F, '{print $3}'

Bash and regex problem : check for tokens entered into a Coke vending machine

Here is a "challenge question" I've got from Linux system programming lecture.
Any of the following strings will give you a Coke if you kick:
L = { aaaa, aab, aba, baa, bb, aaaa"a", aaaa"b", aab"a", … ab"b"a, ba"b"a, ab"bbbbbb"a, ... }
The letters shown in wrapped double quotes indicate coins that would have fallen through (but those strings are still part of the language in this example).
Exercise (a bit hard) show this is the language of a regular expression
And this is what I've got so far :
#!/usr/bin/bash
echo "A bottle of Coke costs you 40 cents"
echo -e "Please enter tokens (a = 10 cents, b = 20 cents) in a
sequence like 'abba' :\c"
read tokens
#if [ $tokens = aaaa ]||[ $tokens = aab ]||[ $tokens = bb ]
#then
# echo "Good! now a coke is yours!"
#else echo "Thanks for your money, byebye!"
if [[ $tokens =~ 'aaaa|aab|bb' ]]
then
echo "Good! now a coke is yours!"
else echo "Thanks for your money, byebye!"
fi
Sadly it doesn't work... always outputs "Thanks for your money, byebye!" I believe something is wrong with syntax... We didn't provided with any good reference book and the only instruction from the professor was to consult "anything you find useful online" and "research the problem yourself" :(
I know how could I do it in any programming language such as Java, but get it done with bash script + regex seems not "a bit hard" but in fact "too hard" for anyone with little knowledge on something advanced as "lookahead"(is this the terminology ?)
I don't know if there is a way to express the following concept in the language of regex:
Valid entry would consist of exactly one of the three components : aaaa, aab and bb, regardless of the order in a component, followed by an arbitrary sequence of a or b's
So this is what it should be like :
(a{4}Ua{2}bUb{2})(aUb)*
where the content in first braces is order irrelevant.
Thanks a lot in advance for any hints and/or tips :)
Oh, and forget about returning any money back to the user, they are all mine :)
Edit : now the code works ,thanks to Stephen has pointed out my careless typo.
you don't need a regular expression.
case $tokens in
aaaa|aab|bb) echo "coke!";;
*) echo "no coke";;
esac
note the above checks for exactly "aaaa" or exactly "aab"..if you don't care how many characters after that, use wildcard
case $tokens in
aaaa*|aab*|bb*) echo "coke!";;
*) echo "no coke";;
esac
You have:
if [[ $token =~ 'aaaa|aab|bb' ]]
$token should be $tokens
and remove the quotes from the regex.
if [[ $tokens =~ aaaa|aab|bb ]]
But, now you'll have to work on removing the quotes.