Perl regex replacement in file skips newlines - regex

I am currently trying to format text for use with pandoc, but my regex replacement isn't working.
Here is my code so far:
#Importing the file
my $filename = 'example.md';
my $file = path($filename);
my $data = $file ->slurp_utf8;
#Placing code blocks into an array and replacing them in the file
my #code_block_values;
my $i = 0;
while($i > 50) {
#code_bock_values[$i] = ($data =~ /\n\t*```[^`]+(```){1}\n/);
$data =~ s/\n\t*```[^`]+(```){1}\n/(code_block)/;
$i = $i + 1;
}
#Replacing the code blocks
$i = 0;
while($i < 50) {
$data =~ s/\(code_block\)/$code_block_values[$i]/;
$i = $i + 1;
}
print $data;
$file->spew_utf8( $data );
I realize that this probably isn't the most efficient way to be doing this, but right now I'm just trying to get it working.
Basically, I am using github-flavored markdown for typing up my notes, and then trying to convert it with pandoc to a pdf file. I am doing some other formatting before-hand, but I have to extract the code blocks first (which are deliniated by triple backticks (```).)
The following is a sample code block that would be a code block:
```bash
#!/bin/bash
echo "Enter a number"
read count
if [ $count -eq 100 ]
then
echo "Example-3: Count is 100"
elif [ $count -gt 100 ]
then
echo "Example-3: Count is greater than 100"
else
echo "Example-3: Count is less than 100"
fi
```
As far as I can tell, the regex is capturing everything that I need (as tested by an online regex tester), but Perl is only inserting newlines at certain points, specifically newlines followed by a tab.
The previous example translates to:
```bash #!/bin/bash echo "Enter a number" read count if [ $count -eq 100 ] then
echo "Example-3: Count is 100" elif [ $count -gt 100 ] then
echo "Example-3: Count is greater than 100" else
echo "Example-3: Count is less than 100" fi
```
As you can see, the tabs are also completely removed. I copied over all of the file contents from atom and the different lengths of tabs are as copied over from the editor (not sure that makes a difference.) I did my editing for the shell script in vim but the editing for the notes in atom itself.
I am new to Perl, so any help would be appreciated.

My guess:
#!/usr/bin/perl
use strict;
use warnings;
# Slurps DATA after __END__ into a scalar
my $data = do { local $/; <DATA> };
my #code_block_values;
# Extract and replace code blocks with '(code_block)'
while ($data =~ s/ (``` .*? ```) /(code_block)/xs) {
push #code_block_values, $1;
}
printf "\n--| Replaced:\n%s", $data;
# Restore '(code_block)' with actual content
$data =~ s/ \(code_block\) / shift #code_block_values /xge;
printf "\n--| Restored:\n%s", $data;
__END__
```bash
#!/bin/bash
echo "Enter a number"
read count
if [ $count -eq 100 ]
then
echo "Example-3: Count is 100"
elif [ $count -gt 100 ]
then
echo "Example-3: Count is greater than 100"
else
echo "Example-3: Count is less than 100"
fi
```
```perl
#!/usr/bin/perl
print "Hello World\n";
```
Output:
--| Replaced:
(code_block)
(code_block)
--| Restored:
```bash
#!/bin/bash
echo "Enter a number"
read count
if [ $count -eq 100 ]
then
echo "Example-3: Count is 100"
elif [ $count -gt 100 ]
then
echo "Example-3: Count is greater than 100"
else
echo "Example-3: Count is less than 100"
fi
```
```perl
#!/usr/bin/perl
print "Hello World\n";
```
If you want to learn more about Perl's regular expressions, perlre and perlretut is a good read.

Related

unexpected result using regex in bash

I made a trying to make regex expression that will validate a number that is in the range of -100 to 100.
the regex expression I made is ^[-+]?([0-9][0-9]?|100)$.
I am looking for a pattern in a string not just an integer by itself.
this is my script:
#!/bin/bash
a="input2.txt"
while read -r line; do
mapfile -t d <<< "$line"
for i in "${d[#]}"; do
if [[ "$i" =~ ^[-+]?([0-9][0-9]?|100)$ ]]; then
echo "$i"
fi
done
done < "$a"
this is my input file:
add $s1 $s2 $s3
sub $t0
sub $t1 $t0
addi $t1 $t0 75
lw $s1 -23($s2)
the actual result is nothing.
the expected result:
75 -23($s2)
[...] denotes a set of characters, where the dash can be used to specify a character range. For instance, [4-6u-z] in a regexp means one of the characters 4,5,6,u,v,w,x,z. Your expression [1-200] simply matches the characters (digits) 0, 1 and 2.
In your case, I would therefore proceed in two steps: First, extract from your string the initial numeric parts, and then use arithmetic comparision on the result. For example (not tested!):
if [[ $i =~ ^-?[0-9]+ ]]
then
intval=${BASH_REMATCH[0]}
if (( intval >= -200 && intval <= 1000 ))
then
....
See the bash man page for an explanation of the BASH_REMATCH array.
#first store your file in an array so that we could pass thru the words
word_array=( $(<filename) )
for i in "${word_array[#]}"
do
if [[ $i =~ ^([[:blank:]]{0,1}-?[0-9]+)([^[:digit:]]?[^[:blank:]]*)$ ]]
#above line looks for the pattern while separating the number and an optional string
#that may follow like ($s2) using '()' so that we could access each part using BASH_REMATCH later.
then
#now we have only the number which could be checked to fall within a range
[ ${BASH_REMATCH[1]} -ge -100 ] && [ ${BASH_REMATCH[1]} -le 100 ] && echo "$i"
fi
done
Sample Output
75
-23($s2)
Note : The pattern might need a bit more testing, but you could imbibe the idea.

Fast way to detct if variable is multiline or not in bash?

Looking for a quick way in BASH to test whether my variable is single or multiline? I thought the following would work but it always comes back as no
input='foo
bar'
regex="\n" ; [[ $regex =~ "${input}" ]] && { echo 'yes' ; } || { echo 'no' ; }
You don't need regex as you can use glob pattern to check for this:
[[ $str == *$'\n'* ]] && echo "multiline" || echo "single line"
$str == *$'\n'* will return true if any newline is found in $str.
Change your regex like below,
$ regex="[\\r\\n]"
$ [[ "${input}" =~ $regex ]] && { echo 'yes' ; } || { echo 'no' ; }
yes
You do not need to use a regex. Just erase everything that is not a "new line", and count characters:
str=$'foo\nbar\nbaz'
Not calling any eternal program (pure bash):
b=${str//$'\n'}; echo $(( ${#str} - ${#b} ))
The number printed is the number of new lines.
An alternative solution is to cut the variable at the \n, and find if it got shorter:
b="${str%%$'\n'*}"; (( ${#b} < ${#str} )) && echo "Multiline"
Note that this will fail if the line end is \c (one CR, as in classic MAC).

Split a string on a pattern using bash and/or awk

I have a file that is formatted like
file header string(s)
"section title" : [status]
unknown
text
"next section" : [different_status]
different
amount of
strings
I want to break this into sections such as
file header string(s)
and
"section title" : [status]
unknown
text
and
"next section" : [different_status]
different
amount of
strings
though it isn't critical to capture that header string.
As you can see, the pattern I can depend on for splitting is
"string in quotes" : [string in square brackets]
This delimiting string needs to also be captured.
What is a simple way to do this within a bash script? I predict something in awk will do it, but my awk-fu is weak.
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
my $output = 0;
open my $OUT, '>', "section-$output" or die $!;
while (<>) {
if (/"[^"]*" : \[[^\]]*\]/) {
$output++;
open $OUT, '>', "section-$output" or die $!;
}
print {$OUT} $_;
}
This does the trick in pure Bash:
#!/bin/bash
while read line; do
[[ "$line" =~ "^\"[^\"]*\" : \[[^]]*\]" ]] && i=$(( ++i ))
[[ $i > 0 ]] && echo "SECTION_$i: " $line
done < $1
Update: improved regex.
Should be a one-liner in awk. Assuming I'm interpreting your diving lines correctly, what about this?
awk '/^"[^"]+" : \[[^]]+\]$/ { printf("\n"); } 1' inputfile > outputfile
The "1" at the end is a shortcut that says "print the current line". The condition and expression pair before it will insert the blank if the current line matches the pattern.
You could alternately do the same thing in a sed one-liner:
sed -r '/^"[^"]+" : \[[^]]+\]$/{x;p;x;}' inputfile > outputfile
This uses the magic of sed's "hold space". You can man sed for details of how x works.

How can I retrieve logs between two timeframes using regex?

I have a huge log file, where each line is a log entry with it's own timestamp. How can I retrieve logs in-between two specified timestaps (ie. start time - 22:00:00, end time - 23:00:00)?
Using bash, it is possible to generate a regex statement based only on two input timestamps, which you may pipe to a command (such as grep):
#!/bin/bash
#This is a bare-bone recursive script that accepts input of a start/end timeframe
#in the 24-hour format, based on which a regex statement will be generated that will
#select all values in between those two timeframes. Please note that this script
#has undergone some optimization (I apologize for the reading difficulty, but it
#did not occur to me until now that someone might have a use for such a script).
#You are free to use, distribute, or modify this script to your heart's content. Any
#constructive feedback is welcome. I did my best to eliminate all bugs, but should
#you find any case where the generated regex is INCORRECT for some timestamps, please
#let me know.
echo $0 - Args passed: $1 $2
START_TIME=$(echo $1 | tr -d ":")
END_TIME=$(echo $2 | tr -d ":")
DIFF_CHAR=""
REGEX=""
function loop ()
{
diffcharValue=$1
maxVal=$2
while [ $diffcharValue -lt $maxVal ]
do
REGEX="${REGEX}:[0-5][0-9]"
let diffcharValue+=2
done
}
function regexGen ()
{
diffChar=$1
start=$2
end=$3
if [ $diffChar -le 6 ]; then
regexGen $(($diffChar + 1)) $start $end 0 #the fourth arg acts as a recursion indicaton, whether the function was called recursively or not
let diffChar-=1
diffCharMinusOne=$(($diffChar - 1))
startBegin=${start:0:$diffCharMinusOne}
endBegin=${end:0:$diffCharMinusOne}
startCut=${start:$diffCharMinusOne:1}
endCut=${end:$diffCharMinusOne:1}
endStartDiff=$(($endCut-$startCut))
if [ $(($diffChar % 2)) -eq 0 ]; then
if [ $4 -eq 0 ]; then
REGEX="${REGEX}$startBegin[$startCut-9]"
loop $diffChar 6
REGEX="${REGEX}|$endBegin[0-$endCut]"
loop $diffChar 6
REGEX="${REGEX}|"
elif [ $endStartDiff -gt 1 ]; then
if [ $endStartDiff -eq 2 ]; then
REGEX="${REGEX}$startBegin[$(($startCut+1))]"
else
REGEX="${REGEX}$startBegin[$(($startCut+1))-$(($endCut-1))]"
fi
loop $diffChar 6
echo $REGEX
else
echo ${REGEX%?}
fi
else
if [ $4 -eq 0 ]; then
if [ $startCut -lt 5 ]; then
REGEX="${REGEX}$startBegin[$startCut-5][0-9]"
loop $diffChar 5
REGEX="${REGEX}|"
fi
if [ $endCut -gt 0 ]; then
REGEX="${REGEX}$endBegin[0-$endCut][0-9]"
loop $diffChar 5
REGEX="${REGEX}|"
fi
elif [ $endStartDiff -gt 1 ]; then
if [ $diffCharMinusOne -gt 0 ]; then
REGEX="${REGEX}$startBegin"
fi
if [ $endStartDiff -eq 2 ]; then
REGEX="${REGEX}[$(($startCut+1))][0-9]"
else
REGEX="${REGEX}[$(($startCut+1))-$(($endCut-1))][0-9]"
fi
loop $diffChar 5
echo $REGEX
else
echo ${REGEX%?}
fi
fi
fi
}
for a in {0..5}
do
if [ ${END_TIME:$a:1} -gt ${START_TIME:$a:1} ];then
DIFF_CHAR=$(($a+1))
break
fi
done
result=$(regexGen $DIFF_CHAR $START_TIME $END_TIME 1 | sed 's/\([0-9][0-9]\)/\1:/g')
echo $result

change number to english Perl

Hye, Can you check my script where is my problem..sorry I'm new in perl..I want to convert from number to english words for example 1400 -> one thousand four hundred...I already used
Lingua::EN::Numbers qw(num2en num2en_ordinal);
this is my input file.txt
I have us dollar 1200
and the output should be. "I have us dollar one thousand two hundred"
this is my script
#!/usr/bin/perl
use utf8;
use Lingua::EN::Numbers qw(num2en num2en_ordinal);
if(! open(INPUT, '< snuker.txt'))
{
die "cannot opent input file: $!";
}
select OUTPUT;
while($lines = <INPUT>){
$lines =~ s/usd|USD|Usd|uSd|UsD/us dollar/g;
$lines =~ s/\$/dollar /g;
$lines =~ s/rm|RM|Rm|rM/ringgit malaysia /g;
$lines =~ s/\n/ /g;
$lines =~ s/[[:punct:]]//g;
$lines =~ s/(\d+)/num2en($lines)/g; #this is where it should convert to english words
print lc($lines); #print lower case
}
close INPUT;
close OUTPUT;
close STDOUT;
the output i got is "i have us dollar num2en(i have us dollar 1200 )"
thank you
You need to refer to the capture using $1 instead of passing the $lines in your last regex where you also need an e flag at the end so that it is evaluated as an expression. You can use i flag to avoid writing all combinations of [Uu][Ss][Dd]...:
while($lines = <INPUT>){
$lines =~ s/usd/us dollar/ig;
$lines =~ s/\$/dollar /g;
$lines =~ s/rm/ringgit malaysia /ig;
$lines =~ s/\n/ /g;
$lines =~ s/[[:punct:]]//g;
$lines =~ s/(\d+)/num2en($1)/ge; #this is where it should convert to english words
print lc($lines), "\n"; #print lower case
}
You’re missing the e modifier on the regex substitution:
$ echo foo 42 | perl -pe "s/(\d+)/\$1+1/g"
foo 42+1
$ echo foo 42 | perl -pe "s/(\d+)/\$1+1/ge"
foo 43
See man perlop:
Options are as with m// with the addition of the following replacement
specific options:
        e    Evaluate the right side as an expression.
Plus you have to refer to the captured number ($1), not the whole string ($lines), but I guess you have already caught that.
The problem here is that you are confusing regexps with functions. In the line where you try to do the conversion, you're not calling the function num2en; instead, you're replacing the number with the text num2en($line). Here's a suggestion for you:
($text, $number) = $lines =~ s/(.*)+(\d+); # split the line into a text part and a number part
print lc($text . num2en($number)); # print first the text, then the converted number;