Perl6: Capturing Windows newline in a string with regex - regex

Disclaimer: I've cross-posted this over at PerlMonks.
In Perl5, I can quickly and easily print out the hex representation of the \r\n Windows-style line ending:
perl -nE '/([\r\n]{1,2})/; print(unpack("H*",$1))' in.txt
0d0a
To create a Windows-ending file on Unix if you want to test, create a in.txt file with a single line and line ending. Then: perl -ni -e 's/\n/\r\n/g;print' in.txt. (or in vi/vim, create the file and just do :set ff=dos).
I have tried many things in Perl6 to do the same thing, but I can't get it to work no matter what I do. Here's my most recent test:
use v6;
use experimental :pack;
my $fn = 'in.txt';
my $fh = open $fn, chomp => False; # I've also tried :bin
for $fh.lines -> $line {
if $line ~~ /(<[\r\n]>**1..2)/ {
$0.Str.encode('UTF-8').unpack("H*").say;
}
}
Outputs 0a, as do:
/(\n)/
/(\v)/
First, I don't even know if I'm using unpack() or the regex properly. Second, how do I capture both elements (\r\n) of the newline in P6?

Perl 6 automatically chomps the line separator off for you. Which means it isn't there when you try to do a substitution.
Perl 6 also creates synthetic characters if there are combining characters. so if you want a base 16 representation of your input, use the encoding 'latin1' or use methods on $*IN that return a Buf.
This example just appends CRLF to the end of every line.
( The last line will always end with 0D 0A even if it didn't have a line terminator )
perl6 -ne 'BEGIN $*IN.encoding("latin1"); #`( basically ASCII )
$_ ~= "\r\n"; #`( append CRLF )
put .ords>>.fmt("%02X");'
You could also turn off the autochomp behaviour.
perl6 -ne 'BEGIN {
$*IN.encoding("latin1");
$*IN.chomp = False;
};
s/\n/\r\n/;
put .ords>>.fmt("%02X");'

Ok, so what my goal was (I'm sorry I didn't make that clear when I posted the question) was I want to read a file, capture the line endings, and write the file back out using the original line endings (and not the endings for the current platform).
I got a proof of concept working now. I'm very new to Perl 6, so the code probably isn't very p6-ish, but it does do what I needed it to.
Code tested on FreeBSD:
use v6;
use experimental :pack;
my $fn = 'in.txt';
my $outfile = 'out.txt';
# write something with a windows line ending to a new file
my $fh = open $fn, :w;
$fh.print("ab\r\ndef\r\n");
$fh.close;
# re-open the file
$fh = open $fn, :bin;
my $eol_found = False;
my Str $recsep = '';
# read one byte at a time, or else we'd have to slurp the whole
# file, as I can't find a way to differentiate EOL from EOF
while $fh.read(1) -> $buf {
my $hex = $buf.unpack("H*");
if $hex ~~ /(0d|0a)/ {
$eol_found = True;
$recsep = $recsep ~ $hex;
next;
}
if $eol_found {
if $hex !~~ /(0d|0a)/ {
last;
}
}
}
$fh.close;
my %recseps = (
'0d0a' => "\r\n",
'0d' => "\r",
'0a' => "\n",
);
my $nl = %recseps<<$recsep>>;
# write a new file with the saved record separator
$fh = open $outfile, :w;
$fh.print('a' ~ $nl);
$fh.close;
# re-read file to see if our newline stuck
$fh = open $outfile, :bin;
my $buf = $fh.read(1000);
say $buf;
Output:
Buf[uint8]:0x<61 0d 0a>

Related

How can I properly stop and start metacharacter interpolation in regexp in Perl

Editing to be more concise, pardon.
I need to be able to grep from an array using a string that may contain one of the following characters: '.', '+', '/', '-'. The string will be captured via from the user. The array contains each line of the file I'm searching through (I'm chomping the file into the array to avoid keeping it open while the user is interfacing with the program because it is on a cron and I do not want to have it open when the cron runs), and each line has a unique identifier within it which is the basis for the search string used in the regexp. The code below shows the grep statement I am using, and I use OUR and MY in my programs to make the variables I want access to in all namespaces available, and the ones I use only in subroutines not. If you do want to try and replicate the issue
#!/usr/bin/perl -w
use strict;
use Switch;
use Data::Dumper;
our $pgm_path = "/tmp/";
our $device_info = "";
our #new_filetype1 = ();
our #new_filetype2 = ();
our #dev_info = ();
our #pgm_files = ();
our %arch_rtgs = ();
our $file = "/path/file.csv";
open my $fh, '<', $file or die "Couldn't open $file!\n";
chomp(our #source_file = <$fh>);
close $fh;
print "Please enter the device name:\n";
chomp(our $dev = <STDIN>);
while ($device_info eq "") {
# Grep the device info from the sms file
my #sms_device = grep(/\Q$dev\E/, #source_file);
if (scalar(#sms_device) > 1) {
my $which_dup = find_the_duplicate(\#sms_device);
if ($which_dup eq "program") {
print "\n-> $sms_dev <- must be a program name instead of a device name." .
"\nChoose the device from the list you are working on, specifically.\n";
foreach my $fix(#sms_device) {
my #fix_array = split(',', $fix);
print "$fix_array[1]\n";
undef #fix_array;
}
chomp($sms_dev = <STDIN>);
} else { $device_info = $which_dup; }
} elsif (scalar(#sms_device) == 1) {
($device_info) = #sms_device;
#sms_device = ();
}
}
When I try the code with an anchor:
my #sms_device = grep(/\Q$dev\E^/, #source_file);
No more activity from the program is noticed. It just sits there like it's waiting on some more input from the user. This is not what I expected to happen. The reason I would like to anchor the search pattern is because there are many, many examples of similarly named devices that have the same character order as the search pattern, but also include additional characters that are ignored in the regexp evaluation. I don't want them to be ignored, in the sense that they are included in matches. I want to force an exact match of the string in the variable.
Thanks in advance for wading through my terribly inexperienced code and communication attempts at detailing my problem.
The device id followed by the start of the string? /\Q$dev\E^/ makes no sense. You want the device id to be preceded by the start of the string and followed by the end of the string.
grep { /^\Q$dev\E\z/ }
Better yet, let's avoid spinning up the regex engine for nothing.
grep { $_ eq $dev }
For example,
$ perl -e'my $dev = "ccc"; CORE::say for grep { /^\Q$dev\E\z/ } qw( accc ccc ccce );'
ccc
$ perl -e'my $dev = "ccc"; CORE::say for grep { $_ eq $dev } qw( accc ccc ccce );'
ccc
I would use quotemeta. Here is an example of how it compares:
my $regexp = '\t';
my $metaxp = quotemeta ($regexp);
while (<DATA>) {
print "match \$regexp - $_" if /$regexp/;
print "match \$metaxp - $_" if /$metaxp/;
}
__DATA__
This \t is not a tab
This is a tab
(there is literally a tab in the second line)
The meta version will match line 1, as it turned "\t" into essentially "\t," and the non-meta (original) version will match line 2, which assumes you are looking for a tab.
match $metaxp - This \t is not a tab
match $regexp - This is a tab
Hopefully you get my meaning.
I think adding $regexp = quotemeta ($regexp) (or doing it when you capture the standard input) should meet your need.

How to print lines in between two patterns?

I would like to print everything between lines #cluster t.# has ### elements (including this line) and #cluster t.#+1 has ### elements (preferably omitting this line) from my input file into corresponding numbered output files (clust(#).txt). The script thus far creates the appropriate numbered files, without any content.
#!/usr/bin/perl
use strict;
use warnings;
open(IN,$ARGV[0]);
our $num = 0;
while(my $line = <IN>) {
if ($line =~ /^\#cluster t has (\d+) elements/) {
my $clust = "full";
open (OUT, ">clust$clust.txt");
} elsif ($line =~ m/^\#cluster t.(\d+.*) has (\d+) elements/) {
my $clust = $1;
$num++;
open (OUT, ">clust$clust.txt");
print OUT, $_ if (/$line/ ... /$line/);
}
}
Update Re-arranged so that the version based on my final understanding of input comes first. Also edited for clarity.
Detect the line that starts the section to be written to its own file and open the suitable file; otherwise just write to the filehandle (that corresponds to the current output file).
An example input file, in my understanding, data_range.txt
#cluster t.1 has 100 elements
data 1
data 1 1
#cluster t.2 has 200 elements
data 2
#cluster t.3 has 300 elements
Print t.N and the lines following up to the next t.N, to a file clust(N).txt.
use warnings;
use strict;
my $file = shift || 'data_range.txt';
open my $fh, $file or die "Can't open $file: $!";
my $fh_out;
my $clustline = qr/\#cluster t\.([0-9]+) has [0-9]+ elements/;
while (<$fh>)
{
if (/$clustline/) {
my $outfile = "clust($1).txt";
open $fh_out, '>', $outfile or die "Can't open $outfile: $!";
}
print $fh_out $_;
}
For each line with #cluster a new file with the corresponding number is opened, closing the previous one since we use the same filehandle. All following lines, including that one, belong to that file and they are printed there.
The code above assumes that the first line in the file is a #cluster line, and that all lines in this file belong to one of output files. If this may not be so then we need to be more careful: (1) use a flag for when the writing starts and (2) add a branch that allows to skip lines.
my $started_writing = 0;
my $clustline = qr/\#cluster t\.([0-9]+) has [0-9]+ elements/;
while (<$fh>)
{
if (/$clustline/) {
my $fout = "clust($1).txt";
open $fh_out, '>', $fout or die "Can't open $fout for writing: $!";
$started_writing = 1;
}
elsif (not $started_writing) { # didn't get to open output files yet
next;
}
elsif (/dont_write_this_line/) { # condition for lines to skip altogether
next;
}
print $fh_out $_;
}
All of this assumes that a #cluster line cannot repeat with the same number. You'd lose output data if that happened, so add a test if you aren't sure of your input (or open output files in append mode).
With either we get output clust(1).txt
#cluster t.1 has 100 elements
data 1
data 1 1
and clust(2).txt
#cluster t.2 has 200 elements
data 2
and clust(3).txt with the #cluster t.3 line.
Original version, with the initial understanding of input and requirements
The range operator is nearly tailor made for this. It keeps track of its true/false state across repeated calls. It turns true once its left operand evaluates true and stays that way until the right one is true, after which it is false, so on the next evaluation. There is more to it, please see the docs.
Made-up input file data_range.txt
#cluster t.1 has 100 elements
#cluster t.2 has 200 elements
#cluster t.3 has 300 elements
#cluster t.4 has 400 elements
#cluster t.5 has 500 elements
Print everything between marker-lines 2 and 4, including the starting line but not the ending one.
use warnings;
use strict;
my $file = 'data_range.txt';
open my $fh, $file or die "Can't open $file: $!";
# Build the start and end patterns
my $beg = qr/^\#cluster t\.2 has 200 elements$/;
my $end = qr/^\#cluster t\.4 has 400 elements$/;
while (<$fh>)
{
if (/$beg/ .. /$end/) {
print if not /$end/;
}
}
This prints lines 2 and 3. The .. operator turns true once the line ($_) matches $beg and is true until a line matches $end. After that it is false, for the next line. Thus it ends up including both start and end lines as well. So we also test for the end marker, and not print if we have that line.
If you would rather use the literal marker lines you can test strings for equality
my $beg = q(#cluster t.2 has 200 elements);
my $end = q(#cluster t.4 has 400 elements);
while (my $line = <$fh>)
{
chomp($line);
if ($line eq $beg .. $line eq $end) {
print "$line\n" if $line ne $end;
}
}
This works the same way as the example above. Note that now we have to chomp since the newline would foil eq test (and then we add \n for printing).
I have a more concise way to provide :
perl -ne 'print if /^foo/ .. /^base/' file.txt
Sample input
Lorem ipsum dolor
sit amet,
consectetur adipiscing
foo
bar
base
elit,
sed do
Output
foo
bar
base

Perl newbie: trying to find string in array of strings

I need to match a string against an array of strings. The string that I am searching for should be able to contain wildcards.
#!/usr/bin/perl
#
## disable buffered I/O which would lead
## to deadloops for the Apache server
$| = 1;
#
## read URLs one per line from stdin
while (<>) {
my $line = $_;
my #array1 = ("abc","def","ghi");
$found = 0;
if (/$line/i ~~ #array1)
{
print "found\n";
}
else
{
print "not found\n";
}
}
I test this script with the input of abc and it returns not found
perl ./mapscript.pl
abc
not found
Your input has a newline at the end. Add:
chomp $line;
right after
my $line = $_;
Use chomp(my $input = $_) to remove newline instead of my $input = $_ inside your while..
** OOPs.. Didn't see that I'm posting Duplicate..
a newline at the end always exists using <>. see chomp

processing multiline pattern in perl

i am working on adding new languages support for my mobile platform.I have to add entry for each language in several files,so i thought to do it using perl.To automate this process,i am feeling problem in how to match multi-line patterns in perl.
Here is my scenario :
const mmi_imeres_mode_details_struct g_ime_mode_array_int[] =
{
{
INPUT_MODE_NONE,
0,
0,
0,
0,
0,
0
},
{
INPUT_MODE_MULTITAP_LOWERCASE_ABC,
STR_INPUT_METHOD_MENU_MULTITAP_abc,
WGUI_IME_MULTITAP_LOWERCASE_ABC_IMG,
INPUT_MODE_DEFAULT_ALTERNATE_METHOD,
MMI_IME_ALL_EDITORS | MMI_IME_ENGLISH_ONLY_MODE | MMI_IME_ALPHABETIC | MMI_IME_LOWERCASE,
MMI_IMM_WRITING_LANGUAGE_ENGLISH,
"en-US"
},
}
First i had problem because in perl file is read one line at a time.so i first converted my file stream in to single variable.
my $newstr = '';
open (FH, "$filename") || die "Could not open file.\n";
while(<FH>)
{
$newstr = $newstr.$_;
}
No can someone help me how to search for text within { } , if it is a multi-line pattern.please reply soon...:)
First, there's a better idiom for slurping a file:
my $newstr;
{
open my $fh, '<', $filename or die "Could not open file $filename.\n$!\n";
local $/ = undef;
$newstr = <$fh>;
}
Next, you can set the /s modifier on your regexp, which treats the string as a single line by allowing '.' (dot) to match anything including newlines. But even that's not really necessary since you won't be using 'dot' in your regexp anyway.....
while(
$newstr =~ m/
{ # Match the opening bracket.
([^}]*) # Capture any number of characters that exclude '}'
} # Match the closing bracket.
/gx # Use /g for multiple matches, and /x for readability.
) {
print "$1\n";
}
Another solution would be to set your input record separator, $/, to '}'. That way you're reading the file in as chunks that end with a closing bracket. Nifty trick.

Perl - Start reading from specific line, and only get first column of it line, until end

I have a text file that looks like the following:
Line 1
Line 2
Line 3
Line 4
Line 5
filename2.tif;Smpl/Pix & Bits/Smpl are missing.
There are 5 lines that are always the same, and on the 6th line is where I want to start reading data. Upon reading data, each line (starting from line 6) is delimited by semicolons. I need to just get the first entry of each line (starting on line 6).
For example:
Line 1
Line 2
Line 3
Line 4
Line 5
filename2.tif;Smpl/Pix & Bits/Smpl are missing.
filename4.tif;Smpl/Pix & Bits/Smpl are missing.
filename6.tif;Smpl/Pix & Bits/Smpl are missing.
filename8.tif;Smpl/Pix & Bits/Smpl are missing.
Output desired would be:
filename2.tif
filename4.tif
filename6.tif
filename8.tif
Is this possible, and if so, where do I begin?
This uses the Perl 'autosplit' (or 'awk') mode:
perl -n -F'/;/' -a -e 'next if $. <= 5; print "$F[0]\n";' < data.file
See 'perlrun' and 'perlvar'.
If you need to do this in a function which is given a file handle and a number of lines to skip, then you won't be using the Perl 'autosplit' mode.
sub skip_N_lines_read_column_1
{
my($fh, $N) = #_;
my $i = 0;
my #files = ();
while (my $line = <$fh>)
{
next if $i++ < $N;
my($file) = split /;/, $line;
push #files, $file;
}
return #files;
}
This initializes a loop, reads lines, skipping the first N of them, then splitting the line and capturing the first result only. That line with my($file) = split... is subtle; the parentheses mean that the split has a list context, so it generates a list of values (rather than a count of values) and assigns the first to the variable. If the parentheses were omitted, you would be providing a scalar context to a list operator, so you'd get the number of fields in the split output assigned to $file - not what you needed. The file name is appended to the end of the array, and the array is returned. Since the code did not open the file handle, it does not close it. An alternative interface would pass the file name (instead of an open file handle) into the function. You'd then open and close the file in the function, worrying about error handling.
And if you need the help with opening the file, etc, then:
use Carp;
sub open_skip_read
{
my($name) = #_;
open my $fh, '<', $name or croak "Failed to open file $name ($!)";
my #list = skip_N_lines_read_column_1($fh, 5);
close $fh or croak "Failed to close file $name ($!)";
return #list;
}
#!/usr/bin/env perl
#
# name_of_program - what the program does as brief one-liner
#
# Your Name <your_email#your_host.TLA>
# Date program written/released
#################################################################
use 5.10.0;
use utf8;
use strict;
use autodie;
use warnings FATAL => "all";
# ⚠ change to agree with your input: ↓
use open ":std" => IN => ":encoding(ISO-8859-1)",
OUT => ":utf8";
# ⚠ change for your output: ↑ — *maybe*, but leaving as UTF-8 is sometimes better
END {close STDOUT}
our $VERSION = 1.0;
$| = 1;
if (#ARGV == 0 && -t STDIN) {
warn "reading stdin from keyboard for want of file args or pipe";
}
while (<>) {
next if 1 .. 5;
my $initial_field = /^([^;]+)/ ? $1 : next;
# ╔═══════════════════════════╗
# ☞ your processing goes here ☜
# ╚═══════════════════════════╝
} continue {
close ARGV if eof;
}
__END__
Kinda ugly but, read out the dummy lines and then split on ; for the rest of them.
my $logfile = '/path/to/logfile.txt';
open(FILE, $logfile) || die "Couldn't open $logfile: $!\n";
for (my $i = 0 ; $i < 5 ; $i++) {
my $dummy = <FILE>;
}
while (<FILE>) {
my (#fields) = split /;/;
print $fields[0], "\n";
}
close(FILE);