find strings in source code by using regex in perl - regex

I am studying on regular expression in perl.
I want to write a script that accepts a C source code file and finds strings.
This is my code:
my $file1= #ARGV;
open my $fh1, '<', $file1;
while(<>)
{
#words = split(/\s/, $_);
$newMsg = join '', #words;
push #strings,($newMsg =~ m/"(.*\\*.*\\*.*\\*.*)"/) if($newMsg=~/".*\\*.*\\*.*\\*.*"/);
print Dumper(\#strings);
foreach(#strings)
{
print"strings: $_\n";
}
but i have problem in matching multiple string like this
const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent";
what i must do?

Here is a fun solution. It uses MarpaX::Languages::C::AST, an experimental C parser. We can use the c2ast.pl program that ships with the module to convert a piece of C source file to an abstract syntax tree, which we dump to some file (using Data::Dumper). We can then extract all strings with a bit of magic.
Unfortunately, the AST objects have no methods, but as they are autogenerated, we know how they look on the inside.
They are blessed arrayrefs.
Some contain a single unblessed arrayrefs of items,
Others contain zero or more items (lexemes or objects)
“Lexemes” are an arrayref with two fields of location information, and the string contents at index 2.
This information can be extracted from the grammar.
The code:
use strict; use warnings;
use Scalar::Util 'blessed';
use feature 'say';
our $VAR1;
require "test.dump"; # populates $VAR1
my #strings = map extract_value($_), find_strings($$VAR1);
say for #strings;
sub find_strings {
my $ast = shift;
return $ast if $ast->isa("C::AST::string");
return map find_strings($_), map flatten($_), #$ast;
}
sub flatten {
my $thing = shift;
return $thing if blessed($thing);
return map flatten($_), #$thing if ref($thing) eq "ARRAY";
return (); # we are not interested in other references, or unblessed data
}
sub extract_value {
my $string = shift;
return unless blessed($string->[0]);
return unless $string->[0]->isa("C::AST::stringLiteral");
return $string->[0][0][2];
}
A rewrite of find_strings from recursion to iteration:
sub find_strings {
my #unvisited = #_;
my #found;
while (my $ast = shift #unvisited) {
if ($ast->isa("C::AST::string")) {
push #found, $ast;
} else {
push #unvisited, map flatten($_), #$ast;
}
}
return #found;
}
The test C code:
/* A "comment" */
#include <stdio.h>
static const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent";
int main() {
printf("Hello %s:\n%s\n", "World", text2);
return 0;
}
I ran the commands
$ perl $(which c2ast.pl) test.c -dump >test.dump;
$ perl find-strings.pl
Which produced the output
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"World"
"Hello %s\n"
""
""
""
""
""
""
Notice how there are some empty strings not from our source code, which come somewhere from the included files. Filtering those out would probably not be impossible, but is a bit impractical.

It appears you're trying to use the following regular expression to capture multiple lines in a string:
my $your_regexp = m{
(
.* # anything
\\* # any number of backslashes
.* # anything
\\* # any number of backslashes
.* # anything
\\* # any number of backslashes
.* # anything
)
}x
But it appears more of a grasp of desperation than a deliberately thought out plan.
So you've got two problems:
find everything between double quotes (")
handle the situation where there might be multiple lines between those quotes
Regular expressions can match across multiple lines. The /s modifier does this. So try:
my $your_new_regexp = m{
\" # opening quote mark
([^\"]+) # anything that's not a quote mark, capture
\" # closing quote mark
}xs;
You might actually have a 3rd problem:
remove trailing backslash/newline pairs from strings
You could handle this by doing a search-replace:
foreach ( #strings ) {
$_ =~ s/\\\n//g;
}

Here is a simple way of extracting all strings in a source file. There is an important decision we can make: Do we preprocess the code? If not, we may miss some strings if they are generated via macros. We would also have to treat the # as a comment character.
As this is a quick-and-dirty solution, syntactic correctness of the C code is not an issue. We will however honour comments.
Now if the source was pre-processed (with gcc -E source.c), then multiline strings are already folded into one line! Also, comments are already removed. Sweet. The only comments that remain are mention line numbers and source files for debugging purposes. Basically all that we have to do is
$ gcc -E source.c | perl -nE'
next if /^#/; # skip line directives etc.
say $1 while /(" (?:[^"\\]+ | \\.)* ")/xg;
'
Output (with the test file from my other answer as input):
""
"__isoc99_fscanf"
""
"__isoc99_scanf"
""
"__isoc99_sscanf"
""
"__isoc99_vfscanf"
""
"__isoc99_vscanf"
""
"__isoc99_vsscanf"
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"
So yes, there is a lot of garbage here (they seem to come from __asm__ blocks), but this works astonishingly well.
Note the regex I used: /(" (?:[^"\\]+ | \\.)* ")/x. The pattern inside the capture can be explained as
" # a literal '"'
(?: # the begin of a non-capturing group
[^"\\]+ # a character class that matches anything but '"' or '\', repeated once or more
|
\\. # an escape sequence like '\n', '\"', '\\' ...
)* # zero or more times
" # closing '"'
What are the limitations of this solution?
We need the a preprocessor
This code was tested with gcc
clang also supports the -E option, but I have no idea how the output is formatted.
Character literals are a failure mode, e.g. myfunc('"', a_variable, '"') would be extracted as "', a_variable, '".
We also extract strings from other source files. (false positives)
Oh wait, we can fix the last bit by parsing the source file comments which the preprocessor inserted. They look like
# 29 "/usr/include/stdio.h" 2 3 4
So if we remeber the current filename, and compare it to the filename we want, we can skip unwanted strings. This time, I'll write it as a full script instead of a one-liner.
use strict; use warnings;
use autodie; # automatic error handling
use feature 'say';
my $source = shift #ARGV;
my $string_re = qr/" (?:[^"\\]+ | \\.)* "/x;
# open a pipe from the preprocessor
open my $preprocessed, "-|", "gcc", "-E", $source;
my $file;
while (<$preprocessed>) {
$file = $1 if /^\# \s+ \d+ \s+ ($string_re)/x;
next if /^#/;
next if $file ne qq("$source");
say $1 while /($string_re)/xg;
}
Usage: $perl extract-strings.pl source.c
This now produces the output:
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"
If you cannot use the convenient preprocessor to fold multiline strings and remove comments, this gets a lot uglier, because we have to account for all of that ourselves. Basically, you want to slurp in the whole file at once, not iterate it line by line. Then, you skip over any comments. Do not forget to ignore preprocessor directives as well. After that, we can extract the strings as usual. Basically, you have to rewrite the grammar
Start → Comment Start
Start → String Start
Start → Whatever Start
Start → End
to a regex. As the above is a regular language, this isn't too hard.

Related

Perl Regex: How to parse string from " to" without \"?

I have to parse current line "abc\",","\"," by regex in Perl,
and get this result "abc\"," and "\","
I do this
while (/(\s*)/gc) {
if (m{\G(["])([^\1]+)\1,}gc){
say $2;
}
}
but it is wrong, because this regexp go to the last ",
My question is, How can I jump over this \" and stop on first ", ?
The following program performs matches according to your specification:
while (<>) {
#arr = ();
while (/("(?:\\"|[^"])*")/) {
push #arr, $1;
$_ = $';
}
print join(' ', #arr), "\n";
}
Input file input.txt:
"abc", "def"
"abc\",","\","
Output:
$ ./test.pl < input.txt
"abc" "def"
"abc\"," "\","
It can be improved to match more strictly because in this form a lot of input is possible that is maybe not desirable, but it serves as a first pointer. Additionally, it is better matching a CSV file with the corresponding module and not with regular expressions, but you have not stated if your input is really a CSV file.
Don't reinvent the wheel. If you have CSV, use a CSV parser.
use Text::CSV_XS qw( );
my $string = '"abc\",","\","';
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
$csv->parse($_)
my #fields = $csv->fields();
Regexes aren't the best tool for this task. The standard Text::ParseWords module does this easily.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Text::ParseWords;
my $line = '"abc\",","\","';
my #fields = parse_line(',', 1, $line);
for (0 .. $#fields) {
say "$_: $fields[$_]"
}
The output is:
0: "abc\","
1: "\","
split /(?<!\\)",(?<!\\)"/, $_
(preceded by cleaning the boundary of $_ with s/^"// && s/"$//; because enclosing external quotes didn't need to be in the definition of the input string, but you have them)
returns directly the array you want (without the need of external loop as the loop is inside the core perl function split, you might add \s* surrounding the comma according to how the string might be provided).
..but (actually just a note as you didn't mention) there could be a deeper case
If you have \" meaning " you possibly have also \\ meaning \, so you might have \\\" and \\", the last one (more generally an even number of \ preceding ") is complicate with one line regexp because look-behind is implemented for fixed size, and the unsupported regexp form (?<!\\(?:\\\\)*)" which would potentially get well also a string delimiter after backslash not intending as escape quote \" from the sequence \\", is inapplicable and a less efficient code that mine would be required, but again this marginal consideration is about the case that \\ has to be hypothetically interpreted too.

Using string variables containing literal escapes in a Perl substitution

I'm new to Perl and I found behaviour which I don't understand and can't solve.
I'm making a small find and replace program and there are some things I need to do. I have bunch of files that I need to process. Then I have a list of find / replace rules in an external text file. In replacing there I need three special things:
Replacing utf-8 characters (Czech diacritics)
Work with adding/removing lines (so working in a slurp mode)
Use a regular expressions
I want a program that works alone, so I wrote it so that it takes three arguments:
The file to work on
What to find
What to replace.
I'm sending parameters in a loop from a bash script which parse the rules list and loads other files.
My problem is when I have a "\n" string in a rules list and I send it to the Perl script. If it's in the first part of replacement (in the find section) it looks for a newline correctly, but when it's in the second part (the replace section) it just prints \n instead of a newline.
I tried hardcoding "\n" to the string right into the variable instead of passing it from the list and then it works fine.
What's the reason Perl doesn't interpret the "\n" string there, and how can I make it work?
This is my code:
list.txt - One line from the external replacement list
1\. ?\\n?NÁZEV PŘÍPRAVKU;\\n<<K1>> NÁZEV PŘÍPRAVKU;
farkapitoly.sh - The bash script for parsing list.txt and cycling through all of the files and calling the Perl script
...
FILE="/home/tmp.txt"
while read LINE
do
FIND=`echo "$LINE" | awk -F $';' 'BEGIN {OFS = FS} {print $1}'`
REPLACE=`echo "$LINE" | awk -F $';' 'BEGIN {OFS = FS} {print $2}'`
perl -CA ./pathtiny.pl "$FILE" "$FIND" "$REPLACE"
done < list.txt
...
pathtiny.pl - The Perl script for find and replace
#!/usr/bin/perl
use strict;
use warnings;
use Modern::Perl;
use utf8; # Enable typing Unicode in Perl strings
use open qw(:std :utf8); # Enable Unicode to STDIN/OUT/ERR and filehandles
use Path::Tiny;
my $file = path("$ARGV[0]");
my $searchStr = "$ARGV[1]";
my $replaceStr = "$ARGV[2]";
# $replaceStr="\n<<K1>> NÁZEV PRÍPRAVKU"; # if I hardcode it here \n is replaced right away
print("Search String:", "$searchStr", "\n");
print("Replace String:", "$replaceStr", "\n\n");
my $guts = $file->slurp_utf8;
$guts =~ s/$searchStr/$replaceStr/gi;
$file->spew_utf8($guts);
If it's important, I'm using Linux Mint 13 64-bit on VirtualBox (under Win 8.1) and I have Perl v5.14.2. Every file is UTF-8 with Linux endings.
Example files can be found on pastebin. this should end up like this.
But examples varies a lot. I need a universal solution to write down newline in a replacement string so it replaces correctly.
The problem is that the replacement string is read literally from the file, so if your file contains
xx\ny
then you will read exactly those six characters. Also, the replacement part of a substitution is evaluated as if it was in double quotes. So your replacement string is "$replaceStr" which interpolates the variable and goes no further, so you will again have xx\nyy in the new string. (By the way, please avoid using capital letters in local Perl identifiers as in practice they are reserved for globals such as Module::Names.)
The answer lies in using eval, or its equivalent - the /e modifier on the substitution.
If I write
my $str = '<b>';
my $r = 'xx\ny';
$str =~ s/b/$r/;
then the replacement string is interpolated to xx\ny, as you have experienced.
A single /e modifier evaluates the replacement as an expression instead of just a double-quoted string, but of course $r as an expression is xx\ny again.
What you need is a second /e modifier, which does the same evaluation as a single /e and then does an additional eval of the result on top. For this it is cleanest if you use qq{ .. } as you need two levels of quotation.
If you write
$str =~ s/b/qq{"$r"}/ee
then perl will evaluate qq{"$r"} as an expression, giving "xx\nyy", which, when evaluated again will give you the string you need - the same as the expression 'xx' . "\n" . 'yy'.
Here's a full program
use strict;
use warnings;
my $s = '<b>';
my $r = 'xx\nyy';
$s =~ s/b/qq{"$r"}/ee;
print $s;
output
<xx
yy>
But don't forget that, if your replacement string contains any double quotes, like this
my $r = 'xx\n"yy"'
then they must be escaped before putting the through the substitution as the expression itself also uses double quotes.
All of this is quite hard to grasp, so you may prefer the String::Escape module which has an unbackslash function that will change a literal \n (and any other escapes) within a string to its equivalent character "\n". It's not a core module so you probably will need to install it.
The advantage is that you no longer need a double evaluation, as the replacement string can be just unbackslash $r which give the right result if it evaluated as an expression. It also handles double quotes in $r without any problem, as the expression doesn't use double quotes itself.
The code using String::Escape goes like this
use strict;
use warnings;
use String::Escape 'unbackslash';
my $s = '<b>';
my $r = 'xx\nyy';
$s =~ s/b/unbackslash $r/e;
print $s;
and the output is identical to that of the previous code.
Update
Here is a refactoring of your original program that uses String::Escape. I have removed Path::Tiny as I believe it is best to use Perl's built-in inplace-edit extension, which is documented under the General Variables section of perlvar.
#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use 5.010;
use open qw/ :std :utf8 /;
use String::Escape qw/ unbackslash /;
our #ARGV;
my ($file, $search, $replace) = #ARGV;
print "Search String: $search\n";
print "Replace String: $replace\n\n";
#ARGV = ($file);
$^I = '';
while (<>) {
s/$search/unbackslash $replace/eg;
print;
}
You got \n as a content of a string. (as two chacters 1: \ and second n, and not as one newline.
Perl interprets the \n as newline when it is as literal (e.g. it is in your code).
The quick-fix would be:
my $replaceStr=eval qq("$ARGV[2]"); #evaling a string causes interpreting the \n as literal
or, if you don't like eval, you can use the String-Escape cpan module. (the unbackslash function)
You're wanting a literal string to be treated as if it were a double quoted string. To do that you'll have to translate any backslash followed by another character.
The other experts have shown you how to do that over the entire string (which is risky since it uses eval with unvalidated data). Alternatively, you could use a module, String::Escape, which requires an install (not a high bar, but too high for some).
However, the following does a translation of the return value string itself in a safe way, and then it can be used like a normal value in your other search and replace:
use strict;
use warnings;
my $r = 'xx\nyy';
$r =~ s/(\\.)/qq{"$1"}/eeg; # Translate \. as a double quoted string would
print $r;
Outputs:
xx
yy

print the name gathered after a regular expression in perl

Ok so i need to gather the name of people in a txt file using perl, i can find the names and print them out with NAME: in front using a regular expression but i need to gather just the persons name I want to do this using regular expressions, because there is multiple different names to gather in each file.
Example of file input:
NAME: Bigelow, Patrick R DATE: 28 Apr 2014
code so far:
if (/NAME:/){
my #arr = /NAME:\s\S*\s\S*\s\S*/g or next;
print "$_\n" for #arr;
}
Better use two lines of code, one for stripping out the names(between Name: ... Date:), and then split the names for comma.
my ($tmp) = /(?<=NAME: )(.*?)(?=\s+DATE:)/g or next;
my #arr = split(/,\s+/, $tmp);
You can use s/regex/replace/ to do what you want. I avoid using post fix loops (easy to miss and are a pain to fix when you add more lines) and the use of $_ (does not improve readability, is inconstantly implemented, and is global in scope):
use strict;
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
TEXT_FILE => 'file.txt',
NAME_LINE => qr/^NAME:\s+/,
};
open my $name_fh, "<", TEXT_FILE; # Autodie eliminates need for testing if this worked
while ( my $line = <DATA> ) {
chomp $line;
next unless $line =~ NAME_LINE; # Skip lines if they don't contain name
( my $name = $line ) =~ s/NAME:\s+(.+?)\s+DATE.*/$1/;
say qq("$name");
}
close $name_fh;
The regular expresion s/NAME:\s+(.+?) +DATE.*/$1/ needs a bit of explaining:
The parentheses (...) are a capture group. I can refer to them by the $1 on the substitution side.
The +? is a non greedy qualifier. I am matching the smallest group I can that will match the regular expression. The trick is that \s+DATE.* will match as big as possible. Between the two, I eliminate all of the spaces from the end of the name to the DATE string this way. It's really not that efficient, but it works well.
The qq(...) is just a fancy way of doing double quotes. It makes it easy to put quotes in your quotes this way.
I use say instead of print because say will automatically add in the \n on the end. this becomes important where adding . "\n" to the end of something can cause problems.
For example:
print join ": ", #foo . "\n"; # Doesn't work.
say join ": ", #foo; # This works fine.

PowerShell multiple string replacement efficiency

I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:
Script:
$string = gc $filePath
$string | % {
$_ -replace 'something0','somethingelse0' `
-replace 'something1','somethingelse1' `
-replace 'something2','somethingelse2' `
-replace 'something3','somethingelse3' `
-replace 'something4','somethingelse4' `
-replace 'something5','somethingelse5' `
...
(600 More Lines...)
...
}
$string | ac "C:\log.txt"
But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.
Is there a better alternative to doing this that is more efficient?
Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:
# Build hashtable of search and replace values.
$replacements = #{
'something0' = 'somethingelse0'
'something1' = 'somethingelse1'
'something2' = 'somethingelse2'
'something3' = 'somethingelse3'
'something4' = 'somethingelse4'
'something5' = 'somethingelse5'
'X:\Group_14\DACU' = '\\DACU$'
'.*[^xyz]' = 'oO{xyz}'
'moresomethings' = 'moresomethingelses'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
Get-Content $filePath |
foreach { $r.Replace( $_, $matchEval ) } |
Add-Content 'C:\log.txt'
So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?
Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.
The Method:
Construct a hash where the keys are the somethings and the values are the somethingelses.
Join the keys of the hash with the | symbol, and use it as a match group in the regex.
In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group
The Problem:
Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.
In Perl, you can do this, for example:
$string =~ s/(1|2|3)/#{[$1 + 5]}/g;
This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".
However, in PowerShell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)"
[regex]::replace($string,'(1|2|3)',"$($1 + 5)")
In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.
The Solution:
[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]
If using another language is acceptable to you, the following Perl script works like a charm:
$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
'something0' => 'somethingelse0',
'something1' => 'somethingelse1',
'something2' => 'somethingelse2',
'something3' => 'somethingelse3',
'something4' => 'somethingelse4',
'something5' => 'somethingelse5',
'X:\Group_14\DACU' => '\\DACU$',
'.*[^xyz]' => 'oO{xyz}',
'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
push #strings, qr/\Q$_\E/;
$replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', #strings;
while (<INPUT>) {
s/($pattern)/$replacements{$1}/g;
print OUTPUT;
}
close INPUT;
close OUTPUT;
It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:
The foreach loop goes through all the elements of the hash and create an array called #strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.
BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.
while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
I simplified #{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so #{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.
I also have no idea how to solve this in powershell, but I do know how to solve it in Bash and that is by using a tool called sed. Luckily, there is also Sed for Windows. If all you want to do is replace "something#" with "somethingelse#" everywhere then this command will do the trick for you
sed -i "s/something([0-9]+)/somethingelse\1/g" c:\log.txt
In Bash you'd actually need to escape a couple of those characters with backslashes, but I'm not sure you need to in windows. If the first command complains you can try
sed -i "s/something\([0-9]\+\)/somethingelse\1/g" c:\log.txt
I would use the powershell switch statement:
$string = gc $filePath
$string | % {
switch -regex ($_) {
'something0' { 'somethingelse0' }
'something1' { 'somethingelse1' }
'something2' { 'somethingelse2' }
'something3' { 'somethingelse3' }
'something4' { 'somethingelse4' }
'something5' { 'somethingelse5' }
'pattern(?<a>\d+)' { $matches['a'] } # sample of more complex logic
...
(600 More Lines...)
...
default { $_ }
}
} | ac "C:\log.txt"

Removing newline character from a string in Perl

I have a string that is read from a text file, but in Ubuntu Linux, and I try to delete its newline character from the end.
I used all the ways. But for s/\n|\r/-/ (I look whether it finds any replaces any new line string) it replaces the string, but it still goes to the next line when I print it. Moreover, when I used chomp or chop, the string is completely deleted. I could not find any other solution. How can I fix this problem?
use strict;
use warnings;
use v5.12;
use utf8;
use encoding "utf-8";
open(MYINPUTFILE, "<:encoding(UTF-8)", "file.txt");
my #strings;
my #fileNames;
my #erroredFileNames;
my $delimiter;
my $extensions;
my $id;
my $surname;
my $name;
while (<MYINPUTFILE>)
{
my ($line) = $_;
my ($line2) = $_;
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
#chop($line2);
$line2 =~ s/^\n+//;
print $line2 . " WRONG FORMAT!\n";
}
else {
#print "INSERTED:".$13."\n";
my($id) = $13;
my($name) = $2;
print $name . "\t" . $id . "\n";
unshift(#fileNames, $line2);
unshift(#strings, $line2 =~ /[^\W_]+/g);
}
}
close(MYINPUTFILE);
The correct way to remove Unicode linebreak graphemes, including CRLF pairs, is using the \R regex metacharacter, introduced in v5.10.
The use encoding pragma is strongly deprecated. You should either use the use open pragma, or use an encoding in the mode argument on 3-arg open, or use binmode.
use v5.10; # minimal Perl version for \R support
use utf8; # source is in UTF-8
use warnings qw(FATAL utf8); # encoding errors raise exceptions
use open qw(:utf8 :std); # default open mode, `backticks`, and std{in,out,err} are in UTF-8
while (<>) {
s/\R\z//;
...
}
You are probably experiencing a line ending from a Windows file causing issues. For example, a string such as "foo bar\n", would actually be "foo bar\r\n". When using chomp on Ubuntu, you would be removing whatever is contained in the variable $/, which would be "\n". So, what remains is "foo bar\r".
This is a subtle, but very common error. For example, if you print "foo bar\r" and add a newline, you would not notice the error:
my $var = "foo bar\r\n";
chomp $var;
print "$var\n"; # Remove and put back newline
But when you concatenate the string with another string, you overwrite the first string, because \r moves the output handle to the beginning of the string. For example:
print "$var: WRONG\n";
It would effectively be "foo bar\r: WRONG\n", but the text after \r would cause the following text to wrap back on top of the first part:
foo bar\r # \r resets position
: WRONG\n # Second line prints and overwrites
This is more obvious when the first line is longer than the second. For example, try the following:
perl -we 'print "foo bar\rbaz\n"'
And you will get the output:
baz bar
The solution is to remove the bad line endings. You can do this with the dos2unix command, or directly in Perl with:
$line =~ s/[\r\n]+$//;
Also, be aware that your other code is somewhat horrific. What do you for example think that $13 contains? That'd be the string captured by the 13th parenthesis in your previous regular expression. I'm fairly sure that value will always be undefined, because you do not have 13 parentheses.
You declare two sets of $id and $name. One outside the loop and one at the top. This is very poor practice, IMO. Only declare variables within the scope they need, and never just bunch all your declarations at the top of your script, unless you explicitly want them to be global to the file.
Why use $line and $line2 when they have the same value? Just use $line.
And seriously, what is up with this:
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
That looks like an attempt to obfuscate, no offence. Three nested negations and a bunch of unnecessary parentheses?
First off, since it is an if-else, just swap it around and reverse the regular expression. Second, [^\W_] a double negation is rather confusing. Why not just use [A-Za-z0-9]? You can split this up to make it easier to parse:
if ($line =~ /^(.+)(\.docx)\s*$/) {
my $pre = $1;
my $ext = $2;
You can wipe the linebreaks with something like this:
$line =~ s/[\n\r]//g;
When you do that though, you'll need to change the regex in your if statement to not look for them. I also don't think you want a /g in your if. You really shouldn't have a $line2 either.
I also wouldn't do this type of thing:
print $line2." WRONG FORMAT!\n";
You can do
print "$line2 WRONG FORMAT!\n";
... instead. Also, print accepts a list, so instead of concatenating your strings, you can just use commas.
You can do something like:
=~ tr/\n//
But really chomp should work:
while (<filehandle>){
chomp;
...
}
Also s/\n|\r// only replaces the first occurrence of \r or \n. If you wanted to replace all occurrences you would want the global modifier at the end s/\r|\n//g.
Note: if you're including \r for windows it usually ends its line as \r\n so you would want to replace both (e.g. s/(?:\r\n|\n)//), of course the statement above (s/\r|\n//g) with the global modifier would take care of that anyways.
$variable = join('',split(/\n/,$variable))