perl regexp for multi line file - regex

i have patterns in a file which looks like this:
db::parameter nf
-data. Value. \
-data2. Value2. \
db::parameter ww
-data1. Value1. \
-data2. Value2. \
I need regexp which will take whole pattern into a variable starting from db
I tried to match the pattern untill empty line will show up
while(<$infile>){
chomp;
If( $_=~/db:parameter\s+$){
print $_;}
P.s. I know regexp is totaly wrong, but im not that good at regexps

If you want to use an empty line as a record separator, may I suggest using paragraph mode?
$/ = ""; # set input record separator to empty string
while (<>) { # proceed as usual
Using the empty string is a special case, as described in the documentation linked above:
Setting $/ to "\n\n" means something slightly different than setting to "" , if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline.

Related

Keep all lines of a list with identical beginning (Notepad++)

From a list, how to keep all occurrences of those lines only whose "first part or beginning" (defined from the beginning of the line to the ^ character) is present in other lines? (The pattern of lines in the list: beginning-of-line^rest_of_line_012345)
The type of characters, length, etc. after the ^ is irrelevant (but needs to be kept). Every line has only one (1) ^ character. The "beginning" string that determines identity must be present in the same (analogous) position in other lines (i.e., from the beginning of the line to ^, and must be exact match). (Lines contain characters that trouble regex, such as \/()*., so these need to be summarily escaped.)
For example: Original list:
abc^123
0xyz^xxx
aaa-123^123
aaa-12^0xyz
0xyz^098
00xyz^098
0xyz^x111xx
Keep all occurrences of lines with identical first part:
0xyz^xxx
0xyz^098
0xyz^x111xx
This elegant script by #Lars Fischer ((.*)\R(\2\R?)+)*\K.* (after pre-sorting) keeps all occurrences of duplicate lines, but it considers the entire line (it was designed to do so).
In this Q, I am looking for a solution that considers only the "beginning" of the line to see if it occurs more than once, and if yes, then keep the entire line. Any guidance?
Note: in this solution the characters # and % are used based on the assumption that these characters do not show up ANYWHERE in the file to begin with. If that's not the case for you, just use different patterns that you know don't show up anywhere in the file, such as ##### and %%%%%.
Start by sorting the file Lexicographically with Notepad++ by going to Edit -> Line Operations -> Sort Lines Lexicographically Ascending
Do a regex Find-and-Replace (UNcheck the box for ". matches newline"):
Find what:
^(.*?)\^[^\r\n]+[\r\n]+(\1\^.*?[\r\n]+)*\1\^.*?$
Replace with:
#$&%
Now do another regex Find-and-Replace (CHECK the box for ". matches newline"):
Find what:
%.*?#
Replace with:
\r\n
Finally, do one last regex Find-and-Replace (CHECK the box for ". matches newline"):
Find what:
^.*?#|%.*
Replace with nothing.
You said in comments that a perl script is OK for you.
#!/usr/bin/perl
use Modern::Perl;
my %values;
my $file = 'path/to/file';
open my $fh, '<', $file or die "unable to open '$file': $!";
while(<$fh>) {
chomp;
# get the prefix value
my ($prefix) = split('\^', $_);
# push in array the whole line in hash with the prefix as key
push #{$values{$prefix}}, $_;
}
foreach (keys %values) {
# skip the prefix tat have only one line
next if scalar #{$values{$_}} == 1;
local $" = "\n";
say "#{$values{$_}}";
}
Output:
0xyz^xxx
0xyz^098
0xyz^x111xx

Use sed to remove string results in empty file

I have large text files, in which sometimes long lines are broken into multiple lines by writing a = and then a newline character. (Enron email data from Kaggle). Since even words are broken this way and I want to do some machine learning with the data, I'd like to remove those breaks. As far as I can see the combination =\n is only used for these breaks, so if I remove those, I have the same information without the breaks and nothing gets lost.
I cannot use tr because it only replaces 1 character, but I have two characters to replace.
The sed command I am using so far to no avail is:
sed --in-place --quiet --regexp-extended 's/=\n//g' email_aa_edit
where email_aa_edit is a part of the enron mail data (used split to split it) and is my input file. However this only produces an empty file and I am not sure why. Afaik = is not a special character on itself and the newline should be \n.
What is the correct way of removing those =\n occurrences?
You can't remove newlines characters since sed works line by line, but it's possible if you append the next line to the pattern space:
sed ':a;/=$/{N;s/=\n//;ta}' file
details:
:a; # defines a label "a"
/=$/ { # if the line ends with =
N; # append the next line to the pattern space
s/=\n//; # replace the =\n
ta # jump to label "a" when something is replaced (that's always the case
# except if the last line ends with =)
}
Note: if your file uses the Windows newline sequence, change \n to \r\n.

Matching the end of line $ in perl; print showing different behavior with chomp

I am reading a file and matching a regex for lines with a hex number at the start followed by few dot separated hex values followed by optional array name which may contain an option index. For eg:
010c10 00000000.00000000.0000a000.02300000 myFooArray[0]
while (my $rdLine = <RDHANDLE>) {
chomp $rdLine;
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+([0-9.a-z]+)[ \t]*([A-Za-z_0-9]*)\[*[0-9]*\]*$/) {
...
My source file containing these hex strings is also script generated. This match works fine for some files but other files produced thru the exact same script (ie no extra spaces, formats etc) do not match when the last $ is present on the match condition.
If I modify the condition to not have the end $, lines match as expected.
Another curious thing is for debugging this, I added a print statement like this:
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+/) {
print "Hey first part matched for $rdLine \n";
}
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+([0-9.a-z]+)/) {
print "Hey second part matched for $rdLine \n";
}
The output on the terminal for the following input eats the first character :
010000 00000000 foo
"ey first part matched for 010000 00000000 foo
ey second part matched for 010000 00000000 foo"
If I remove the chomp, it prints the Hey correctly instead of just ey.
Any clues appreciated!
"other files produced thru the exact same script (ie no extra spaces, formats etc) do not match when the last $ is present on the match condition"
Although you deny it, I am certain that your file contains a single space character directly before the end of the line. You should check by using Data::Dump to display the true contents of each file record. Like this
use Data::Dump;
dd \$read_line;
It is probably best to use
$read_line =~ s/\s+\z//;
in place of chomp. That will remove all spaces and tabs, as well as line endings like carriage-return and linefeed from the end of each line.
"If I remove the chomp, it prints the Hey correctly instead of just ey."
It looks like you are working on a Linux machine, processing a file that was generated on a Windows platform. Windows uses the two characters CR LF as a record separator, whereas Linux uses just LF, so a chomp removes just the trailing LF, leaving CR to cause the start of the string to be overwritten.
If it wasn't for your secondary problem of having trailing whitespace, tThe best solution here would be to replace chomp $read_line with $read_line =~ s/\R\z//. The \R character class matches the Unicode idea of a line break sequence, and was introduced in version 10 of Perl 5. However, the aforementioned s/\s+\z// will deal with your line endings as well, and should be all that you need.
Borodin is right, \r\n is the culprit.
I used a less elegant solution, but it works:
$rdLine =~ s/\r//g;
followed by:
chomp $rdLine;

Regular expression in TCL

I have to parse this format using regexp in TCL.
Here is the format
wl -i eth1 country
Q1 (Q1/27) Q1
I'm trying to use the word country as a keyword to parse the format 'Q1 (Q1/27) Q1'.
I can do it if it is in a same line as country using the following regexp command.
regexp {([^country]*)country(.*)} $line match test country_value
But how can i tackle the above case?
Firstly, the regular expression you are using isn't doing quite the right thing in the first place, because [^country] matches a set of characters that consists of everything except the letters in country (so it matches from the h in eth1 onwards only, given the need to have country afterwards).
By default, Tcl uses the whole string to match against and newlines are just ordinary characters. (There is an option to make them special by also specifying -line, but it's not on by default.) This means that if I use your whole string and feed it through regexp with your regular expression, it works (well, you probably want to string trim $country_value at some point). This means that your real problem is in presenting the right string to match against.
If you're presenting lines one at a time (read from a file, perhaps) and you want to use a match against one line to trigger processing in the next, you need some processing outside the regular expression match:
set found_country 0
while {[gets $channel line] >= 0} {
if {$found_country} {
# Process the data...
puts "post-country data is $line"
# Reset the flag
set found_country 0
} elseif {[regexp {(.*) country$} $line -> leading_bits]} {
# Process some leading data...
puts "pre-country data is $leading_bits"
# Set the flag to handle the next line specially
set found_country 1
}
}
If you want to skip blank lines completely, put a if {$line eq ""} continue before the if {$found_country} ....

Handle commas in quoted strings in Tcl

I'm using the following line in Tcl to parse a comma-separated line of fields. Some of the fields may be quoted so they can contain comma's:
set line {12,"34","56"}
set fresult [regsub -all {(\")([^\"]+)(\",)|([^,\"]+),} $line {{\2\4} } fields]
puts $fields
{12} {34} "56"
(It's a bit strange that the last field is quoted instead of braced but that's not the problem here)
However, when there is a comma in the quote, it does not work:
set line {12,"34","56,78"}
set fresult [regsub -all {(\")([^\"]+)(\",)|([^,\"]+),} $line {{\2\4} } fields]
puts $fields
{12} {34} "{56} 78"
I would expect:
{12} {34} {56,78}
Is there something wrong with my regexp or it there something tcl-ish going on?
One option that comes to mind is using the CSV functionality in TclLib. (No reason to reinvent the wheel unless you have to...)
http://tcllib.sourceforge.net/doc/csv.html
Docs Excerpt
::csv::split ? -alternate ? line
{sepChar ,} {delChar "} converts a
line in CSV format into a list of the
values contained in the line. The
character used to separate the values
from each other can be defined by the
caller, via sepChar, but this is
optional. The default is ",". The
quoting character can be defined by
the caller, but this is optional. The
default is '"'. If the option
-alternate is spcified a slightly different syntax is used to parse the
input. This syntax is explained below,
in the section FORMAT.
The problem seems to be an extra comma: you only accept quoted strings if they have a comma after them., and do the same for non-quoted tokens, This works:
set fresult [regsub -all {(\")([^\"]+)(\")|([^,\"]+)} $line {{\2\4} } fields]
^(no commas)^
Working Example: http://ideone.com/O2hss
You can safely keep the commas out of the pattern - the regex engine will keen searching new matches: it will skip a comma it cannot match, and start at the next character.
Bonus: this will also handle escaped quotes, using \" (if you need you should be able to adapt easily by using "" instead of \\. ).:
set fresult [regsub -all {"((?:[^"\\]|\\.)+)"|([^,"]+)} $line {{\1\2} } fields]
Example: http://ideone.com/ztkBh
Use the following regsub
% set line {12,"34","56,78"}
% regsub -all {(,")|(",)|"} $line " " line
% set line
12 34 56,78 <<< Result
Here all the occurrences of ," or ", or " (in order) are replaced by space
As you said to #Kobi, if you allow for empty fields, you should allow for empty strings ""
{((\")([^\"]*)(\")|([^,\"]*))(,|$)} where the fields of interest shifted to 3 and 5
Expanded: { ( (\")([^\"]*)(\") | ([^,\"]*) ) (,|$) } I admit, I don't know if tcl allows (?:) non-capture grouping.