Use sed to remove string results in empty file - regex

I have large text files, in which sometimes long lines are broken into multiple lines by writing a = and then a newline character. (Enron email data from Kaggle). Since even words are broken this way and I want to do some machine learning with the data, I'd like to remove those breaks. As far as I can see the combination =\n is only used for these breaks, so if I remove those, I have the same information without the breaks and nothing gets lost.
I cannot use tr because it only replaces 1 character, but I have two characters to replace.
The sed command I am using so far to no avail is:
sed --in-place --quiet --regexp-extended 's/=\n//g' email_aa_edit
where email_aa_edit is a part of the enron mail data (used split to split it) and is my input file. However this only produces an empty file and I am not sure why. Afaik = is not a special character on itself and the newline should be \n.
What is the correct way of removing those =\n occurrences?

You can't remove newlines characters since sed works line by line, but it's possible if you append the next line to the pattern space:
sed ':a;/=$/{N;s/=\n//;ta}' file
details:
:a; # defines a label "a"
/=$/ { # if the line ends with =
N; # append the next line to the pattern space
s/=\n//; # replace the =\n
ta # jump to label "a" when something is replaced (that's always the case
# except if the last line ends with =)
}
Note: if your file uses the Windows newline sequence, change \n to \r\n.

Related

Matching the end of line $ in perl; print showing different behavior with chomp

I am reading a file and matching a regex for lines with a hex number at the start followed by few dot separated hex values followed by optional array name which may contain an option index. For eg:
010c10 00000000.00000000.0000a000.02300000 myFooArray[0]
while (my $rdLine = <RDHANDLE>) {
chomp $rdLine;
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+([0-9.a-z]+)[ \t]*([A-Za-z_0-9]*)\[*[0-9]*\]*$/) {
...
My source file containing these hex strings is also script generated. This match works fine for some files but other files produced thru the exact same script (ie no extra spaces, formats etc) do not match when the last $ is present on the match condition.
If I modify the condition to not have the end $, lines match as expected.
Another curious thing is for debugging this, I added a print statement like this:
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+/) {
print "Hey first part matched for $rdLine \n";
}
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+([0-9.a-z]+)/) {
print "Hey second part matched for $rdLine \n";
}
The output on the terminal for the following input eats the first character :
010000 00000000 foo
"ey first part matched for 010000 00000000 foo
ey second part matched for 010000 00000000 foo"
If I remove the chomp, it prints the Hey correctly instead of just ey.
Any clues appreciated!
"other files produced thru the exact same script (ie no extra spaces, formats etc) do not match when the last $ is present on the match condition"
Although you deny it, I am certain that your file contains a single space character directly before the end of the line. You should check by using Data::Dump to display the true contents of each file record. Like this
use Data::Dump;
dd \$read_line;
It is probably best to use
$read_line =~ s/\s+\z//;
in place of chomp. That will remove all spaces and tabs, as well as line endings like carriage-return and linefeed from the end of each line.
"If I remove the chomp, it prints the Hey correctly instead of just ey."
It looks like you are working on a Linux machine, processing a file that was generated on a Windows platform. Windows uses the two characters CR LF as a record separator, whereas Linux uses just LF, so a chomp removes just the trailing LF, leaving CR to cause the start of the string to be overwritten.
If it wasn't for your secondary problem of having trailing whitespace, tThe best solution here would be to replace chomp $read_line with $read_line =~ s/\R\z//. The \R character class matches the Unicode idea of a line break sequence, and was introduced in version 10 of Perl 5. However, the aforementioned s/\s+\z// will deal with your line endings as well, and should be all that you need.
Borodin is right, \r\n is the culprit.
I used a less elegant solution, but it works:
$rdLine =~ s/\r//g;
followed by:
chomp $rdLine;

perl regexp for multi line file

i have patterns in a file which looks like this:
db::parameter nf
-data. Value. \
-data2. Value2. \
db::parameter ww
-data1. Value1. \
-data2. Value2. \
I need regexp which will take whole pattern into a variable starting from db
I tried to match the pattern untill empty line will show up
while(<$infile>){
chomp;
If( $_=~/db:parameter\s+$){
print $_;}
P.s. I know regexp is totaly wrong, but im not that good at regexps
If you want to use an empty line as a record separator, may I suggest using paragraph mode?
$/ = ""; # set input record separator to empty string
while (<>) { # proceed as usual
Using the empty string is a special case, as described in the documentation linked above:
Setting $/ to "\n\n" means something slightly different than setting to "" , if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline.

Regex: Match any character (including whitespace) except a comma

I would like to match any character and any whitespace except comma with regex. Only matching any character except comma gives me:
[^,]*
but I also want to match any whitespace characters, tabs, space, newline, etc. anywhere in the string.
EDIT:
This is using sed in vim via :%s/foo/bar/gc.
I want to find starting from func up until the comma, in the following example:
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
I
To work with multiline in SED using RegEx, you should look at here.
EDIT:
In SED command, working with NewLine is a bit different. SED command support three patterns to manage multiline operations N, P and D. To see how it works see this(Working with Multiple Lines) explaination. Here these three operations discussed.
My guess is that N operator is the area of consideration that is missing from here. Addition of N operator will allows to sense \n in string.
An example from here:
Occasionally one wishes to use a new line character in a sed script.
Well, this has some subtle issues here. If one wants to search for a
new line, one has to use "\n." Here is an example where you search for
a phrase, and delete the new line character after that phrase -
joining two lines together.
(echo a;echo x;echo y) | sed '/x$/ { N s:x\n:x: }'
which generates
a xy
However, if you are inserting a new line, don't use "\n" - instead
insert a literal new line character:
(echo a;echo x;echo y) | sed 's:x:X\ :'
generates
a X
y
So basically you're trying to match a pattern over multiple lines.
Here's one way to do it in sed (pretty sure these are not useable within vim though, and I don't know how to replicate this within vim)
sed '
/func/{
:loop
/,/! {N; b loop}
s/[^,]*/func("ok"/
}
' inputfile
Let's say inputfile contains these lines
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
The output is
func("ok", "more strings")
Details:
If a line contains func, enter the braces.
:loop is a label named loop
If the line does not contain , (that's what /,/! means)
append the next line to pattern space (N)
branch to / go to loop label (b loop)
So it will keep on appending lines and looping until , is found, upon which the s command is run which matches all characters before the first comma against the (multi-line) pattern space, and performs a replacement.

How to match the end of a line but not a paragraph with a regex in Vim?

I am trying to join all lines in a paragraph, but not join one paragraph with the next.
In my text file, the paragraph is not defined by blank lines in between them, but with a period at the end of the line. There could be white spaces after the period but it still defines the end of the paragraph.
So, I wanted to do a macro that jumps to the next end of line, not stopping on those lines that have a period at the end.
I used this regex:
[^\.\s][\s]*$
Meaning: find any character that is not a period nor a whitespace, optionally followed by whitespaces to the end of the line.
I would then apply the J command to join the matched line with the next one, and then repeat.
It works fine on RegexPal, but in Vim it stops at lines that have a period and two spaces.
What am I doing wrong?
Instead of using the regex in a macro in conjunction with the J command, how about using a regex substitution to remove linebreaks? This seems to work for me:
:%s/[^\.]\s*\zs$\n\(^\s*$\n\)*/ /
Explanation:
[^\.]\s*\zs$\n -- lines not ending with a period; start the replacement before the linebreak.
\(^\s*$\n\)* -- include any further lines containing only whitespace
This regex is then replaced with a space.
If the cursor is located at the first line of a paragraph,
one can join its lines with
:,/\.\s*$/j
To do the same for all paragraphs in a buffer, use the command
:g/^/,/\.\s*$/j
This should get you part way there: use shime's regexp (\.\s*$) to identify lines you want to join, then use :v//j! to join each such line to the next line.
Then repeat the :v//j! command until done. (Define a macro to do it: :map v :v//j!<cr> then just hit v repeatedly.)
A better solution, if you're on a *NIX-like machine is:
awk '/\.\s*$/ { printf("%s\n", $0);} { printf("%s", $0); } END { printf("\n"); }' <your_file >your_other_file

sed - behaviour of holdspace

I have (from the sed website http://sed.sourceforge.net/sed1line.txt) this one-liner:
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;/BBB/!d;/CCC/!d'
Its purpose is to search a paragraph for either AAA, BBB or CCC.
My understanding of the script:
'/./' matches every line wich is not empty
'{}' all commands within the brackets handle the matched lines
'H' appends the holdspace with the matched lines
'$!d' delete from patternspace everything but the last line
'x' swaps the pattern- and holdspace
'/AAA/!d' search for AAA paragraph and print it
What is not clear to me:
In the holdspace should be several separate lines (for each paragraph), why am I able to search the whole paragraph? Are the lines in the holdspace merged to one line?
And how does sed know when one paragraph ends and the other begins in the holdspace?
Why do I have to append '$!d', why is not '$d' sufficient? Why am I not able to omit the '-n' and use '$p' instead of '$!d' in this case?
Thank you very much for every comment!
My test data (match every paragraph with XX in it):
YYaaaa
aaa1
aaa2
aXX3
aaa4
YYbbbb
bbb1
bbb2
YYcccc
ccc1
ccc2
ccc3
cXX4
ccc5
YYdddd
ddd1
dXX2
Following command is used:
sed -ne '/./{H;$!d};x;/XX/p' test2
Versions:
$ sed --version
GNU sed-Version 4.2.1
$ bash --version
GNU bash, Version 4.2.10(1)-release (x86_64-pc-linux-gnu)
It collects a paragraph as individual lines into the hold space (H), then when you hit an empty line, /./ fails and it falls through to the x which basically zaps the hold space for the next paragraph.
In order to correctly handle the final paragraph, it needs to cope with a paragraph which is not followed by an empty line, therefore it falls through from the last line as if it were followed by an empty line. This is a common idiom for scripts which collect something up through a particular pattern (or, to put it differently, it's a common error for such scripts to fail to handle the last collected data at end of file).
So in other words, if we are looking at a non-empty line, add it to the hold space, and unless it's the last line in the file, delete it and start over from the beginning of the script with the next input line. (Perhaps your understanding of d was not complete? This is what $!d means.)
Otherwise, we have an empty line, or end of file, and the hold space contains zero or more lines of text (one paragraph, possibly empty). Exchange them into the pattern space (the current, empty, line conveniently moves to the hold space) and examine the pattern space. If it fails to match one of our expressions, delete it. Otherwise, the default action is to print the entire pattern space.