How to match CR CR LF newline pattern [duplicate] - regex

This question already has answers here:
Does awk CR LF handling break on cygwin?
(2 answers)
Closed 4 years ago.
In Windows 10 environment I have to check how many CSV files (separator is ";") in a directory have this odd newline pattern: CR CR LF (or \r\r\n if you prefer).
However, I can match \r\r neither with grep nor with awk. On awk I've also tried to change RS to be ; and FS a not-used character (#), but apparently awk matches single CR, not CR CR. So awk in Windows sees CR CR LF as CR LF and FNR output a number of records equal to any other "normal end-line" file.
Strange thing is that with Notepad++ I can clearly see CR CR LF (causing an extra line break, as e.g. in Excel) and with built-in regex finder, search for \r\r\n match all the lines. Is it not possible to force awk to act on a raw text file without removing some CR?
The file is like this (I've simplified a little): 5 lines with 4 x fields separated by ; and a the end of each line CRCRLF. Opening with Notepad++ (and Excel) I see 10 lines.
I hoped that the following GNU awk script would return 16 5
BEGIN {RS = ";";FS = "#"; linecount = 0}
/\r\r/ {linecount = linecount + 1}
END {print FNR, linecount}
However, it returns 16 0. If I search to match /\r/ instead, I obtain 16 5.
So basically I'm afraid that Windows CMD shell is stripping out one of the two consecutive CR (or to say it better, is replacing a CR LF pair with LF) before passing stream to gawk, I was wondering if it is possible to avoid this, because I want to use gawk to detect how many files have this weird CR CR LF newline.
I believe a very similar question has been posted here:
In Perl, how to match two consecutive Carriage Returns?

After realizing there is a duplicate (thanks #tripleee):
Under MS-Windows, gawk (and many other text programs) silently translates end-of-line \r\n to \n on input and \n to \r\n on output. A special BINMODE variable (c.e.) allows control over these translations and is interpreted as follows:
If BINMODE is "r" or one, then binary mode is set on read (i.e., no translations on reads).
If BINMODE is "w" or two, then binary mode is set on write (i.e., no translations on writes).
If BINMODE is "rw" or "wr" or three, binary mode is set for both read and write.
BINMODE=non-null-string is the same as BINMODE=3 (i.e., no translations on reads or writes). However, gawk issues a warning message if the string is not one of "rw" or "wr".
source: https://www.gnu.org/software/gawk/manual/gawk.html#PC-Using
To keep awk in its original POSIX-style, you should use BINMODE=3. Using awk (or any unmodified version), you should easily be able to do it by checking if the record ends with \r\r. This is because awk defaultly0 splits a file in records using RS="\n". As GOW is using GNU awk, you have the following options:
count files:
awk '/\r\r$/{f++; nextfile} END {print f,"files match"}' BINMODE=3 *.csv
count files and print filename:
awk '/\r\r$/{f++; print FILENAME; nextfile} END {print f,"files match"}' BINMODE=3 *.csv
count files, print filename and lines:
awk '(FNR==1){if (c) {print fname, c; f++}; c=0; fname=FILENAME}
/\r\r$/{c++}
END { print f,"files match" }' BINMODE=3 *.csv
note: remove BINMODE=3 on any POSIX system.

You can try GNU grep's -z and -P switch, try this:
grep -zcP "\r\r\n" *.csv | awk -F: "$2{c++}END{print c}"
So I created a file like you said by this:
awk 'BEGIN{ORS="\r\r\n"; OFS=";"; for(i=1;i<11;i++)print "aa","bb","cc",i>"strange.csv"}'
And I can search \r\r\n in the csv files like this:
> grep -zcP "\r\r\n" *.csv
file1.csv:0
file2.csv:0
file3.csv:0
file_a.csv:0
file_b.csv:0
results.csv:0
strange.csv:1
And combine it with awk:
awk -F: "$2{c++}END{print c}"
to get the count:
> grep -zcP "\r\r\n" *.csv | awk -F: "$2{c++}END{print c}"
1
OR, just use awk alone:
> awk 'BEGIN{RS="";}/\r\r\n/{c++;nexfile}END{print c}' *.csv
1
So both above grep and awk examples, read whole file instead of dealing with each line every turn.

Related

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

Using awk or sed to merge / print lines matching a pattern (oneliner?)

I have a file that contains the following text:
subject:asdfghj
subject:qwertym
subject:bigger1
subject:sage911
subject:mothers
object:cfvvmkme
object:rjo4j2f2
object:e4r234dd
object:uft5ed8f
object:rf33dfd1
I am hoping to achieve the following result using awk or sed (as a oneliner would be a bonus! [Perl oneliner would be acceptable as well]):
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
I'd like to have each line that matches 'subject' and 'object' combined in the order that each one is listed, separated with a comma. May I see an example of this done with awk, sed, or perl? (Preferably as a oneliner if possible?)
I have tried some uses of awk to perform this, I am still learning I should add:
awk '{if ($0 ~ /subject/) pat1=$1; if ($0 ~ /object/) pat2=$2} {print $0,pat2}'
But does not do what I thought it would! So I know I have the syntax wrong. If I were to see an example that would greatly help so that I can learn.
not perl or awk but easier.
$ pr -2ts, file
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
Explanation
-2 2 columns
t ignore print header (filename, date, page number, etc)
s, use comma as the column separator
I'd do it something like this in perl:
#!/usr/bin/perl
use strict;
use warnings;
my #subjects;
while ( <DATA> ) {
m/^subject:(\w+)/ and push #subjects, $1;
m/^object:(\w+)/ and print "subject:",shift #subjects,",object:", $1,"\n";
}
__DATA__
subject:asdfghj
subject:qwertym
subject:bigger1
subject:sage911
subject:mothers
object:cfvvmkme
object:rjo4j2f2
object:e4r234dd
object:uft5ed8f
object:rf33dfd1
Reduced down to one liner, this would be:
perl -ne '/^(subject:\w+)/ and push #s, $1; /^object/ and print shift #s,$_' file
grep, paste and process substitution
$ paste -d , <(grep 'subject' infile) <(grep 'object' infile)
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
This treats the output of grep 'subject' infile and grep 'object' infile like files due to process substitution (<( )), then pastes the results together with paste, using a comma as the delimiter (indicated by -d ,).
sed
The idea is to read and store all subject lines in the hold space, then for each object line fetch the hold space, get the proper subject and put the remaining subject lines back into hold space.
First the unreadable oneliner:
$ sed -rn '/^subject/H;/^object/{G;s/\n+/,/;s/^(.*),([^\n]*)(\n|$)/\2,\1\n/;P;s/^[^\n]*\n//;h}' infile
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
-r is for extended regex (no escaping of parentheses, + and |) and -n does not print by default.
Expanded, more readable and explained:
/^subject/H # Append subject lines to hold space
/^object/ { # For each object line
G # Append hold space to pattern space
s/\n+/,/ # Replace first group of newlines with a comma
# Swap object (before comma) and subject (after comma)
s/^(.*),([^\n]*)(\n|$)/\2,\1\n/
P # Print up to first newline
s/^[^\n]*\n// # Remove first line (can't use D because there is another command)
h # Copy pattern space to hold space
}
Remarks:
When the hold space is fetched for the first time, it starts with a newline (H adds one), so the newline-to-comma substitution replaces one or more newlines, hence the \n+: two newlines for the first time, one for the rest.
To anchor the end of the subject part in the swap, we use (\n|$): either a newline or the end of the pattern space – this is to get the swap also on the last line, where we don't have a newline at the end of the pattern space.
This works with GNU sed. For BSD sed as found in MacOS, there are some changes required:
The -r option has to be replaced by -E.
There has to be an extra semicolon before the closing brace: h;}
To insert a newline in the replacement string (swap command), we have to replace \n by either '$'\n'' or '"$(printf '\n')"'.
Since you specifically asked for a "oneliner" I assume brevity is far more important to you than clarity so:
$ awk -F: -v OFS=, 'NR>1&&$1!=p{f=1}{p=$1}f{print a[++c],$0;next}{a[NR]=$0}' file
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1

Sed substitution for text

I have this kind of text:
0
4.496
4.496
Plain text.
7.186
Plain text
10.949
Plain text
12.988
Plain text
16.11
Plain text
17.569
Plain text
ESP:
Plain text
I am trying to make a sed substitution because I need all aligned after the numbers, something like:
0
4.496
4.496 Plain text.
7.186 Plain text
10.949 Plain text
12.988 Plain text
16.11 Plain text
17.569 Plain text ESP:Plain text
But I'm trying with different combinations of commands with sed but I can't keep a part of the matching pattern
sed -r 's/\([0-9]+\.[0-9]*\)\s*/\1/g'
I am trying to remove all \n after the numbers and align the text, but it does not work. I need to align text with text too.
I tried this as well:
sed -r 's/\n*//g'
But without results.
Thank you
This is a bit tricky. Your approach does not work because sed operates in a line-based manner (it reads a line, runs the code, reads another line, runs the code, and so forth), so it doesn't see the newlines unless you do special things. We'll have to completely override sed's normal control flow.
With GNU sed:
sed ':a $!{ N; /\n[0-9]/ { P; s/.*\n//; }; s/\n/ /; ba }' filename
This works as follows:
:a # Jump label for looping (we'll build our own loop, with
# black jack and...nevermind)
$! { # Unless the end of the input was reached:
N # fetch another line, append it to what we already have
/\n[0-9]/ { # if the new line begins with a number (if the pattern space
# contains a newline followed by a number)
P # print everything up to the newline
s/.*\n// # remove everything before the newline
}
s/\n/ / # remove the newline if it is still there
ba # go to a (loop)
}
# After the end of the input we drop off here and implicitly
# print the last line.
The code can be adapted to work with BSD sed (as found on *BSD and Mac OS X), but BSD sed is a bit picky about labels and jump instructions. I believe
sed -e ':a' -e '$!{ N; /\n[0-9]/ { P; s/.*\n//; }; s/\n/ /; ba' -e '}' filename
should work.
This gnu-awk command can also handle this:
awk -v RS= 'BEGIN{ FPAT="(^|\n)[0-9]+(\\.[0-9]+)?(\n[^0-9][^\n]*)*" }
{for (i=1; i<=NF; i++) {f=$i; sub(/^\n/, "", f); gsub(/\n/, " ", f); print f}}' file
0
4.496
4.496 Plain text.
7.186 Plain text
10.949 Plain text
12.988 Plain text
16.11 Plain text
17.569 Plain text ESP: Plain text
sed -n '1h;1!H;$!b
x;s/\n\([^0-9]\)/ \1/gp' YourFile
-n: print only at demand
'1h;1!H;$!b;xload the whole file in buffer until the end (at the end the whole file is in work buffer)
s/\n\([^0-9]\)/ \1/gp: replace all new line followed by a non digit character by a space character and print the result.
This might work for you (GNU sed):
sed ':a;N;s/\n\([[:alpha:]]\)/ \1/;ta;P;D' file
Process two lines at a time. If the second line begins with an alphabetic character, remove the preceeding newline and append another line and repeat. If the second line does not begin with an alphabetic character, print and then delete the first line and its newline. The second line now becomes the first line and the process repeats.

tr '\n\t+' command not working in shell bash?

Text1 Text2
(3 tabs) text 3
(4 tabs) text 4
(2 tabs) text 5
Text2 Text7
(2 tabs) Text8
I have a text file in the above format. Basically what I want to do is that, I want to replace consecutive newline and tabs with a special char. I am using this command
tr '\n\t+' '#'
I am expecting this output
Text1 Text2#text 3#text 4#text 5<br/>
Text2 Text7#Text8
this regex is working fine with eclipse find and replace (also with editplus). However tr puts everything in one line.
Can anyone tell me what is problem with tr, with this regex? And, what is the resolution?
That is wrong use of tr command. It lets you translate one character (class) by another but you cannot use it for regex string replacements like this.
You can use gnu sed instead:
sed ':a;N;$!ba;s/\n\t\+/#/g;' file
Text1 Text2#text 3#text 4#text 5
Text2 Text7#text8
There are 2 parts of this sed command:
:a;N;$!ba;: Appends the current and next line to the pattern space via N command (is a loop that reads the entire input up front before then applying the string substitution)
s/\n\t\+/#/g; Replaces every newline followed by 1 or more tabs by #
EDIT: Here is a non-gnu sed version that worked on OSX also:
sed -e ':a' -e 'N' -e '$!ba' -e $'s/\\n\t\t*/#/g' file
#anubhava's helpful answer explains why tr doesn't work here, but the pure sed solution has a slight drawback (aside from being somewhat difficult to understand): it reads the entire input file into memory before performing the desired string substitution (which may be perfectly fine for smaller files).
IF you:
have GNU awk or mawk
and don't mind combining awk and sed
here's a solution that doesn't read the entire input all at once:
awk -v RS='\n\t+' -v ORS=# '1' file | sed '$d'
-v RS='\n\t+' assigns to RS, the [input] record separator, which breaks the input (potentially across lines) into records based on being separated a newline followed by at least 1 space. Note that it's the use of a regex as the record separator that is not POSIX-compliant and thus requires GNU awk or mawk.
-v ORS=# assigns # to variable ORS, the output record separator.
1 constitutes the entire awk program in this case: it is a common shortcut that is effectively the same as {print}, i.e., it simply outputs each input record, followed by ORS, the output record separator.
However, since every record, including the last one, is terminated with ORS, we end up with \n# at the end of the output, which is undesired.
sed '$d' simply deletes that last line from the output ($ matches the last line, and d deletes it).

how to rejoin words that are split accross lines with a hyphen in a text file

OCR texts often have words that flow from one line to another with a hyphen at the end of the first line. (ie: the word has '-\n' inserted in it).
I would like rejoin all such split words in a text file (in a linux environment).
I believe this should be possible with sed or awk, but the syntax for these is dark magic to me! I knew a text editor in windows that did regex search/replace with newlines in the search expression, but am unaware of such in linux.
Make sure to back up ocr_file before running as this command will modify the contents of ocr_file:
perl -i~ -e 'BEGIN{$/=undef} ($f=<>) =~ s#-\s*\n\s*(\S+)#$1\n#mg; print $f' ocr_file
This answer is relevant, because I want the words joined together... not just a removal of the dash character.
cat file| perl -CS -pe's/-\n//'|fmt -w52
is the short answer, but uses fmt to reform paragraphs after the paragraphs were mangled by perl.
without fmt, you can do
#!/usr/bin/perl
use open qw(:std :utf8);
undef $/; $_=<>;
s/-\n(\w+\W+)\s*/$1\n/sg;
print;
also, if you're doing OCR, you can use this perl one-liner to convert unicode utf-8 dashes to ascii dash characters. note the -CS option to tell perl about utf-8.
# 0x2009 - 0x2015 em-dashes to ascii dash
perl -CS -pe 'tr/\x{2009}\x{2010}\x{2011}\x{2012\x{2013}\x{2014}\x{2015}/-/'
cat file | perl -p -e 's/-\n//'
If the file has windows line endings, you'll need to catch the cr-lf with something like:
cat file | perl -p -e 's/-\s\n//'
Hey this is my first answer post, here goes:
'-\n' I suspect are the line-feed characters. You can use sed to remove these. You could try the following as a test:
1) create a test file:
echo "hello this is a test -\n" > testfile
2) check the file has the expected contents:
cat testfile
3) test the sed command, this sends the edited text stream to standard out (ie your active console window) without overwriting anything:
sed 's/-\\n//g' testfile
(you should just see 'hello this is a test file' printed to the console without the '-\n')
If I build up the command:
a) First off you have the sed command itself:
sed
b) Secondly the expression and sed specific controls need to be in quotations:
sed 'sedcontrols+regex' (the text in quotations isn't what you'll actually enter, we'll fill this in as we go along)
c) Specify the file you are reading from:
sed 'sedcontrols+regex' testfile
d) To delete the string in question, sed needs to be told to substitute the unwanted characters with nothing (null,zero), so you use 's' to substitute, forward-slash, then the unwanted string (more on that in a sec), then forward-slash again, then nothing (what it's being substituted with), then forward-slash, and then the scale (as in do you want to apply the edit to a single line or more). In this case I will select 'g' which represents global, as in the whole text file. So now we have:
sed 's/regex//g' testfile
e) We need to add in the unwanted string but it gets confusing because if there is a slash in your string, it needs to be escaped out using a back-slash. So, the unwanted string
-\n ends up looking like -\\n
We can output the edited text stream to stdout as follows:
sed 's/-\\n//g' testfile
To save the results without overwriting anything (assuming testfile2 doesn't exist) we can redirect the output to a file:
sed 's/-\\n//g' testfile >testfile2
sed -z 's/-\n//' file_with_hyphens