Remove \n newline if string contains keyword - regex

I'd like to know if I can remove a \n (newline) only if the current line has one ore more keywords from a list; for instance, I want to remove the \n if it contains the words hello or world.
Example:
this is an original
file with lines
containing words like hello
and world
this is the end of the file
And the result would be:
this is an original
file with lines
containing words like hello and world this is the end of the file
I'd like to use sed, or awk and, if needed, grep, wc or whatever commands work for this purpose. I want to be able to do this on a lot of files.

Using awk you can do:
awk '/hello|world/{printf "%s ", $0; next} 1' file
this is an original
file with lines
containing words like hello and world this is the end of the file

here is simple one using sed
sed -r ':a;$!{N;ba};s/((hello|world)[^\n]*)\n/\1 /g' file
Explanation
:a;$!{N;ba} read whole file into pattern, like this: this is an original\nfile with lines\ncontaining words like hell\
o\nand world\nthis is the end of the file$
s/((hello|world)[^\n]*)\n/\1 /g search the key words hello or world and remove the next \n,
g command in sed substitute stands to apply the replacement to all matches to the regexp, not just the first.

A non-regex approach:
awk '
BEGIN {
# define the word list
w["hello"]
w["world"]
}
{
printf "%s", $0
for (i=1; i<=NF; i++)
if ($i in w) {
printf " "
next
}
print ""
}
'
or a perl one-liner
perl -pe 'BEGIN {#w = qw(hello world)} s/\n/ / if grep {$_ ~~ #w} split'
To edit the file in-place, do:
awk '...' filename > tmpfile && mv tmpfile filename
perl -i -pe '...' filename

This might work for you (GNU sed):
sed -r ':a;/^.*(hello|world).*\'\''/M{$bb;N;ba};:b;s/\n/ /g' file
This checks if the last line, of a possible multi-line, contains the required string(s) and if so reads another line until end-of-file or such that the last line does not contain the/those string(s). Newlines are removed and the line printed.

$ awk '{ORS=(/hello|world/?FS:RS)}1' file
this is an original
file with lines
containing words like hello and world this is the end of the file

sed -n '
:beg
/hello/ b keep
/world/ b keep
H;s/.*//;x;s/\n/ /g;p;b
: keep
H;s/.*//
$ b beg
' YourFile
a bit harder due to check on current line that may include a previous hello or world already
principle:
on every pattern match, keep the string in hold buffer
other wise, load hold buffer and remove \n (use of swap and empty the current line due to limited buffer operation available) and print the content
Add a special case of pattern in last line (normaly hold so not printed otherwise)

Related

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

How can I merge multiple blocks/lines with sed or regex?

Is it possible to merge multiple blocks/lines into a "single" line?
So basically if the next line starts with the same "#Msg" tag then append it to the previous line. (Hard to explain, but my example speaks for itself) (The blocks are separated by a new/blank line)
My input file looks like this:
#Msg,00000
#Msg,00001
#Msg,00002
#Msg,00003
#Msg,00004
#Msg,00005
#Msg,00006
#Msg,00007
#Msg,00008
#Msg,00009
#Msg,00010
#Msg,00011
Output should be like this:
#Msg,00000
#Msg,00001 #Msg,00002
#Msg,00003 #Msg,00004
#Msg,00005
#Msg,00006 #Msg,00007 #Msg,00008
#Msg,00009
#Msg,00010 #Msg,00011
Any advice is very welcome.
This would be pretty easy to do in Perl:
perl -00 -ple 'tr/\n/ /'
-e CODE specifies the program.
-p wraps a read/write line loop around it (by default it reads from STDIN, but you can also specify one or more filenames on the command line).
-00 specifies that the input "lines" are actually paragraphs.
-l has two effects: Incoming line terminators are automatically stripped from lines, and outgoing lines get line terminators added to them (and because we used -00 (paragraph mode), our line terminator is actually \n\n).
To recap:
We read the input one paragraph at a time. For each paragraph, we remove any trailing newlines. We then translate every newline to a space. Finally we output the transformed paragraph, followed by \n\n.
No point in trying to produce a shorter code than is possible with Perl!
Collect lines from the input file in list group until a blank line appears. Then output the contents of group, empty it and start again. When end-of-file is encountered output whatever is in group, if it is non-empty.
group = []
with open('vollschauer.txt') as vollschauer:
for line in vollschauer:
line = line.rstrip()
if line:
group.append(line)
else:
if group:
print (' '.join(group))
print()
group = []
if group:
print (' '.join(group))
group = []
$ awk -v RS= -v ORS='\n\n' '{$1=$1}1' file
#Msg,00000
#Msg,00001 #Msg,00002
#Msg,00003 #Msg,00004
#Msg,00005
#Msg,00006 #Msg,00007 #Msg,00008
#Msg,00009
#Msg,00010 #Msg,00011
If you insist on using sed, this should do the trick:
sed -r ':a; N; /^(#[^,]+,).*\n\1/! { P; D }; s/\n/ /; ba' file
It takes different tags into account. Such tags won't be grouped together (that's what I understood is the desired behavior):
$ cat file
#Msg,00000
#Msg,00001
#Hello,00002
#Hello,00003
#What,00004
#What,00005
$ sed -r ':a; N; /^(#[^,]+,).*\n\1/! { P; D }; s/\n/ /; ba' file
#Msg,00000 #Msg,00001
#Hello,00002
#Hello,00003
#What,00004 #What,00005
Note that this solution uses GNU sed.
This might work for you (GNU sed):
sed ':a;N;/^$/M!s/\n/ /;ta' file
Gather up lines, replacing each newline by a space until an empty line.
N.B. The use of the M flag on the repexp /^$/ which matches an empty line on a pattern space containing multiple lines.

Using awk or sed to merge / print lines matching a pattern (oneliner?)

I have a file that contains the following text:
subject:asdfghj
subject:qwertym
subject:bigger1
subject:sage911
subject:mothers
object:cfvvmkme
object:rjo4j2f2
object:e4r234dd
object:uft5ed8f
object:rf33dfd1
I am hoping to achieve the following result using awk or sed (as a oneliner would be a bonus! [Perl oneliner would be acceptable as well]):
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
I'd like to have each line that matches 'subject' and 'object' combined in the order that each one is listed, separated with a comma. May I see an example of this done with awk, sed, or perl? (Preferably as a oneliner if possible?)
I have tried some uses of awk to perform this, I am still learning I should add:
awk '{if ($0 ~ /subject/) pat1=$1; if ($0 ~ /object/) pat2=$2} {print $0,pat2}'
But does not do what I thought it would! So I know I have the syntax wrong. If I were to see an example that would greatly help so that I can learn.
not perl or awk but easier.
$ pr -2ts, file
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
Explanation
-2 2 columns
t ignore print header (filename, date, page number, etc)
s, use comma as the column separator
I'd do it something like this in perl:
#!/usr/bin/perl
use strict;
use warnings;
my #subjects;
while ( <DATA> ) {
m/^subject:(\w+)/ and push #subjects, $1;
m/^object:(\w+)/ and print "subject:",shift #subjects,",object:", $1,"\n";
}
__DATA__
subject:asdfghj
subject:qwertym
subject:bigger1
subject:sage911
subject:mothers
object:cfvvmkme
object:rjo4j2f2
object:e4r234dd
object:uft5ed8f
object:rf33dfd1
Reduced down to one liner, this would be:
perl -ne '/^(subject:\w+)/ and push #s, $1; /^object/ and print shift #s,$_' file
grep, paste and process substitution
$ paste -d , <(grep 'subject' infile) <(grep 'object' infile)
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
This treats the output of grep 'subject' infile and grep 'object' infile like files due to process substitution (<( )), then pastes the results together with paste, using a comma as the delimiter (indicated by -d ,).
sed
The idea is to read and store all subject lines in the hold space, then for each object line fetch the hold space, get the proper subject and put the remaining subject lines back into hold space.
First the unreadable oneliner:
$ sed -rn '/^subject/H;/^object/{G;s/\n+/,/;s/^(.*),([^\n]*)(\n|$)/\2,\1\n/;P;s/^[^\n]*\n//;h}' infile
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
-r is for extended regex (no escaping of parentheses, + and |) and -n does not print by default.
Expanded, more readable and explained:
/^subject/H # Append subject lines to hold space
/^object/ { # For each object line
G # Append hold space to pattern space
s/\n+/,/ # Replace first group of newlines with a comma
# Swap object (before comma) and subject (after comma)
s/^(.*),([^\n]*)(\n|$)/\2,\1\n/
P # Print up to first newline
s/^[^\n]*\n// # Remove first line (can't use D because there is another command)
h # Copy pattern space to hold space
}
Remarks:
When the hold space is fetched for the first time, it starts with a newline (H adds one), so the newline-to-comma substitution replaces one or more newlines, hence the \n+: two newlines for the first time, one for the rest.
To anchor the end of the subject part in the swap, we use (\n|$): either a newline or the end of the pattern space – this is to get the swap also on the last line, where we don't have a newline at the end of the pattern space.
This works with GNU sed. For BSD sed as found in MacOS, there are some changes required:
The -r option has to be replaced by -E.
There has to be an extra semicolon before the closing brace: h;}
To insert a newline in the replacement string (swap command), we have to replace \n by either '$'\n'' or '"$(printf '\n')"'.
Since you specifically asked for a "oneliner" I assume brevity is far more important to you than clarity so:
$ awk -F: -v OFS=, 'NR>1&&$1!=p{f=1}{p=$1}f{print a[++c],$0;next}{a[NR]=$0}' file
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1

How to replace a text sequence that includes "\n" in a text file

This may sound duplicated, but I can't make this works.
Consider:
_ = space
- = minus sign
particle_little.csv is a file of this form:
waste line to be deleted
__data__data__data
_-data__data_-data
__data_-data__data
I need to get a standard csv format in particle_std.csv, like this:
data,data,data
-data,data,-data
data,-data,data
I am trying to use tail and tr to do that conversion, here I split the command:
tail -n +2 particle_little.csv to delete the first line
| tr -s ' ' to remove duplicated spaces
| tr '/\b\n \b/' '\n' to delete the very beginning space
| tr ' ' ',' to change spaces for commas
> particle_std.csv to put it in a output file
But I get this (without the 4th step):
data
data
data
-data
...
Finally, the file is huge, so it is almost impossible to open in editors (I know there are super editors that maybe can)
I would suggest that you used awk:
$ cat file
waste line to be deleted
data data data
-data data -data
data -data data
$ awk -v OFS=, '{ $1 = $1 } NR > 1' file
data,data,data
-data,data,-data
data,-data,data
The script sets the output field separator OFS to , and reassigns the first field to itself $1 = $1, causing awk to touch each line (and replace the spaces with commas). Lines after the first, where NR > 1, are printed (the default action is to print the line).
So if I'm reading you right - ignore lines that don't start with whitespace. Comma separate everything else.
I'd suggest perl:
perl -lane 'next unless /^\s/; print join ",", #F';
This, when given:
waste line to be deleted
data data data
-data data -data
data -data data
On STDIN (Or specified in a filename) outputs:
data,data,data
-data,data,-data
data,-data,data
This is because:
-l strips linefeeds (and replaces them after each print);
-a autosplits on any whitespace
-n wraps it in a while ( <> ) { loop which iterates line by line - functionally it means it works just like sed/grep/tr and reads STDIN or files specified as args.
-e allows specifying a perl snippet.
In this case:
skip any lines that don't start with \s or any whitespace.
any other lines, join the fields (#F generated by -a) with , as delimiter. (This auto-inserts a linefeed because -l)
Then you can either redirect the output to a file (>output.csv) or use -i.bak to edit inplace.
You should probably use sed or awk for this:
sed -e 1d -e 's/^ *//' -e 's/ */,/g'
One way to do it in Awk is:
awk 'NR == 1 { next }
{ pad=""; for (i = 1; i <= NF; i++) { printf "%s%s", pad, $i; pad="," } print "" }'
but there's a better way to do it in Awk:
awk 'BEGIN { OFS=","} NR == 1 { next } { $1 = $1; print }' data
The BEGIN block sets the output field separator; the assignment $1 = $1; forces Awk to rework the output line; the print prints it.
I've left the first Awk version around because it shows there's more than one way to do it, and in some circumstances, such methods can be useful. But for this task, the second Awk version is better — simpler, more compact (and isomorphic with Tom Fenech's answer).

how to replace the next string after match (every) two blank lines?

is there a way to do this kind of substitution in Awk, sed, ...?
I have a text file with sections divived into two blank lines;
section1_name_x
dklfjsdklfjsldfjsl
section2_name_x
dlskfjsdklfjsldkjflkj
section_name_X
dfsdjfksdfsdf
I would to replace every "section_name_x" by "#section_name_x", this is, how to replace the next string after match (every) two blank lines?
Thanks,
Steve,
awk '
(NR==1 || blank==2) && $1 ~ /^section/ {sub(/section/, "#&")}
{
print
if (length)
blank = 0
else
blank ++
}
' file
#section1_name_x
dklfjsdklfjsldfjsl
#section2_name_x
dlskfjsdklfjsldkjflkj
#section_name_X
dfsdjfksdfsdf
hm....
Given your example data why not just
sed 's/^section[0-9]*_name.*/#/' file > newFile && mv newFile file
some seds support sed -i OR sed -i"" to overwrite the existing file, avoiding the && mv ... shown above.
The reg ex says, section must be at the beginning of the line, and can optionally contain a number or NO number at all.
IHTH
In gawk you can use the RT builtin variable:
gawk '{$1="#"$1; print $0 RT}' RS='\n\n' file
* Update *
Thanks to #EdMorton I realized that my first version was incorrect.
What happens:
Assigning to $1 causes the record to be rebuildt, which is not good in this cases since any sequence of white space is replaced by a single space between fields, and by the null string in the beginning and at the end of the record.
Using print adds an additional newline to the output.
The correct version:
gawk '{printf "%s", "#" $0 RT}' RS='\n\n\n' file