sed substitution including newlines - regex

I want to change a text file so that any line beginning with "Length:" is appended to the previous line.
I'm aware that sed '/\nLength:/ Length:/' isn't going to work because sed is line based.
Googling for "How to match newlines in sed" did turn up a complex sed method for joining a pattern to the next line but I couldn't figure out how to adapt it.
Help would be appreciated.

In awk you can use something like:
awk '/^/&&!/^Length/{printf "\n"}{printf "%s",$0}' infile
Will only print \n when line start ^ is matched. Exception: Length is found at that beginnig.

If the file isn't too large, you can use a Perl command line in slurp mode (load all the file content before processing) :
perl -0777 -pe 's/\R(?=Length:)//g' file
-0777 switches on the slurp mode
pattern:
\R any kind of newlines
(?=...) lookahead assertion
If there's no consecutive lines starting with Length: you can use this sed command:
sed -n ':a;/\nLength:/!{$p;N;ba;}; s/\n\(Length:\)/$1/;p;' file
details:
:a; # define the label "a"
/\nLength:/! { # if "\nLength:" doesn't match then:
$p; # if last line, print
N; # append the next line to the pattern space
ba; # go to label "a"
};
s/\n\(Length:\)/$1/; # perform the replacement
p; # print
An other way with awk using the record separator:
awk 'BEGIN{RS="\nLength:";ORS="Length:"}1' file | head -n -1

This might work for you (GNU sed):
sed 'N;/\nLength:/s/\n/ /;P;D' file
This appends the next line to the present line in the pattern space and if the appended line begins with the required string it replaces the newline with a space (if you do not want the space just replace the newline with nothing). The first line is then printed and deleted and the process repeated (the second line is now the first unless the condition was met in which case a line is automatically read in and then the first command appends the next).

Related

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

How can I merge multiple blocks/lines with sed or regex?

Is it possible to merge multiple blocks/lines into a "single" line?
So basically if the next line starts with the same "#Msg" tag then append it to the previous line. (Hard to explain, but my example speaks for itself) (The blocks are separated by a new/blank line)
My input file looks like this:
#Msg,00000
#Msg,00001
#Msg,00002
#Msg,00003
#Msg,00004
#Msg,00005
#Msg,00006
#Msg,00007
#Msg,00008
#Msg,00009
#Msg,00010
#Msg,00011
Output should be like this:
#Msg,00000
#Msg,00001 #Msg,00002
#Msg,00003 #Msg,00004
#Msg,00005
#Msg,00006 #Msg,00007 #Msg,00008
#Msg,00009
#Msg,00010 #Msg,00011
Any advice is very welcome.
This would be pretty easy to do in Perl:
perl -00 -ple 'tr/\n/ /'
-e CODE specifies the program.
-p wraps a read/write line loop around it (by default it reads from STDIN, but you can also specify one or more filenames on the command line).
-00 specifies that the input "lines" are actually paragraphs.
-l has two effects: Incoming line terminators are automatically stripped from lines, and outgoing lines get line terminators added to them (and because we used -00 (paragraph mode), our line terminator is actually \n\n).
To recap:
We read the input one paragraph at a time. For each paragraph, we remove any trailing newlines. We then translate every newline to a space. Finally we output the transformed paragraph, followed by \n\n.
No point in trying to produce a shorter code than is possible with Perl!
Collect lines from the input file in list group until a blank line appears. Then output the contents of group, empty it and start again. When end-of-file is encountered output whatever is in group, if it is non-empty.
group = []
with open('vollschauer.txt') as vollschauer:
for line in vollschauer:
line = line.rstrip()
if line:
group.append(line)
else:
if group:
print (' '.join(group))
print()
group = []
if group:
print (' '.join(group))
group = []
$ awk -v RS= -v ORS='\n\n' '{$1=$1}1' file
#Msg,00000
#Msg,00001 #Msg,00002
#Msg,00003 #Msg,00004
#Msg,00005
#Msg,00006 #Msg,00007 #Msg,00008
#Msg,00009
#Msg,00010 #Msg,00011
If you insist on using sed, this should do the trick:
sed -r ':a; N; /^(#[^,]+,).*\n\1/! { P; D }; s/\n/ /; ba' file
It takes different tags into account. Such tags won't be grouped together (that's what I understood is the desired behavior):
$ cat file
#Msg,00000
#Msg,00001
#Hello,00002
#Hello,00003
#What,00004
#What,00005
$ sed -r ':a; N; /^(#[^,]+,).*\n\1/! { P; D }; s/\n/ /; ba' file
#Msg,00000 #Msg,00001
#Hello,00002
#Hello,00003
#What,00004 #What,00005
Note that this solution uses GNU sed.
This might work for you (GNU sed):
sed ':a;N;/^$/M!s/\n/ /;ta' file
Gather up lines, replacing each newline by a space until an empty line.
N.B. The use of the M flag on the repexp /^$/ which matches an empty line on a pattern space containing multiple lines.

How to get the last line that contains \n?

By definition a line must end with newline character (\n) (ref.). But for the purpose of this post, I will consider any series of characters as a line whether or not it finishes with \n.
The command tail -n 1 returns the last line whether or not it ends with \n. How can one get from a file the last line that ends with \n whether or not this line is the last line or the second-to-last line of the file?
Here's one way you could do it using Perl:
perl -ne '$s = $_ if /\n$/ }{ print $s' file
The script reads each line of the file one by one and assigns it to the variable $s if it ends with \n. Once the file has been read, $s is printed. If the last line didn't end with a newline, then the penultimate line will be printed, as shown below:
$ cat file
first line
second
third$ perl -ne '$s = $_ if /\n$/ }{ print $s' file
second
note that I intentionally left in the $ to show the prompt, which is at the end of the last line of the file due to the absence of the newline character.
cat -vte file|grep "\$$"|tail -1
What about this? Or some other way with cat -vte
This way the extra $ will be removed:
echo -en "Enter\nEnter again\nNo enter this time"|cat -vte|grep "\$$"|sed 's/\$$//g'|tail -1
+1 variant for linux (Perl regexp, positive look-ahead assertion, show matched part only):
echo -en "Enter\nEnter again\nNo enter this time"|cat -vte|grep -Po ".*(?=\\\$$)"|tail -1

Regex to move second line to end of first line

I have several lines with certain values and i want to merge every second line or every line beginning with <name> to the end of the line ending with
<id>rd://data1/8b</id>
<name>DM_test1</name>
<id>rd://data2/76f</id>
<name>DM_test_P</name>
so end up with something like
<id>rd://data1/8b</id><name>DM_test1</name>
The reason why it came out like this is because i used two piped xpath queries
Regex
Simply remove the newline at the end of a line ending in </id>. On a windows, replace (<\/id>)\r\n with \1 or $1 (which is perl syntax). On a linux search for (<\/id>)\n and replace it with the same thing.
awk
The ideal solution uses awk. The idea is simply, when the line number is odd, we print the line without a newline, if not we print it with a newline.
awk '{ if(NR % 2) { printf $0 } else { print $0 } }' file
sed
Using sed we place a line in the hold space when it contains <id>ยด and append the line to it when it's a` line. Then we remove the newline and print the hold buffer by exchanging it with the pattern space.
sed -n '/<id>.*<\/id>/{h}; /<name>.*<\/name>/{H;x;s/\n//;p}' file
pr
Using pr we can achieve a similar goal:
pr -s --columns 2 file

How to find/extract a pattern from a file?

Here are the contents of my text file named 'temp.txt'
---start of file ---
HEROKU_POSTGRESQL_AQUA_URL (DATABASE_URL) ----backup---> b687
Capturing... done
Storing... done
---end of file ----
I want to write a bash script in which I need to capture the string 'b687' in a variable. this is really a pattern (which is the letter 'b' followed by 'n' number of digits). I can do it the hard way by looping through the file and extracting the desired string (b687 in example above). Is there an easy way to do so? Perhaps by using awk or sed?
Try using grep
v=$(grep -oE '\bb[0-9]{3}\b' file)
This will seach for a word starting with b followed by '3' digits.
regex101 demo
Using sed
v=$(sed -nr 's/.*\b(b[0-9]{3})\b.*/\1/p' file)
varname=$(awk '/HEROKU_POSTGRESQL_AQUA_URL/{print $4}' filename)
what this does is reads the file when it matches the pattern HEROKU_POSTGRESQL_AQUA_URL print the 4th token in this case b687
your other option is to use sed
varname=$(sed -n 's/.* \(b[0-9][0-9]*\)/\1/p' filename)
In this case we are looking for the pattern you mentioned b####... and only print that pattern the -n tells sed not to print line that do not have that pattern. the rest of the sed command is a substitution .* is any string at the beginning. followed by a (...) which forms a group in which we put the regex that will match your b##### the second part says out of all that match only print the group 1 and the p at the end tells sed to print the result (since by default we told sed not to print with the -n)