Regex to move second line to end of first line - regex

I have several lines with certain values and i want to merge every second line or every line beginning with <name> to the end of the line ending with
<id>rd://data1/8b</id>
<name>DM_test1</name>
<id>rd://data2/76f</id>
<name>DM_test_P</name>
so end up with something like
<id>rd://data1/8b</id><name>DM_test1</name>
The reason why it came out like this is because i used two piped xpath queries

Regex
Simply remove the newline at the end of a line ending in </id>. On a windows, replace (<\/id>)\r\n with \1 or $1 (which is perl syntax). On a linux search for (<\/id>)\n and replace it with the same thing.
awk
The ideal solution uses awk. The idea is simply, when the line number is odd, we print the line without a newline, if not we print it with a newline.
awk '{ if(NR % 2) { printf $0 } else { print $0 } }' file
sed
Using sed we place a line in the hold space when it contains <id>ยด and append the line to it when it's a` line. Then we remove the newline and print the hold buffer by exchanging it with the pattern space.
sed -n '/<id>.*<\/id>/{h}; /<name>.*<\/name>/{H;x;s/\n//;p}' file
pr
Using pr we can achieve a similar goal:
pr -s --columns 2 file

Related

I am in troubles with a regexp to remove some \n

Im trying to define a regexp to remove some carriage return in a file to be loaded into a DB.
Here is the fragment
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP
";"";Hamilton;"";;0;0;0;1;1;"";
This is the regexp I used in https://regex101.com/
(;"[[:alnum:] ]+)[\n]+([[:alnum:] ]*)"
Which should get two groups, one before and one after some newline.
Looking at regexp101, it informs that the groups are correctly captured
But the result is wrong, because it still introduce an invisible new line as follow
I also try to use sed but the result is exactly the same.
So, the question is: Where am I wrong?
sed is line based. It's possible to achieve what you want, but I'd rather use a more suitable tool. For example, Perl:
perl -pe 's/\n/<>/e if tr/"// % 2 == 1' file.csv
-p reads the input line by line, running the code for each line before outputting it;
The /e option interprets the replacement in a substitution as code, in this case replacing the final newline with the following line (<> reads the input)
tr/"// in numeric context returns the number of matches, i.e. the number of double quotes;
If the number is odd, we remove the newline (% is the modulo operator).
The corresponding sed invocation would be
sed '/^\([^"]*"[^"]*"\)*[^"]*"[^"]*$/{N;s/\n//}' file.csv
on lines containing a non-paired double quote, read the next line to the pattern space (N) and remove the newline.
Update:
perl -ne 'chomp $p if ! /^[0-9]+;/; print $p; $p = $_; END { print $p }' file.csv
This should remove the newlines if they're not followed by a number and a semicolon. It keeps the previous line in the variable $p, if the current line doesn't start with a number followed by a semicolon, newline is chomped from the previous line. The, the previous line is printed and the current line is remembered. The last line needs to be printed separately as there's no following line for it to make it printed.
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}))' file.csv
will remove trailing newlines from every field in the CSV (with sep ;) and spit out correct CSV (with sep ,). If you want ; in to output too, use
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}),sep=>";")' file.csv
It's usually best to use an existing parser rather than writing your own.
I'd use the following Perl program:
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
sep => ";",
blank_is_undef => 1,
quote_empty => 1,
on_in => sub { s/\n//g for #{ $_[1] }; };
' old.csv >new.csv
Output:
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP";"";Hamilton;"";;0;0;0;1;1;"";
If for some reason you want to avoid XS, the slower Text::CSV is a drop-in replacement.

How can I merge multiple blocks/lines with sed or regex?

Is it possible to merge multiple blocks/lines into a "single" line?
So basically if the next line starts with the same "#Msg" tag then append it to the previous line. (Hard to explain, but my example speaks for itself) (The blocks are separated by a new/blank line)
My input file looks like this:
#Msg,00000
#Msg,00001
#Msg,00002
#Msg,00003
#Msg,00004
#Msg,00005
#Msg,00006
#Msg,00007
#Msg,00008
#Msg,00009
#Msg,00010
#Msg,00011
Output should be like this:
#Msg,00000
#Msg,00001 #Msg,00002
#Msg,00003 #Msg,00004
#Msg,00005
#Msg,00006 #Msg,00007 #Msg,00008
#Msg,00009
#Msg,00010 #Msg,00011
Any advice is very welcome.
This would be pretty easy to do in Perl:
perl -00 -ple 'tr/\n/ /'
-e CODE specifies the program.
-p wraps a read/write line loop around it (by default it reads from STDIN, but you can also specify one or more filenames on the command line).
-00 specifies that the input "lines" are actually paragraphs.
-l has two effects: Incoming line terminators are automatically stripped from lines, and outgoing lines get line terminators added to them (and because we used -00 (paragraph mode), our line terminator is actually \n\n).
To recap:
We read the input one paragraph at a time. For each paragraph, we remove any trailing newlines. We then translate every newline to a space. Finally we output the transformed paragraph, followed by \n\n.
No point in trying to produce a shorter code than is possible with Perl!
Collect lines from the input file in list group until a blank line appears. Then output the contents of group, empty it and start again. When end-of-file is encountered output whatever is in group, if it is non-empty.
group = []
with open('vollschauer.txt') as vollschauer:
for line in vollschauer:
line = line.rstrip()
if line:
group.append(line)
else:
if group:
print (' '.join(group))
print()
group = []
if group:
print (' '.join(group))
group = []
$ awk -v RS= -v ORS='\n\n' '{$1=$1}1' file
#Msg,00000
#Msg,00001 #Msg,00002
#Msg,00003 #Msg,00004
#Msg,00005
#Msg,00006 #Msg,00007 #Msg,00008
#Msg,00009
#Msg,00010 #Msg,00011
If you insist on using sed, this should do the trick:
sed -r ':a; N; /^(#[^,]+,).*\n\1/! { P; D }; s/\n/ /; ba' file
It takes different tags into account. Such tags won't be grouped together (that's what I understood is the desired behavior):
$ cat file
#Msg,00000
#Msg,00001
#Hello,00002
#Hello,00003
#What,00004
#What,00005
$ sed -r ':a; N; /^(#[^,]+,).*\n\1/! { P; D }; s/\n/ /; ba' file
#Msg,00000 #Msg,00001
#Hello,00002
#Hello,00003
#What,00004 #What,00005
Note that this solution uses GNU sed.
This might work for you (GNU sed):
sed ':a;N;/^$/M!s/\n/ /;ta' file
Gather up lines, replacing each newline by a space until an empty line.
N.B. The use of the M flag on the repexp /^$/ which matches an empty line on a pattern space containing multiple lines.

sed substitution including newlines

I want to change a text file so that any line beginning with "Length:" is appended to the previous line.
I'm aware that sed '/\nLength:/ Length:/' isn't going to work because sed is line based.
Googling for "How to match newlines in sed" did turn up a complex sed method for joining a pattern to the next line but I couldn't figure out how to adapt it.
Help would be appreciated.
In awk you can use something like:
awk '/^/&&!/^Length/{printf "\n"}{printf "%s",$0}' infile
Will only print \n when line start ^ is matched. Exception: Length is found at that beginnig.
If the file isn't too large, you can use a Perl command line in slurp mode (load all the file content before processing) :
perl -0777 -pe 's/\R(?=Length:)//g' file
-0777 switches on the slurp mode
pattern:
\R any kind of newlines
(?=...) lookahead assertion
If there's no consecutive lines starting with Length: you can use this sed command:
sed -n ':a;/\nLength:/!{$p;N;ba;}; s/\n\(Length:\)/$1/;p;' file
details:
:a; # define the label "a"
/\nLength:/! { # if "\nLength:" doesn't match then:
$p; # if last line, print
N; # append the next line to the pattern space
ba; # go to label "a"
};
s/\n\(Length:\)/$1/; # perform the replacement
p; # print
An other way with awk using the record separator:
awk 'BEGIN{RS="\nLength:";ORS="Length:"}1' file | head -n -1
This might work for you (GNU sed):
sed 'N;/\nLength:/s/\n/ /;P;D' file
This appends the next line to the present line in the pattern space and if the appended line begins with the required string it replaces the newline with a space (if you do not want the space just replace the newline with nothing). The first line is then printed and deleted and the process repeated (the second line is now the first unless the condition was met in which case a line is automatically read in and then the first command appends the next).

How can I delete the last word in the current line, but only if a pattern occurs on the next line?

The contents of the file are
some line DELETE_ME
some line this_is_the_pattern
If the this_is_the_pattern occurs in the next line, then delete the last word (in this case DELETE_ME) in the current line.
How can I do this using sed or awk? My understanding is that sed is more appropriate for this task than awk is, because awk is suitable for operations on data stored tabular format. If my understanding is incorrect, please let me know.
$ awk '/this_is_the_pattern/{sub(/[^[:space:]]+$/, "", last)} NR>1{print last} {last=$0} END{print last}' file
some line
some line this_is_the_pattern
How it works
This script uses a single variable called last which contains the previous line in the file. In summary, if the current line contains the pattern, then the last word is removed from last. Otherwise, last is printed as is.
In detail, taking each command in turn:
/this_is_the_pattern/{sub(/[^[:space:]]+$/, "", last)}
If this line has the pattern, remove the final word from the last line.
NR>1{print last}
For each line after the first line, print the last line.
last=$0
Save the current line in variable last.
END{print last}
Print the last line from the file.
awk 'NR>1 && /this_is_the_pattern/ {print t;}
NR>1 && !/this_is_the_pattern/ {print f;}
{f=$0;$NF="";t=$0}
END{print f}' input-file
Note that this will modify whitespace in any lines in which the last field is removed, squeezing runs of whitespace into a single space.
You could simplify this to:
awk 'NR>1 { print( /this_is_the_pattern/? t:f)}
{f=$0;$NF="";t=$0}
END{print f}' input-file
and you can resolve the squeezed whitespace issue with:
awk 'NR>1 { print( /this_is_the_pattern/? t:f)}
{f=$0;sub(" [^ ]*$","");t=$0}
END{print f}' input-file
You could use tac to cat the file backwards, so that you see the pattern first. Then set a flag and delete the last word on the next line you see. Then at the end, reverse the file through tac back to the original order.
tac file | awk '/this_is_the_pattern/{f=1;print;next} f==1{sub(/ [^ ]+$/, "");print;f=0}' | tac
Use buffer to keep previous line in memory
sed -n 'H;1h;1!{x;/\nPAGE/ s/[^ ]*\(\n\)/\1/;P;s/.*\n//;h;$p;}' YourFile
Use loop but same concept
sed -n ':cycle
N;/\nPAGE/ s/[^ ]*\(\n\)/\1/;P;s/.*\n//;$p;b cycle' YourFile
in both case, it remove last word of previous line also the search pattern is on 2 consecutive lines
work with 2 last read lines, test if pattern on last and delete word if present than print first line, remove it and cycle
The idiomatic awk solution is simply to keep a buffer of the previous line (or N lines in the general case) so you can test the current line and then modify and/or print the buffer accordingly:
$ awk '
NR>1 {
if (/this_is_the_pattern/) {
sub(/[^[:space:]]+$/,"",prev)
}
print prev
}
{ prev = $0 }
END { print prev }
' file
some line
some line this_is_the_pattern

SED: addressing two lines before match

Print line, which is situated 2 lines before the match(pattern).
I tried next:
sed -n ': loop
/.*/h
:x
{n;n;/cen/p;}
s/./c/p
t x
s/n/c/p
t loop
{g;p;}
' datafile
The script:
sed -n "1N;2N;/XXX[^\n]*$/P;N;D"
works as follows:
Read the first three lines into the pattern space, 1N;2N
Search for the test string XXX anywhere in the last line, and if found print the first line of the pattern space, P
Append the next line input to pattern space, N
Delete first line from pattern space and restart cycle without any new read, D, noting that 1N;2N is no longer applicable
This might work for you (GNU sed):
sed -n ':a;$!{N;s/\n/&/2;Ta};/^PATTERN\'\''/MP;$!D' file
This will print the line 2 lines before the PATTERN throughout the file.
This one with grep, a bit simpler solution and easy to read [However need to use one pipe]:
grep -B2 'pattern' file_name | sed -n '1,2p'
If you can use awk try this:
awk '/pattern/ {print b} {b=a;a=$0}' file
This will print two line before pattern
I've tested your sed command but the result is strange (and obviously wrong), and you didn't give any explanation. You will have to save three lines in a buffer (named hold space), do a pattern search with the newest line and print the oldest one if it matches:
sed -n '
## At the beginning read three lines.
1 { N; N }
## Append them to "hold space". In following iterations it will append
## only one line.
H
## Get content of "hold space" to "pattern space" and check if the
## pattern matches. If so, extract content of first line (until a
## newline) and exit.
g
/^.*\nsix$/ {
s/^\n//
P
q
}
## Remove the old of the three lines saved and append the new one.
s/^\n[^\n]*//
h
' infile
Assuming and input file (infile) with following content:
one
two
three
four
five
six
seven
eight
nine
ten
It will search six and as output will yield:
four
Here are some other variants:
awk '{a[NR]=$0} /pattern/ {f=NR} END {print a[f-2]}' file
This stores all lines in an array a. When pattern is found store line number.
At then end print that line number from the file.
PS may be slow with large files
Here is another one:
awk 'FNR==NR && /pattern/ {f=NR;next} f-2==FNR' file{,}
This reads the file twice (file{,} is the same as file file)
At first round it finds the pattern and store line number in variable f
Then at second round it prints the line two before the value in f