RegEx for a multiple line search and replace using sed - regex

I need to have a RegEx that finds a \n in the middle of a line as a start point, anything before is random, and replace after 15 digits and 49 alpha on the second line. I need to replace all that by blanks, but the second line needs to join with the first one.
Attempt
sed -r -e '{N;s/\n[[:digit:]]{15}[[:space:]]{49}//}'
Input
QC HOH 0H0 CA
:70:NOFX TRADE TR
100000100200621 ADE RELATED WOOD PURCHASE
What needs to be removed is the linefeed after TRADE TR and bring the ADE RELATED to the TR so it spells TRADE.
Desired Output
QC H0H 0H0 CA
:70:NOFX TRADE TRADE RELATED WOOD PURCHASE

This might work for you (GNU sed):
sed -E 'N;s/\n[[:digit:]]{15}[[:space:]]{49}//;P;D' file
This opens up a two line window and amends the second of them if the substitute command matches. It always prints the first of the two lines and then removes it.

With GNU sed:
$ sed -Ez 's/\n[[:digit:]]{15}[[:space:]]{49}//' file
QC J0B 2Y0 CA
:70:NOFX TRADE TRADE RELATED WOOD PURCHASE

Related

Backreference with sed

I am trying to rearrange headers in my fasta file. I thought I could select the first 10 characters, then the rest of the line, and then backreference the first selection to move it to the end.
AY843768_1 Genus species 12S
would then be
Genus species 12S AY843768_1
I need to use sed for a learning exercise, but I am unsure how to make the selections to switch the 10-digit ID to the end of the line for every header in my file.
sed ‘s/^\(.\{10\}\)(.*$)/\2 \1/g' file1.fasta > file2.fasta
This will do it:
sed 's/^\(.\{10\}\) \(.*\)/\2 \1/' file1.fasta > file2.fasta
backslashes at the second braces where missing.
.* already matches the rest of the line so $ isn't needed
g isnt needed cause multiple matches per line arent possible if expression begins with ^
Anyway, using ERE instead of BRE makes it more readable:
sed -E 's/^(.{10}) (.*)/\2 \1/' file1.fasta > file2.fasta
OT: Same (or similar) can be achieved with bash only:
while read id rest;do echo "$rest $id";done <file1.fasta >file2.fasta

add characters each two places within sed

I am working with csv files, they seismic catalogs from a database, I need to arrange them like USGS format in order to start another steps.
My input data format is:
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909,7,23,170000,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913,12,14,024500,-17.780,-63.170,5.6,0,PRE-GEM-ISC
The USGS input format is
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909-7-23T17:00:00,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913-12-14T02:45:00,-17.780,-63.170,5.6,0,PRE-GEM-ISC
To "convert" my input to USGS format I did the following steps:
archi='catalog.txt'
sed 's/,/-/1' $archi > temp1.dat # to change "," to "-"
sed 's/,/-/1' temp1.dat > temp2.dat # same as above
sed 's/,/T/1' temp2.dat > temp3.dat # To add T between date and time
sed -i.bak "1 s/^.*$/DatesT,Latitude,Longitude,Magnitude,Depth,Catalog/" temp3.dat #to preserve the header.
I have the following output:
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909-7-23T170000,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913-12-14T024500,-17.780,-63.170,5.6,0,PRE-GEM-ISC
I tried to implement the following command:
sed 's/.\{13\}/&: /g' temp3.dat > temp4.dat
Unfortunately it did not work as I thought because it did not have the same place for all lines.
Do you have any idea to improve my code?
One way using GNU sed:
sed -r 's/([0-9]{4}),([0-9]{1,2}),([0-9]{1,2}),([0-9]{2})([0-9]{2})([0-9]{2})(,.*)/\1-\2-\3T\4:\5:\6\7/' file
You split the file into individual tokens,meaning column as token one, 2nd column as token 2, and when it comes to 4th column, take 2 numbers as a token, and then substitute it as required.
You can do:
cat initialfile.csv|perl -p -e "s/^(\d{4}),(\d+),(\d+),(\d{2})(\d{2})(\d{2}),([0-9.-]+),([0-9.-]+),(.*)$/\1-\2-\3T\4:\5:\6,\7,\8,\9/g"
or for inline edit:
perl -p -i -e "s/^(\d{4}),(\d+),(\d+),(\d{2})(\d{2})(\d{2}),([0-9.-]+),([0-9.-]+),(.*)$/\1-\2-\3T\4:\5:\6,\7,\8,\9/g" initialfile.csv
which should output USGS format
This might work for you (GNU sed):
sed -E '1!s/^([^,]*),([^,]*),([^,]*),(..)(..)/\1-\2-\3T\4:\5:/' file
Forget about the header.
Replace the first and second fields delimiters (all fields are delimited by a comma ,) with a dash -.
Replace the third fields delimiter by T.
Split the fourth field into three equal parts and separate each part by a colon :.
N.B. The last part of the fourth field will stay as is and so does not need to be defined.
Sometimes as programmers we become too focused on data and would be better served by looking at the problem as an artist and coding what we see.

Vim EX command to remove non-duplicate records

I have a large file which I am trying to reduce to only neighboring duplicated record id lines. (It's been sorted already)
Example:
AB12345 10987654321 Andy Male
AB12345 10987654321 Andrea Female
CD34567 98765432100 Andrea Female
EF45678 54321098765 Bobby Tables
should remove lines 3-4 leaving lines 1-2.
The following regex pattern finds just the duplicate lines successfully, but the subsequent command removes some but not all of the non-matching lines.
:/\v^(\a{2}\d{5}\s{2}\d{11}).*\n(\1.*)+
:g!/\v^(\a{2}\d{5}\s{2}\d{11}).*\n(\1.*)+/d
Why aren't all the non-matching lines being deleted?
Not a Vim solution, but this should work:
$ fgrep -f <(awk -v OFS=' ' '{print $1, $2}' data.txt | sort | uniq -d) data.txt
The <(...) is a bashism, and OSF=' ' has exactly two spaces.
There's no "magic" version of :global
Possible solutions: escape special characters as this
:g!/^(\a\{2}\d\{5}\s\{2}\d\{11}).*\n(\1.*)\+/d.
You can always reuse previous find pattern, and use it like this g://d
Extra links
very magic
Simplifying regular expressions using magic and no-magic

Remove the data before the second repeated specified character in linux

I have a text file which has some below data:
AB-NJCFNJNVNE-802ac94f09314ee
AB-KJNCFVCNNJNWEJJ-e89ae688336716bb
AB-POJKKVCMMMMMJHHGG-9ae6b707a18eb1d03b83c3
AB-QWERTU-55c3375fb1ee8bcd8c491e24b2
I need to remove the data before the second hyphen (-) and produce another text file with the below output:
802ac94f09314ee
e89ae688336716bb
9ae6b707a18eb1d03b83c3
55c3375fb1ee8bcd8c491e24b2
I am pretty new to linux and trying sed command with unsuccessful attempts for the last couple of hours. How can I get the desired output with sed or any other useful command like awk?
You can use a simple cut call:
$ cat myfile.txt | cut -d"-" -f3- > myoutput.txt
Edit:
Some explanation, as requested in the comments:
cut breaks up a string of text to fields according to a given delimiter.
-d defines the delimiter, - in this case.
-f defines which fields to output. In this case, we want to eliminate everything before the second hyphen, or, in other words, return the third field and onwards (3-).
The rest of the command is just piping the output. cating the file into cut, and then saving the result to an output file.
Or, using sed:
cat myfile.txt | sed -e 's/^.\+-//'

Print line after multiline match with sed

I am trying to create a script to pull out an account code from a file. The file itself is long and contains a lot of other data, but I have included below an excerpt of the part I am looking at (there is other contents before and after this excerpt)
The section of the file I am interested in sometimes look like this
Account Customer Order No. Whse Payment Terms Stock No. Original Invoice No.
VIN No.
AAAAAA01 9999 1000 30 days
and sometimes it looks like this
Account Customer Order No. Whse Payment Terms Stock No. Original Invoice No.
AAAAAA01 9999 1000 30 days
(one field cut off the end, where that field had been wrapping down onto it's own line)
I know I can use | tr -s ' ' | cut -d ' ' -F 1 to pull the code once I have the line it is on, but that is not a set line number (the content before this section is dynamic).
I am starting by trying to handled the case with the extra field, I figure it will be easy enough to make that an optional match with ?
The number of spaces used to separate the fields can change as this is essentially OCRed.
A few of my attempts so far - (assume the file is coming in from STDIN)
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s\+VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\n\s*VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\r\s*VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\r\n\s*VIN No\.\s*/{n;p;}'
These all failed to match whatsoever
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*/,/\s\*VIN No\.\s*/{n;p;}'
This at least matched something, but frustratingly printed the VIN No. line, followed by every second line after it. It also seems like it would be more difficult to mark as an optional part of the expression.
So, given an input of the full file (including either of the above excerpts), I am looking for an output of either
AAAAAA01 9999 1000 30 days
(which I can then trim to the required data) or AAAAAA01 if there is an easier way of getting straight to that.
This might work for you (GNU sed):
sed -n '/Account/{n;/VIN No\./n;p}' file
Use sed with the -n switch, this makes sed act like grep i.e. only print lines explicitly using the commands P or (this case) p.
/Account/ match a line with the pattern Account
For the above match only:
n normally this would print the current line and then read the next line into the pattern space, but as the -n is in action no printing takes place. So now the pattern space contains the next line.
/VIN No\./n If the current line contains Vin No effectively empty the pattern space and read in the next line.
p print whatever is currently in the pattern space.
So this a condition within a condition. When we encounter Action print either the following line or the line following that.
awk '/^\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.$/ {
getline;
if (/^\s*VIN No\.$/) getline;
print;
exit;
}'
Going strictly off your input, in both cases the desired field is on the last line. So to print the first field of the last line,
awk 'END {print $1}'
Result
AAAAAA01