sed reverse order label reading - regex

I would like to use sed to parse a file and prints only the last i labels within a field. Each label is separate by a ..
If I select i=3, with a file that contains the following lines:
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Begin_Text|Label2.Label1|End_Text
I would like, if there are at least 3 labels, the output lines to be:
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Currently :
sed 's;\(^[^|]\+\)|.*\.\([^\.]\+\.[^\.]\+\.[^\.]\+\)|\([^|]\+$\);\1|\2|\3;' test.txt
produces:
begin_text|label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Begin_Text|Label2.Label1|End_Text
and I do not get why a matching occurs for line 3. I also suppose ther eis a better way to do reverse order label reading.
Any comment/suggestion is appreciated.

Using awk will make the job easier.
awk 'split($2,a,".")>=i' FS="|" i=3 file
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Explanation
split(string, array, fieldsep)
split returns the number of elements created.

Related

How can i remove a line only if it is followed by a line that starts with the same character?

I need some help with sed or awks.
How can i remove a line only if it is followed by a line that starts with the same character (in this case >)?
Example I have this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
I want to get this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
Note that not all the lines have the same numbers but they all have the same format, which is why I want to use regular expressions. If you could explain how to read the code you produce that would be really great.
Thank you so much!
If the whole file follows that pattern (some number of lines starting with >, of which you want only the last, followed by a single line that should always be printed), you could use something like this:
awk '/^>/ { latest=$0 } !/^>/ { if (latest) { print latest; latest="" } print }'
If the line starts with >, then it is remembered (stored in the variable latest) but not printed. If the line doesn't start with >, then it is printed, but only after first printing whatever was most recently stored in latest.
The conditional means each printed > line will appear only once, even if there are multiple non-> lines in a row. Since that doesn't happen in your sample data, you may not need the complication, and could use this simpler unconditional version:
awk '/^>/ { latest=$0 } !/^>/ { print latest; print }'
The needed result can be easily achieved by just using uniq command with -w(--check-chars=N) option:
cat testfile | uniq -w 3
The output:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
-w, --check-chars=N
compare no more than N characters in lines
http://man7.org/linux/man-pages/man1/uniq.1.html
It will compare the first N characters of each line to make decision for repeated lines
try: if your data is same as given sample Input_file then following may help you in same.
awk '/^>/{A=$0;next} {print A ORS $0;A=""}' Input_file
This might work for you (GNU sed):
sed 'N;/^>.*\n>/!P;D' file
Read two lines into the pattern space and do not print the first of these lines if the first and second lines begin with >.
sed 'N;/^>.*\n\w/!D' file #(GNU sed)
N: read next line into the pattern space. /^>.*\n\w/!D: delete the first line if the first line starts with ">" and the second line doesn't begin with a letter

remove duplicate lines (only first part) from a file

I have a list like this
ABC|Hello1
ABC|Hello2
ABC|Hello3
DEF|Test
GHJ|Blabla1
GHJ|Blabla2
And i want it to be this:
ABC|Hello1
DEF|Test
GHJ|Blabla1
so i want to remove the duplicates in each line before the: |
and only let the first one there.
A simple way using awk
$ awk -F"|" '!seen[$1]++ {print $0}' file
ABC|Hello1
DEF|Test
GHJ|Blabla1
The trick here is to set the appropriate field separator "|" in this case after which the individual columns can be accessed column-wise starting with $1. In this answer, am maintaining a unique-value array seen and printing the line only if the value from $1 is not seen previously.

Calculating in NP++ with regular expressions

My document has x|y| in the beginning of each line, where x,y are integers between 0 and 300. E.g.:
1|1|text1
1|2|text2
1|3|text3
Now I want to make the following simple change: Each second number of every line should be subtracted by 1. So the above lines should be changed to
1|0|text1
1|1|text2
1|2|text3
Is that possible?
Okay, so here's something funny you can do, assuming the text is formatted like you indicated:
Do a first search and replace, replacing (\d+\|(\d+)\|.*) with ####\2\n\1. this will grab the index and output
1|1|text1
####1
1|2|text2
####2
1|3|text3
####3
do second search and replace to update the index of the row following the #### marker, replacing
####(\d)\n(\d+\|)\d+(\|.*)
with
\2\1\3
You need to update manually the first and last line, and you're good to go !
Since the numbers in the second column are following each other, all you need to do is increment their row.
A perl way to do the job:
perl -i.back -ape 's/\|(\d+)\|/"|".($1-1)."|"/e' in.txt
This replace all second number by this number minus one directly in the file (in.txt).
This file is save before in in.txt.back
Input file before:
1|1|text1
1|2|text2
1|3|text3
2|1|text4
3|1|text5
3|2|text6
after:
1|0|text1
1|1|text2
1|2|text3
2|0|text4
3|0|text5
3|1|text6
Regex can't do math. If the file is huge I suggest you to look into a simple python script or small C++ program to do the parsing job for you. Good luck.
missing awk ??? awwwww !!
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt
where text.txt is the file containing your contents.
you can write the output into a file using the > operator
like this
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt > final.txt

Find text enclosed by patterns using sed

I have a config file like this:
[whatever]
Do I need this? no!
[directive]
This lines I want
Very much text here
So interesting
[otherdirective]
I dont care about this one anymore
Now I want to match the lines in between [directive] and [otherdirective] without matching [directive] or [otherdirective].
Also if [otherdirective] is not found all lines till the end of file should be returned. The [...] might contain any number or letter.
Attempt
I tried this using sed like this:
sed -r '/\[directive\]/,/\[[[:alnum:]+\]/!d
The only problem with this attempt is that the first line is [directive]and the last line is [otherdirective].
I know how to pipe this again to truncate the first and last line but is there a sed solution to this?
You can use the range, as you were trying, and inside it use // negated. When it's empty it reuses last regular expression matched, so it will skip both edge lines:
sed -n '/\[directive\]/,/\[otherdirective\]/ { //! p }' infile
It yields:
This lines I want
Very much text here
So interesting
Here is a nice way with awk to get section of data.
awk -v RS= '/\[directive\]/' file
[directive]
This lines I want
Very much text here
So interesting
When setting RS to nothing RS= it divides the file up in records based on blank line.
So when searching for [directive] it will print that record.
Normally a record is one line, but due to the RS (record selector) is change, it gives the block.
Okay damn after more tries I found the solution or merely one solution:
sed -rn '/\[buildout\]/,/\[[[:alnum:]]+\]/{
/\[[[:alnum:]]+\]/d
p }'
is this what you want?
\[directive\](.*?)\[
Look here

Reorder columns in a CSV file by number

i am trying to replace the second column with the second to last column and also remove the three last column. For example, I have this sample.csv
1,2,3,4,5,6
a,b,c,d,e,f
g,h,i,j,k,l
I want to output:
1,5,3
a,e,c
g,k,i
I am using this command:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1); NF=NF-3}'1 sample.csv
which works perfectly when I view the csv file in excel. however, when I look at the .csv file in notepad, I notice that the last item on one row is connected to the first item in the next row. so I am getting
1,5,3a,e,cg,k,i
Can anyone give me any advice on how to fix the problem so I can get the .csv file to have a new paragraph for each row like the desired output? Thanks.
Adding a carriage return(\r) to the end of each line should help:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1); NF=NF-3;sub(/$/,"\r");}'1 sample.csv
Code for GNU sed:
sed -r 's/(\w,)\w,(\w,)\w,(\w,)\w/\1\3\2/' file
$cat file
1,2,3,4,5,6
a,b,c,d,e,f
g,h,i,j,k,l
$sed -r 's/(\w,)\w,(\w),\w,(\w,)\w/\1\3\2/' file
1,5,3
a,e,c
g,k,i