Calculating in NP++ with regular expressions - regex

My document has x|y| in the beginning of each line, where x,y are integers between 0 and 300. E.g.:
1|1|text1
1|2|text2
1|3|text3
Now I want to make the following simple change: Each second number of every line should be subtracted by 1. So the above lines should be changed to
1|0|text1
1|1|text2
1|2|text3
Is that possible?

Okay, so here's something funny you can do, assuming the text is formatted like you indicated:
Do a first search and replace, replacing (\d+\|(\d+)\|.*) with ####\2\n\1. this will grab the index and output
1|1|text1
####1
1|2|text2
####2
1|3|text3
####3
do second search and replace to update the index of the row following the #### marker, replacing
####(\d)\n(\d+\|)\d+(\|.*)
with
\2\1\3
You need to update manually the first and last line, and you're good to go !
Since the numbers in the second column are following each other, all you need to do is increment their row.

A perl way to do the job:
perl -i.back -ape 's/\|(\d+)\|/"|".($1-1)."|"/e' in.txt
This replace all second number by this number minus one directly in the file (in.txt).
This file is save before in in.txt.back
Input file before:
1|1|text1
1|2|text2
1|3|text3
2|1|text4
3|1|text5
3|2|text6
after:
1|0|text1
1|1|text2
1|2|text3
2|0|text4
3|0|text5
3|1|text6

Regex can't do math. If the file is huge I suggest you to look into a simple python script or small C++ program to do the parsing job for you. Good luck.

missing awk ??? awwwww !!
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt
where text.txt is the file containing your contents.
you can write the output into a file using the > operator
like this
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt > final.txt

Related

How can i remove a line only if it is followed by a line that starts with the same character?

I need some help with sed or awks.
How can i remove a line only if it is followed by a line that starts with the same character (in this case >)?
Example I have this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
I want to get this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
Note that not all the lines have the same numbers but they all have the same format, which is why I want to use regular expressions. If you could explain how to read the code you produce that would be really great.
Thank you so much!
If the whole file follows that pattern (some number of lines starting with >, of which you want only the last, followed by a single line that should always be printed), you could use something like this:
awk '/^>/ { latest=$0 } !/^>/ { if (latest) { print latest; latest="" } print }'
If the line starts with >, then it is remembered (stored in the variable latest) but not printed. If the line doesn't start with >, then it is printed, but only after first printing whatever was most recently stored in latest.
The conditional means each printed > line will appear only once, even if there are multiple non-> lines in a row. Since that doesn't happen in your sample data, you may not need the complication, and could use this simpler unconditional version:
awk '/^>/ { latest=$0 } !/^>/ { print latest; print }'
The needed result can be easily achieved by just using uniq command with -w(--check-chars=N) option:
cat testfile | uniq -w 3
The output:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
-w, --check-chars=N
compare no more than N characters in lines
http://man7.org/linux/man-pages/man1/uniq.1.html
It will compare the first N characters of each line to make decision for repeated lines
try: if your data is same as given sample Input_file then following may help you in same.
awk '/^>/{A=$0;next} {print A ORS $0;A=""}' Input_file
This might work for you (GNU sed):
sed 'N;/^>.*\n>/!P;D' file
Read two lines into the pattern space and do not print the first of these lines if the first and second lines begin with >.
sed 'N;/^>.*\n\w/!D' file #(GNU sed)
N: read next line into the pattern space. /^>.*\n\w/!D: delete the first line if the first line starts with ">" and the second line doesn't begin with a letter

Adding commas when necessary to a csv file using regex

I have a csv file like the following:
entity_name,data_field_name,type
Unit,id
Track,id,LONG
The second row is missing a comma. I wonder if there might be some regex or awk like tool in order to append commas to the end of line in case there are missing commas in these rows?
Update
I know the requirements are a little vague. There might be several alternative ways to narrow down the requirements such as:
The header row should define the number of columns (and commas) that is valid for the whole file. The script should read the header row first and find out the correct number of columns.
The number of columns might be passed as an argument to the script.
The number of columns can be hardcoded into the script.
I didn't narrow down the requirements at first because I was ok with any of them. Of course, the first alternative is the best but I wasn't sure if this was easy to implement or not.
Thanks for all the great answers and comments. Next time, I will state acceptable alternative requirements explicitly.
You can use this awk command to fill up all rows starting from 2nd row with the empty cell values based on # of columns in the header row, in order to avoid hard-coding # of columns:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} NF{$nc=$nc} 1' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
Earlier solution:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} {printf "%s", $0;
for (i=NF+1; i<=nc; i++) printf "%s", OFS; print ""}' file
I would use sed,
sed 's/^[^,]*,[^,]*$/&,/' file
Example:
$ echo 'Unit,id' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,
$ echo 'Unit,id,bar' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,bar
Try this:
$ awk -F , 'NF==2{$2=$2","}1' file
Output:
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
With another awk:
awk -F, 'NF==2{$3=""}1' OFS=, yourfile.csv
to present balance to all the awk solutions, following could be a vim only solution
:v/,.*,/norm A,
rationale
/,.*,/ searches for 2 comma's in a line
:v apply a global command on each line NOT matching the search
norm A, enters normal mode and appends a , to the end of the line
This MIGHT be all you need, depending on the info you haven't shared with us in your question:
$ awk -F, '{print $0 (NF<3?FS:"")}' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

how to parse a text file for a particular compound expressions filtering in shell scripting

I want to extract (parse) a text file which has particular word, for my requirement whatever the rows which have the words "cluster" and "week" and "8.2" it should be written to the output file.
sample text in the file
2013032308470272~800000102507~Cluster-Mode~WEEK~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~monthly~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
2013032308470272~800000102507~Cluster-Mode~yearly~8.1.2~V6240
Desired output into another text file by above mentioned filters
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
I have writen a code using the awk command, however the output file contains the rows which are out of the scope of the filters.
code used to extract the text
awk '/Cluster/ && /WEEK/ && /8.2/ { print $NF > "/u/nbsvc/Data/Lookup/derived_asup_2010404_201409_2.txt" }' /u/nbsvc/Data/Lookup/cmode_asup_lookup.txt
obtained output
2013032308470272~800000102507~Cluster-Mode~WEEK~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
Note: the first line of obtained output is not needed in the desired output. How can I change my script to only get the line that I want?
To remove any ambiguity and false matches on partial fields or the wrong field, THIS is the command you need to run:
$ awk -F'~' '$3~/^Cluster/ && $4=="WEEK" && $5~/^8\.2/' file
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
I don't think that awk is needed at all here. Just use grep to match the line that you're interested in:
grep 'Cluster.*WEEK.*8\.2' file > output_file
The .* matches zero or more of any character and > is used to redirect the output to a new file. I have escaped the . in between "8.2" so that it is interpreted literally, rather than matching any character (although it would work either way).
there is actually little more in my requirement, it is I need to read this text file, then I need to split the line (where the cursor is) and push the values into a array and then I need to check for the values does it match with my pattern or not, if it matches then I need to write it to a out put text file, else simply ignore it, this one I did like as below..
cat /inputfolder_path/lookup_filename.txt | awk '{IGNORECASE = 1;line=$0;split(line,a, "~") ;if (a[1] ~ /201404/ && a[3]~/Cluster/ && a[4]~/WEEK/ && a[5]~/8.2/){print $0}}' > /outputfolder_path/derived_output_filename.txt
this is working exactly for my requirement..
Just thought to update this to every one, as it may help someone..
Thanks,
Siva

sed reverse order label reading

I would like to use sed to parse a file and prints only the last i labels within a field. Each label is separate by a ..
If I select i=3, with a file that contains the following lines:
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Begin_Text|Label2.Label1|End_Text
I would like, if there are at least 3 labels, the output lines to be:
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Currently :
sed 's;\(^[^|]\+\)|.*\.\([^\.]\+\.[^\.]\+\.[^\.]\+\)|\([^|]\+$\);\1|\2|\3;' test.txt
produces:
begin_text|label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Begin_Text|Label2.Label1|End_Text
and I do not get why a matching occurs for line 3. I also suppose ther eis a better way to do reverse order label reading.
Any comment/suggestion is appreciated.
Using awk will make the job easier.
awk 'split($2,a,".")>=i' FS="|" i=3 file
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Explanation
split(string, array, fieldsep)
split returns the number of elements created.

Find text enclosed by patterns using sed

I have a config file like this:
[whatever]
Do I need this? no!
[directive]
This lines I want
Very much text here
So interesting
[otherdirective]
I dont care about this one anymore
Now I want to match the lines in between [directive] and [otherdirective] without matching [directive] or [otherdirective].
Also if [otherdirective] is not found all lines till the end of file should be returned. The [...] might contain any number or letter.
Attempt
I tried this using sed like this:
sed -r '/\[directive\]/,/\[[[:alnum:]+\]/!d
The only problem with this attempt is that the first line is [directive]and the last line is [otherdirective].
I know how to pipe this again to truncate the first and last line but is there a sed solution to this?
You can use the range, as you were trying, and inside it use // negated. When it's empty it reuses last regular expression matched, so it will skip both edge lines:
sed -n '/\[directive\]/,/\[otherdirective\]/ { //! p }' infile
It yields:
This lines I want
Very much text here
So interesting
Here is a nice way with awk to get section of data.
awk -v RS= '/\[directive\]/' file
[directive]
This lines I want
Very much text here
So interesting
When setting RS to nothing RS= it divides the file up in records based on blank line.
So when searching for [directive] it will print that record.
Normally a record is one line, but due to the RS (record selector) is change, it gives the block.
Okay damn after more tries I found the solution or merely one solution:
sed -rn '/\[buildout\]/,/\[[[:alnum:]]+\]/{
/\[[[:alnum:]]+\]/d
p }'
is this what you want?
\[directive\](.*?)\[
Look here