Conditional gsub action

Conditional gsub action - if-statement

According to this answer, I am trying to reproduce a conditional statement where, in the event of a match, a substitusion occurs (it matches dates). If no match happens, the line is printed as it is.
#!/bin/bash
cleaner(){
./date_remove.awk $1
}
cleaner $1 > "out"
where 'date_remove.awk' is
#! /usr/bin/awk -f
date = /(^|[^[:alpha:]])[[:digit:]]{2}[[:space:]]{1,}[[:alpha:]]{3,8}[[:space:]]{1,}[[:digit:]]{4}([^[:alpha:]]|$)/ {gsub(date, "")} !date {print}
At this point the substitution does not happens. 'gsub' should return only the matched phrases, but it does not return anything, actually. Just unmatched phrases are printed correctly. At this point, I am pretty sure is a problem of syntax, but I cannot figure out where.
Input:
ci sono 4444444444444Quattro mele
sentiamoci il 16 Ottobre 2018
deciIIIIIIdiamo il 17 ottabre 2017
Manipolo di eroi 55555555555
17 mele
18 ott 2020 llllllLLLLLLLLLLLL
una mela e mezza
2 mAAAeleA
0000 asd a0 0 ad000
Actual output:
ci sono 4444444444444Quattro mele
Manipolo di eroi 55555555555
17 mele
una mela e mezza
2 mAAAeleA
0000 asd a0 0 ad000
Expected output:
ci sono 4444444444444Quattro mele
sentiamoci il
deciIIIIIIdiamo il
Manipolo di eroi 55555555555
17 mele
llllllLLLLLLLLLLLL
una mela e mezza
2 mAAAeleA
0000 asd a0 0 ad000

It is not quite correct, gsub() does not return the matched phrases on its own. It just returns the count of substitutions made. Your problem is dealing with how to store the matching group for subsequent string replacement.
The problem with your attempt is the regexp matched within /../ is not stored explicitly, you need to make it be stored by using match() or index() and use that in the replacement part,
awk '
match($0, /(^|[^[:alpha:]])[[:digit:]]{2}[[:space:]]{1,}[[:alpha:]]{3,8}[[:space:]]{1,}[[:digit:]]{4}([^[:alpha:]]|$)/) {
str=substr($0, RSTART, RLENGTH); sub(str," ",$0 );
}1' file
The example above would replace the captured group i.e. your date strings below and replace them with a single white space.
16 Ottobre 2018
17 ottabre 2017
18 ott 2020
One could use sub() or gsub() depending on the number of occurrences of the regex in the line. Applying the command above would remove the those date strings from the file and produce a result as below.
ci sono 4444444444444Quattro mele
sentiamoci il
deciIIIIIIdiamo il
Manipolo di eroi 55555555555
17 mele
llllllLLLLLLLLLLLL
una mela e mezza
2 mAAAeleA
0000 asd a0 0 ad000
Notice the {..}1 after we do the string replace. It is needed to reconstruct the line after the appropriate replacements are done.
Putting it in awk script it would look like
#!/usr/bin/awk -f
match($0, /(^|[^[:alpha:]])[[:digit:]]{2}[[:space:]]{1,}[[:alpha:]]{3,8}[[:space:]]{1,}[[:digit:]]{4}([^[:alpha:]]|$)/) {
str=substr($0, RSTART, RLENGTH)
sub(str," ",$0 )
}1

Related

Remove "." from digits

I have a string in the following way =
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
I want to convert it into :
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
that is I only convert a.b -> ab where a and b are integer
waiting for help

Assuming you are using Python. You can use captured groups in regex. Either numbered captured group or named captured group. Then use the groups in the replacement while leaving out the ..
import re
text = "lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
Numbered: You reference the pattern group (content in brackets) by their index.
text = re.sub("(\d+)\.(\d+)", "\\1\\2", text)
Named: You reference the pattern group by a name you specified.
text = re.sub("(?P<before>\d+)\.(?P<after>\d+)", "\g<before>\g<after>", text)
Which each returns:
print(text)
> lmn abc 40mg 350 mg over 12 days. Standing nebs.
However you should be aware that leaving out the . in decimal numbers will change their value. So you should be careful with whatever you are doing with these numbers afterwards.

Using any sed in any shell on every Unix box:
$ sed 's/\([0-9]\)\.\([0-9]\)/\1\2/g' file
"lmn abc 40mg 350 mg over 12 days. Standing nebs."

Using sed
$ cat input_file
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs. a.b.c."
$ sed 's/\([a-z0-9]*\)\.\([a-z0-9]\)/\1\2/g' input_file
"lmn abc 40mg 350 mg over 12 days. Standing nebs. abc."

echo '1.2 1.23 12.34 1. .2' |
ruby -p -e '$_.gsub!(/\d+\K\.(?=\d+)/, "")'
Output
12 123 1234 1. .2
If performance matters:
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e 'BEGIN{$regex = /\d+\K\.(?=\d+)/; $empty_string = ""}; $_.gsub!($regex, $empty_string)'

SED Replace after certain pattern - value in brackets

I have files where i need to replace all occourences of
AC %blabla% with AC (%blabla%+PAR).
ROT: S 3 BL 3900 SPEED 20
BEN: L 15
ROT: S 2 BLL (DimZ/2+25) BLR (DimZ/2-29) SPEED 20
BEN: L 14-0.5 A 116 AC -1
ROT: S 2 BLR (DimZ/2-29) BLL (DimZ/2-20) SPEED 20
CLA: L 133 A 64 AC -1
ROT: S 1 BLL (DimZ/2-29) BLR (DimZ/2+25) SPEED 20
BEN: L 11-0.5 AC -90
BEN: L 95 AC 1.5
E.g.:
AC -1 should be AC (-1+PAR) afterwards.
AC 90 should be AC (90+PAR) afterwards.
What i've tried is:
sed "s/\( AC"."\)/\1(/"
But that doesn't even always add the "("...
I get:
ROT: S 3 BL 3900 SPEED 20
BEN: L 15
ROT: S 2 BLL (DimZ/2+25) BLR (DimZ/2-29) SPEED 20
BEN: L 14-0.5 A 116 AC (-1
ROT: S 2 BLR (DimZ/2-29) BLL (DimZ/2-20) SPEED 20
CLA: L 133 A 64 AC -1
ROT: S 1 BLL (DimZ/2-29) BLR (DimZ/2+25) SPEED 20
BEN: L 11-0.5 AC (-90
BEN: L 95 AC (1.5
Could someone please help me?
Thank you.

You can use the following POSIX BRE compliant regex with sed:
sed "s/\( AC \)\([^[:space:]]*\)/\1(\2+PAR)/" file
See the online sed demo
If you have GNU sed, I suggest
sed -E "s/\b(AC\s+)(\S+)/\1(\2+PAR)/" file
See another demo.
Regex details
\( AC \) - Group 1: space, AC, space (so, no match for BAC, for example)
\([^[:space:]]*\) - Group 2: zero or more non-whitespace chars
\1(\2+PAR) - the replacement is the concatenated Group 1 value + ( + Group 2 value and +PAR).
GNU sed regex details
\b - a word boundary
(AC\s+) - Group 1: AC and one or more whitespaces
(\S+) - Group 2: one or more non-whitespace chars.

$ sed -E 's/(AC )([^ ]*)/\1(\2+PAR)/' ip.txt
ROT: S 3 BL 3900 SPEED 20
BEN: L 15
ROT: S 2 BLL (DimZ/2+25) BLR (DimZ/2-29) SPEED 20
BEN: L 14-0.5 A 116 AC (-1+PAR)
ROT: S 2 BLR (DimZ/2-29) BLL (DimZ/2-20) SPEED 20
CLA: L 133 A 64 AC (-1+PAR)
ROT: S 1 BLL (DimZ/2-29) BLR (DimZ/2+25) SPEED 20
BEN: L 11-0.5 AC (-90+PAR)
BEN: L 95 AC (1.5+PAR)
-E to enable Extended Regular Expressions
Use sed 's/\(AC \)\([^ ]*\)/\1(\2+PAR)/' if -E isn't supported
(AC ) to match and capture AC followed by space
use ( AC ) to avoid partial match or use \b(AC ) if word boundary is supported
([^ ]*) to capture non-space characters
\1(\2+PAR) required output format
What's wrong with OP's attempt:
"s/\( AC"."\)/\1(/" will be treated as concatenation of s/\( AC followed by . followed by \)/\1(/
can be simplified to sed 's/\( AC.\)/\1(/' --> use single quotes unless double is required
\( AC.\) will match space followed by AC followed by any character only once
\1( will give you captured portion followed by (

Remove duplicate words and just print lines in which this occurs

I have a challenge to look in a file if a sentence contains 2 identical consecutive words. If so, you print the word; otherwise, you don't print the sentence.
Example:
abc2 1 def2 3 abc2
F4
--------------
dea 123 123 zy45
12 12
abc cd abc cd
xyz%$#! xyz%$#! kk
xyzxyz
abc h h h h
After running the program the output will be:
dea 123 zy45
12
xyz%$#! kk
abc h h h
3
This is what I have so far:
sed '/\([^\([^ ]\+\)[ ]\+\1]\)/d' F4 >|tmp
I got this so far but this is only separating between the sentences that have the double word and sentences that don't.

Your sed expression was quite accurate. However, it needed some mangling to make it work:
$ sed -nr 's/\b(\S+)\s+\1(\s|$)/\1/p' file
dea 123 zy45
12
xyz%$#! kk
abc h h h
The idea is the one you already implemented: match a given word with [^ ] and see if you match it again with \1. What I added is all of this to be replaced with \1 so the repeated block disappears.
Instead of [^ ] it is also useful to use \S and instead of [ ], \s. Note also the usage of \b as a word boundary to prevent false positives like fedorqui qui and the usage of \1(\s|$) to prevent other false positives like hello helloa (thanks WalterA for the examples!). Note the usage of \s|$ to match either a space or the end of the line; \b matches any not-word character, which makes it not useful for the case with xyz%$#! kk.
To prevent all lines to be printed, we use sed -n. This way, we just print (with p) those that go through the regular expression that was defined.
Note the usage of -r to get rid of all those escaping to capture groups. Without it, the command would be:
sed -n 's/\b\([^ ]\+\)[ ]\+\1/\1/p' file
Let's test it with a more comprehensive input:
$ cat a
abc2 1 def2 3 abc2
F4
--------------
dea 123 123 zy45
12 12
abc cd abc cd
xyz%$#! xyz%$#! kk
xyzxyz
fedorqui qui
hello helloa
abc h h h h
$ sed -nr 's/\b(\S+)\s+\1(\s|$)/\1/p' a
dea 123zy45
12
xyz%$#!kk
abc hh h

I was looking for a sed solution that seemed to be easy. perhaps in this case awk is better (F4 is the inputfile):
awk '{
for (i=2; i<=NF; i++) {
if ($(i-1)==$i) {
$i="";
printf("%s\n", $0);
break;
}
}
}' F4
I am not complete happy with this solution, since it will leave a double FieldSep in $0 after deleting the doubled word, but literally the OP did not see that a space or tab should be deleted too.

Extract a dictionary of substring in bash?

I have files of the kind:
(1), (2), (3), (4), (5), (6), (10), (11), (12), (13), (14), (15), (16), (17), (18), (24), (25), (26), (27), (28), (29), (30), (31), (32), (33), (34), (35), (36), (37), (38), (39), (40), (41), (42), (43), (51), (52), (53), (54), (55), (56), (57), (58), (62), (63), (64), (65), (66), (67), (68), (69), (70), (71), (72), (73), (74) Use method number 1. (7), (8), (9), (19), (20), (21), (22), (23), (59), (60), (61) Use method number 2. (44), (45), (46), (47), (48), (49), (50) Use method number 3.
I would like to build a dictionary containing the numbers between parentheses and link them to the sentences of the type: "Use method number #". So, in this case:
1,2,3,4,5...74 --> Use method number 1.
7,8,9,19....61 --> Use method number 2.
Currently I am building a complex while that reads regexs (^ *\([0-9]+\)), extracts each number, deletes the coincidence and starts again until regex is not found and then extracts the sentence. But this is quite poor in performance and tedious to maintain.
Have you got any suggestions on how to improve this through more compact methods other than the while do?
I am not bothered by the dictionary structure, do not consider it right now if it does not imply modifying the method.
EDIT. ADDING REAL DATA STRING:
(12), (13), (14), (15) P.S.: 3 días en cultivo de invernadero. Efectuar un máximo de 6 aplicaciones por
campaña a intervalos de 7 días utilizando un volumen máximo de caldo de 600 l/Ha. y un máximo de
7,5 Kg de cobre inorgánico por campaña.
(28) Tratamiento en otoño, pulverizando hasta una altura de 1,5 m.
(44), (45), (46), (47), (48), (49), (50), (51) Efectuar sólo tratamientos desde la cosecha hasta la
floración, limitando la aplicación a 1200 l. de caldo/Ha. y un máximo de 3 aplicaciones por campaña
(con un intervalo de tratamientos de 14 días) y un máximo de 7,5 Kg. de cobre inorgánico/Ha.por
campaña.

You can use sed:
sed -r 's/( *\(|\))//g;s/\./\n/g' input.txt
This assumes that your input file does not contain line breaks. If it contains line breaks the command needs to get modified a bit.
Explanation:
The first command s/( *\(|\))//g removes the parentheses and additional whitespace. The second command s/\./\n/g adds a newline after a dot.
Oh I missed that you want to add an additional -->. If you really need that, the second sed commands needs to get modified:
sed -r 's/( *\(|\))//g;s/U[^.]+\./--> \0\n/g' input.txt
Now the second command searches for the sequence U --> until a dot and prepends a --> plus adds the newline after the dot.
Output:
1,2,3,4,5,6,10,...,74 --> Use method number 1.
7,8,9,19,20,21,22,23,59,60,61 --> Use method number 2.
44,45,46,47,48,49,50 --> Use method number 3.
One another thing: The above commands adds an additional newline at the end of output. You can suppress that by adding a third sed command s/\n$// which removes the additional new line before the end of the output:
sed -r 's/( *\(|\))//g;s/U[^.]+\./--> \0\n/g;s/\n$//' input.txt

Quite an idiomatic gnu awk solution:
awk -v RS="Use method number [0-9]."
-v OFS=" --> "
'NF{gsub(/\s*|\(|\)/, ""); print $0, RT}' file
Test
$ awk -v RS="Use method number [0-9]." -v OFS=" --> " 'NF{gsub(/\s*|\(|\)/, ""); print $0, RT}' a
1,2,3,4,5,6,10,11,12,13,14,15,16,17,18,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,51,52,53,54,55,56,57,58,62,63,64,65,66,67,68,69,70,71,72,73,74 --> Use method number 1.
7,8,9,19,20,21,22,23,59,60,61 --> Use method number 2.
44,45,46,47,48,49,50 --> Use method number 3.
Explanation
-v RS="Use method number [0-9]." set the record separator to the string "Use method number X.`, X being a digit.
-v OFS=" --> " set the print separator.
NF{gsub(/\s*|\(|\)/, ""); print $0, RT} main code
-- NF {} if there is at least one field, proceed.
-- gsub(/\s*|\(|\)/, "") remove all spaces, ( and ) from the string.
-- print $0, RT print the replaced string together with the record separator that was used ("Use method number X."). Using RT instead of RS so that we catch the value of the specific X used in the string.
From man awk:
RT
The record terminator. Gawk sets RT to the input text that matched
the character or regular expression specified by RS.

you can very intuitively do it with an ed script,
:: ed.script ::
# first you split your data in multiple lines
,s/\(\(([0-9]*), \)*([0-9]*)\)/\
\1\
/g
# then for each matching line with numbers, you remove unwanted chars
# and append " --> " to the next line
,g/\(\(([0-9]*), \)*([0-9]*)\)/\
s/[)( ]//g\
a\
-->\
.
# and finally you join lines
,g/^ -->/-1,+1j
# save if you want
w
Then you launch it with the following command:
cat ed.script | ed -s file.txt
that was the part intuitive... and it works with your sample data.

Match a word just once - AWK

I was reading GNU awk manual but I didnt find a regular expression wich whom I can match a string just once.
For example from the files aha_1.txt, aha_2.txt, aha_3.txt, .... I would like to print the second column $2 from the first time ana appears in the files (aha_1.txt, aha_2.txt, aha_3.txt, ....). In addition, the same thing when pedro appears.
aha_1.txt
luis 321 487
ana 454 345
pedro 341 435
ana 941 345
aha_2.txt
pedro 201 723
gusi 837 134
ana 319 518
cindy 738 278
ana 984 265
.
.
.
.
Meanwhile I did this but it counts all the cases not just the first time
/^ana/ {print $2 }
/^pedro/ {print $2 }
Thanks for your help :-)

Just call the exit command after printing the first value(second column in the line which starts with the string ana).
$ awk '$1~/^ana$/{print $2; exit}' file
454

Original question
Only processing one file.
awk '/ana/ { if (ana++ == 0) print $2 }' aha.txt
or
awk '/ana/ && ana++ == 0 { print $2 }' aha.txt
Or, if you don't need to do anything else, you can exit after printing, as suggested by Avinash Raj in his answer.
Revised question
I have many files (aha.txt, aha_1.txt, aha_2.txt, ...) each file has ana inside and I need just to take the fist time ana appears in each file and the output has to be one file.
That's sightly different as a question. If you have GNU grep, you can use (more or less):
grep -m1 -e ana aha*.txt
That will list the whole line, not just column 2, and will list the filenames too, so it isn't a perfect match.
Using awk, you have to work a bit more:
awk 'FILENAME != old_file { ana = 0; old_file = FILENAME }
/ana/ { if (ana++ == 0) print $2 }' aha*.txt

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Conditional gsub action - if-statement

Related

Remove "." from digits

SED Replace after certain pattern - value in brackets

Remove duplicate words and just print lines in which this occurs

Extract a dictionary of substring in bash?

Match a word just once - AWK

Categories

Resources