I have a string in the following way =
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
I want to convert it into :
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
that is I only convert a.b -> ab where a and b are integer
waiting for help
Assuming you are using Python. You can use captured groups in regex. Either numbered captured group or named captured group. Then use the groups in the replacement while leaving out the ..
import re
text = "lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
Numbered: You reference the pattern group (content in brackets) by their index.
text = re.sub("(\d+)\.(\d+)", "\\1\\2", text)
Named: You reference the pattern group by a name you specified.
text = re.sub("(?P<before>\d+)\.(?P<after>\d+)", "\g<before>\g<after>", text)
Which each returns:
print(text)
> lmn abc 40mg 350 mg over 12 days. Standing nebs.
However you should be aware that leaving out the . in decimal numbers will change their value. So you should be careful with whatever you are doing with these numbers afterwards.
Using any sed in any shell on every Unix box:
$ sed 's/\([0-9]\)\.\([0-9]\)/\1\2/g' file
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
Using sed
$ cat input_file
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs. a.b.c."
$ sed 's/\([a-z0-9]*\)\.\([a-z0-9]\)/\1\2/g' input_file
"lmn abc 40mg 350 mg over 12 days. Standing nebs. abc."
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e '$_.gsub!(/\d+\K\.(?=\d+)/, "")'
Output
12 123 1234 1. .2
If performance matters:
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e 'BEGIN{$regex = /\d+\K\.(?=\d+)/; $empty_string = ""}; $_.gsub!($regex, $empty_string)'
I have a challenge to look in a file if a sentence contains 2 identical consecutive words. If so, you print the word; otherwise, you don't print the sentence.
Example:
abc2 1 def2 3 abc2
F4
--------------
dea 123 123 zy45
12 12
abc cd abc cd
xyz%$#! xyz%$#! kk
xyzxyz
abc h h h h
After running the program the output will be:
dea 123 zy45
12
xyz%$#! kk
abc h h h
3
This is what I have so far:
sed '/\([^\([^ ]\+\)[ ]\+\1]\)/d' F4 >|tmp
I got this so far but this is only separating between the sentences that have the double word and sentences that don't.
Your sed expression was quite accurate. However, it needed some mangling to make it work:
$ sed -nr 's/\b(\S+)\s+\1(\s|$)/\1/p' file
dea 123 zy45
12
xyz%$#! kk
abc h h h
The idea is the one you already implemented: match a given word with [^ ] and see if you match it again with \1. What I added is all of this to be replaced with \1 so the repeated block disappears.
Instead of [^ ] it is also useful to use \S and instead of [ ], \s. Note also the usage of \b as a word boundary to prevent false positives like fedorqui qui and the usage of \1(\s|$) to prevent other false positives like hello helloa (thanks WalterA for the examples!). Note the usage of \s|$ to match either a space or the end of the line; \b matches any not-word character, which makes it not useful for the case with xyz%$#! kk.
To prevent all lines to be printed, we use sed -n. This way, we just print (with p) those that go through the regular expression that was defined.
Note the usage of -r to get rid of all those escaping to capture groups. Without it, the command would be:
sed -n 's/\b\([^ ]\+\)[ ]\+\1/\1/p' file
Let's test it with a more comprehensive input:
$ cat a
abc2 1 def2 3 abc2
F4
--------------
dea 123 123 zy45
12 12
abc cd abc cd
xyz%$#! xyz%$#! kk
xyzxyz
fedorqui qui
hello helloa
abc h h h h
$ sed -nr 's/\b(\S+)\s+\1(\s|$)/\1/p' a
dea 123zy45
12
xyz%$#!kk
abc hh h
I was looking for a sed solution that seemed to be easy. perhaps in this case awk is better (F4 is the inputfile):
awk '{
for (i=2; i<=NF; i++) {
if ($(i-1)==$i) {
$i="";
printf("%s\n", $0);
break;
}
}
}' F4
I am not complete happy with this solution, since it will leave a double FieldSep in $0 after deleting the doubled word, but literally the OP did not see that a space or tab should be deleted too.
I have files of the kind:
(1), (2), (3), (4), (5), (6), (10), (11), (12), (13), (14), (15), (16), (17), (18), (24), (25), (26), (27), (28), (29), (30), (31), (32), (33), (34), (35), (36), (37), (38), (39), (40), (41), (42), (43), (51), (52), (53), (54), (55), (56), (57), (58), (62), (63), (64), (65), (66), (67), (68), (69), (70), (71), (72), (73), (74) Use method number 1. (7), (8), (9), (19), (20), (21), (22), (23), (59), (60), (61) Use method number 2. (44), (45), (46), (47), (48), (49), (50) Use method number 3.
I would like to build a dictionary containing the numbers between parentheses and link them to the sentences of the type: "Use method number #". So, in this case:
1,2,3,4,5...74 --> Use method number 1.
7,8,9,19....61 --> Use method number 2.
Currently I am building a complex while that reads regexs (^ *\([0-9]+\)), extracts each number, deletes the coincidence and starts again until regex is not found and then extracts the sentence. But this is quite poor in performance and tedious to maintain.
Have you got any suggestions on how to improve this through more compact methods other than the while do?
I am not bothered by the dictionary structure, do not consider it right now if it does not imply modifying the method.
EDIT. ADDING REAL DATA STRING:
(12), (13), (14), (15) P.S.: 3 días en cultivo de invernadero. Efectuar un máximo de 6 aplicaciones por
campaña a intervalos de 7 días utilizando un volumen máximo de caldo de 600 l/Ha. y un máximo de
7,5 Kg de cobre inorgánico por campaña.
(28) Tratamiento en otoño, pulverizando hasta una altura de 1,5 m.
(44), (45), (46), (47), (48), (49), (50), (51) Efectuar sólo tratamientos desde la cosecha hasta la
floración, limitando la aplicación a 1200 l. de caldo/Ha. y un máximo de 3 aplicaciones por campaña
(con un intervalo de tratamientos de 14 días) y un máximo de 7,5 Kg. de cobre inorgánico/Ha.por
campaña.
You can use sed:
sed -r 's/( *\(|\))//g;s/\./\n/g' input.txt
This assumes that your input file does not contain line breaks. If it contains line breaks the command needs to get modified a bit.
Explanation:
The first command s/( *\(|\))//g removes the parentheses and additional whitespace. The second command s/\./\n/g adds a newline after a dot.
Oh I missed that you want to add an additional -->. If you really need that, the second sed commands needs to get modified:
sed -r 's/( *\(|\))//g;s/U[^.]+\./--> \0\n/g' input.txt
Now the second command searches for the sequence U --> until a dot and prepends a --> plus adds the newline after the dot.
Output:
1,2,3,4,5,6,10,...,74 --> Use method number 1.
7,8,9,19,20,21,22,23,59,60,61 --> Use method number 2.
44,45,46,47,48,49,50 --> Use method number 3.
One another thing: The above commands adds an additional newline at the end of output. You can suppress that by adding a third sed command s/\n$// which removes the additional new line before the end of the output:
sed -r 's/( *\(|\))//g;s/U[^.]+\./--> \0\n/g;s/\n$//' input.txt
Quite an idiomatic gnu awk solution:
awk -v RS="Use method number [0-9]."
-v OFS=" --> "
'NF{gsub(/\s*|\(|\)/, ""); print $0, RT}' file
Test
$ awk -v RS="Use method number [0-9]." -v OFS=" --> " 'NF{gsub(/\s*|\(|\)/, ""); print $0, RT}' a
1,2,3,4,5,6,10,11,12,13,14,15,16,17,18,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,51,52,53,54,55,56,57,58,62,63,64,65,66,67,68,69,70,71,72,73,74 --> Use method number 1.
7,8,9,19,20,21,22,23,59,60,61 --> Use method number 2.
44,45,46,47,48,49,50 --> Use method number 3.
Explanation
-v RS="Use method number [0-9]." set the record separator to the string "Use method number X.`, X being a digit.
-v OFS=" --> " set the print separator.
NF{gsub(/\s*|\(|\)/, ""); print $0, RT} main code
-- NF {} if there is at least one field, proceed.
-- gsub(/\s*|\(|\)/, "") remove all spaces, ( and ) from the string.
-- print $0, RT print the replaced string together with the record separator that was used ("Use method number X."). Using RT instead of RS so that we catch the value of the specific X used in the string.
From man awk:
RT
The record terminator. Gawk sets RT to the input text that matched
the character or regular expression specified by RS.
you can very intuitively do it with an ed script,
:: ed.script ::
# first you split your data in multiple lines
,s/\(\(([0-9]*), \)*([0-9]*)\)/\
\1\
/g
# then for each matching line with numbers, you remove unwanted chars
# and append " --> " to the next line
,g/\(\(([0-9]*), \)*([0-9]*)\)/\
s/[)( ]//g\
a\
-->\
.
# and finally you join lines
,g/^ -->/-1,+1j
# save if you want
w
Then you launch it with the following command:
cat ed.script | ed -s file.txt
that was the part intuitive... and it works with your sample data.
I was reading GNU awk manual but I didnt find a regular expression wich whom I can match a string just once.
For example from the files aha_1.txt, aha_2.txt, aha_3.txt, .... I would like to print the second column $2 from the first time ana appears in the files (aha_1.txt, aha_2.txt, aha_3.txt, ....). In addition, the same thing when pedro appears.
aha_1.txt
luis 321 487
ana 454 345
pedro 341 435
ana 941 345
aha_2.txt
pedro 201 723
gusi 837 134
ana 319 518
cindy 738 278
ana 984 265
.
.
.
.
Meanwhile I did this but it counts all the cases not just the first time
/^ana/ {print $2 }
/^pedro/ {print $2 }
Thanks for your help :-)
Just call the exit command after printing the first value(second column in the line which starts with the string ana).
$ awk '$1~/^ana$/{print $2; exit}' file
454
Original question
Only processing one file.
awk '/ana/ { if (ana++ == 0) print $2 }' aha.txt
or
awk '/ana/ && ana++ == 0 { print $2 }' aha.txt
Or, if you don't need to do anything else, you can exit after printing, as suggested by Avinash Raj in his answer.
Revised question
I have many files (aha.txt, aha_1.txt, aha_2.txt, ...) each file has ana inside and I need just to take the fist time ana appears in each file and the output has to be one file.
That's sightly different as a question. If you have GNU grep, you can use (more or less):
grep -m1 -e ana aha*.txt
That will list the whole line, not just column 2, and will list the filenames too, so it isn't a perfect match.
Using awk, you have to work a bit more:
awk 'FILENAME != old_file { ana = 0; old_file = FILENAME }
/ana/ { if (ana++ == 0) print $2 }' aha*.txt