SED Replace after certain pattern - value in brackets - regex

I have files where i need to replace all occourences of
AC %blabla% with AC (%blabla%+PAR).
ROT: S 3 BL 3900 SPEED 20
BEN: L 15
ROT: S 2 BLL (DimZ/2+25) BLR (DimZ/2-29) SPEED 20
BEN: L 14-0.5 A 116 AC -1
ROT: S 2 BLR (DimZ/2-29) BLL (DimZ/2-20) SPEED 20
CLA: L 133 A 64 AC -1
ROT: S 1 BLL (DimZ/2-29) BLR (DimZ/2+25) SPEED 20
BEN: L 11-0.5 AC -90
BEN: L 95 AC 1.5
E.g.:
AC -1 should be AC (-1+PAR) afterwards.
AC 90 should be AC (90+PAR) afterwards.
What i've tried is:
sed "s/\( AC"."\)/\1(/"
But that doesn't even always add the "("...
I get:
ROT: S 3 BL 3900 SPEED 20
BEN: L 15
ROT: S 2 BLL (DimZ/2+25) BLR (DimZ/2-29) SPEED 20
BEN: L 14-0.5 A 116 AC (-1
ROT: S 2 BLR (DimZ/2-29) BLL (DimZ/2-20) SPEED 20
CLA: L 133 A 64 AC -1
ROT: S 1 BLL (DimZ/2-29) BLR (DimZ/2+25) SPEED 20
BEN: L 11-0.5 AC (-90
BEN: L 95 AC (1.5
Could someone please help me?
Thank you.

You can use the following POSIX BRE compliant regex with sed:
sed "s/\( AC \)\([^[:space:]]*\)/\1(\2+PAR)/" file
See the online sed demo
If you have GNU sed, I suggest
sed -E "s/\b(AC\s+)(\S+)/\1(\2+PAR)/" file
See another demo.
Regex details
\( AC \) - Group 1: space, AC, space (so, no match for BAC, for example)
\([^[:space:]]*\) - Group 2: zero or more non-whitespace chars
\1(\2+PAR) - the replacement is the concatenated Group 1 value + ( + Group 2 value and +PAR).
GNU sed regex details
\b - a word boundary
(AC\s+) - Group 1: AC and one or more whitespaces
(\S+) - Group 2: one or more non-whitespace chars.

$ sed -E 's/(AC )([^ ]*)/\1(\2+PAR)/' ip.txt
ROT: S 3 BL 3900 SPEED 20
BEN: L 15
ROT: S 2 BLL (DimZ/2+25) BLR (DimZ/2-29) SPEED 20
BEN: L 14-0.5 A 116 AC (-1+PAR)
ROT: S 2 BLR (DimZ/2-29) BLL (DimZ/2-20) SPEED 20
CLA: L 133 A 64 AC (-1+PAR)
ROT: S 1 BLL (DimZ/2-29) BLR (DimZ/2+25) SPEED 20
BEN: L 11-0.5 AC (-90+PAR)
BEN: L 95 AC (1.5+PAR)
-E to enable Extended Regular Expressions
Use sed 's/\(AC \)\([^ ]*\)/\1(\2+PAR)/' if -E isn't supported
(AC ) to match and capture AC followed by space
use ( AC ) to avoid partial match or use \b(AC ) if word boundary is supported
([^ ]*) to capture non-space characters
\1(\2+PAR) required output format
What's wrong with OP's attempt:
"s/\( AC"."\)/\1(/" will be treated as concatenation of s/\( AC followed by . followed by \)/\1(/
can be simplified to sed 's/\( AC.\)/\1(/' --> use single quotes unless double is required
\( AC.\) will match space followed by AC followed by any character only once
\1( will give you captured portion followed by (

Related

Remove "." from digits

I have a string in the following way =
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
I want to convert it into :
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
that is I only convert a.b -> ab where a and b are integer
waiting for help
Assuming you are using Python. You can use captured groups in regex. Either numbered captured group or named captured group. Then use the groups in the replacement while leaving out the ..
import re
text = "lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
Numbered: You reference the pattern group (content in brackets) by their index.
text = re.sub("(\d+)\.(\d+)", "\\1\\2", text)
Named: You reference the pattern group by a name you specified.
text = re.sub("(?P<before>\d+)\.(?P<after>\d+)", "\g<before>\g<after>", text)
Which each returns:
print(text)
> lmn abc 40mg 350 mg over 12 days. Standing nebs.
However you should be aware that leaving out the . in decimal numbers will change their value. So you should be careful with whatever you are doing with these numbers afterwards.
Using any sed in any shell on every Unix box:
$ sed 's/\([0-9]\)\.\([0-9]\)/\1\2/g' file
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
Using sed
$ cat input_file
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs. a.b.c."
$ sed 's/\([a-z0-9]*\)\.\([a-z0-9]\)/\1\2/g' input_file
"lmn abc 40mg 350 mg over 12 days. Standing nebs. abc."
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e '$_.gsub!(/\d+\K\.(?=\d+)/, "")'
Output
12 123 1234 1. .2
If performance matters:
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e 'BEGIN{$regex = /\d+\K\.(?=\d+)/; $empty_string = ""}; $_.gsub!($regex, $empty_string)'

Conditional gsub action

According to this answer, I am trying to reproduce a conditional statement where, in the event of a match, a substitusion occurs (it matches dates). If no match happens, the line is printed as it is.
#!/bin/bash
cleaner(){
./date_remove.awk $1
}
cleaner $1 > "out"
where 'date_remove.awk' is
#! /usr/bin/awk -f
date = /(^|[^[:alpha:]])[[:digit:]]{2}[[:space:]]{1,}[[:alpha:]]{3,8}[[:space:]]{1,}[[:digit:]]{4}([^[:alpha:]]|$)/ {gsub(date, "")} !date {print}
At this point the substitution does not happens. 'gsub' should return only the matched phrases, but it does not return anything, actually. Just unmatched phrases are printed correctly. At this point, I am pretty sure is a problem of syntax, but I cannot figure out where.
Input:
ci sono 4444444444444Quattro mele
sentiamoci il 16 Ottobre 2018
deciIIIIIIdiamo il 17 ottabre 2017
Manipolo di eroi 55555555555
17 mele
18 ott 2020 llllllLLLLLLLLLLLL
una mela e mezza
2 mAAAeleA
0000 asd a0 0 ad000
Actual output:
ci sono 4444444444444Quattro mele
Manipolo di eroi 55555555555
17 mele
una mela e mezza
2 mAAAeleA
0000 asd a0 0 ad000
Expected output:
ci sono 4444444444444Quattro mele
sentiamoci il
deciIIIIIIdiamo il
Manipolo di eroi 55555555555
17 mele
llllllLLLLLLLLLLLL
una mela e mezza
2 mAAAeleA
0000 asd a0 0 ad000
It is not quite correct, gsub() does not return the matched phrases on its own. It just returns the count of substitutions made. Your problem is dealing with how to store the matching group for subsequent string replacement.
The problem with your attempt is the regexp matched within /../ is not stored explicitly, you need to make it be stored by using match() or index() and use that in the replacement part,
awk '
match($0, /(^|[^[:alpha:]])[[:digit:]]{2}[[:space:]]{1,}[[:alpha:]]{3,8}[[:space:]]{1,}[[:digit:]]{4}([^[:alpha:]]|$)/) {
str=substr($0, RSTART, RLENGTH); sub(str," ",$0 );
}1' file
The example above would replace the captured group i.e. your date strings below and replace them with a single white space.
16 Ottobre 2018
17 ottabre 2017
18 ott 2020
One could use sub() or gsub() depending on the number of occurrences of the regex in the line. Applying the command above would remove the those date strings from the file and produce a result as below.
ci sono 4444444444444Quattro mele
sentiamoci il
deciIIIIIIdiamo il
Manipolo di eroi 55555555555
17 mele
llllllLLLLLLLLLLLL
una mela e mezza
2 mAAAeleA
0000 asd a0 0 ad000
Notice the {..}1 after we do the string replace. It is needed to reconstruct the line after the appropriate replacements are done.
Putting it in awk script it would look like
#!/usr/bin/awk -f
match($0, /(^|[^[:alpha:]])[[:digit:]]{2}[[:space:]]{1,}[[:alpha:]]{3,8}[[:space:]]{1,}[[:digit:]]{4}([^[:alpha:]]|$)/) {
str=substr($0, RSTART, RLENGTH)
sub(str," ",$0 )
}1

How to grep any word that appears between 2 and 4 times?

My file is:
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
And I need to extract the words and numbers that appears 2-4 times.- {2,4}
I've tried many regex lines and even regex101.
I cant really put my finger on what's not working.
this is the closest I've got so far:
egrep -o '[\w]{2,4}' A1
Native grep doesn't supoort \w and {} notations. You have to use extended regular expressions.
Use
-E option as,
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).
Also use
-w to match words, so that it matches the entire words instead of partial.
-w, --word-regexp
The expression is searched for as a word (as if surrounded by [[:<:]]' and[[:>:]]'; see re_format(7)).
Example
$ grep -Ewo "\w{2,4}" file
ab
12ab
1cd
uu
88
ab
33
33
ab
cd
uu
88
88
33
33
33
cw
Note
You can eliminated use of an un-necessary cat by providing file as input to grep instead.
You were very close; within character class notation [], the special notation \w is being treated literally, put it out of []:
egrep -o '\w{2,4}'
Also egrep is deprecated in favor of grep -E, and you don't need the cat as grep takes file(s) as argument(s):
grep -Eo '\w{2,4}' file.txt
I would use awk for it:
awk '{for(i=1;i<=NF;i++)a[$i]++}
END{for(x in a)if(a[x]>1&&a[x]<5)print x}' file
It will scan the whole file, find out the words with occurrence (in the file) in this range [2,4]
Output is:
uu
ab
88
1
Using AWK, this solution counts the word instances per line not per file:
awk '{delete array; for(i = 1; i <= NF; i++) array[$i]+=1; for(i in array) if(array[i] >= 2 && array[i] <= 4) printf "%s ", i; printf "\n" }' input.txt
Delete to clear the array for each new line. Use fields as hash for array indexes and increment it's value by one. Print the index (field) with values between 2 and 4 inclusive.
Output:
ab 1 33
ab 88 33
Perl implementation for a file small enough to process its content as a single string:
$/ = undef;
$_ = <>;
#_ = /(\b\w+\b)/gs;
my %h; $h{$_}++ for #_;
for (keys %h) {
print "$_\n" if $h{$_} >= 2 and $h{$_} <= 4;
}
Save it into a script.pl and run:
perl script.pl < file
Of course, you can pass the code via -e option as well: perl -e 'the code' < file.
Input
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
Output
88
uu
ab
1
There is no 33 in the output, since it occurs 5 times in the input.
The code reads the file in slurp mode into the default variable ($_), then collects all the words (\w with word boundaries around) into #_ array. Then it counts the number of times each word occurred in the file and stores the result into %h hash. The final block prints only the items that occurred 2, 3, or 4 times, no more and no less.
Note, in Perl you should always use strict; and use warnings; in order to detect issues at early phase.

How can I extract Twitter #handles from a text with RegEx?

I'm looking for an easy way to create lists of Twitter #handles based on SocialBakers data (copy/paste into TextMate).
I've tried using the following RegEx, which I found here on StackOverflow, but unfortunately it doesn't work the way I want it to:
^(?!.*#([\w+])).*$
While the expression above deletes all lines without #handles, I'd like the RegEx to delete everything before and after the #handle as well as lines without #handles.
Example:
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Desired result:
#katyperry
#justinbieber
#taylorswift13
Thanks in advance for any help!
Something like this:
cat file | perl -ne 'while(s/(#[a-z0-9_]+)//gi) { print $1,"\n"}'
This will also work if you have lines with multiple #handles in.
A Twitter handle regex is #\w+. So, to remove everything else, you need to match and capture the pattern and use a backreference to this capture group, and then just match any character:
(#\w+)|.
Use DOTALL mode to also match newline symbols. Replace with $1 (or \1, depending on the tool you are using).
See demo
Strait REGEX Tested in Caret:
#.*[^)]
The above will search for and any given and exclude close parenthesis.
#.*\b
The above here does the same thing in Caret text editor.
How to awk and sed this:
Get usernames as well:
$ awk '/#.*/ {print}' test
katyperry KATY PERRY (#katyperry)
justinbieber Justin Bieber (#justinbieber)
taylorswift13 Taylor Swift (#taylorswift13)
Just the Handle:
$ awk -F "(" '/#.*/ {print$2}' test | sed 's/)//g'
#katyperry
#justinbieber
#taylorswift13
A look at the test file:
$ cat test
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Bash Version:
$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

Replace first two whitespace occurrences with a comma using sed

I have a whitespace delimited file with a variable number of entries on each line. I want to replace the first two whitespaces with commas to create a comma delimited file with three columns.
Here's my input:
a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
And here's my desired output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
I'm trying to use perl regular expressions in a sed command but I can't quite get it to work. First I try capturing a word, followed by a space, then another word, but that only works for lines 1, 2, and 5:
$ cat test | sed -r 's/(\w)\s+(\w)\s+/\1,\2,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try capturing whitespace, a word, and then more whitespace, but that gives me the same result:
$ cat test | sed -r 's/\s+(\w)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try doing this with the .? wildcard, but that does something funny to line 4.
$ cat test | sed -r 's/\s+(.?)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh,,77 88 99
z,y,2 3 33
Any help is much appreciated!
How about this:
sed -e 's/\s\+/,/' | sed -e 's/\s\+/,/'
It's probably possible with a single sed command, but this is sure an easy way :)
My output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
Try this:
sed -r 's/\s+(\S+)\s+/,\1,/'
Just replaced \w (one "word" char) with \S+ (one or more non-space chars) in one of your attempts.
You can provide multiple commands to a single instance of sed by just providing multiple -e arguments.
To do the first two, just use:
sed -e 's/\s\+/,/' -e 's/\s\+/,/'
This basically runs both commands on the line in sequence, the first doing the first block of whitespace, the second doing the next.
The following transcript shows this in action:
pax$ echo 'a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
' | sed -e 's/\s\+/,/' -e 's/\s\+/,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
Sed s/// supports a way to say which occurrence of a pattern to replace: just add the n to the end of the command to replace only the nth occurrence. So, to replace the first and second occurrences of whitespace, just use it this way:
$ sed 's/ */,/1;s/ */,/2' input
a,b ,1 2 3 3 2 1
c,d ,44 55 66 2355
line,http://google.com 100,200 300
ef,jh ,77 88 99
z,y 2,3 33
EDIT: reading another proposed solutions, I noted that the 1 and 2 after s/ */,/ is not only unnecessary but plainly wrong. By default, s/// just replaces the first occurrence of the pattern. So, if we have two identical s/// in sequence, they will replace the first and the second occurrence. What you need is just
$ sed 's/ */,/;s/ */,/' input
(Note that you can put two sed commands in one expression if you separate them by a semicolon. Some sed implementations do not accept the semicolon after the s/// command; use a newline to separate the commands, in this case.)
A Perl solution is:
perl -pe '$_=join ",", split /\s+/, $_, 3' some.file
Not sure about sed/perl, but here's an (ugly) awk solution. It just prints fields 1-2, separated by commas, then the remaining fields separated by space:
awk '{
printf("%s,", $1)
printf("%s,", $2)
for (i=3; i<=NF; i++)
printf("%s ", $i)
printf("\n")
}' myfile.txt