Identify & replace 2nd instance of search term in string... VBA RegEx doesn't have lookbehind functionality - regex

I have a list of strings in format as below :
xxxxxxxxxx xxxxxxxxxxxxx 100PS xxxxxxxxxxxxxxxx xxxxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 250PS xxxxxxxxxxx xxxxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 350PS xxxxxxxxxxxxx xxxxxxxxx xxxx
xxxxxxxxxx xxxxxxxxxxxxx 100PS xxxxxxxxxxxxx 100PS xxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 200PS xxxxxxxxxxxxxxxx 200PS xxxxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 100PS xxxxxxxxxxxxxxxx xxxxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 250PS xxxxxxxxxxx xxxxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 350PS xxxxxxxxxxxxx xxxxxxxxx xxxx
In Excel/VBA, and I am trying to remove duplicate values from the string i.e. 100PS and 200PS where it is printed out twice. Using VBA and Reg-Ex I've come up with :
(?<=\d\d\dPS\s.*)(\d\d\dPS\s)
And this seems to work when testing it online and on other languages, but in VBA, lookbehind is not supported, and this is absolutely wrecking my brain.
The value always consists of \d\d\d (3 digits) and PS, ends with \s but all the xxxxxx text around it can differ every time and have different lengths etc.
How would I possibly choose the duplicate PS value with regex?
I have looked through stackoverflow and found a couple of reg-ex examples, but they don't seem to be working in VBA..
Any help is greatly appreciated,
Thanks

Have you considered a worksheet formula?
=SUBSTITUTE(A1,MID(A1,SEARCH("???PS",A1),6),"",2)

See regex in use here
(\s(\d{3}PS)\s.*\s)\2\s
(\s(\d{3}PS)\s.*\s) Capture the following into capture group 1
\s Matches a single whitespace character
(\d{3}PS) Capture the following into capture group 2
\d{3} Matches any 3 digits
PS Match this literally
\s Matches a single whitespace character
.* Matches any character (except \n) any number of times
\s Matches a single whitespace character
\2 Matches the text that was most recently captured by capture group 2
\s Matches a single whitespace character
Replacement: $1 (puts capture group 1 back into the string)
Result:
xxxxxxxxxx xxxxxxxxxxxxx 100PS xxxxxxxxxxxxxxxx xxxxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 250PS xxxxxxxxxxx xxxxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 350PS xxxxxxxxxxxxx xxxxxxxxx xxxx
xxxxxxxxxx xxxxxxxxxxxxx 100PS xxxxxxxxxxxxx xxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 200PS xxxxxxxxxxxxxxxx xxxxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 100PS xxxxxxxxxxxxxxxx xxxxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 250PS xxxxxxxxxxx xxxxxxxxx xxxxxx
xxxxxxxxxx xxxxxxxxxxxxx 350PS xxxxxxxxxxxxx xxxxxxxxx xxxx

Related

filter a regex result from a get-content command

i have a text file which have on each line a sentence of this form:
XXXX - hi
XXXX - hello
XXXX - whatever
WW - blabla
WW - blblbl
CCC - nice
CCC - common
CCC - itsux
CCC - regex
BBBB_BBB - flibidibalala
what i'm trying to do is to create a regex with powershell to sort this content like this:
XXXX
WW
CCC
BBBB_BB
I want to sort the first file lines to have only one time the part before the " -".
I've tried somethings like this:
Get-Content coucou2.txt -Filter '(\w - )?'
Get-Content coucou2.txt -Filter '\w - ?'
Get-Content coucou2.txt -Filter '\w - {1}'
Get-Content coucou2.txt -Filter '(\w - ){1}'
Get-Content coucou2.txt | Select-String '\w - {1}'
Get-Content coucou2.txt | Select-String '(\w - ){1}'
Get-Content coucou2.txt | Select-String '(\w - )?'
Get-Content coucou2.txt | Select-String '\w - ?'
But none of them worked,
Is someone have an idea or just a clue to help me please ?
The following solution using -Split will suffice.
# sort.txt file contains the strings in your example randomized
Get-Content sort.txt
WW - blblbl
CCC - nice
CCC - itsux
CCC - regex
BBBB_BBB - flibidibalala
XXXX - whatever
WW - blabla
CCC - common
XXXX - hello
XXXX - hi
# Code to sort and output sorted strings
Get-Content sort.txt | ForEach-Object {
($_ -split " - ")[0]} | Sort-Object -Desc -Unique
XXXX
WW
CCC
BBBB_BBB
The method above splits (-split) each line (one at a time) delimiting by - and then grabs the first item ([0]) from the resulting split. The pipe into Sort-Object sorts in descending order (-Desc) and outputs only unique objects (-Unique) (cudo's Lieven). You could also use Group-Object here to grab the .name property, which will output the unique strings. See about_Split and Sort-Object. Also, see Group-Object.
If you are dead set on regex, you can use the -replace operator, but this includes duplicates:
(Get-Content sort.txt) -Replace "(\w+) - .*",'$1' | Sort-Object -Desc
XXXX
XXXX
XXXX
XX
WW
WW
CCC
CCC
CCC
CCC
BBBB_BBB
BB
Using the same method as above displaying no duplicates:
(Get-Content sort.txt) -Replace "(\w+) - .*",'$1' | Sort-Object -Desc -Unique
XXXX
XX
WW
CCC
BBBB_BBB
BB
See About Comparison Operators to find more information on -Replace.
Grouping your groups might be more interesting:
> Get-Content .\coucou2.txt|Group-Object {($_ -split ' ')[0]}
Count Name Group
----- ---- -----
3 XXXX {XXXX - hi, XXXX - hello, XXXX - whatever}
2 WW {WW - blabla, WW - blblbl}
4 CCC {CCC - nice, CCC - common, CCC - itsux, CCC - regex}
1 BBBB_BBB {BBBB_BBB - flibidibalala}
> Get-Content .\coucou2.txt|Group-Object {($_ -split ' ')[0]} -NoElement
Count Name
----- ----
3 XXXX
2 WW
4 CCC
1 BBBB_BBB
> (Get-Content .\coucou2.txt|Group-Object {($_ -split ' ')[0]} -NoElement).Name
XXXX
WW
CCC
BBBB_BBB

awk to extract lines in file that contain matching pattern and variable digit

I am trying to use awk to extract those lines that in $2 contain exon (some digit that is 1-99) sequence. The text will always be the same but the digit will be variable.
file tab-delimeted
Tier 2 exon 10 sequence xxxxx
Tier 2 full sequence yyyyy
Tier 1 exon 5 sequence aaaaa
desired output tab-delimeted
Tier 2 exon 10 sequence xxxxx
Tier 1 exon 5 sequence aaaaa
awk
awk '$2 ~ /^exon [0-9][0-9] sequence$/' file
using awk
awk '/exon\s+[0-9]+\s+sequence/ {print $0}' file
or grep
grep -P 'exon\s+[0-9]+\s+sequence' file
awk -F'\t' '$2 ~ /exon [1-9][0-9]? sequence/' file
Note that the regexp for 1-99 is [1-9][0-9]?, not [0-9][0-9]? as that latter would include 0 (as well as 00, 01, etc.).
awk '$3 ~ /exon/' file
Tier 2 exon 10 sequence xxxxx
Tier 1 exon 5 sequence aaaaa
Given:
awk 'BEGIN{FS="\t"; OFS="|"} $1=$1' file
Tier 2|exon 10 sequence|xxxxx
Tier 2|full sequence|yyyyy
Tier 1|exon 5 sequence|aaaaa
(i.e., the tabs are where the | are above)
You can do:
$ awk -F"\t" '$2~/exon[ ]+[0-9][0-9]?/' /tmp/file
Tier 2 exon 10 sequence xxxxx
Tier 1 exon 5 sequence aaaaa

Using regex with sed search and replace

I want to remove some dynamic text from the log file. I am able to extract it using regex and grep -oP, however, the same regex is not working
with sed command.
Sample data: (for reading convenience Concerned data between ABCDEF and LMNOP only)
XXX 2 13:53:35 XXXX0-0-0 XXXXXXXX[3513]: ABCDEF[XXXX]: 1472846015.555671: LMNOP(79): XXXXXXXXXXXXX - XXXXXX XX XXX XXX XXXXX XX XXXXX XXXX XXX XXXX XXX
Following is the data I want to remove from the log file. I am able to extract it using regex + grep :
grep -Po ']: [0-9]{10}\.[0-9]{6}:' sample
]: 1472846015.555671:
Now, if I use the same regex with sed command it's not helping.Any suggestions ?
I used the following command with sed and it returned me the unchanged file.
sed "s/]: [0-9]{10}\.[0-9]{6}://" input
or
awk '{gsub(/]: [0-9]{10}\.[0-9]{6}:/,"")}1' input
I need following output:
XXX 2 13:53:35 XXXX0-0-0 XXXXXXXX[3513]: ABCDEF[XXXX LMNOP(79): XXXXXXXXXXXXX - XXXXXX XX XXX XXX XXXXX XX XXXXX XXXX XXX XXXX XXX
OR even better :
XXX 2 13:53:35 XXXX0-0-0 XXXXXXXX[3513]: ABCDEF[XXXX]::LMNOP(79): XXXXXXXXXXXXX - XXXXXX XX XXX XXX XXXXX XX XXXXX XXXX XXX XXXX XXX
Into the sed use:
sed "s/]: [0-9]\{10\}\.[0-9]\{6\}: /]::/" input
#1 of the "s/#1/#2/" instruction searchs for the pattern, but you need to escape curly braces (\{ and \}). Then replace it to #2, which will add ]: backward cause it is in the search pattern. If you needs ::, the add it into the replace pattern, like above.
But maybe you don't need to search and replace ]:, just replace digits and dot to : with command (it works for your example)
sed "s/ [0-9]\{10\}\.[0-9]\{6\}: /:/" input
You can choose to use sed with extended regex. But note that the extended regex is a GNU extension and so may not be portable. Here is the same sed as suggested by #Konstantin Morenko, but without the backslashes for the { and }. Extended regex option is -r or --regexp-extended
sed -r "s/ [0-9]{10}\.[0-9]{6}: /:/" input

grep: how to find ALL the lines between to expressions

We have a HUGE file (numbers), we want to get ALL the lines between two expressions, e.g.,
232445 -9998.01 xxxxxxxxxx
234566 -9998.02 xxxxxxxxx
.
.
324444 -8000.012 xxxxxxx
344444 -8000.0 xxxx
and the expressions are -9998.01 and -8000.0, so tried:
$ grep -A100000 '[0-9] -9998.[0-9]' mf.in | grep -B100000 '[0-9] -8000.[0-9]' mf.in > mfile.out
And this is OK ...ALL the lines between are get it... of course, 100000 is so big as to keept ALL the lines between... but if we are wrong? i.e., if there are more than 100000 between? How we can take ALL between without numeric specification after A and B ...
PD: I was unable to use sed with similar "[ ...]" expressions
PD2: the columns has more digits (here only 4 columns)
-1931076.0 -9998.96235 1.0002741998076021 0.0191476198569163
-1931075.0 -9998.95962 1.0000742544770280 0.0192495084654059
-1931074.0 -9998.95688 0.9998778097258081 0.0193725608470694
With awk:
awk '$2 ~ /^-9998.01$/{p=1} p{print} $2 ~ /^-8000.0$/{p=0}' file
Test:
$ cat file
232445 -9998.00 xxxxxxxxxx
232445 -9998.01 xxxxxxxxxx
234566 -9998.02 xxxxxxxxx
234566 -9998.03 xxxxxxxxx
234566 -9998.05 xxxxxxxxx
....
....
324444 -8000.011 xxxxxxx
324444 -8000.012 xxxxxxx
344444 -8000.0 xxxx
344444 -8000.1 xxxx
$ awk '$2 ~ /^-9998.01$/{p=1} p{print} $2 ~ /^-8000.0$/{p=0}' file
232445 -9998.01 xxxxxxxxxx
234566 -9998.02 xxxxxxxxx
234566 -9998.03 xxxxxxxxx
234566 -9998.05 xxxxxxxxx
....
....
324444 -8000.011 xxxxxxx
324444 -8000.012 xxxxxxx
344444 -8000.0 xxxx
sed already has this functionality builtin using this expression:
/regex1/,/regex2/ p=>p command prints all lines that are present in between 2 lines(start line having regex1 and end line having regex2(both inclusive in output)).
Here is an example wrt your file format:
$ cat file
124235 -69768.77 xxx
232445 -9998.01 xxxxxxxxxx
234566 -9998.02 xxxxxxxxx
12345 -124.66 xxxx
324444 -8000.012 xxxxxxx
344444 -8000.0 xxxx
344444 -7000.0 xxxx
$ sed -nr '/^[0-9]+\s-9998.[0-9]+\s/,/^[0-9]+\s-8000.[0-9]+\s/ p' file
232445 -9998.01 xxxxxxxxxx
234566 -9998.02 xxxxxxxxx
12345 -124.66 xxxx
324444 -8000.012 xxxxxxx
344444 -8000.0 xxxx
$
Well it might not be the best answer, but the easy fix for your command would be to use the file's number of lines as argument to -A and -B, so you're sure you cannot miss any lines:
NB_LINES=$(wc -l main.c | awk '{print $1}')
grep -A$NB_LINES '[0-9] -9998.[0-9]' mf.in | grep -B$NB_LINES '[0-9] -8000.[0-9]' mf.in > mfile.out
Though, tbh, in pure shell it's very likely I'd do something similar. Or I'd write a small python script, that would look like:
import re
LINE_RE = re.compile(r'[^ ]+ (-[0-9]+\.[0-9]+) .*')
with open('mf.in', 'r') as fin:
with open('mf.out', 'w') as fout:
for line in f:
match = LINE_RE.match(line)
if match:
if float(match.groups()[0]) > -9998.0:
fout.write(line)
elif float(match.groups()[0]) < -8000.0:
break
N.B.: this script is just to expose the algorithmic idea, and being blindly coded and untested, it might need some tweaking to actually work.
HTH

Sed to extract text between two strings

Please help me in using sed.
I have a file like below.
START=A
xxxxx
xxxxx
END
START=A
xxxxx
xxxxx
END
START=A
xxxxx
xxxxx
END
START=B
xxxxx
xxxxx
END
START=A
xxxxx
xxxxx
END
START=C
xxxxx
xxxxx
END
START=A
xxxxx
xxxxx
END
START=D
xxxxx
xxxxx
END
I want to get the text between START=A, END.
I used the below query.
sed '/^START=A/, / ^END/!d' input_file
The problem here is ,
I am getting
START=A
xxxxx
xxxxx
END
START=D
xxxxx
xxxxx
END
instead of
START=A
xxxxx
xxxxx
END
Sed finds greedily.
Please help me in resolvng this.
Thanks in advance.
Can I use AWK for achieving above?
sed -n '/^START=A$/,/^END$/p' data
The -n option means don't print by default; then the script says 'do print between the line containing START=A and the next END.
You can also do it with awk:
A pattern may consist of two patterns separated by a comma; in this case, the action is performed for
all lines from an occurrence of the first pattern though an occurrence of the second.
(from man awk on Mac OS X).
awk '/^START=A$/,/^END$/ { print }' data
Given a modified form of the data file in the question:
START=A
xxx01
xxx02
END
START=A
xxx03
xxx04
END
START=A
xxx05
xxx06
END
START=B
xxx07
xxx08
END
START=A
xxx09
xxx10
END
START=C
xxx11
xxx12
END
START=A
xxx13
xxx14
END
START=D
xxx15
xxx16
END
The output using GNU sed or Mac OS X (BSD) sed, and using GNU awk or BSD awk, is the same:
START=A
xxx01
xxx02
END
START=A
xxx03
xxx04
END
START=A
xxx05
xxx06
END
START=A
xxx09
xxx10
END
START=A
xxx13
xxx14
END
Note how I modified the data file so it is easier to see where the various blocks of data printed came from in the file.
If you have a different output requirement (such as 'only the first block between START=A and END', or 'only the last ...'), then you need to articulate that more clearly in the question.
Basic version ...
sed -n '/START=A/,/END/p' yourfile
More robust version...
sed -n '/^ *START=A *$/,/^ *END *$/p' yourfile
Your sed expression has a space before end, i.e / ^END/. So sed gets the starting pattern, but does not get the ending pattern and keeps on printing till end. Use sed '/^START=A/, /^END/!d' input_file (notice /^END/)