Extract string pattern in new variable based on other string variable

Extract string pattern in new variable based on other string variable - regex

Consider the following variable:
clear
input str18 string
"abc bcd cde"
"def efg fgh"
"ghi hij ijk"
end
I can use the regexm() function to extract all occurrences of abc, cde and def:
generate new = regexm(string, "abc|cde|def")
list
|string new |
|--------------------|
| abc bcd cde 1 |
| def efg fgh 1 |
| ghi hij ijk 0 |
How can I get the following?
|string wanted |
|--------------------------|
| abc bcd cde abc cde |
| def efg fgh def |
| ghi hij ijk |
This question is an extension of the one answered here:
Create new string variable with partial matching of another

I read this as your
Having a list of allowed words.
Wanting the words in a string that occur among the allowed words.
It is fashionable to seek a fancy regular expression solution for such problems, but your example at least yields to a plain loop over the words that exist. Be aware, however, that inlist() has advertised limits.
clear
input str18 string
"abc bcd cde"
"def efg fgh"
"ghi hij ijk"
end
generate wanted = ""
generate wc = wordcount(string)
summarize wc, meanonly
quietly forvalues j = 1/`r(max)' {
replace wanted = wanted + " " + word(string, `j') if inlist(word(string, `j'), "abc", "cde", "def")
}
replace wanted = trim(wanted)
list
+----------------------------+
| string wanted wc |
|----------------------------|
1. | abc bcd cde abc cde 3 |
2. | def efg fgh def 3 |
3. | ghi hij ijk 3 |
+----------------------------+

This is the solution using a regular expression:
clear
input str18 string
"abc bcd cde"
"def efg fgh"
"ghi hij ijk"
end
generate wanted = ustrregexra(string, "(\b((?!(abc|cde|def))\w)+\b)", " ")
replace wanted = strtrim(stritrim(wanted))
list
+-----------------------+
| string wanted |
|-----------------------|
1. | abc bcd cde abc cde |
2. | def efg fgh def |
3. | ghi hij ijk |
+-----------------------+

Related

How to select lines between two marker patterns which may occur multiple times with awk/sed and delete those lines

Suppose a file contains:
abc
def1
ghi1
mno
abc
def2
ghi2
jkl2
abc
pqr
stu
mno
And the starting pattern is abc and ending pattern is mno So, I need the output as:
def2
ghi2
jkl2
i.e anything between abc and mno should be removed and I want anything after abc including abc if it's not ended by mno to be printed
I tried this sed '/^abc$/,/^mno$/{//!b};d' file but it delete all lines except for those between lines starting abc and mno.

If you split the input on abc\n you might get away with removing those that contain mno, e.g. with GNU awk:
awk '!/mno/' RS='abc\n' infile
def2
ghi2
jkl2

This might work for you (GNU sed):
sed -n '/abc/{:a;x;z;x;:b;n;/abc/{:c;g;s/.//p;ba};/mno/d;H;$bc;bb};p' file
Any lines before or after a cycle of abc thru mno will be printed as normal. Any lines between a cycle of abc and a following abc will printed less the first and last lines of the cycle. All other lines will be deleted i.e. a cycle between abc and mno with no abc lines between. For the special case of a cycle abc and the end of the file with no abc or mno lines between, all lines except the first will be printed.

Here is a simple way to do it that works with any awk:
awk 'BEGIN{begpat="^abc$"; endpat="^mno$" }
($0 ~ endpat) { n=0; buffer=""; next }
($0 ~ begpat) { n=1; if(buffer) print buffer; buffer=""; next }
(n==0) { print; next }
{buffer = (buffer ? buffer ORS : "") $0 }
END { if(buffer) print buffer }' file
The following logictable explains how this works:
from\to | begpat | endpat | END BEGIN: BEGIN OF FILE
--------+--------+--------+----- END : END OF FILE
BEGIN | 1 | 1 | 1
--------+--------+--------+----- 1: print lines between [from,to]
begpat | 1 | 0 | 1 0: don't print lines between [from,to]
--------+--------+--------+-----
endpat | 1 | 1 | 1
The above shows that there is only an exception when begpat is met. Only in this case we build a buffer to decide if we have to print.

MariaDB REGEXP_REPLACE backreferences not works in FUNCTION

I have this simple expression:
SELECT REGEXP_REPLACE('aaa 123 bbb 456 ccc', '([0-9]+)', LPAD('\\1', 10, '0'));
+-------------------------------------------------------------------------+
| REGEXP_REPLACE('aaa 123 bbb 456 ccc', '([0-9]+)', LPAD('\\1', 10, '0')) |
+-------------------------------------------------------------------------+
| aaa 00000000123 bbb 00000000456 ccc |
+-------------------------------------------------------------------------+
But if i put this into a FUNCTION:
CREATE FUNCTION test()
RETURNS VARCHAR(4000)
BEGIN
RETURN REGEXP_REPLACE('aaa 123 bbb 456 ccc', '([0-9]+)', LPAD('\\1', 10, '0'));
END
It doesn't works:
SELECT test();
+-----------------------------------+
| test() |
+-----------------------------------+
| aaa 0000000001 bbb 0000000001 ccc |
+-----------------------------------+
How can i fix it? (10.1.38-MariaDB-0+deb9u1)

Use '\1' instead of '\\1'....

Removing symbols and making a tab delimited file while keeping all the words after a certain string in one column

I have a file full of such lines:
>Mouse|chr9:95713136-95716028 | element 1367 | positive | hindbrain (rhombencephalon)[5/8] | midbrain (mesencephalon)[3/8] | other[7/8]
>Mouse|chr16:90449561-90451327 | element 1672 | positive | forebrain[4/8] | heart[6/8]
>Mouse|chr3:137446183-137449401 | element 4 | positive | heart[3/4]
What I want to get is something like this:
Mouse chr9 95713136 95716028 element 1367 positive hindbrain (rhombencephalon)[5/8]|midbrain (mesencephalon)[3/8]|other[7/8]
Such that all the words after "positive" are in one column of their own separated by a pipe, and all the columns are separated by tab.
This is what I did:
sed -E 's/ *[>\|:-] */\t/g' mouse_genome_vista1.txt > mouse_genome_vista2.txt
sed "s/^[ \t]*//" -i mouse_genome_vista2.txt
My output was like this:
Mouse chr9 95713136 95716028 element 1367 positive hindbrain (rhombencephalon)[5/8] midbrain (mesencephalon)[3/8] other[7/8]
Mouse chr16 90449561 90451327 element 1672 positive forebrain[4/8] heart[6/8]
Mouse chr3 137446183 137449401 element 4 positive heart[3/4]
It works if I have just one word after "positive" it'll be alone in its column . However if I have more than one I'll have multiple columns. For instance hindbrain, midbrain , and other are each in their own tab delimited columns I want them to be pipe separated in one column.

You may try this with perl or awk:
[|:-](?=.*positive)|positive\s+\K\|
Regex 101 Demo
Sample Perl Solution(note it illustrates over a set of string not file):
use strict;
my $str = 'Mouse|chr9:95713136-95716028 | element 1367 | positive | hindbrain (rhombencephalon)[5/8] | midbrain (mesencephalon)[3/8] | other[7/8]
Mouse|chr16:90449561-90451327 | element 1672 | positive | forebrain[4/8] | heart[6/8]
Mouse|chr3:137446183-137449401 | element 4 | positive | heart[3/4]
';
my $regex = qr/[|:-](?=.*positive)|positive\s+\K\|/xmp;
my $subst = '\\t';
my $result = $str =~ s/$regex/$subst/rg;
print $result;

awk remove substring using regex

i have a pipe delimited file that looks like this:
34ab1 | aaa bbb ccc fff vf | 2015-01-01
35ab1 | aaa bbb ccc dddefd ddff ssss fff vi | 2015-01-01
i want to replace everything that starts with bbb and ends with fff.
i used this:
BEGIN {
FS = OFS = "|"
}
{
sub(/[0-9].*[0-9]/, "", $2); sub(/bbb.*fff/, "", $2);
print
}
the regex part for the numbers worked but the second part of the regex didnt.
output i want:
34ab1 | aaa vf | 2015-01-01
35ab1 | aaa vi | 2015-01-01

Use a single gsub function for both.
BEGIN {
FS = OFS = "|"
}
{
gsub(/[0-9].*[0-9]|bbb.*fff/, "", $2);
print
}

How to change case of back references?

I'm trying to modify a back reference in PowerShell but am having no luck :(
This is my example:
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"`$2`".ToUpper()) | `$1 |"
If I run it I get this:
| Jane Doe | 456 |
But I'm really expecting this:
| JANE DOE | 456 |
If I run the following (the same as above but without the '()' on the call to ToUpper):
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"`$2`".ToUpper) | `$1 |"
I get this:
| string ToUpper(), string ToUpper(System.Globalization.CultureInfo culture) | 456 |
So it would appear that PowerShell knows that the back reference '$2' is a string but why can't I get PowerShell to convert it to upper case?
Terry

[Regex]::Replace('456,Jane Doe',
'^(\d{3}),(.*)$',
{
param($m)
'| ' + $m.Groups[2].Value.ToUpper() + ' | ' + $m.Groups[1].Value + ' |'
}
)
Not very pretty, I admit. And you sadly cannot use script blocks as replacement in the -replace operator.

Just to explain what is happening, in "| $(`"`$2`".ToUpper()) | `$1 |" PowerShell is evaluating the highlighted subexpression before passing the string to the -replace operator, rather than after the replace operation has occurred.
In other words, ToUpper is called on the string value $2, resulting in | $2 | $1 | being used for the replace operation. You can see this by including a letter in the subexpression string, for example:
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"zz `$2`".ToUpper()) | `$1 |"
This has an effective replace string of | ZZ $2 | $1 |, giving | ZZ Jane Doe | 456 | as the result.
Similarly, the second version omitting parenthesis, "| $(`"`$2`".ToUpper) | `$1 |", is evaluated as "some string".ToUpper, which puts the array of overload definitions for the ToUpper method on System.String in the replace string.
To keep the replace operation as a one-liner, Joey's answer using the MatchEvaluator overload to Regex.Replace works well. Or you might do the string formatting yourself based on the results of a -match:
if( '456,Jane Doe' -match '^(\d{3}),(.*)$' ) {
'| {0} | {1} |' -f $matches[2].ToUpper(),$matches[1]
}
If this needs to be replaced in the context of a larger string, you can always do a literal replace to get the final result:
PS> $r = '| {0} | {1} |' -f $matches[2].ToUpper(),$matches[1]
PS> 'A longer string with 456,Jane Doe in it.'.Replace( $matches[0], $r )
A longer string with | JANE DOE | 456 | in it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract string pattern in new variable based on other string variable - regex

Related

How to select lines between two marker patterns which may occur multiple times with awk/sed and delete those lines

MariaDB REGEXP_REPLACE backreferences not works in FUNCTION

Removing symbols and making a tab delimited file while keeping all the words after a certain string in one column

awk remove substring using regex

How to change case of back references?

Categories

Resources