awk remove substring using regex

awk remove substring using regex - regex

i have a pipe delimited file that looks like this:
34ab1 | aaa bbb ccc fff vf | 2015-01-01
35ab1 | aaa bbb ccc dddefd ddff ssss fff vi | 2015-01-01
i want to replace everything that starts with bbb and ends with fff.
i used this:
BEGIN {
FS = OFS = "|"
}
{
sub(/[0-9].*[0-9]/, "", $2); sub(/bbb.*fff/, "", $2);
print
}
the regex part for the numbers worked but the second part of the regex didnt.
output i want:
34ab1 | aaa vf | 2015-01-01
35ab1 | aaa vi | 2015-01-01

Use a single gsub function for both.
BEGIN {
FS = OFS = "|"
}
{
gsub(/[0-9].*[0-9]|bbb.*fff/, "", $2);
print
}

Related

MariaDB REGEXP_REPLACE backreferences not works in FUNCTION

I have this simple expression:
SELECT REGEXP_REPLACE('aaa 123 bbb 456 ccc', '([0-9]+)', LPAD('\\1', 10, '0'));
+-------------------------------------------------------------------------+
| REGEXP_REPLACE('aaa 123 bbb 456 ccc', '([0-9]+)', LPAD('\\1', 10, '0')) |
+-------------------------------------------------------------------------+
| aaa 00000000123 bbb 00000000456 ccc |
+-------------------------------------------------------------------------+
But if i put this into a FUNCTION:
CREATE FUNCTION test()
RETURNS VARCHAR(4000)
BEGIN
RETURN REGEXP_REPLACE('aaa 123 bbb 456 ccc', '([0-9]+)', LPAD('\\1', 10, '0'));
END
It doesn't works:
SELECT test();
+-----------------------------------+
| test() |
+-----------------------------------+
| aaa 0000000001 bbb 0000000001 ccc |
+-----------------------------------+
How can i fix it? (10.1.38-MariaDB-0+deb9u1)

Use '\1' instead of '\\1'....

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

Background
Homopolymers are a sub-sequence of DNA with consecutives identical bases, like AAAAAAA. Example in python for extract it:
import re
DNA = "ACCCGGGTTTAACCGGACCCAA"
homopolymers = re.findall('A+|T+|C+|G+', DNA)
print homopolymers
['A', 'CCC', 'GGG', 'TTT', 'AA', 'CC', 'GG', 'A', 'CCC', 'AA']
my effort
I made a gawk script that solves the problem, but without to use regular expressions:
echo "ACCCGGGTTTAACCGGACCCAA" | gawk '
BEGIN{
FS=""
}
{
homopolymer = $1;
base = $1;
for(i=2; i<=NF; i++){
if($i == base){
homopolymer = homopolymer""base;
}else{
print homopolymer;
homopolymer = $i;
base = $i;
}
}
print homopolymer;
}'
output
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
question
how can I use regular expressions in awk or sed, getting the same result ?

grep -o will get you that in one-line:
echo "ACCCGGGTTTAACCGGACCCAA"| grep -ioE '([A-Z])\1*'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
Explanation:
([A-Z]) # matches and captures a letter in matched group #1
\1* # matches 0 or more of captured group #1 using back-reference \1
sed is not the best tool for this but since OP has asked for it:
echo "ACCCGGGTTTAACCGGACCCAA" | sed -r 's/([A-Z])\1*/&\n/g'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
PS: This is gnu-sed.

Try using split and just comparing.
echo "ACCCGGGTTTAACCGGACCCAA" | awk '{ split($0, chars, "")
for (i=1; i <= length($0); i++) {
if (chars[i]!=chars[i+1])
{
printf("%s\n", chars[i])
}
else
{
printf("%s", chars[i])
}
}
}'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
EXPLANATION
The split method divides the one-line string you send to awk, and separes each character in array chars[]. Now, we go through the entire array and check if the char is equal to the next One if (chars[i]!=chars[i+1]) and then, if it´s equal, we just print the char, and wait for the next one. If the next one is different, we just print the base char, a \n what means a newline.

Write a regular expression to parse following string

I am using sed command and I want to parse following string:
Mr. XYZ Mr. ABC, PQR
Ward-2, abc vs. MG Road, Pune,
Pune Dist.,
(Appellant) (Respondent)
Now I want to parse the above string and I want to get Appellant part separated from above example and respondent part separated.
That is I want following output:
Mr. XYZ Ward-2, abc(Appellant) that is one output and Mr. ABC, PQR MG Road, Pune, Pune Dist.,(Respondent) is another output by using sed command.
I used following regex but not getting proper output:
sed -n '/assessment year/I{ :loop; n; /Respondent/Iq; p; b loop}' abc.txt

sed is always the wrong tool for any job that involves looking at multiple lines. Just use awk, it's what it was invented for. Here's GNU awk for a couple of extensions:
$ cat tst.awk
BEGIN { FIELDWIDTHS="30 7 99" }
{
for (i=1;i<=NF;i++) {
gsub(/^\s*|\s*$/,"",$i)
if ($i != "") {
rec[i] = (rec[i]=="" ? "" : rec[i] " ") $i
}
}
}
/^\(/ {
print rec[1]
print rec[3]
delete rec
}
$
$ awk -f tst.awk file
Mr. XYZ Ward-2, abc (Appellant)
Mr. ABC, PQR MG Road, Pune, Pune Dist., (Respondent)

I achieved this with following way by using ruby:
appellant_respondent = %x(sed -n '/assessment year/I{ :loop; n; /respondent/Iq; p; b loop}' #{#file_name}).split("\n")
appellant_name_array = []
respondent_name_array = []
appellant_respondent.delete("")
appellant_respondent.each do |names|
names_array = names.split(/\s+\s+/)
appellant_name_array << names_array.first if names_array.first != ""
respondent_name_array << names_array.last if names_array.last != ""
end
#item[:appellant] = appellant_name_array.join(' ').gsub(/\s+vs\.*\s+/i, ' ').strip
#item[:respondent] = respondent_name_array.join(' ').gsub(/\s+vs\.*\s+/i, ' ').strip

Replace - with ' ' with gawk

I have this data set
Format
ID date delimited-characters
Here is a sample file
FILE data.txt
004 06/23/1962 AAA-BBB-CCC-DDD
023 11/22/1963 AAA-BBB-CCC-DDD
070 06/23/1963 AAA-BBB-CCC-DDD
My gawk script works fine like this
call gawk 'BEGIN { BLANK = " " } { print $2 BLANK $3 }' lottery.midday.txt
and I receive just data and data which is what I want
06/23/1962 AAA-BBB-CCC-DDD
11/22/1963 AAA-BBB-CCC-DDD
06/23/1963 AAA-BBB-CCC-DDD
But my problem is I dont know how to substitute - with
I want to substitute dashes with blank spaces
gawk 'BEGIN { BLANK = " " } { print $3 BLANK $2 } data.txt
gawk 'BEGIN { BLANK = " " } { b=$3 gsub(/-/, " ") print} {print nb BLANK $2 }' data.txt
gawk { BLANK = " " } {print nb BLANK $2; gsub(/-/, " "); print }
gawk 'BEGIN { BLANK = " " RESULT=$3} {print gsub(/-/, " ", RESULT)} { print $3 BLANK $2 }' data.txt

try this:
awk '{gsub(/-/," ",$3);print $2,$3}' file
with your input example, the line above outputs:
06/23/1962 AAA BBB CCC DDD
11/22/1963 AAA BBB CCC DDD
06/23/1963 AAA BBB CCC DDD
P.S. I just found that we have same username! ^_^

How to change case of back references?

I'm trying to modify a back reference in PowerShell but am having no luck :(
This is my example:
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"`$2`".ToUpper()) | `$1 |"
If I run it I get this:
| Jane Doe | 456 |
But I'm really expecting this:
| JANE DOE | 456 |
If I run the following (the same as above but without the '()' on the call to ToUpper):
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"`$2`".ToUpper) | `$1 |"
I get this:
| string ToUpper(), string ToUpper(System.Globalization.CultureInfo culture) | 456 |
So it would appear that PowerShell knows that the back reference '$2' is a string but why can't I get PowerShell to convert it to upper case?
Terry

[Regex]::Replace('456,Jane Doe',
'^(\d{3}),(.*)$',
{
param($m)
'| ' + $m.Groups[2].Value.ToUpper() + ' | ' + $m.Groups[1].Value + ' |'
}
)
Not very pretty, I admit. And you sadly cannot use script blocks as replacement in the -replace operator.

Just to explain what is happening, in "| $(`"`$2`".ToUpper()) | `$1 |" PowerShell is evaluating the highlighted subexpression before passing the string to the -replace operator, rather than after the replace operation has occurred.
In other words, ToUpper is called on the string value $2, resulting in | $2 | $1 | being used for the replace operation. You can see this by including a letter in the subexpression string, for example:
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"zz `$2`".ToUpper()) | `$1 |"
This has an effective replace string of | ZZ $2 | $1 |, giving | ZZ Jane Doe | 456 | as the result.
Similarly, the second version omitting parenthesis, "| $(`"`$2`".ToUpper) | `$1 |", is evaluated as "some string".ToUpper, which puts the array of overload definitions for the ToUpper method on System.String in the replace string.
To keep the replace operation as a one-liner, Joey's answer using the MatchEvaluator overload to Regex.Replace works well. Or you might do the string formatting yourself based on the results of a -match:
if( '456,Jane Doe' -match '^(\d{3}),(.*)$' ) {
'| {0} | {1} |' -f $matches[2].ToUpper(),$matches[1]
}
If this needs to be replaced in the context of a larger string, you can always do a literal replace to get the final result:
PS> $r = '| {0} | {1} |' -f $matches[2].ToUpper(),$matches[1]
PS> 'A longer string with 456,Jane Doe in it.'.Replace( $matches[0], $r )
A longer string with | JANE DOE | 456 | in it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk remove substring using regex - regex

Use a single gsub function for both. BEGIN { FS = OFS = "|" } { gsub(/[0-9].[0-9]|bbb.fff/, "", $2); print }

Related

MariaDB REGEXP_REPLACE backreferences not works in FUNCTION

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

Write a regular expression to parse following string

Replace - with ' ' with gawk

How to change case of back references?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk remove substring using regex - regex

Use a single gsub function for both. BEGIN { FS = OFS = "|" } { gsub(/[0-9].*[0-9]|bbb.*fff/, "", $2); print }

Related

MariaDB REGEXP_REPLACE backreferences not works in FUNCTION

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

Write a regular expression to parse following string

Replace - with ' ' with gawk

How to change case of back references?

Categories

Resources

Use a single gsub function for both. BEGIN { FS = OFS = "|" } { gsub(/[0-9].[0-9]|bbb.fff/, "", $2); print }