Regex, get what's after the second occurence of a string

Regex, get what's after the second occurence of a string - regex

I have a string of the following format:
TEXT####TEXT####SPECIALTEXT
I need to get the SPECIALTEXT, basically what is after the second occurrence of the ####. I can't get it done. Thanks

The regex (?:.*?####){2}(.*) contains what you're looking for in its first group.

If you are using shell and can use awk for it:
From a file:
awk 'BEGIN{FS="####"} {print $3}' input_file
From a variable:
awk 'BEGIN{FS="####"} {print $3}' <<< "$input_variable"

Related

AWK: how to match a comma

I want to return lines from awk with a pattern "C," or ".,C" or ".,C,.*".
For example:
Valid
C,G
G,C
G,C,A
Invalid
G,CC
My code is below:
echo G,CC | awk '$0 ~ /^C,+.*|.*,C,*.*/ {print $0}'
output:
G,CC
I hope it returns nothing to me. Unfortunately, it returns "G,CC" to me.
How do I solve this problem?
Edit:
Based on the answers from #Emma and #perreal. I used a shorter command line to solve my question:
awk '$0 ~ /^C,.*|.*,C,.*|.*,C$/ {print $0}'
Until now, it works well. Thanks for your help!!

Could you please try following.
awk '!/CC/ && /^C,+.*|.*,C,*.*/' Input_file

The + is not necessary in ^C,+.*, since you already match the comma and also match whatever comes after.
The * right after the second comma is not correct in .*,C,*.*. It makes the comma optional so it can also match G,CC (.*, matches G, and C,* matches CC).
This should work:
awk '$0 ~ /^[GCA](,[GCA])*$/ && /C/ {print $0}'

My guess is that maybe this would also work:
awk '$0 ~ /^([A-Z],C,[A-Z]|[A-Z],C|C,[A-Z])$/ {print $0}'
Demo
Advice
Mr. Rankin is advising that:
It is equivalent to awk '/^([A-Z],C,[A-Z]|[A-Z],C|C,[A-Z])$/'. Output
with print is the default operation along with the match against the
record.

$ awk '/(^|,)C(,|$)/' file
C,G
G,C
G,C,A

More alternatives
In other words, you want to select lines with "C" as word? If yes, here are 2 solutions:
grep -w C
grep -E '\<C\>'
The first one advises grep to match only whole words. The second line uses begin-word and end-word patterns. These pattern can be used with awk too:
awk '/\<C\>/ {print}'
A complete different solution (and different form other answers too) is to add commas at both ends before comparing ,C,:
awk '"," $0 "," ~ /,C,/ {print}

sed regex cut string after match

I tested a regex on http://regexr.com/ and it works like expected.
How can I run this by using sed?
/^.*?OU=([^,]*)/g
The test string looks like:
mario.test;Mario Test;Mario;Test;123;+001122334455;CN=Mario Test,OU=AT-Test,OU=Tese Sites,DC=Test,DC=local;test.local
And the output is:
mario.test;Mario Test;Mario;Test;123;+001122334455;CN=Mario Test,OU=AT-Test
So it should cut the string before the second OU= starts.
Thanks

sed is not the best tool for this case when you have to deal with text that contains "columns" and can be split. Here are two possibilities, one with sed and the other with awk:
s="mario.test;Mario Test;Mario;Test;123;+001122334455,CN=Mario Test,OU=AT-Linz,OU=Tese Sites,DC=Test,DC=local;test.local"
echo $s | sed 's/OU=/й/' | sed 's/\([^й]*\)й\([^,]*\).*/\1OU=\2/'
echo $s | awk -F",OU=" '{print $1 ",OU=" $2}'
See the online demo
The awk solution splits with ,OU= substring and then joins the first and second column with the separator (since it is hardcoded, it is easy to put it back).
sed uses 2 passes: 1) add a non-used char (must be a control char, here, a Cyrillic letter is used for better "visibility") to mark the border of our match, 2) match all we do not need and match and capture what we need to keep with the help of capturing groups and backreferences.

Your question isn't clear but from reading your comments, are either of these what you're looking for?
$ awk -F, '{print $1 FS $2}' file
mario.test;Mario Test;Mario;Test;123;+001122334455;CN=Mario Test,OU=AT-Test
$ awk -F'CN=[^,]+,OU=|,' '{print $1 $2}' file
mario.test;Mario Test;Mario;Test;123;+001122334455;AT-Test

Finding and replacing the last space at or before nth character works with sed but not awk, what am I doing wrong?

I have a string in a test.csv file like this:
here is my string
when I use sed it works just as I expect:
cat test.csv | sed -r 's/^(.{1,9}) /\1,/g'
here is,my string
Then when I use awk it doesn't work and I'm not sure why:
cat test.csv | awk '{gsub(/^(.{1,9}) /,","); print}'
,my string
I need to use awk because once I get this figured out I will be selecting only one column to split into two columns with the added comma. I'm using extended regex with sed, "-r" and was wondering how or if it's supported with awk, but I don't know if that really is the problem or not.

awk does not support back references in gsub. If you are on GNU awk, then gensub can be used to do what you need.
echo "here is my string" | awk '{print gensub(/^(.{1,9}) /,"\\1,","G")}'
here is,my string
Note the use of double \ inside the quoted replacement part. You can read more about gensub here.

Print matched pattern with AWK

For example i have this data:
/home/test/dat1.txt
/home/test/dat2.txt
/home/test/test1/dat3.txt
/home/test/test2/dat4.txt
/home/test/test3/test4/dat5.txt
I need to print only the name and extension, that output should be:
dat1.txt
dat2.txt
dat3.txt
dat4.txt
dat5.txt
I need to use the awk command... anyone can help?
I use this regular expression: '/\/*\.txt/{print ???}

If you are going to use awk, you do not need a regex for this purpose.
You can just tell awk to print the last field, using a field separator of /.
awk -F'/' '{print $NF}' Input.txt
As hd1's comment already noted, NF is the number of fields on the current input record (in this case line). Since awk starts indexing fields at $1, $NF gives you the last field.

You could use this short awk
awk -F/ '$0=$NF' Input.txt
If you need empty line use
awk -F/ '{$0=$NF}1' Input.txt

Extract substring using regex shell

I have a string that contains multiple ocurrences in the way:
element 1 tag1{field1:"text",field2:"text"...},tag2{field1:"text",field2:"text"...},..
element 2 tag1{field1:"text",field2:"text"...},tag2{field1:"text",field2:"text"...},..
I want to extract using shell all the fields1, of the tag1 of all the elements
my try:
sed -n "s/.*\"tag1\":{\"fiel1\":\"\(.*\),\"fiel2\".*/\1/gp"
I am obtaining just the final one not all of them.
EDIT: The problem is that the whole text is in one single string and the regex just get me one cocurrence.
Thanks

You can try this,
sed 's/\(.*tag1{field1:"\)\([^"]*\)\(".*\)/\2/g' yourfile

perl -pe 's/tag1\{field1:\"([^\"]*)".*/$1/g' your_file
Or
awk -F":|," '{print $2}'

sed -n 's/.*[[:space:]]\{1,\}tag1{field1:"\([^"]*\)".*/\1/gp' YourFile
based on text sample
element 1 tag1{field1:"text",field2:"text"...},tag2{field1:"text",field2:"text"...},..
element 2 tag1{field1:"text",field2:"text"...},tag2{field1:"text",field2:"text"...},..

Using awk
awk -F\" '{print $2}'
or to make sure its only extracted for lines with that field1
awk -F\" '/field1/ {print $2}'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex, get what's after the second occurence of a string - regex

I have a string of the following format: TEXT####TEXT####SPECIALTEXT I need to get the SPECIALTEXT, basically what is after the second occurrence of the ####. I can't get it done. Thanks

The regex (?:.?####){2}(.) contains what you're looking for in its first group.

If you are using shell and can use awk for it: From a file: awk 'BEGIN{FS="####"} {print $3}' input_file From a variable: awk 'BEGIN{FS="####"} {print $3}' <<< "$input_variable"

Related

AWK: how to match a comma

sed regex cut string after match

Finding and replacing the last space at or before nth character works with sed but not awk, what am I doing wrong?

Print matched pattern with AWK

Extract substring using regex shell

Categories

Resources