Remove character from the middle of a string

Remove character from the middle of a string - regex

I have a SAM file with an RX: field containing 12 bases separated in the middle by a - i.e. RX:Z:CTGTGC-TCGTAA
I want to remove the hyphen from this field, but I can't simply remove all hyphens from the whole file as the read names contain them, like 1713704_EP0004-T
Have mostly been trying tr, but this is just removing all hyphens from the file.:
tr -d '"-' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam
input is a large SAM file of >10,000,000 lines like this:
1902336-103-016_C1D1_1E-T:34 99 chr1 131341 36 146M = 131376 182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MC:Z:147M MD:Z:83T62cD:i:4 cE:f:0 PG:Z:bwa RG:Z:A MI:Z:34 NM:i:1 cM:i:3 MQ:i:36 UQ:i:45 AS:i:141 XS:i:136 RX:Z:CTGTGC-TCGTAA
Desired output (i.e. last field)
1902336-103-016_C1D1_1E-T:34 99 chr1 131341 36 146M = 131376 182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MC:Z:147M MD:Z:83T62cD:i:4 cE:f:0 PG:Z:bwa RG:Z:A MI:Z:34 NM:i:1 cM:i:3 MQ:i:36 UQ:i:45 AS:i:141 XS:i:136 RX:Z:CTGTGCTCGTAA
How do I solve this problem?

awk
awk '{sub(/-/,"",$NF)}1' file
is what you need.
Explanation
From this it is clear that you're concerned only about the last field.
NF is the total number of fields that a record contains, hence $NF is the last field.
sub(/-/,"",$NF) replaces the - in the last field with an empty string, making the change persistent.
GNU sed
For this same reason,
sed -Ei 's/^(.*)-/\1/' file
will work. It has an added advantage that it can perform an inplace edit.
Explanation
The -E option enables the extended regular expression engine.
The (.*) is a greedy search that will match any character(.) any number of times(*). For the fact that is greedy it will match anything up to the last hyphen.
The () makes sed remember what was matched.
In the substitution part, we put just the matched part \1 (1 because we having only one pair of parenthesis, note that you can have as many as you like) without the hyphen, thus effectively removing it from the last field where it should occur.
Note : The GNU awk support -i inplace, but I'm not sure from which version on.

I've solved this problem using pysam which is faster, safer and requires less disk space as a sam file is not required. It's not perfect, I'm still learning python and have used pysam for half a day.
import pysam
import sys
from re import sub
# Provide a bam file
if len(sys.argv) == 2:
assert sys.argv[1].endswith('.bam')
# Makes output filehandle
inbamfn = sys.argv[1]
outbamfn = sub('.bam$', '.fixRX.bam', inbamfn)
inbam = pysam.Samfile(inbamfn, 'rb')
outbam = pysam.Samfile(outbamfn, 'wb', template=inbam)
# Counters for reads processed and written
n = 0
w = 0
# .get_tag() retrieves RX tag from each read
for read in inbam.fetch(until_eof=True):
n += 1
umi = read.get_tag('RX')
assert umi is not None
umifix = umi[:6] + umi[7:]
read.set_tag('RX', umifix, value_type='Z')
if '-' in umifix:
print('Hyphen found in UMI:', umifix, read)
break
else:
w += 1
outbam.write(read)
inbam.close()
outbam.close()
print ('Processed', n, 'reads:\n',
w, 'UMIs written.\n',
str(int((w / n) * 100)) + '% of UMIs fixed')

The best solution is to work with BAM rather than SAM files, and to use a proper BAM parser/writer library, such as htslib.
Lacking that, you can cobble something together by searching for the regular expression ^RX:Z: in the optional tags (columns 12 and up).
Working with columns, while possible, is hard with sed. Instead, here’s how to do this in awk:
awk -F '[[:space:]]*' '{
for (i = 12; i <= NF; i++) {
if ($i ~ /^RX:Z:/) gsub("-", "", $i)
}
}
1' file.sam
And here’s a roughly equivalent solution as a Perl “one-liner”:
perl -ape '
for (#F[11..(scalar #F)]) {
s/-//g if /^RX:Z:/;
}
$_ = join("\t", #F);
' file.sam
To perform the replacement in the original file, you can pass the option -i.bak to perl (this will create a backup file.sam.bak; if you don’t want the backup, omit the extension).

This pattern is on many records that you want to edit, and is always at the end of the line? If so -
sed -E 's/^(.*)(\s..:.:......)-(......\s*)$/\1\2\3/' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

Related

grep filename with extension from a list of URLS using [regex]

Hello everyone I'm working on a list of urls where I only need to grep all file names that end with .asp or .aspx and there should not be any duplicates as well so I came across this solution to remove everything before the last / and after .asp
I tried this regex which removes everything before the last /
([^\/]+$)
e.g.
abc/abc/abc/xyz.asp >> xyz.asp
But if there is a / after .asp it starts selecting after that /
abc/abc/abc/xyz.asp?ijk=lmn/opq >> opq which i do not want
I want to grep only strings that has .asp and .aspx and remove every single character before the last / and after it.
I simple words I want to grep the filename.asp or filename.aspx only
Sample Input
https://www.redacted.com/abc/xyz.aspx?something=something
Sample output:
xyz.aspx
Sample input:
https://www.redacted.com/abc/xyz/file.aspx?z=x&LOC=http%3A%2F%2Fwww.redacted.com%2Fasp%2Fanotherfile-asp%2F_%2FCRID--7%2Fthirdfile.asp%3Fui%3Dhash
sample output:
file.aspx, anotherfile-asp, thirdfile.asp

With your shown samples, in GNU awk you could try following regex, along with its match and RS function used with regex.
awk -v RS='[^.]*[-\\.]aspx?' '
RT{
num=split(RT,arr,"[/%2F]")
for(i=1;i<=num;i++){
if(arr[i]~/[-.]asp/){
print arr[i]
}
}
}
' Input_file
If your file contains both the lines(shown in your question) then samples output will be as follows:
xyz.aspx
file.aspx
anotherfile-asp
thirdfile.asp
Explanation: Simple explanation would be, setting RS(record separator) as [^.]*[-\\.]asp for whole Input_file. Then in main program spitting records with /%2F and checking if any parts contains -asp OR .asp then print that matched part, like shown in sample output in above.

This is Python, but the regex should work elsewhere.
import re
s1 = "https://www.redacted.com/abc/xyz.aspx?something=something"
s2 = "https://www.redacted.com/abc/xyz/file.aspx?z=x&LOC=http%3A%2F%2Fwww.redacted.com%2Fasp%2Fanotherfile-asp%2F_%2FCRID--7%2Fthirdfile.asp%3Fui%3Dhash"
# We want the set of things that is not a slash, until we get to .asp or
# .aspx, followed either by ? or end of string.
name = r"[^/]*\.aspx?((?=\?)|$)"
for s in s1,s2:
print( re.search( name, s ).group() )
Output:
xyz.aspx
file.aspx

Another option could be using awk and first split on the URL encoded parts that should not be part of the result.
Then from all the parts, match only the strings that do not contain / and end on asp with an optional x, and preceded by either - or .
awk '
{
n = split($0 ,a, /(%[A-F0-9]+)+/)
for (i=1; i <= n; i++) {
if (match(a[i], /[^/]+[.-]aspx?/)){
print substr(a[i], RSTART, RLENGTH)
}
}
}
' file
Output
file.aspx
anotherfile-asp
thirdfile.asp
xyz.aspx
If grep -P is supported, you might also use a Perl-compatible regular expression that skips the URL encoded parts:
grep -oP "(?:%[A-F0-9]+)+(*SKIP)(*F)|(?:(?!%[A-F0-9])[^/])*[-.]aspx?" file
See a regex demo.

How to output multiple regex matches through comma on the same line

I want to use grep/awk/sed to extract matched strings for each line of a log file. Then place it into csv file.
Highlighted strings (1432,53,http://www.espn.com/)
If the input is:
2018-10-31
18:48:01.717,INFO,15592.15627,PfbProxy::handlePfbFetchDone(0x1d69850,
pfbId=561, pid=15912, state=4, fd=78, timer=61), FETCH DONE: len=45,
PFBId=561, pid=0, loadTime=1434 ms, objects=53, fetchReqEpoch=0.0,
fetchDoneEpoch:0.0, fetchId=26, URL=http://www.espn.com/
2018-10-31
18:48:01.806,DEBUG,15592.15621,FETCH DONE: len=45, PFBId=82, pid=0,
loadTime=1301 ms, objects=54, fetchReqEpoch=0.0, fetchDoneEpoch:0.0,
fetchId=28, URL=http://www.diply.com/
Expected output for the above log lines:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
This is an example, and the actual Log File will have much more data.
--My-Solution-So-far-
For now I used grep to get all lines containing keyword 'FETCH DONE' (these lines contain strings I am looking for).
I did come up with regular expression that matches the data I need, but when I grep it and put it in the file it prints each string on the new line which is not quite what I am looking for.
The grep and regular expression I use (online regex tool: https://regexr.com/42cah):
echo -en 'url,loadtime,object\n'>test1.csv #add header
grep -Po '(?<=loadTime=).{1,5}(?= )|((?<=URL=).*|\/(?=.))|((?<=objects=).{1,5}(?=\,))'>>test1.csv #get matching strings
Actual output:
URL,LoadTime,Objects
http://www.espn.com
1434
53
http://www.diply.com
1301
54
Expected output:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
I was trying using awk to match multiple regex and print comma in between. I couldn't get it to work at all for some reason, even though my regex matches correct strings.
Another idea I have is to use sed to replace some '\n' for ',':
for(i=1;i<=n;i++)
if(i % 3 != 0){
sed REPLACE "\n" with "," on i-th line
}
Im pretty sure there is a more efficient way of doing it

Using sed:
sed -n 's/.*loadTime=\([0-9]*\)[^,]*, objects=\([0-9]*\).* URL=\(.*\)/\3,\1,\2/p' input | \
sed 1i'URL,LoadTime,Objects'

Extract multiple independent regex matches per line

For the file below, I want to extract the two strings following "XC:Z:" and "XM:Z:". For example:
1st line output should be this: "TGGTCGGCGCGT, GAGTCCGT"
2nd line output should be this: "GAAGCCGCTTCC, ACCGACGG"
The original version of the file has a few more columns and millions of rows than the following example, but it should give you the idea:
MOUSE_10 XC:Z:TGGTCGGCGCGT RG:Z:A XM:Z:GAGTCCGT ZP:i:33
MOUSE_10 XC:Z:GAAGCCGCTTCC NM:i:0 XM:Z:ACCGACGG AS:i:16
MOUSE_10 ZP:i:36 XC:Z:TCCCCGGGTACA NM:i:0 XM:Z:GGGACGGG ZP:i:28
MOUSE_10 XC:Z:CAAATTTGGAAA RG:Z:A NM:i:1 XM:Z:GCAGATAG
In addition, each of following criteria would be a bonus but is not mandatory if you can get it to work:
use standard bash tools: awk, sed, grep, etc. (no GAWK, csvtools,...)
assume we don't know the order in which XC and XM appear (although I'm fairly certain XC is almost first, but I am unsure how to check). In the output, however, the XC-string should always be before the XM-string, if at all possible.
The answers from here awk extract multiple groups from each line come awfully close to it, but whenever I try using match(...) I get a "syntax error near unexpected token" message.
Looking forward to your solutions!
Thanks,
Felix

With sed you can capture non-space characters after XC:Z: and XM:Z:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/p;' file
You can add a second s command for reversed values:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/;s/.*XM:Z:\([^[:blank:]]*\).*XC:Z:\([^[:blank:]]*\).*/\1, \2/;p;' file

Following awk solution may help you in same.
awk '
/XC:Z:/{
match($0,/XC:[^ ]*/);
num=split(substr($0,RSTART,RLENGTH),a,":");
match($0,/XM:[^ ]*/);
num1=split(substr($0,RSTART,RLENGTH),b,":");
print a[num],b[num1]
}' Input_file
Output will be as follows.
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG

If we don't know the order in which XC and XM appear
You can try this sed
sed -E 'h;s/(XC:Z:.*XM:Z:)//;tA;x;s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/;b;:A;x;s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/' infile
explanation :
sed -E '
h
# keep the line in the hold space
s/(XC:Z:.*XM:Z:)//;x;tA
# if XCZ come before XMZ, go to A but before everything restore the pattern space with x
s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/
# XMZ come before XCZ, get the interresting parts and reorder it
b
# It is all for this line
:A
s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/
# XCZ come before XMZ, get the interresting parts
' infile

another awk
$ awk '{c=p=""; # need to reset c and p before each line
for(i=1;i<=NF;i++) # for all fields in the line
if($i~/^XC:Z:/) c=substr($i,6) # check pattern from the start of field
else if($i~/^XM:Z:/) p=substr($i,6) # if didn't match check other other pattern
if(c && p) print c,p}' file # if both matched print
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last matches if there are multiple instances on the same line. Here is another one with slightly different characteristic.
$ awk 'function s(x) {return ($i~x)?substr($i,6):""}
{c=p="";
for(i=1;i<=NF;i++) {
c=c?c:s("^XC:Z:"); p=p?p:s("^XM:Z:");
if(c && p)
{print c,p; next}}}' file
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last of the repeated match before the first match of the other. It they appear in pairs, will print the first pair.

Using POSIX awk, you can only use the string-function match(s,ere) as defined by IEEE Std 1003.1-2008 :
match(s, ere)
Return the position, in characters, numbering from 1, in
string s where the extended regular expression ere occurs, or zero if
it does not occur at all. RSTART shall be set to the starting position
(which is the same as the returned value), zero if no match is found;
RLENGTH shall be set to the length of the matched string, -1 if no
match is found.
The patterns you want to match are XM:Z:[^[:blank:]]* and XC:Z:[^[:blank:]]*. This however assumes you do not have any string which contains something like PXM:Z: (i.e. an extra non-blank character advancing the searched string). When the pattern is found in the line $0, then you only need to extract the important parts, which start 5 characters later.
The following code does the above:
awk '{match($0,/XM:Z:[^[:blank:]]*/);xm=substr($0,RSTART+5,RLENGTH-5)}
{match($0,/XC:Z:[^[:blank:]]*/);xc=substr($0,RSTART+5,RLENGTH-5)}
{print xc","xm}' <file>
As you can see, the first line extracts XM, the second XC and the third prints the outcome with comma-separator ",".
Remark - The following assumptions are made here :
each line contains both an xm and xc string
no strings of the type [^[:blank:]]X[CM]:Z:[^[:blank:]]* exist
If you are willing to use gawk, then you could use the patsplit function for string operations (Ref. here). You can do this with a single regex /X[CM]:Z:[^[:blank:]]*/. This gives you directly the requested strings in a single call which include the XM:Z: or XM:C: part. Afterwards you can easily sort them and extract the last parts.
The following lines do exactly the same in gawk
gawk '{patsplit($0,a,/X[MC]:Z:[^[:blank:]]*/) }
{xc=(a[1]~/^XC/)?a[1]:a[2]; xm=(a[1]~/^XC/)?a[2]:a[1]}
{print substr(xc,5)","substr(xm,5)' <file>
Nonetheless, I believe the awk solution is cleaner from a symmetric point of view.

Search text for multiple lines matching string 1 which are not separated by string 2

I've got a file looking like this:
abc|100|test|line|with|multiple|information|||in|different||fields
abc|100|another|test|line|with|multiple|information|in||different|fields|
abc|110|different|looking|line|with|some|supplementary|information
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information
abc|100|another|test|line|with|multiple|information|in||different|fields|
abc|110|different|looking|line|with|supplementary|information
I'm looking for a regexp to use with sed / awk / (e)grep (it actually doesn't matter to me which of these as all would be fine) to find the following in the above mentioned text:
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information
I want to get back a |100| line if it is followed by at least two |110| lines before another |100| line appears. The result should contain the initial |100| line together with all |110| lines that follow but not the following |100| line.
sed -ne '/|100|/,/|110|/p'
provides me a list of all |100| lines which are followed by at least one |110| line. But it doesn't check, if the |110| line has been repeated more than once. I get back results I don't look for.
sed -ne '/|100|/,/|100|/p'
returns a list of all |100| lines and the content between the next |100| line including the next |100| line.
Trying to find lines between search patterns always was a nightmare to me. I spent hours of try and error on similar problems which finally worked. But I never really understood why. I hope, s.o. might be so kind to save me of the headache this time and maybe explain how the pattern does the work. I'm quite sure, I'll face this kind of problem again and then I finally could help myself.
Thank you for any help on this one!
Regards
Manuel

I'd do this in awk.
awk -F'|' '$2==100&&c>2{print b} $2==100{c=1;b=$0;next} $2==110&&c{c++;b=b RS $0;next} {c=0}' file
Broken out for easier reading:
awk -F'|' '
# If we're starting a new section and conditions have been met, print buffer
$2==100 && c>2 {print b}
# Start a section with a new count and a new buffer...
$2==100 {c=1;b=$0;next}
# Add to buffer
$2==110 && c {c++;b=b RS $0}
# Finally, zero everything if we encounter lines that don't fit the pattern
{c=0;b=""}
' file
Rather than using a regex, this steps through the file using the field delimiters you've specified. Upon seeing the "start" condition, it begins keeping a buffer. As subsequent lines match your "continue" condition, the buffer grows. Once we see the start of a new section, we print the buffer if the the counter is big enough.
Works for me on your sample data.

Here's a GNU awk specific answer: use |100| as the record separator, |110| as the field separator, and look for records with at least 3 fields.
gawk '
BEGIN {
# a newline, the first pipe-delimited column, then the "100" value
RS="(\n[^|]+[|]100[|])"
FS="[|]110[|]"
}
NF >= 3 {print RT $0} # RT is the actual text matching the RS pattern
' file

In AWK, the field separator is set to a pipe character and the second field is compared to 100 and 110 per line. $0 represents a line from the input file.
BEGIN { FS = "|" }
{
if($2 == 100) {
one_hundred = 1;
one_hundred_one = 0;
var0 = $0
}
if($2 == 110) {
one_hundred_one += 1;
if(one_hundred_one == 1 && one_hundred = 1) var1 = $0;
if(one_hundred_one == 2 && one_hundred = 1) var2 = $0;
}
if(one_hundred == 1 && one_hundred_one == 2) {
print var0
print var1
print var2
}
}
awk -f foo.awk input.txt
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing

My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt

You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'

I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt

If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).

You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.

perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.

If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp

why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution

there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.

you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"

For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}

gawk '/.*abc([0-9]+)xyz.*/' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove character from the middle of a string - regex

This pattern is on many records that you want to edit, and is always at the end of the line? If so - sed -E 's/^(.)(\s..:.:......)-(......\s)$/\1\2\3/' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

Related

grep filename with extension from a list of URLS using [regex]

How to output multiple regex matches through comma on the same line

Extract multiple independent regex matches per line

Search text for multiple lines matching string 1 which are not separated by string 2

how to use sed, awk, or gawk to print only what is matched?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove character from the middle of a string - regex

This pattern is on many records that you want to edit, and is always at the end of the line? If so - sed -E 's/^(.*)(\s..:.:......)-(......\s*)$/\1\2\3/' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

Related

grep filename with extension from a list of URLS using [regex]

How to output multiple regex matches through comma on the same line

Extract multiple independent regex matches per line

Search text for multiple lines matching string 1 which are not separated by string 2

how to use sed, awk, or gawk to print only what is matched?

Categories

Resources

This pattern is on many records that you want to edit, and is always at the end of the line? If so - sed -E 's/^(.)(\s..:.:......)-(......\s)$/\1\2\3/' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam