Output name of named pattern in sed or grep - regex

I'm looking for a solution to output the name of named pattern in regular expression
Regex - can contain n patterns, each named idn, no duplicates:
(?P<id1>aba)|(?P<id2>cde)|(?P<id3>esa)|(?P<id4>fav)
input-file:
aba
cec
fav
gex
hur
output (any of the following):
id1
id4
id1;id4
1
4
1;4
Is there any way to do it with sed or grep on a linux os. The input file is a text file 200-500MB.
I know that PHP outputs pattern names in output array, but I'd prefer not to use it.
Any other solution is also welcome, but it should use basic linux commands.

Here's a simple Perl script which does what you ask.
perl -nle 'if (m/(?P<id1>aba)|(?P<id2>cde)|(?P<id3>esa)|(?P<id4>fav)/) {
for my $pat (keys %+) { print $pat } }' filename

Related

Extract Filename before date Bash shellscript

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?
The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.
Expected input files:
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Expected Extract:
EXAMPLE_FILE
EXAMPLE_FILE_2
Attempt:
filename=$(basename "$file")
folder=sed '^s/_[^_]*$//)' $filename
echo 'Filename:' $filename
echo 'Foldername:' $folder
$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$
$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$
No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:
$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2
Read all about how to use the %%, %, ## and # operators in your friendly shell manual.
Bash itself has regex capability so you do not need to run a utility. Example:
for fn in *.out; do
[[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
done
With the example files, output is:
EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2
Using Bash itself will be faster, more efficient than spawning sed, awk, etc for each file name.
Of course in use, you would want to test for a successful match:
for fn in *.out; do
if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
else
echo "$fn no match"
fi
done
As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:
for fn in *.out; do
cap="${fn%_*}"
printf "%s => %s\n" "$fn" "$cap"
done
And then test $cap against $fn. If they are equal, the parameter expansion did not trim the file name after _ because it was not present.
The regex allows a test that a date-like string \d\d\d\d-\d\d-\d\d is after the _. Up to you which you need.
Code
See this code in use here
^\w+(?=_)
Results
Input
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Output
EXAMPLE_FILE
EXAMPLE_FILE_2
Explanation
^ Assert position at start of line
\w+ Match any word character (a-zA-Z0-9_) between 1 and unlimited times
(?=_) Positive lookahead ensuring what follows is an underscore _ character
Simply with sed:
sed 's/_[^_]*$//' file
The output:
EXAMPLE_FILE
EXAMPLE_FILE_2
----------
In case of iterating through the list of files with extension .out - bash solution:
for f in *.out; do echo "${f%_*}"; done
awk -F_ 'NF-=1' OFS=_ file
EXAMPLE_FILE
EXAMPLE_FILE_2
Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.
awk --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out
Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.
Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.
awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{ ##Checking here condition that when very first line of any Input_file is being read then do following actions.
if(val){ ##Checking here if variable named val value is NOT NULL then do following.
close(val) ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
print array[1]; ##Printing array 1st element here.
val=FILENAME; ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
nextfile ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out ##Mentioning all *.out Input_file(s) here.

Edit line names with a new name containing an incremented value

This seems like a simple task to me but getting it to work easily is ending up more difficult than I thought:
I have a fasta file containing several million lines of text (only a few hundred individual sequence entries) and these sequence names are long, I want to replace all characters after the header > with Contig $n, where $n is an integer starting at 1 and is incremented for each replacement.
an example input sequence name:
>NODE:345643RD:Cov_456:GC47:34thgd
ATGTCGATGCGT
>NODE...
ATGCGCTTACAC
Which I then want to output like this
>Contig 1
ATGTCGATGCGT
>Contig 2
ATGCGCTTACAC
so maybe a Perl script? I know some basics but I'd like to read in a file and then output the new file with the changes, and I'm unsure of the best way to do this? I've seen some Perl one liner examples but none did what I wanted.
$n = 1
if {
s/>.*/(Contig)++$n/e
++$n
}
$ awk '/^\\>/{$0="\\>Contig "++n} 1' file
\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC
Try something like this:
#!/usr/bin/perl -w
use strict;
open (my $fh, '<','example.txt');
open (my $fh1, '>','example2.txt');
my $n = 1;
# For each line of the input file
while(<$fh>) {
# Try to update the name, if successful, increment $n
if ($_ =~ s/^>.*/>Contig$n/) { $n++; }
print $fh1 $_;
}
When you use the /e modifier, Perl expects the substitution pattern to be a valid Perl expression. Try something like
s/>.*/">Contig " . ++$n/e
perl -i -pe 's/>.*/">Contig " . ++$c/e;' file.txt
Output:
\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC
I'm not awk expert (far from that), but solved this only for curiosity and because sed don't contain variables (limited possibilities).
One possible gawk solution could be
awk -v n=1 '/^>/{print ">Contig " n; n++; next}1' <file

Extracting string from html file or curl output

I have a html file where some of them are "minified", this means that a whole website can be in just one line.
I want to filter the value of ?idsite= which contains numbers. So a html contains something like this: img src="//stats.domains.com/piwik.php?idsite=44.
So the plain output should be "44".
I tried grep but it echos the whole line and just highlights the value.
With perl it could be something like:
echo "Whole bunch of stuff \
img src=\"stats.domains.com/piwik.php?idsite=44\" " \
| perl -nE 'say /.*idsite=(..)\"/ '
(assumes that idsite is always two characters ! :-). Your regex will need to be more sophisticated than this most likely).
Putting the snippet from the page you reference above in an HTML file (non-minified) and subsituting 44 for the parameter variable, this bit of perl will extract the "44":
perl -nE 'say /.*idsite=(..)/ if /idsite/ ' idsite.html
Translating the one liner to a sed command line would be similar:
echo "Whole bunch of stuff \
img src=\"stats.domains.com/piwik.php?idsite=44\" " \
| sed -En "s/^.*idsite=(..)\"/\1/p"
This is POSIXsed from FreeBSD (should work on OSX) the -E switch is to add "modern" regexes.
Doing it in awk is left as an exercise for another community member :-)
Here is a perl way to extract only the trailing digits of strings like src="//stats.domains.com/piwik.php?idsite=44" and run on a bash command line:
echo $src|perl -ne '$_ =~m /(\d+$)/; print $1'
Here is a python way to do the same thing:
import re
print ', '.join( re.findall(r'\d+$', src))
If there will be a lot of src strings to process, it would be best to compile the regex when using Python as follows:
import re
p = re.compile('\d+$')
print ', '.join(p.findall(src))
The import and the compilation only have to be done once.
Here is a Ruby way to do it:
puts src.scan( /\d+$/ ).first
In all cases the regexes end with "$" which matches the end of the string. That is why they match and extract only digits (\d+) at the end of the string.
If you don't need to check whether the idsite is in the value of a src attribute, then all you need is
perl -nE'say $1 if /\bidsite=(\d+)' myfile.html
$ cat site.html
lorem ipsum idsite='4934' fasdf a
other line
$ sed -n '/idsite/ { s/.*idsite=\([0-9]\+\).*$/\1/; p }' < site.html
4934
Let me know in case you need an explanation of what is going on.

RegEx to Remove Unwanted text

I'm still kind of new to RegEx in general. I'm trying to retrieve the names from a field so I can split them for further use (using Pentaho Data Integration/Kettle for the data extraction). Here's an example of the string I'm given:
CN=Name One/OU=Site/O=Domain;CN=Name Two/OU=Site/O=Domain;CN=Name Three/OU=Site/O=Domain
I would like to have the following format returned:
Name One;Name Two;Name Three
Kettle uses Java Regular Expressions.
That sounds like you want substitute&replace based on a regex. How to correctly do that depends on your language. But with sed I would do it like this:
echo "CN=Name One/OU=Site/O=Domain;CN=Name Two/OU=Site/O=Domain;CN=Name Three/OU=Site/O=Domain" |\
sed 's/CN=\([^\/]*\)[^;]*/\1/g'
If you intend to split it later anyway, you probably want to just match the names and return them im a loop. Example code in perl:
#!/usr/bin/perl
$line="CN=Name One/OU=Site/O=Domain;CN=Name Two/OU=Site/O=Domain;CN=Name Three/OU=Site/O=Domain";
for $match ($line =~ /CN=([^\/]*)/g ){
print "Name: $match\n";
}
assuming you have it in file.txt:
sed -e 's/\/OU=Site\/O=Domain//g' -e 's/CN=//g' file.txt

Getting the index of the substring on solaris

How can I find the index of a substring which matches a regular expression on solaris10?
Assuming that what you want is to find the location of the first match of a wildcard in a string using bash, the following bash function returns just that, or empty if the wildcard doesn't match:
function match_index()
{
local pattern=$1
local string=$2
local result=${string/${pattern}*/}
[ ${#result} = ${#string} ] || echo ${#result}
}
For example:
$ echo $(match_index "a[0-9][0-9]" "This is a a123 test")
10
If you want to allow full-blown regular expressions instead of just wildcards, replace the "local result=" line with
local result=$(echo "$string" | sed 's/'"$pattern"'.*$//')
but then you're exposed to the usual shell quoting issues.
The goto options for me are bash, awk and perl. I'm not sure what you're trying to do, but any of the three would likely work well. For example:
f=somestring
string=$(expr match "$f" '.*\(expression\).*')
echo $string
You tagged the question as bash, so I'm going to assume you're asking how to do this in a bash script. Unfortunately, the built-in regular expression matching doesn't save string indices. However, if you're asking this in order to extract the match substring, you're in luck:
if [[ "$var" =~ "$regex" ]]; then
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo "capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
fi
This snippet will output in turn all of the submatches. The first one (index 0) will be the entire match.
You might like your awk options better, though. There's a function match which gives you the index you want. Documentation can be found here. It'll also store the length of the match in RLENGTH, if you need that. To implement this in a bash script, you could do something like:
match_index=$(echo "$var_to_search" | \
awk '{
where = match($0, '"$regex_to_find"')
if (where)
print where
else
print -1
}')
There are a lot of ways to deal with passing the variables in to awk. This combination of piping output and directly embedding one into the awk one-liner is fairly common. You can also give awk variable values with the -v option (see man awk).
Obviously you can modify this to get the length, the match string, whatever it is you need. You can capture multiple things into an array variable if necessary:
match_data=($( ... awk '{ ... print where,RLENGTH,match_string ... }'))
If you use bash 4.x you can source the oobash. A string lib written in bash with oo-style:
http://sourceforge.net/projects/oobash/
String is the constructor function:
String a abcda
a.indexOf a
0
a.lastIndexOf a
4
a.indexOf da
3
There are many "methods" more to work with strings in your scripts:
-base64Decode -base64Encode -capitalize -center
-charAt -concat -contains -count
-endsWith -equals -equalsIgnoreCase -reverse
-hashCode -indexOf -isAlnum -isAlpha
-isAscii -isDigit -isEmpty -isHexDigit
-isLowerCase -isSpace -isPrintable -isUpperCase
-isVisible -lastIndexOf -length -matches
-replaceAll -replaceFirst -startsWith -substring
-swapCase -toLowerCase -toString -toUpperCase
-trim -zfill