Extract a substring from a string in shell script - regex

$sea = xyz-ajay-no-siddhart-ubidone-fdh-DMJK.UK.1.0-32133-Z-1.tgz
and i want to extract only DMJK.UK.1.0-32133-Z-1 that is 1st occurenceand and search string will start as DMJK.
$sea = anything-anything.xyz-ajay-no-siddhart-ubidone-fdh-DMJK.UK.1.0-32133-Z-1.tgz matchessadds.dsdsds.21212.anything-anything
or
$sea = anything-anything.xyz-ajay-no-siddhart-ubidone-fdh-DMJK.UK.1.0-32133-Z-1.tar.bz2 matchessadds.dsdsds.21212.anything-anything
need to extract start from DMJK and before either ".tgz" or ".tar" or "tar.bz2",
o/p should be DMJK.UK.1.0-32133-Z-1 in the above case.
main String can be anything that include the format
"anything.anything-DMJK.XXX.[]0-9].[0-9].[0-9]-3213309-[A-Z]-[0-9].tar.tgz anything-anything "
change on fly, o/p could be DMJK.XXX.1.3.0-3213309-Z-13
I tried but its not working:
echo $sea | sed 's#.*(DMJK.*).t*#\1#g'

There are better ways to do this, but you can fix your current solution with:
sed -E 's#.*(DMJK.*)\.(tgz|tar.bz2|tar)#\1#g'
( well, sort of. This will perhaps not work as desired on a string with .tar.bz5. It's not entirely clear what you want to do in that case, though.)

Related

regex in sed removing only the first occurrence from every line

I have the following file I would like to clean up
cat file.txt
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
My desired output is:
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
I would like to remove everything between ":" and the first occurence of "or"
I tried sed 's/MNS:d*?or /MNS:/g' though it removes the second "or" as well.
I tried every option in https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/
to no avail. should I create alias sed='perl -pe'? It seems that sed does not properly support regex
perl should be more suitable here because we need Lazy match logic here.
perl -pe 's|(:.*?or +)(.*)|:\2|' Input_file
by using .*?or we are checking for the first nearest match for or string in the line.
This might work for you (GNU sed):
sed '/:.*\<or\>/{s/\<or\>/\n/;s/:.*\n//}' file
If a line contains : followed by the word or, then substitute the first occurrence of the word or with a unique delimiter (e.g.\n) and then remove everything between : and the unique delimiter.
Wrt I would like to remove everything between ":" and the first occurence of "or" - no you wouldn't. The first occurrence of or in the 2nd line of sample input is as the start of orweqqwe. That text immediately after : looks like it could be any set of characters so couldn't it contain a standalone or, e.g. MNS:2 or eqqwe or M+ GYPA*02 or GYPA*N
Given that and the fact it's apparently a fixed number of characters to be removed on every line, it seems like this is what you should really be using:
$ sed 's/:.\{14\}/:/' file
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
If it is sure the or always occurs twice a line as provided example, please try:
sed 's/\(MNS:\).\+ or \(.\+ or .*\)/\1\2/' file.txt
Result:
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
Otherwise using perl is a better solution which supports the shortest match as RavinderSingh13 answers.
ex supports lazy matching with \{-}:
ex -s '+%s/:\zs.\{-}or //g|wq' input_file
The pattern :\zs.\{-}or matches any character after the first : up to the first or.

Regex a string with unknown number of parameters

Let say I have millions of string in a text file in this format:
st=expand&c=22&t=button&k=fun HTTP
This is a string we can look at as a hash with keys st, c, t and k. Some of the strings in the text file might not have a given &KEY=VALUE present and might thus look like this:
st=expand&k=fun HTTP
How would one use sed to change the string to following
expand,,,fun
that is, even thought the key=value isn't present, we still add a comma. We can assume that we have a fixed key set [st,c,t,k].
What I've tried is something like (just an idea!!)
sed 's/\(st=\|c=\|t=\|k=\)\([\(^\&\|HTTP\)])\(\&\|HTTP\)/\3,/g' big_file
but obviously, if c isn't there, it isn't adding a comma since it doesn't find any. Any ideas how to approach this? Using awk might also be acceptable (or any other fast text-processing utility)
Thanks!
Input data example
st=expand&c=22&t=button&k=fun HTTP
c=22&t=button&k=fun HTTP
st=expand&c=22&t=party&k=fun HTTP
st=expand&c=22&k=fun HTTP
st=expand HTTP
HTTP
Output data
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
,,,
You can use this sed:
sed -E 's/(st=([^& ]*)|)(.*c=([^& ]*)|)(.*t=([^& ]*)|)(.*k=([^& ]*)|) HTTP/\2,\4,\6,\8/' file
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
,,,
Sed Demo
RegEx Demo
Whenever you have name=value pairs in your input data, it's simplest and clearest and usually most efficient to create a name->value array and then print the values by name in whatever order you want, e.g.:
$ cat tst.awk
BEGIN { FS="[&= ]"; OFS="," }
{
delete n
for (i=1;i<NF;i+=2) {
n[$i] = $(i+1)
}
print n["st"], n["c"], n["t"], n["k"]
}
$ awk -f tst.awk file
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
,,,
Another pattern for sed to try:
sed -r "s/(st=(\w+))?(&?c=(\w+))?(&t=(\w+))?(&k=(\w+))?( HTTP)/\2,\4,\6,\8/g" big_file
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
REGEX 101 DEMO
How about something like this? It's not perfectly strict, but as long as your data follows the format you described on every line, it will work.
Regex:
^(?:st=([^&\n]*))?&?(?:c=([^&\n]*))?&?(?:t=([^&\n]*))?&?(?:k=([^&\n]*))? HTTP$ (must be run once per line or with multi-line and global options enabled)
Substitution:
\1,\2,\3,\4
Try it here: https://regex101.com/r/nE1oP7/2
EDIT: If you are using sed, you will need to change the non-capturing groups to regular ones ((?:) to ()) and update the backreferences accordingly (\2,\4,\6,\8). Demo: http://ideone.com/GNRNGp

Sed search one regex, return another

I have the following csv file:
hd1,100
hd2,200
I'd like to change it so it reads like this:
hard1drive,100
hard2drive,200
I thought sed could help:
sed s'/hd[0-9]/hard[0-9]drive]/ < infile.csv
but instead of the desired output I get:
hard[0-9]drive,100
hard[0-9]drive,200
Is there any way I can 'capture' the number from the search parameter and insert it within the replace parameter within sed, or am I going to have to use another command?
Use capturing groups
sed 's/hd\([0-9]\)/hard\1drive/'
option without grouping:
kent$ echo "hd1,100
hd2,200"|sed 's/d[0-9]/ar&drive/'
hard1drive,100
hard2drive,200

Trying to match a string and add it to the replacement

I am trying to add a string to the end of a line using sed and regex.
I have the following string:
disable_functions = pcntl_alarm,pcntl_fork,pcntl_waitpid,pcntl_wait,pcntl_wifexited,pcntl_wifstopped,pcntl_wifsignaled,pcntl_wexitstatus,pcntl_wtermsig,pcntl_wstopsig,pcntl_signal,pcntl_signal_dispatch,pcntl_get_last_error,pcntl_strerror,pcntl_sigprocmask,pcntl_sigwaitinfo,pcntl_sigtimedwait,pcntl_exec,pcntl_getpriority,pcntl_setpriority,
and im trying to add to the end of it the string:
exec,system,shell_exec,passthru,
My attempt is as follows:
sed -ie 's/disable_functions = .*[a-zA-z,]$/disable_functions = $1exec,system,shell_exec,passthru,/gI' /etc/php5/apache2/php.ini
It seems to just add $1 to the string disable_functions = $1exec,system,shell_exec,passthru,
Where am I going wrong?
Try that :
sed '/^disable_functions/s/$/exec,system,shell_exec,passthru,/' /etc/php5/apache2/php.ini
If the output seems OK, then add -i switch to modify the file.
$ here, mean end of line.
Your original command is missing the capture group which sets the value of \1 (not $1):
sed -ie 's/disable_functions = \(.*[a-zA-z,])\$/disable_functions = \1exec,system,shell_exec,passthru,/gI' /etc/php5/apache2/php.ini
But as sputnick points out, you simply need to find the appropriate line and append the desired text; there's no need to match the old values and reinsert them.

Regular Expression to parse Common Name from Distinguished Name

I am attempting to parse (with sed) just First Last from the following DN(s) returned by the DSCL command in OSX terminal bash environment...
CN=First Last,OU=PCS,OU=guests,DC=domain,DC=edu
I have tried multiple regexs from this site and others with questions very close to what I wanted... mainly this question... I have tried following the advice to the best of my ability (I don't necessarily consider myself a newbie...but definitely a newbie to regex..)
DSCL returns a list of DNs, and I would like to only have First Last printed to a text file. I have attempted using sed, but I can't seem to get the correct function. I am open to other commands to parse the output. Every line begins with CN= and then there is a comma between Last and OU=.
Thank you very much for your help!
I think all of the regular expression answers provided so far are buggy, insofar as they do not properly handle quoted ',' characters in the common name. For example, consider a distinguishedName like:
CN=Doe\, John,CN=Users,DC=example,DC=local
Better to use a real library able to parse the components of a distinguishedName. If you're looking for something quick on the command line, try piping your DN to a command like this:
echo "CN=Doe\, John,CN=Users,DC=activedir,DC=local" | python -c 'import ldap; import sys; print ldap.dn.explode_dn(sys.stdin.read().strip(), notypes=1)[0]'
(depends on having the python-ldap library installed). You could cook up something similar with PHP's built-in ldap_explode_dn() function.
Two cut commands is probably the simplest (although not necessarily the best):
DSCL | cut -d, -f1 | cut -d= -f2
First, split the output from DSCL on commas and print the first field ("CN=First Last"); then split that on equal signs and print the second field.
Using sed:
sed 's/^CN=\([^,]*\).*/\1/' input_file
^ matches start of line
CN= literal string match
\([^,]*\) everything until a comma
.* rest
http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators
awk -v RS=',' -v FS='=' '$1=="CN"{print $2}' foo.txt
I like awk too, so I print the substring from the fourth char:
DSCL | awk '{FS=","}; {print substr($1,4)}' > filterednames.txt
This regex will parse a distinguished name, giving name and val a capture groups for each match.
When DN strings contain commas, they are meant to be quoted - this regex correctly handles both quoted and unquotes strings, and also handles escaped quotes in quoted strings:
(?:^|,\s?)(?:(?<name>[A-Z]+)=(?<val>"(?:[^"]|"")+"|[^,]+))+
Here is is nicely formatted:
(?:^|,\s?)
(?:
(?<name>[A-Z]+)=
(?<val>"(?:[^"]|"")+"|[^,]+)
)+
Here's a link so you can see it in action:
https://regex101.com/r/zfZX3f/2
If you want a regex to get only the CN, then this adapted version will do it:
(?:^|,\s?)(?:CN=(?<val>"(?:[^"]|"")+"|[^,]+))