Convention on how pass multiple regexps on command line - regex

I'm writing a small command line utility that will need to take several arguments each of which can be a list of regular expressions. Is there a convention on how to do that?
Here is an example of what I have in mind
mycliutility -i regexp1,regexp2 -o regexp3,regexp4 somefilename
so I'm asking if for example a comma is good separtor for the regexpression and what/how to escape that if the separator need to appear in the regexp.
I'm expecting/hoping that the need to use comma (or whatever) in the regexp is rare so I would like to use a syntax that is as as light weight as possible.
Pointer to existing cli tools that take arguments like that are welcome.
EDIT
It is also possible that the regexps come from a Java Properties file and for this reason if would be 'cleaner' if multiple rexeps on the command line were treated as one (so the syntax would be the same on CLI and the properties file), see this example.properties file:
iexps = regexp1,regexp2
oexps = regexp3,regexp4

If the regexes are simple alternatives, a single regex of the form regex1|regex2 may well be the simplest solution.
If you need to parse comma-separated regexes out of the property file anyway, you'd better use the same syntax on the command line as well. Game over.
One thing I thought of, but don't really recommend, is to wrap the regex inside a pair of delimiters, outside of which a comma delimiter would be unambiguous. Slashes are popular as regex delimiters in sed, Awk, Perl, and PHP; but PHP should act as a warning example, because the preg_replace syntax has a pesky problem with double quoting ("/regex/" is a regex between slash delimiters inside a double-quoted string).
No, a comma is not a good separator, because it can validly occur inside a regular expression.
My recommendation would be to use an option parser which allows you to specify the same option name multiple times, so you can say
mycliutility -i regexp1 -i regexp2 -o regexp3 -o regexp4 somefilename
If your implementation language is Python and you are using optparse, for example, look at the action='append' behavior.

Related

Can OR expressions be used in ${var//OLD/NEW} replacements?

I was testing some string manipulation stuff in a bash script and I've quickly realized it doesn't understand regular expressions (at least not with the syntax I'm using for string operations), then I've tried some glob expressions and it seems to understand some of them, some not. To be specific:
FINAL_STRING=${FINAL_STRING//<title>/$(get_title)}
is the main operation I'm trying to use and the above line works, replacing all occurrences of <title> with $(get_title) on $FINAL_STRING... and
local field=${1/#*:::/}
works, assigning $1 with everything from the beginning to the first occurrence of ::: replaced by nothing (removed). However # do what I'd expect ^ to do. Plus when I've tried to use the {,,} glob expression here:
FINAL_STRING=${FINAL_STRING//{<suffix>,<extension>}/${SUFFIX}}
to replace any occurrence of <suffix> OR <extension> by ${SUFFIX} , it works not.
So I see it doesn't take regex and it also doesn't take glob patterns... so what Does it take? Are there any exhaustive listing of what symbols/expressions are understood by plain bash string operations (particularly substring replacement)? Or are *, ?, #, ##, % and %% the only valid stuff?
(I'm trying to rely only on plain bash, without calling sed or grep to do what I want)
The gory details can be found in the bash manual, Shell Expansions section. The complete picture is surprisingly complex.
What you're doing is described in the Shell Parameter Expansion section. You'll see that the pattern in
${parameter/pattern/string}
uses the Filename Expansion rules, and those don't include Brace Expansion - that is done earlier when processing the command line arguments. Filename expansion "only" does ?, * and [...] matching (unless extglob is set).
But parameter expansion does a bit more than just filename expansion, notably the anchoring you noticed with # or %.
bash does in fact handle regex; specifically, the [[ =~ ]] operator, which you can then assign to a variable using the magic variable $BASH_REMATCH. It's funky, but it works.
See: http://www.linuxjournal.com/content/bash-regular-expressions
Note this is a bash-only hack feature.
For code that works in shells besides bash as well, the old school way of doing something like this is indeed to use #/##/%/%% along with a loop around a case statement (which supports basic * glob matching).

how to extract a part of header in Fasta file by using Linux command

I have a Fasta file with unique header,I would like to extract a part of this header by using Regular expression in Unix.
for example My Fasta file start with this header:
>jgi|Penbr2|47586|fgenesh1_pm.1_#_25
and I would like to extract just the last part of this header like:
>fgenesh1_pm.1_#_25
Actually I use this regular expression in vim editor but It did not work:
:%s/^([^|]+\|){3}//g
or
:%s/^([A-Z][0-9]+\|){3}//g
I would be appropriate if give me some suggestion.
You can use sed:
sed -e 's/>.*|/>/' fasta-file
i.e. everything between > and | is replaced by >.
I don't know if the leading > is also a part of your text. Assume that they are not.
Since you tagged with vim, then I just post the vim solution.
You can make the usage of the "greedy" of regex:
In vim:
%s/.*|//
will leave the last part, this is the easiest way.
in vim you can use \zs, \ze and non-greedy matching too:
%s/\zs.\{-}\ze[^|]\+$//
Of course, if you like grouping, you can use \(...\) to group and don't use \zs \ze.
In your codes, you grouped just with (...) without escaping. I don't know how did you configure your magic setting in your vimrc, if you use default, you have to escape the ( and ) to give them special meanings (grouping here). Just like what we do with BRE. Do a :h magic, and find the table to know the difference.
In vim do :h terms to get detail information.

Bash script - variable expansion within backtick/grep regex string

I'm trying to expand a variable in my bash script inside of the backticks, inside of the regex search string.
I want the $VAR to be substituted.
The lines I am matching are like:
start....some characters.....id:.....some characters.....[variable im searching for]....some characters....end
var=`grep -E '^.*id:.*$VAR.*$' ./input_file.txt`
Is this a possibility?
It doesn't seem to work. I know I can normally expand a variable with "$VAR", but won't this just search directly for those characters inside the regex? I am not sure what takes precedence here.
Variables do expand in backticks but they don't expand in single quotes.
So you need to either use double quotes for your regex string (I'm not sure what your concern with that is about) or use both quotes.
So either
var=`grep -E "^.*id:.*$VAR.*$" ./input_file.txt`
or
var=`grep -E '^.*id:.*'"$VAR"'.*$' ./input_file.txt`
Also you might want to use $(grep ...) instead of backticks since they are the more modern approach and have better syntactic properties as well as being able to be nested.
You need to have the expression in double quotes (and, then, escape anything which needs to be escaped) in order for the variable to be interpolated.
var=$(grep -E "^.*id:.*$VAR.*\$" ./input_file.txt)
(The backslash is not strictly necessary here, but I put it in to give you an idea. Your real expression is perhaps more complex.)

Regular expression trouble

Hey guys - I'm tearing my hair out trying to create a regular expression to match something like:
{TextOrNumber{MoreTextOrNumber}}
Note the matching number of open/close {}. Is this even possible?
Many thanks.
Note the matching number of open/close {}. Is this even possible?
Historically, no. However, modern regular expressions aren’t actually regular and some allow such constructs:
\{TextOrNumber(?R)?\}
(?R) recursively inserts the pattern again. Notice that not many regex engines support that (yet).
If you need to do an arbitrary number of braces, you can use a parser generator, or create a regex inside a nested function. The following is an example of a recursive regex in ruby.
def parse(s)
if s =~ /^\{([A-Za-z0-9]*)({.*})?\}$/ then
puts $1
parse($2)
end
end
parse("{foo{bar{baz}}}")
This is not possible with 1 regex if you don't have a recursive extension available. You'll have to match a regex like the following one multiple times
/\{[a-z0-9]+([a-z0-9\{\}]+)?\}/i
capture the "MoreTextOrNumber" and let it match again until you are through or it fails.
Not easy but possible
Officially, regular expressions are not designed for parsing nested paired brackets --- and if you try to do this, you run into all sorts of problems. There are other other tools (like parser generators, e.g. yacc or bison) that are designed for such structures and can handle them well. But it can be done --- and if you do it right it may even be simpler than a yacc grammar with all the support code to work around the problems of yacc.
Here are some hints:
First of all, my suggestions work best if you have some characters that will never appear in the input. Often, characters like \01 and \02 should never appear, so you can do
s/[\01\02]/ /g;
to make sure they are not there. Otherwise, you may want to escape them (e.g. convert them to text like %0 and %1) with an expression like
s/([\01\02%])/"%".ord($1)/ge;
Notice, that I also escaped the escape character "%".
Now, I suggest to parse brackets from the inside out: replace any substring "{ text }" where "text" does not contain any brackets by a place holder "\01$number\2" and store the included text in $array[$number]:
$number=1;
while (s/\{([^{}]*)\}/"\01$number\02"/e) { $array[$number]=$1; $number++; }
$array[0]=$_; # $array[0] corresponds to your input
As a final step, you may want to process each element in #array to pull out and process the "\01$number\02" markers. This is easy because they are no longer nested.
I happily use this idea in a few parsers (including separating matching bracket types like "(){}[]" etc).
But before you go down this road, make sure to have used regular expressions in simpler applications: You will run into many small problems and you need experience to resolve them (rather than turning one small problem into two small problems etc.).

find, replace and escape string linux

I'm trying find all instances of a string and replace them, the original string looks like this:
<li>Some Text Here</li>
the replacement looks like this:
<li>Something new</li>
What would be a good way to do this in the CLI
Thanks
I think the sed command would do the job nicely, provided your onclick handler and the "Some Text Here" don't include any nested HTML tags that the regex might confuse for the closing tags of the replacement string.
Searching and replacing in HTML is a guaranteed headache. At some point someone will pass you malformed HTML and even the most careful crafted regexp will fail horribly.
I'd definitely work with the HTML at the highest possible level of abstraction, preferably a homegrown tool that uses DOM or SAX.
For a quick fix
A command line tool using XSL/XSLT
Standard way of searching and replacing in linux command line is sed, eg:
sed -i yourfilename -e "s%textyouwantoreplace%newtext%g"
The only thing is, sed uses regular expressions, so you'll need to escape stuff that might be a wildcard, by putting a \ before it, eg write \$ instead of $.
-i means: edit the file in-place
-e means: the next thing in the commandline is an expression to evaluate, ie the whole thing in quotes after it
's' means 'substitute'
'g' means 'global' substitution