Grep a filename with a specific underscore pattern - regex

I am trying to grep a pattern from files using egrep and regex without success.
What I need is to get a file with for example a convention name of:
xx_code_lastname_firstname_city.doc
The code should have at least 3 digits, the lastname and firstname and city can vary on size
I am trying the code below but it fails to achieve what I desire:
ls -1 | grep -E "[xx_][A-Za-z]{3,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[.][doc|pdf]"
That is trying to get the standard xx_ from the beggining, then any code that has at least 3 words and after that it must have another underscore, and so on.
Could anybody help ?

Consider an extglob, as follows:
#!/bin/bash
shopt -s extglob # turn on extended globbing syntax
files=( xx_[[:alpha:]][[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]]).#(doc|docx|pdf) )
[[ -e ${files[0]} ]] || -L ${files[0]} ]] && printf '%s\n' "${files[#]}"
This works because
[[:alpha:]][[:alpha:]]+([[:alpha:]])
...matches any string of three or more alpha characters -- two of them explicitly, one of them with the +() one-or-more extglob syntax.
Similarly,
#(doc|docx|pdf)
...matches any of these three specific strings.

So you're trying to match a literal xx_? Begin your pattern with that portion then.
xx_
Next comes the "3 digits" you're trying to match. I'm going to assume based off your own regex that by "digits" you mean characters (hence the [a-zA-Z] character classes). Let's make the quantifier non-greedy to avoid any unintentional capturing behavior.
xx_[a-zA-Z]{3,}?
For the firstname and lastname portions, I see you've specified a variable length with at least 2 characters. Let's make sure these quantifiers are non-greedy as well by appending the ? character after our quantifiers. According to your regex, it also looks like you expect your city construct to take a similar form to the firstname and lastname bits. Let's add all three then.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.
NOTE: We didn't need to make the city quantifier non-greedy since we asserted that it's followed by a literal ".", which we don't expect to appear anywhere else in the text we're interested in matching. Notice how it's escaped because it's a metacharacter in the regex syntax.
Lastly comes the file extensions, which your example has as "docx". I also see you put a "doc" and a "pdf" extension in your regex. Let's combine all three of these.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.(docx?|pdf)
Hopefully this works. Comment if you need any clarification. Notice how the "doc" and the "docx" portions were condensed into one element. This is not necessary, but I think it looks more deliberate in this form. It could also be written as (doc|docx|pdf). A little repetitive for my taste.

Related

Linux Bash Script Regex malfunction

I would like to make a bash script, which should decide about the given strings, if they fulfill the term or not.
The terms are:
The string's first 3 character must be "le-"
Between hyphens there can any number of consonant in any arrangement, just one "e" and it cannot contain any vowel.
Between hyphens there must be something
The string must not end with hyphen
I made this script:
#!/bin/bash
# Testing regex
while read -r line; do
if [[ $line =~ ^le((-[^aeiou\W]*e+[^aeiou\W]*)+)$ ]]
then
printf "\""$line"\"\t\t\t-> True\n";
else
printf "\""$line"\"\t\t\t-> False\n";
fi
done < <(cat "$#")
It does everything fine, except one thing:
It says true no matter how many hyphens are next to each other.
For example:
It says true for this string "le--le"
I tried this regex expression on websites (like this) and they worked without this malfunction.
All I can think of there must be something difference between the web page and the linux bash. (All I can see on the web page is it runs PHP)
Do you have got any idea, how could I make it work ?
Thank you for your answers!
sweaver2112 rightly points out that the \W is causing you problems, but fails to provide a working example of a bash test regex that does what you ask (at least, i couldn't get it to work).
this seems to do it (adapting Laurel's consonant regex):
[[ "$line" =~ ^le(-[b-df-hj-np-tv-z]*e[b-df-hj-np-tv-z]*)+$ ]]
it matches (e.g.):
le-e
le-e-le
le-e-e-e-e-e
and more generally:
le-([[:consonant:]]*e[[:consonant:]]*)+
and doesn't match (e.g.):
le-
le--le
le-lea-le
also, you can write it more cleanly this way:
c='[b-df-hj-np-tv-z]'
[[ "$line" =~ ^le(-$c*e$c*)+$ ]]
There's at least one problem with your regex: [^aeiou\W] - a negated "non-word", means "word" - and it matches any letter, consonants included. Character classes are inclusive, not exclusive. We're better off just listing all the consonants (and for you case, we'll add 'e' and '-' to the set as well).
So try this one: (edit: using #Laurel's more concise char class)
`(?=^le-)(?!.*--)(?!.*-[^-]*e[^-]*e[^-]*-)[b-hj-np-tv-z-]*[^-]$`
(?=^le-) starts with 'le-'
(?!.*--) no double dashes allowed
(?!.*-[^-]*e[^-]*e[^-]*-) do NOT see two e's between dashes
[b-hj-np-tv-z-]* - consume consonants, e, and dashes (same as [bcdfghjklmnpqrstlvwze-])
[^-]$ last character must be non-dash

foo[E1,E2,...]* glob matches desired contents, but foo[E1,E2,...]_* does not?

I saw something weird today in the behaviour of the Bash Shell when globbing.
So I ran an ls command with the following Glob:
ls GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]* | grep ":"
the result was as expected
GM12878_Hs_InSitu_MboI_rE1_TagDirectory:
GM12878_Hs_InSitu_MboI_rE2_TagDirectory:
GM12878_Hs_InSitu_MboI_rF_TagDirectory:
GM12878_Hs_InSitu_MboI_rG1_TagDirectory:
GM12878_Hs_InSitu_MboI_rG2_TagDirectory:
GM12878_Hs_InSitu_MboI_rH_TagDirectory:
however when I change the same regex by introducing an underscore to this
ls GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]_* | grep ":"
my expected result is the complete set as shown above, however what I get is a subset:
GM12878_Hs_InSitu_MboI_rF_TagDirectory:
GM12878_Hs_InSitu_MboI_rH_TagDirectory:
Can someone explain what's wrong in my logic when I introduce an underscore sign before the asterisk?
I am using Bash.
You misunderstand what your glob is doing.
You were expecting this:
GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]*
to be a glob of files that have any of those comma-separated segments but that's not what [] globbing does. [] globbing is a character class expansion.
Compare:
$ echo GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]
GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]
to what you were trying to get (which is brace {} expansion):
$ echo GM12878_Hs_InSitu_MboI_r{E1,E2,F,G1,G2,H}
GM12878_Hs_InSitu_MboI_rE1 GM12878_Hs_InSitu_MboI_rE2 GM12878_Hs_InSitu_MboI_rF GM12878_Hs_InSitu_MboI_rG1 GM12878_Hs_InSitu_MboI_rG2 GM12878_Hs_InSitu_MboI_rH
You wanted that latter expansion.
Your expansion uses a character class which matches the character E-H, 1-2, and ,; it's identical to:
GM12878_Hs_InSitu_MboI_r[EFGH12,]_*
which, as I expect you can now see, isn't going to match any two character entries (where the underscore-less version will).
* in fileystem globs is not like * in regex. In a regex * means "0 or more of the preceeding pattern," but in filesystem globs it means "anything at all of any size". So in your first example, the _ is just part of the "anything" from the * but in the second you're matching any single character within your character class (not the patterns you seem to be trying to define) followed by _ followed by anything at all.
Also, character classes don't work the way you're trying to use them. [...] will match any character within the brackets, so your pattern is actually the same as [EFGH12,] since those are all the letters in class you define.
To get the grouping of patterns you want, you should use { instead of [ like
ls GM12878_Hs_InSitu_MboI_r{E1,E2,F,G1,G2,H}_* | grep ":"
As far as I know, and this article supports my me, the square brackets don't work as a choice but as a character set, so using [E1,E2,F,G1,G2,H] actually is equivalent to exactly one occurrence of [EGHF12,]. You can then interpret the second result as "one character of EGHF12, and an underscore", which matches GM12878_Hs_InSitu_MboI_rF_TagDirectory: but not GM12878_Hs_InSitu_MboI_rG1_TagDirectory: (there is the r followed by more that "one occurrence of...").
The first regex works because you used the asterisk, which matches what is missed by the wrong [...].
A correct expression would be:
ls GM12878_Hs_InSitu_MboI_r{E1|E2|F|G1|G2|H}* | grep ":"

How to substitute a string even it contains regex meta characters using Shell or Perl?

I want to substitue a word which maybe contains regex meta characters to another word, for example, substitue the .Precilla123 as .Precill, I try to use below solution:
sed 's/.Precilla123/.Precill/g'
but it will change below line
"Precilla123";"aaaa aaa";"bbb bbb"
to
.Precill";"aaaa aaa";"bbb bbb"
This side effect is not I wanted. So I try to use:
perl -pe 's/\Q.Precilla123\E/.Precill/g'
The above solution can disable interpreted regex meta characters, it will not have the side effect.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
Can anybody help this? Many thanks.
Please note that the word I want to substitute is NOT hard coded, it comes from a input file, you can consider it as variable.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
This is not true.
If the value that you want to replace is in a Perl variable, then quotemeta will work on the variable's contents just fine, including the characters $ and #:
echo 'pre$foo to .$foobar' | perl -pe 'my $from = q{.$foo}; s/\Q$from\E/.to/g'
Outputs:
pre$foo to .tobar
If the words that you want to replace are in an external file, then simply load that data in a BEGIN block before composing your regular expressions for replacement.
sed 's/\.Precilla123/.Precill/g'
Escape the meta character with \.
Be carrefull, mleta charactere are not the same for search pattern that are mainly regex []{}()\.*+^$ where replacement is limited to &\^$ (+ the separator that depend of your first char after the s in both pattern)

Regex to find text between second and third slashes

I would like to capture the text that occurs after the second slash and before the third slash in a string. Example:
/ipaddress/databasename/
I need to capture only the database name. The database name might have letters, numbers, and underscores. Thanks.
How you access it depends on your language, but you'll basically just want a capture group for whatever falls between your second and third "/". Assuming your string is always in the same form as your example, this will be:
/.*/(.*)/
If multiple slashes can exist, but a slash can never exist in the database name, you'd want:
/.*/(.*?)/
/.*?/(.*?)/
In the event that your lines always have / at the end of the line:
([^/]*)/$
Alternate split method:
split("/")[2]
The regex would be:
/[^/]*/([^/]*)/
so in Perl, the regex capture statement would be something like:
($database) = $text =~ m!/[^/]*/([^/]*)/!;
Normally the / character is used to delimit regexes but since they're used as part of the match, another character can be used. Alternatively, the / character can be escaped:
($database) = $text =~ /\/[^\/]*\/([^\/]*)\//;
You can even more shorten the pattern by going this way:
[^/]+/(\w+)
Here \w includes characters like A-Z, a-z, 0-9 and _
I would suggest you to give SPLIT function a priority, since i have experienced a good performance of them over RegEx functions wherever it is possible to use them.
you can use explode function with PHP or split with other languages to so such operation.
anyways, here is regex pattern:
/[\/]*[^\/]+[\/]([^\/]+)/
I know you specifically asked for regex, but you don't really need regex for this. You simply need to split the string by delimiters (in this case a backslash), then choose the part you need (in this case, the 3rd field - the first field is empty).
cut example:
cut -d '/' -f 3 <<< "$string"
awk example:
awk -F '/' {print $3} <<< "$string"
perl expression, using split function:
(split '/', $string)[2]
etc.

regex replace: '[A-Z]'' to [A-Z]' - I can't preserve the letter in in the string

My google foo is failing me...
I have a file (well over 2 gig's) that has a SQL format problem. So I need a regex that will update the following examples (remember, I don't know how many there are or what the letters are):
'N'' should be changed to N'
'L'' should be changed to L'
etc
I've tried (within VIM and sed):
s/'[A-Z]''/$1'/
but that just produces:
'N'' -> '$1'
A backreference in sed is \1, not $1. You also need to capture the letter using \(\) (and probably use the global flag g).
Your sed expression should be:
s/'\([A-Z]\)''/\1'/g
Give this a shot:
sed "s/\([[:alpha:]]'\)'/\1/g" file
Example Output
$ sed "s/\([[:alpha:]]'\)'/\1/g" <<<"aBcD''eg''H'i"
aBcD'eg'H'i
Note: Since you said you didn't know what letters they would be I assumed they could be lower case. If you know for a fact they are always uppercase, then change [[:alpha:]] to [[:upper:]]. These character classes are preferred over [A-Za-z] and [A-Z], respectively, because they will always work as you expect no matter the locale.