Regex matching filenames - regex

I Know this will sound silly to some of you but I am not good with regex resolutions. I came across the following expressions in a function someone else has written and can't figure out what he/she was doing.
REGEX 1
[ ! -d ${2%/*}/ ]
REGEX 2
cmp -s $2 ${2##*/}
as you can guess, these regex evaluations are being used in a script, doing file updating and moving them around. I was wondering the meaning of
${2%/*}/
and
${2##*/}

Let's take an example to understand better:
s='abc/def/foo'
echo "${s%/*}/"
abc/def/
echo "${s##*/}"
foo
First expression is discarding text after last / in the input.
Second expression is discarding all the text before last / in the input.
You can see more details in man bash:
##*/ is used to match longest string before / from start of input string.
%/* is used to match text after / from end of input.

Related

How to replace spaces after a certain pattern with commas?

I am new to coding and I'm trying to format some bioinformatics data. I am trying to remove all the spaces after GT:GL:GOF:GQ:NR:NV with commas, but not anything outside of the format xx:xx:xx:xx:xx (like the example). I know I need to use sed with regex option but I'm not very familiar with how to use it. I've never actually used sed before and got confused trying so any help would be appreciated. Sorry if I formatted this poorly (this is my first post).
EDIT 2: I got actual data from the file this time which may help solve the problem. Removed the bad example.
New Example: I pulled this data from my actual file (this is just two samples), and it is surrounded by other data. Essentially the line has a bunch of data followed by "GT:GL:GOF:GQ:NR:NV ", after this there is more data in the format shown below, and finally there is some more random data. Unfortunately I can't post a full line of the data because it is extremely long and will not fit.
Input
0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0
Output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
With Basic Regular Expressions, you can use character classes and backreferences to accomplish your task, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\)[ ]\([0-9][0-9]*:[0-9][0-9]*\)/\1,\2/g' file
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT BB
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 10:13:12,41:41:1:13,13:131:1:1 AB GT RT
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT
Which basically says:
find and capture any [0-9][0-9]* one or more digits,
separated by a :, and
followed by [0-9][0-9]* one or more digits -- as capture group 1,
match a space following capture group 1 followed by capture group 2 (which is the same as capture group 1),
then replace the space separating the capture groups with a comma reinserting the capture group text using backreference 1 and 2 (e.g. \1 and \2), finally
make the replacement global (e.g. g) to replace all matching occurrences.
Edit Based On New Input Posted
If you still need all of the original commas added, and you now want to add a comma between ,0 0/ (where there is a comma precedes a single-digit followed by the space to be replaced with a comma, followed by a single-digit and a forward-slash), then all you need to do is make your capture groups conditional (on either capturing the original data as above -or- capturing this new segment. You do that by including an OR (e.g. \| in basic regex terms) between the conditions.
For instance by adding \|,[0-9] at the end of the first capture group and \|[0-9][/] at the end of the second, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\|,[0-9]\)[ ]\([0-9][0-9]*:[0-9][0-9]*\|[0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
If you have other caveats in your file, I suggest you post several complete lines of input, and if they are too long, then create a zip, gzip, bzip or xz file and post it to a site like pastebin and add the link to your question.
If all you really care about now is the space in ,0 0/, then you can shorten the sed command to:
$ sed 's/\(,[0-9]\)[[:space:]]\([0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
(note: I've included [[:space:]] to handle any whitespace (space, tab, ...) instead of just the literal [ ] (space) in the new example)
Let me know if this fixes the issue.
I'm assuming that the xx:xx:xx or xx:xx:xx:xx can have any number of parts, since some have 3, and some have 4.
This is quite difficult to do reliably with sed, as it does not support lookarounds, which seem like they might be needed for this example.
You can try something like:
perl -pe 's/(?<=\d) (?=\d+(:\d+){2,})/,/g' input.txt
If you've got your heart set on sed, you can try this, but it may miss some cases:
sed -r 's/(:[0-9]+) ([0-9]+:)/\1,\2/g' input.txt
Could you please try following. This will take care of printing those values also which are NOT coming in match of regex. Also we would have made regex mentioned in match a bit shorter by doing it as [0-9]+\.{4} etc since this is tested on old awk so couldn't test it.
awk '
BEGIN{
OFS=","
}
match($0,/GT:GL:GOF:GQ:NR:NV [0-9]+:[0-9]+:[0-9]+:[0-9]+:[0-9]+/){
value=substr($0,RSTART!=1?1:RSTART,RSTART+RLENGTH-1)
value1=substr($0,RSTART+RLENGTH+1)
gsub(/[[:space:]]+/,",",value1)
print value,value1
next
}
1
' Input_file
You may also achieve your desired result without regex, using awk:
awk '{printf "%s", $1FS$2FS$3FS$4FS$5","$6","$7; for (i=8;i<=NF;i++) printf "%s", FS$i; print ""}' input.txt
Basically, it outputs from field 1 to 5 with the default field separator ("space"), then from field 5 to 7 with the comma separator, then from field 8 onwards with default separator again.
perl myscript.pl '0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0'
myscript.pl,
#!/usr/local/ActivePerl-5.20/bin/env perl
my $input = $ARGV[0];
$input =~ s/ /\,/g;
print $input, "\n";
__DATA__
output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
This will remove all spaces, not just the space in question

Regex for string matching ****${****}***

I am trying to write a regex that matches and excludes all strings in a file that contain ${ followed by } with any characters between or around it. In between could be any characters/numbers/underscores/dashes/etc (there won't be another parenthesis inside).
Example matches:
hello ${VAR}
${HELLO_VAR} world
https://${WEB_VAR}
I came up with this: egrep -v '^\${[a-zA-Z?]', though it seems to be working partially and I am not too sure if its right. How can I do this?
The input file has strings separated by a newline, very similar to simple java properties.
You can trying using sed command.
sed 's/\$\{[^}]*\}//g' <input_file> > <output_file>
Sed here excludes all the characters between '{' and '}' and writes the new content in a new output file.
You can give this one a shot:
\$\{[^}]*\}
Match ${ literally, followed by everything except }, followed by }
You say you're trying to exclude all strings in a file, so it sounds like you need something a bit more advanced than just a regex with grep. I'd do this with an awk script:
awk '{while(match($0,/\$\{[^}]*\}/)){$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)}} 1' input.txt
Or, split for easier reading and commenting:
{
while (match($0,/\$\{[^}]*\}/)) {
$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)
}
}
1
The idea here is that for each line, we'll check to see whether the regex matches anything on the line. If it does, we'll replace the line with the parts around the matched regex. (We could alternate sub(/RE/,""), but that would require applying the regex twice per match rather than once.)
The final 1 is shorthand that says "print the current line". It runs whether or not the loop processed any matches.
Just use the global wilcard .* around the two sequences, as in:
.*\$\{.*\}.*
As you want to match entire lines, you have to use wilcard at both sides, to extend the regexp to both ends (it doesn't matter if you anchor it with ^ and $ as the greedy algorithm will try to extend as much as possible) Note that the $, { and } must be escaped, as they are reserved by the regexp language.
This can be seen in action here.
note
the title of this question doesn't specify that the substring between the two curly braces should not have a }, and as you want only to match the whole line, then it is not necessary to check for something except a }, the only requirement is that } must be after the ${ in the line. Anyway, this has no drawback in efficiency, as the NFA that parses this regexp has the same number of states as the other.

Grep a filename with a specific underscore pattern

I am trying to grep a pattern from files using egrep and regex without success.
What I need is to get a file with for example a convention name of:
xx_code_lastname_firstname_city.doc
The code should have at least 3 digits, the lastname and firstname and city can vary on size
I am trying the code below but it fails to achieve what I desire:
ls -1 | grep -E "[xx_][A-Za-z]{3,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[.][doc|pdf]"
That is trying to get the standard xx_ from the beggining, then any code that has at least 3 words and after that it must have another underscore, and so on.
Could anybody help ?
Consider an extglob, as follows:
#!/bin/bash
shopt -s extglob # turn on extended globbing syntax
files=( xx_[[:alpha:]][[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]]).#(doc|docx|pdf) )
[[ -e ${files[0]} ]] || -L ${files[0]} ]] && printf '%s\n' "${files[#]}"
This works because
[[:alpha:]][[:alpha:]]+([[:alpha:]])
...matches any string of three or more alpha characters -- two of them explicitly, one of them with the +() one-or-more extglob syntax.
Similarly,
#(doc|docx|pdf)
...matches any of these three specific strings.
So you're trying to match a literal xx_? Begin your pattern with that portion then.
xx_
Next comes the "3 digits" you're trying to match. I'm going to assume based off your own regex that by "digits" you mean characters (hence the [a-zA-Z] character classes). Let's make the quantifier non-greedy to avoid any unintentional capturing behavior.
xx_[a-zA-Z]{3,}?
For the firstname and lastname portions, I see you've specified a variable length with at least 2 characters. Let's make sure these quantifiers are non-greedy as well by appending the ? character after our quantifiers. According to your regex, it also looks like you expect your city construct to take a similar form to the firstname and lastname bits. Let's add all three then.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.
NOTE: We didn't need to make the city quantifier non-greedy since we asserted that it's followed by a literal ".", which we don't expect to appear anywhere else in the text we're interested in matching. Notice how it's escaped because it's a metacharacter in the regex syntax.
Lastly comes the file extensions, which your example has as "docx". I also see you put a "doc" and a "pdf" extension in your regex. Let's combine all three of these.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.(docx?|pdf)
Hopefully this works. Comment if you need any clarification. Notice how the "doc" and the "docx" portions were condensed into one element. This is not necessary, but I think it looks more deliberate in this form. It could also be written as (doc|docx|pdf). A little repetitive for my taste.

Bash regex to match dots and characters

I'm trying to use the =~ operator to execute a regular expression pattern against a curl response string.
The pattern im currently using is:
name\":\"(\.[a-zA-Z]+)\"
Currently however this pattern only extracts values that that contain only the characters a-z and A-Z. I need this pattern to also pick up values that contain a '.' character and a '#' character. How would I do this?
Also, is there any way this pattern can be improved performance wise? It takes quite a long time to execute against the string.
Cheers.
I recently ran into this problem in my script that sets my bash prompt according to my git status, and found that it was because of the placement of other things (namely, a hyphen) I wanted to match inside the expression.
For example, I wanted to match a certain part of a git status output, e.g. the part where it says "Your branch is ahead of 'origin/mybranch' by 1 commit."
This was my original pattern:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-]+)' by ([0-9]+) commit".
One day I created a branch that had a . in it and found that my bash prompt wasn't showing me the right thing, and modified the expression to the following:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-.]+)' by ([0-9]+) commit".
I expected it to work just fine, but instead there was no match at all.
After reading a lot of posts, I realized it was because of the placement of the hyphen (-); I had to put it right after the first square bracket, otherwise it would be interpreted as a range (in this case, it was trying to interpret the range of _-., which is invalid or just somehow makes the whole expression fall over.
It started working when I changed the expression to the following:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([-a-zA-Z0-9_.]+)' by ([0-9]+) commit".
So basically what I meant to say that it could be something else in your expression (like the hyphen in mine) that is interfering with the matching of the dot and ampersand.
Working example script:
#!/bin/bash
regex='"name":"([a-zA-Z.#]+)"'
input='"name":"internal.action.retry.queue#temp"'
if [[ $input =~ $regex ]]
then
echo "$input matches regex $regex"
for (( i=0; i<${#BASH_REMATCH[#]}; i++))
do
echo -e "\tGroup[$i]: ${BASH_REMATCH[$i]}"
done
else
echo "$input does not match regex $regex"
fi
Just add dot ('.') and at sign ('#'):
name\":\"(\.[a-zA-Z.#]+)\"
If you don't need mandatory dot at the beginnig of the URL, use this:
\"name\":\"([a-zA-Z.#]+)\"

Turning off greed not working in this regex

I am trying to run the following search (with . made to match newlines either by adding the /s flag in perl or replacing it with \_. in vim):
/<output_channels>.*(?=Story).*?<\/output_channels>/
However the ? isn't turning off greed as it normally does - can anyone explain why? For example, it matches the entire contents of the following file rather than just the first element:
<output_channels>
<output_channel>RSS</output_channel>
<output_channel>Story</output_channel>
</output_channels>
<output_channels>
<output_channel>RSS</output_channel>
</output_channels>
Sorry if I'm missing something obvious.
I put your sample text into a vim buffer, and then executed the command
:%!perl -e '$text = join("", <STDIN>); $text =~ /<output_channels>.*(?=Story).*?<\/output_channels>/s; print $&;'
The result is just the first block of XML. I think this is what you want?
Note that I escaped the / within the regex. Other than this, it is the same one given in your question.
Also note that the equivalent vim RE would be (tested, works):
<output_channels>\_.*\(story\)\#=\_.\{-}<\/output_channels>
See :help perl-patterns for a rundown of the differences between perl and vim REs.
Further note that parsing heirarchical markup with regexps has been known to reawaken ancient demons.
The first .* in your regex is still greedy. You only added ? after the second one.