sed to strip multi-part file extension from file pathname - regex

I'm trying to compose a sed command to remove all trailing extensions from file names that have more than one in sequence separated by '.' eg:
/a/b/c.gz -> /a/b/c
/a/b/c.tar.gz -> /a/b/c rather than /a/b/c.tar
Notice that only the filename should be truncated; dots on parent directories are to be preserved.
/a/b.c/d.tar.gz -> /a/b.c/d
never
/a/b.c/d.tar, /a/b or /a/b/d
Therefore simply remove everything after the first '.' is not a solution.
I have a command that works OK as long as there is at least one '/' in the file name (or path rather). I'm not sure how to enhance in order to also cover single element (only filename) cases:
sed 's/^\(.*\/[^.\/]*\)[^\/]*$/\1/' list_of_filepaths.txt \
> output_filepaths_wo_extensions.txt
So, the command above does the right thing with:
./abc.tar.gz, parent/.../abc.tar.gz, /abc.tar.gz
It does not work for single element (only filename) cases:
abc.tar.gz
Of course, this is not surprising since it isn't matching the slash '/' anywhere.
Although adding a second sed command to deal with the '/' free case is trivial, I would like to cover all cases with a single command as it seems to me that it should be possible.
For example, I was hopping that this one would work, but it does not work for either:
sed 's/^\(.*?\/\)?\([^.\/]*\)[^\/]*$/\1\2/'
So, in this attempt of mine, the first (additional) group would capture the optional '/' containing prefix preceding the last '/'. In case of a slash free file-path that group would simply be empty.

Related

Optional grouping with Regex in Eclipse

I have a file with many different file path locations. Some of them have multiple directory depth and some don't. What I need to do is prepend a directory /WEB_ROOT/ to all file path locations in the file.
For example
index.jsp -> /WEB_ROOT/index.jsp
/instructor/assigned_appts.jsp -> /WEB_ROOT/instructor/assigned_appts.jsp
I have tried this one ([\/_]?[A-Za-z]*).jsp to try and capture the optional _ and / values but this doesn't match properly.
/instructor/assigned_appts.jsp only matches _appts.jsp
I have tried this as well ([\/_]?[A-Za-z])*.jsp which properly matches all expected file paths but when I replace I only get the last letter instead of the full group
So a replace with /WEB_ROOT/$1.jsp gives the following
index.jsp -> /WEB_ROOT/x.jsp
/instructor/assigned_appts.jsp -> /WEB_ROOT/s.jsp
Help please!
You can match the whole line, and as [\/_]? is optional, make sure that you match at least a single char A-Za-z before the .jsp
If you want to replace with group 1 like /WEB_ROOT/$1 you can also capture the .jsp
(.*[A-Za-z]\.jsp)
Note sure if supported in eclipse, but you might also just get the whole match and use $0 instead of group 1
.*[A-Za-z]\.jsp
If .jsp is at the end of the string, you can append an anchor .*[A-Za-z]\.jsp$

Regex that deletes everything except for any tags that contains an specific string inside of it

I need a regex that can be applied on vim editor, or bash (grep command), that will delete everything in a file, leaving only the tags containing an specific string:
<generic>
stuff1
stuff2
stuff3
</generic>
and
<generic>
stuff1
stuff2
DESIRED_STRING
stuff3
</generic>
The first one would be wiped and the second one would remain because of the DESIRED_STRING.
At the end, I need a file with tons of tags that contains a modifier on it. This process will be executed several times to separate one huge file into multiple others.
This (?<=\<custom_item\>).*?(?=\<\/custom_item\>) got me in a point where I could match the content inside of the tags. Not able to filter it though.
The file will always follow this structure
<tag>
system : "Linux"
type : CHECK
</tag>
Where 'CHECK' is the modifier and the word I am looking for
Thank you!!
You may use this approach using awk:
awk '/<generic>/ { tag=1 }
tag && /DESIRED_STRING/ { p=1 }
tag { s = s $0 RS }
/<\/generic>/ { if (p) printf "%s", s; tag=p=0; s="" }' file
We use 2 flags to track our state here. tag represents state when we are inside open and close tags and p represents a state when we find our desired string while inside the open/close tags.
Here's an alternative, in Vim: it is much easier to match than avoid to match, so....
Gmz:1,'z g/DESIRED_STRING/norm yat:$pu<Ctrl-V><Enter><Enter>'zdgg
where <Ctrl-V> and <Enter> are supposed to be keys, not actual text to be entered.
Gmz will set a z mark at the last line. Then, we search for the DESIRED_STRING, and at each one, yank the tag, then paste it to the bottom of the file (in order). Then 'zdgg to delete the original (from the mark z to the top of the file).
Basically, instead of trying to delete everything and making exceptions for the desired content, pull the desired content out first, then delete everything.
Bonus: This will work even with tags that don't align with line breaks (even though OP doesn't have those). For example,
outside<tag>inside
foo DESIRED_STRING inside</tag>outside
will correctly produce
<tag>inside
foo DESIRED_STRING inside</tag>
With Vim regex:
:%s/<\([^>]*\)>\(\_.\(DESIRED_STRING\)\#!\)\{-}<\/\1>//
This regex uses a negative look ahead, \#!, to match all blocks of text not containing DESIRED_STRING. These blocks are then removed with the :%s command

REGEX that leaves one if more than one is present

I have to filter paths they can look like:
some_path//rest
some_path/rest
some_path\\\\rest
some_path\rest
I need to replace some_path//rest with FILTER
some_path/rest// I want FILTER/
some_path/rest\\ I want FILTER\
some_path/rest I want FILTER
some_path/rest/ I want FILTER/
some_path/rest\ I want FILTER\
I am using some_path[\\\\\\\/]+rest to match the middle, if I use it at the end it consumes all the path separators.
I do not know in advance whether the separators will be / or \\ it can mix in a single path.
some_path/rest\some_more//and/more\\\\more
Consider using back references. Keep in mind that with Python, you will be seeing the \ escaped with a second \ in the output. This example seems to do what you are looking for:
>>> for test in ('some_path/rest//','some_path/rest\\','some_path/rest','some_path/rest/','some_path/rest\\'):
... re.sub(r"some_path[\/]+rest([\/]?)\1*", r"FILTER\1", test)
...
'FILTER/'
'FILTER\\'
'FILTER'
'FILTER/'
'FILTER\\'
>>>
The \1 is a back reference to the previous () group. In the search, it is searching for any number of matches of that item. In the replace, it is just adding in the one item.
You can do it with a simple (without back reference) replace term by using a look ahead.
Use this regex to search:
some_path[\\\\/]+rest(?:([\\\\/])(?=\1))?
and replace the match with just 'FILTER':
re.sub(r"some_path[\\\\/]+rest(?:([\\\\/])(?=\1))?", 'FILTER', path)
This works by matching (ie consuming) the trailing slash only when it is doubled.
To allow for when there's no trailing slashes, the match for trailing slashes is made optional by wrapping in (?:...)? (which is non-capturing, so the back reference is \1, not \2 which is harder to read).
Note that you don't need quite so many backslashes in your regex.
Here's some test code:
for path in ('some_path/rest//','some_path/rest\\','some_path/rest','some_path/rest/','some_path/rest\\'):
print path + ' -> ' + re.sub(r"some_path[\\\\/]+rest(?:([\\\\/])(?=\1))?", 'FILTER', path)
Output:
some_path/rest// -> FILTER/
some_path/rest\ -> FILTER\
some_path/rest -> FILTER
some_path/rest/ -> FILTER/
some_path/rest\ -> FILTER\

Exclude regular expression match if it contains a string

I'm still learning regular expressions and I seem to be stuck.
I wanted to write a reg exp that matches URL paths like these that contain "bulk":
/bulk-category_one/product
/another-category/bulk-product
to only get the product pages, but not the category pages like:
/bulk-category_one/
/another-category/
So I came up with:
[/].*(bulk).*[/].+|[/].*[/].*(bulk).*
But there's pagination, so when I put the reg exp in Google Analytics, I'm finding stuff like:
/bulk-category/_/showAll/1/
All of them have
/_/
and I don't want any URL paths that contain
/_/
and I can't figure out how to exclude them.
I would go about it this way:
/[^/\s]*bulk[^/]*/[^/\s]+(?!/)|/[^/\s]+/[^/]*bulk[^/\s]*(?!/)
first part:
/ - match the slash
[^/\s]* - match everything that's not a slash and not a whitespace
bulk - match bulk literally
[^/]* - match everything that's not a slash
/ - match the slash
[^/\s]+ - match everything that's not a slash and not a whitespace
(?!/) - ensure there is not a slash afterwards (i.e. url has two parts)
The second part is more of the same, but this time 'bulk' is expected in the second part of the url not the first one.
If you need the word 'product' specifically in the second part of the url one more alternative would be required:
/[^/\s]*bulk[^/]*/[^/\s]*product[^/\s]*(?!/)|/[^/\s]+/[^/]*bulk[^/\s]*product[^/\s]*(?!/)|/[^/\s]+/[^/]*product[^/\s]*bulk[^/\s]*(?!/)
If I apply that simple regex to a file FILE
egrep ".*bulk.*product" FILE
which contains your examples above, it only matches the 2 lines with bulk and product. We can, additionally, exclude '/_/':
egrep ".*bulk.*product" FILE | egrep -v "/_/"
Two invocations are often much more easy to define and to understand, than a big one-fits-all.

How can I use perl regex to remove the first directory name (top level) of a string

I'm making a Wakaba image board using the perl script I can download. However one thing that has perplexed me is the function "expand_filename($)" which will expand the path of the filename.
Everything, on all files, including my images, it would add /~ponydash/ to the end, ponydash is the name of my account on the hosting, so I created a debug function to see what it would return, it is as follows:
sub debug_string()
{
my ($filename)=#_;
return $filename if($filename=~m!^/!);
return $filename if($filename=~m!^\w+:!);
my ($self_path)=$ENV{SCRIPT_NAME}=~m!^(.*/)[^/]+$!;
return $self_path;
}
And when called in the HTML document with
<var debug_string()>
It would return:
/~ponydash/b/
Now I want to know how I could modify the third to last line to remove the /~ponydash/ part to just leave /b/.
This should return only the second path part to the end of the path:
^\/[^\/]*(\/.*)$
The first / and all preceding non-slash characters are ignored up to the second slash which will be captured like the rest of the string.