How to use regular expressions in wget for rejecting files?

How to use regular expressions in wget for rejecting files? - regex

I am trying to download the contents of a website using wget tool. I used -R option to reject some file types. but there are some other files which I don't want to download. These files are named as follows, and don't have any extensions.
string-ID
for example:
newsbrief-02
How I can tell wget not to download these files (the files which their names start with specified string)?

Since (apparently) v1.14 wget accepts regular expressions : --reject-regex and --accept-regex (with --regex-type posix by default, can be set to pcre if compiled with libpcre support).
Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :
wget --reject-regex 'expr1|expr2|…' http://example.com

You can not specify a regular expression in the wget -R key, but you can specify a template (like file template in a shell).
The answer looks like:
$ wget -R 'newsbrief-*' ...
You can also use ? and symbol classes [].
For more information see info wget.

Related

Script to remove characters from a file name for all files in a folder

So basically, I want to write a script that would be able to remove characters from a file name until it hits a letter. Ex. if I were to run it in a folder containing files:
13. abc
0 2 d ef
1.ghi3
It would rename the files to
abc
d ef
ghi3
Thanks

Try the following:
for f in *; do
echo mv "$f" "$(sed 's/^[^[:alpha:]]*//' <<<"$f")"
done
For safety, the mv command is prefixed with echo; remove the echo to peform actual renaming.
The above is a POSIX-compliant implementation.
Note that rename is NOT a POSIX utility, so you can:
neither rely on its presence,
nor rely on it to work the same across platforms.
An overview of popular platforms with respect to rename:
Debian-based platforms such as Ubuntu have a Perl-based rename utility:
It expects Perl statements, most notably s/// to perform substitutions based on regular expressions.
This is what Avinash Raj's answer relies on - a great option if available.
Dry-run support with -n
Fedora has an entirely different utility that comes from the util-linux package:
Supports replacement of literal substrings only.
NO dry-run support.
macOS has NO rename utility at all.
Via Homebrew you can install a Perl-based one (brew install rename) whose features are a superset of what the Perl-based implementation on Debian-based platforms offers.

You may use rename command.
rename 's/^[^a-z]+//' *

How do I make Wget name files as part of URL?

Short story:
I want Wget to name downloaded files as they match regex token ([^/]*)
wget -r --accept-regex="^.*/([^/]*)/$" $MYURL
Full story:
I use GNU Wget to recursively download one specific folder under particular WordPress website. I use regex to accept only posts and nothing else. Here is how I use it:
wget -r --accept-regex="^.*/([^/]*)/$" $MYURL
It works and Wget follows all the desired URLs. However, it saves files as .../last_directory/index.html, but I want these files to be saved as last_directory.html (.html part is optional).
Is there a way to do that with Wget alone? Or would you suggest how to do the same thing with sed or similar tools?

You could use sed.
wget -r --accept-regex="^.*/([^/]*)/$" $MYURL | sed 's~\(.*\)/[^.]*~\1~'
Example:
$ echo '/foo/last_directory/index.html' | sed 's~\(.*\)/[^.]*~\1~'
/foo/last_directory.html

How to get wget to accept files with no suffix

I'm using wget (from perl) to get web pages from a site. I'm really only interested in the html,htm,php,asp,aspx file types. However, at least one site has supplied links using file names with no extensions/suffix. I need those too.
My:
wget -A html,htm,php,asp,aspx
works great, except for the no suffix links.
I've tried a number of regex strings to try and get the no suffix pages, but to no avail. wget returns just the main page. So far, the only way to get these files is to open it up to all files (which isn't terrible for this website, but would be terrible for others).
Is there either a regex or regular way to specify I want links from wget with no suffixes?

wget version 1.14 seems to support a --accept-regex argument which is matched against the full URL, i.e. something like the following should in theory work (untested):
wget --accept-regex '/[^.]+(?:\.(?:html?|php|aspx?))?$'
Or perhaps it would be easier to just reject those extensions you do not want?

how can i use regex with locate command in linux

I want to use the locate command with regex but i am not able to use it.
I want to find pip file which is in /usr folder. i am trying this
locate -r "/usr/*pip"

To use globbing characters in your query you shouldn't specify regex (as you do with -r option), so just do:
locate "/usr/*pip"
From the man page:
If --regex is not specified, PATTERNs can contain globbing characters.
If any PATTERN contains no globbing characters, locate behaves as if
the pattern were *PATTERN*.

I would do so: locate -r '/usr/.*pip'

inotifywait - exclude regex pattern formatting

I am trying to use inotifywait to watch all .js files under my ~/js directory; how do I format my regex inside the following command?
$ inotifywait -m -r --exclude [REGEX HERE] ~/js
The regex - according to the man page, should be of POSIX extended regular expression - needs to match "all files except those that ends in .js", so these files can in turn be excluded by the --exclude option.
I've tried the (?!) lookaround thing, but it doesn't seem to work in this case. Any ideas or workarounds? Would much appreciate your help on this issue.

I've tried the (?!) thing
This thing is called negative lookahead and it is not supported by POSIX ERE.
So you have to do it the hard way, i.e. match everything that you want to exclude.
e.g.
\.(txt|xml) etc.

inotifywait has no include option and POSIX extended regular expressions don't support negation. (Answered by FailedDev)
You can patch the inotify tools to get an --include option. But you need to compile and maintain it yourself. (Answered by browndav)
A quicker workaround is using grep.
$ inotifywait -m -r ~/js | grep '\.js$'
But be aware of grep's buffering if you pipe the output to another commands. Add --line-buffered to make it work with while read. Here is an example:
$ inotifywait -m -r ~/js | grep '\.js$' --line-buffered |
while read path events file; do
echo "$events happened to $file in $path"
done
If you just want to watch already existing files, you can also use find to generate the list of files. It will not watch newly created files.
$ find ~/js -name '*.js' | xargs inotifywait -m
If all your files are in one directory, you can also use ostrokach's suggestion. In that case shell expansion is much easier than find and xargs. But again, it won't watch newly created files.
$ inotifywait -m ~/js/*.js

I posted a patch here that adds --include and --includei options that work like negations of --exclude and --excludei:
https://github.com/browndav/inotify-tools/commit/160bc09c7b8e78493e55fc9f071d0c5575496429
Obviously you'd have to rebuild inotifytools, and this is relatively untested, but hopefully it can make it in to mainline or is helpful to someone who comes across this post later.

Make sure you are quoting the regex command, if you are using shell-relevant characters (including ()).
While this is working:
inotifywait --exclude \.pyc .
this is not:
inotifywait --exclude (\.pyc|~) .
You have to quote the entire regular expression:
inotifywait --exclude '.*(\.pyc|~)' .

As of version 3.20.1, inotifywait does include the --include and --includei options.
To see them, run inotifywait --help. For some reason, they aren't documented in the manpages.

You could get most of this with --exclude '\.[^j][^s]' to ignore files unless they contain .js at some point in the filename or path. If you combine it with -r then it will work with arbitrary levels of nesting.
Only drawback is filenames like test.js.old will still be watched and all files inside a directory called example.js/ will also be watched, but this is probably somewhat unlikely.
You could probably extend this regex to fix this but personally I don't think the drawbacks are a big enough of a deal to worry about.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to use regular expressions in wget for rejecting files? - regex

You can not specify a regular expression in the wget -R key, but you can specify a template (like file template in a shell). The answer looks like: $ wget -R 'newsbrief-*' ... You can also use ? and symbol classes []. For more information see info wget.

Related

Script to remove characters from a file name for all files in a folder

How do I make Wget name files as part of URL?

How to get wget to accept files with no suffix

how can i use regex with locate command in linux

inotifywait - exclude regex pattern formatting

Categories

Resources