How do I make Wget name files as part of URL? - regex

Short story:
I want Wget to name downloaded files as they match regex token ([^/]*)
wget -r --accept-regex="^.*/([^/]*)/$" $MYURL
Full story:
I use GNU Wget to recursively download one specific folder under particular WordPress website. I use regex to accept only posts and nothing else. Here is how I use it:
wget -r --accept-regex="^.*/([^/]*)/$" $MYURL
It works and Wget follows all the desired URLs. However, it saves files as .../last_directory/index.html, but I want these files to be saved as last_directory.html (.html part is optional).
Is there a way to do that with Wget alone? Or would you suggest how to do the same thing with sed or similar tools?

You could use sed.
wget -r --accept-regex="^.*/([^/]*)/$" $MYURL | sed 's~\(.*\)/[^.]*~\1~'
Example:
$ echo '/foo/last_directory/index.html' | sed 's~\(.*\)/[^.]*~\1~'
/foo/last_directory.html

Related

Regex with inotifywait to compile two types of file in golang

I use a script to auto-compile in golang with inotifywait. But this script only checks files with the extension .go. I want to also add the .tmpl extension but the script uses regular expressions. What kind of changes I have to make to this line to get the desired result?
inotifywait -q -m -r -e close_write -e moved_to --exclude '[^g][^o]$' $1
I've tried to concatenate with | or & and other things like ([^t][^m][^p][^l]|[^g][^o])$ but nothing seems to work.
Rather than trying to use a regex to exclude two types of file, why don't you just only watch those files?
inotifywait -q -m -r -e close_write -e moved_to /path/**/*.{go,tmpl}
To use the ** (which does a recursive match), you may have to enable bash's globstar:
shopt -s globstar

Bash script to change file extension using regex

I have a lot of files i've copied over from my iphone file system, to start with they were mp3 files, but app on iphone changed their names to some random staff which looks like:
1c03e04cc1bbfcb0c1237f57f1d0ae2e.mp3?extra=f7NhT68pNkmEbGA_I1WbVShXQ2E2gJAGBKSEyh3hf0hsbLB1cqnXDuepYA5ubcFm_B3KSsrXDuKVtWVAUh_MAPeFiEHXVdg
I only need to remove part of file name after mp3. Please give me a script - there are more than 600 files, and manually it is impossible.
you can use rename command:
rename "s/mp3\?.*/mp3/" *.mp3*
#!/bin/bash
shopt -s nullglob
for F in *.mp3\?*; do
echo mv -v -- "$F" "${F%%.mp3\?*}.mp3"
done
Save it to a script like script.sh then run as bash /path/to/script.sh in the directory where the files exist.
Remove echo when you find it correct already.

Mirror only files having specific string in file path

I'm trying to mirror only those branches of a directory tree that contain a specific directory name somewhere within the branch. I've spent several hours trying different things to no avail.
A remote FTP site has a directory structure like this:
image_db
movies
v2
20131225
xyz
xyz.jpg
20131231
abc
abc.jpg
AllPhotos <-- this is what I want to mirror
xyz
xyz.jpg
abc
abc.jpg
v4
(similar structure to 'v2' above, contains 'AllPhotos')
...
tv_shows
(similar structure to 'movies', contains 'AllPhotos')
other
(different paths, some of which contain 'AllPhotos')
...
I am trying to create a local mirror of only the 'AllPhotos' directories, with their parent paths intact.
I've tried variations of this:
lftp -e 'mirror --only-newer --use-pget-n=4 --verbose -X /* -I AllPhotos/ /image_db/ /var/www/html/mir_images' -u username,password ftp.example.com
...where the "-X /*" excludes all directories and "-I AllPhotos/" includes only AllPhotos. This doesn't work, lftp just copies everything.
I also tried variations of this:
lftp -e 'glob -d -- mirror --only-newer --use-pget-n=4 --verbose /image_db/*/*/AllPhotos/ /var/www/html/mir_images' -u username,password ftp.example.com
...and lftp crunches away at the remote directory structure without actually creating anything on my side.
Basically, I want to mirror only those files that have the string 'AllPhotos' somewhere in the full directory path.
Update 1:
If I can do this with wget, rsync, ftpcopy or some other utility besides lftp, I welcome suggestions for alternatives.
Trying wget didn't work for me either:
wget -m -q -I /image_db/*/*/AllPhotos ftp://username:password#ftp.example.com/image_db
...it just gets the whole directory structure, even though the wget documentation says that wildcards are permitted in -I paths.
Update 2:
After further investigation, I am coming to the conclusion that I should probably write my own mirroring utility, although I still suspect I am approaching lftp the wrong way, and that there's a way to make it mirror only files that have a specific string in the absolute path.
One solution :
curl -s 'ftp://domain.tld/path' |
awk '/^d.*regex/{print $NF}' |
xargs wget -m ftp://domain.tld/path/
Or using lftp :
lftp -e 'ls; quit' 'ftp://domain.tld/path' |
awk '/^d.*regex/{print $NF}' |
xargs -I% lftp -e "mirror -e %; quit" ftp://domain.tld/path/

How to use regular expressions in wget for rejecting files?

I am trying to download the contents of a website using wget tool. I used -R option to reject some file types. but there are some other files which I don't want to download. These files are named as follows, and don't have any extensions.
string-ID
for example:
newsbrief-02
How I can tell wget not to download these files (the files which their names start with specified string)?
Since (apparently) v1.14 wget accepts regular expressions : --reject-regex and --accept-regex (with --regex-type posix by default, can be set to pcre if compiled with libpcre support).
Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :
wget --reject-regex 'expr1|expr2|…' http://example.com
You can not specify a regular expression in the wget -R key, but you can specify a template (like file template in a shell).
The answer looks like:
$ wget -R 'newsbrief-*' ...
You can also use ? and symbol classes [].
For more information see info wget.

inotifywait - exclude regex pattern formatting

I am trying to use inotifywait to watch all .js files under my ~/js directory; how do I format my regex inside the following command?
$ inotifywait -m -r --exclude [REGEX HERE] ~/js
The regex - according to the man page, should be of POSIX extended regular expression - needs to match "all files except those that ends in .js", so these files can in turn be excluded by the --exclude option.
I've tried the (?!) lookaround thing, but it doesn't seem to work in this case. Any ideas or workarounds? Would much appreciate your help on this issue.
I've tried the (?!) thing
This thing is called negative lookahead and it is not supported by POSIX ERE.
So you have to do it the hard way, i.e. match everything that you want to exclude.
e.g.
\.(txt|xml) etc.
inotifywait has no include option and POSIX extended regular expressions don't support negation. (Answered by FailedDev)
You can patch the inotify tools to get an --include option. But you need to compile and maintain it yourself. (Answered by browndav)
A quicker workaround is using grep.
$ inotifywait -m -r ~/js | grep '\.js$'
But be aware of grep's buffering if you pipe the output to another commands. Add --line-buffered to make it work with while read. Here is an example:
$ inotifywait -m -r ~/js | grep '\.js$' --line-buffered |
while read path events file; do
echo "$events happened to $file in $path"
done
If you just want to watch already existing files, you can also use find to generate the list of files. It will not watch newly created files.
$ find ~/js -name '*.js' | xargs inotifywait -m
If all your files are in one directory, you can also use ostrokach's suggestion. In that case shell expansion is much easier than find and xargs. But again, it won't watch newly created files.
$ inotifywait -m ~/js/*.js
I posted a patch here that adds --include and --includei options that work like negations of --exclude and --excludei:
https://github.com/browndav/inotify-tools/commit/160bc09c7b8e78493e55fc9f071d0c5575496429
Obviously you'd have to rebuild inotifytools, and this is relatively untested, but hopefully it can make it in to mainline or is helpful to someone who comes across this post later.
Make sure you are quoting the regex command, if you are using shell-relevant characters (including ()).
While this is working:
inotifywait --exclude \.pyc .
this is not:
inotifywait --exclude (\.pyc|~) .
You have to quote the entire regular expression:
inotifywait --exclude '.*(\.pyc|~)' .
As of version 3.20.1, inotifywait does include the --include and --includei options.
To see them, run inotifywait --help. For some reason, they aren't documented in the manpages.
You could get most of this with --exclude '\.[^j][^s]' to ignore files unless they contain .js at some point in the filename or path. If you combine it with -r then it will work with arbitrary levels of nesting.
Only drawback is filenames like test.js.old will still be watched and all files inside a directory called example.js/ will also be watched, but this is probably somewhat unlikely.
You could probably extend this regex to fix this but personally I don't think the drawbacks are a big enough of a deal to worry about.