How can I reject files named '!' with wget? - regex

I'm using wget to recursively download my university's pages for later analysis and am filtering lots of extensions.
Here's a mwe with the relevant function:
#!/bin/sh
unwanted_extensions='*.apk,*.asc,*.asp,*.avi,*.bat,*.bib,*.bmp,*.bz2,*.c,*.cdf,*.cgi,*.class,*.cpp,*.crt,*.csp,*.css,*.cur,*.dat,*.dll,*.dvi,*.dwg,*.eot,*.eps,*.epub,*.exe,*.f,*.flv,*.for,*.ggb,*.gif,*.gpx,*.gz,*.h,*.heic,*.hpp,*.hqx,*.htc,*.ico,*.jfif,*.jpe,*.jpeg,*.jpg,*.js,*.lib,*.lnk,*.ly,*.m,*.m4a,*.m4v,*.mdb,*.mht,*.mid,*.mp3,*.mp4,*.mpeg,*.mpg,*.mso,*.odb,*.ogv,*.otf,*.out,*.pdb,*.pdf,*.php,*.plot,*.png,*.ps,*.psz,*.py,*.rar,*.sav,*.sf3,*.sgp,*.sh,*.sib,*.svg,*.swf,*.tex,*.tgz,*.tif,*.tiff,*.tmp,*.ttf,*.txt,*.wav,*.webm,*.webmanifest,*.webp,*.wmf,*.woff,*.woff2,*.wxm,*.wxmx,*.xbm,*.xml,*.xps,*.zip'
unwanted_regex='/([a-zA-Z0-9]+)$'
wget_custom ()
{
link="$1"
wget \
--recursive -e robots=off --level=inf --quiet \
--ignore-case --adjust-extension --convert-file-only \
--reject "$unwanted_extensions" \
--reject-regex "$unwanted_regex" --regex-type posix \
"$link"
}
wget_custom "$1"
It works nicely and filters most of the stuff. However, these webs serve many pdf and image files named ! (e.g. biologiacelular.ugr.es/pages/planoweb/!) which I don't need and want to reject. Here's what i've tried but hasn't worked:
Appending ,! to unwanted_extensions
Appending ,%21 to unwanted_extensions
Changing unwanted_regex to '/([a-zA-Z0-9!]+)$'
Changing unwanted_regex to '/([a-zA-Z0-9\!]+)$'
Adding nother --reject-regex '/!$
Adding nother --reject-regex '/\!$
None of these work and I'm out of ideas. How can I filter the ! files? Thank you!

Related

How come file is not excluded with gsutil rsync -x by the Google Cloud Builder?

I am currently running the gsutil rsync cloud build command:
gcr.io/cloud-builders/gsutil
-m rsync -r -c -d -x "\.gitignore" . gs://mybucket/
I am using the -x "\.gitignore" argument here to try and not copy over the .gitignore file, as mentioned here:
https://cloud.google.com/storage/docs/gsutil/commands/rsync
However, when looking in the bucket and the logs, it still says:
2021-04-23T13:29:37.870382893Z Step #1: Copying file://./.gitignore [Content-Type=application/octet-stream]...
So rsync is still copying over the file despite the -x "\.gitignore" argument.
According to the docs -x is a Python regexp, so //./.gitignore should be captured by \.gitignore
Does anyone know why this isn't working and why the file is still being copied?
See the rsync.py source code:
if cls.exclude_pattern.match(str_to_check):
In Python, re.match only returns a match if it occurs at the start of string.
So, in order to find a match anywhere using the -x parameter, you need to prepend the pattern you need to find with .* or with (?s).*:
gcr.io/cloud-builders/gsutil
-m rsync -r -c -d -x ".*\.gitignore" . gs://mybucket/
Note that to make sure .gitignore appears at the end of string, you need to append $, -x ".*\.gitignore$".

/bin/sh: jlink: not found. command '/bin/sh -c jlink' returned a non-zero code: 127

the dockerfile used -
FROM azul/zulu-openjdk-alpine:11 as jdk
RUN jlink \
--module-path /usr/lib/jvm/*/jmods/ \
--verbose \
--add-modules java.base,jdk.unsupported,java.sql,java.desktop \
--compress 2 \
--no-header-files \
--no-man-pages \
--output /opt/jdk-11-minimal
FROM alpine:3.10
ENV JAVA_HOME=/opt/jdk-11-minimal
ENV PATH=$PATH:/opt/jdk-11-minimal/bin
COPY --from=jdk /opt/jdk-11-minimal /opt/jdk-11-minimal
why jlink can't be found in azul/zulu-openjdk-alpine:11?
The simple answer is jlink is not on the PATH so can't be found.
If you change the RUN line to
RUN /usr/lib/jvm/zulu11/bin/jlink
then it can be found.
However, you still have an error using the wildcard in the module path. Change this to
--module-path /usr/lib/jvm/zulu11/jmods/
and the docker command will complete successfully.
Please, use $JAVA_HOME/bin/jlink.
For historical reasons $JAVA_HOME/bin is not included in PATH, so you need to state it directly.
I had the same problem. And it's an issue in the image https://github.com/zulu-openjdk/zulu-openjdk/issues/66
I tried with the version azul/zulu-openjdk-alpine:11.0.7-11.39.15 and it worked

Using rsync with RegEx

I am using rsync to sync folders and their content between a Linux server and a network storage to backup files. For this, I am using this line of code:
rsync -rltPuz -k --chmod=ugo+rwx --prune-empty-dirs --exclude=*backup* --exclude=*.zip --exclude=*.zip.bak --password-file=/rsync_pw.txt /source/ user#storage::Kunden/Jobs
This Code is running on the source via crontab. Everything works fine.
But now I have a little problem. My directories are built like this:
Jobs
Job1
new
all new files
ready
all ready files
Job2
new
all new files
ready
all ready files
I need only to sync all ready folders and their content. I have tried around with --include and --exclude but I did not really got what I needed. Is there a way to tell rsync what I want?
Thanks for your time!
You can use find /path/to/Jobs -name ready and pipe its output to rsync or use find option -exec and place you rsync call there.
In your example the final command will look like:
find Jobs/ -name 'ready' -exec rsync -rltPuz -k --chmod=ugo+rwx --prune-empty-dirs --exclude=*backup* --exclude=*.zip --exclude=*.zip.bak {}/ dest \;
On my ubuntu it works:
kammala#devuntu:~$ ls -R dest/
dest/:
kammala#devuntu:~$ ls -R Jobs/
Jobs/:
Job1 Job2
Jobs/Job1:
new ready
Jobs/Job1/new:
new1.txt new2.txt some_new_backup.txt
Jobs/Job1/ready:
r1.txt r2.txt some_backup_file.txt
Jobs/Job2:
new ready
Jobs/Job2/new:
new3.txt new4.txt zipped_bckp.zip.bak
Jobs/Job2/ready:
r4.txt r5.txt r6.txt some_zipped_file.zip.bak
kammala#devuntu:~$ find Jobs/ -name 'ready' -exec rsync -rltPuz -k --chmod=ugo+rwx --prune-empty-dirs --exclude=*backup* --exclude=*.zip --exclude=*.zip.bak {}/ dest \;
building file list ...
3 files to consider
./
r1.txt
0 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=1/3)
r2.txt
0 100% 0.00kB/s 0:00:00 (xfr#2, to-chk=0/3)
building file list ...
4 files to consider
./
r4.txt
0 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=2/4)
r5.txt
0 100% 0.00kB/s 0:00:00 (xfr#2, to-chk=1/4)
r6.txt
0 100% 0.00kB/s 0:00:00 (xfr#3, to-chk=0/4)
kammala#devuntu:~$ ls -R dest
dest:
r1.txt r2.txt r4.txt r5.txt r6.txt
Eight years later I find this post after days of pounding on globbing and escaping issues for command option parameters. This was doubly important as my IDE was applying "exclude" options for rsync without quotes or escaping.
CompSci 101:
Glob characters ? * [ ] are expanded by the shell before the command is executed. And, they are expanded based on the current working directory. (Yeah, I forget all the places that this applies, too.) This is why it might seem to work in situations.
This includes your option to rsync, --exclude=*.zip. Those parameters need to be either escaped or quoted. So, omitting other options for brevity:
rsync -av --exclude='*backup*' --exclude='*.zip' --exclude='*.zip.bak' /source/ user#storage::Kunden/Jobs
or
rsync -av --exclude=\*backup\* --exclude=\*.zip --exclude=\*.zip.bak /source/ user#storage::Kunden/Jobs
If you are unsure of what the results of an include, exclude, or filter combination is and what is being sent to, say, a production server, you can test your command with the options --dry-run or -n and --debug=filter. You'll get a list of files that are shown or hidden from the planned transfer.

How can I get the "lein repl" history to work in cygwin?

I'm using Cygwin on Windows 7 and the latest lein, but when I am in the repl, pressing up and down moves me around the repl console instead of showing me history (which is what I expect). I've googled around and seen that this is related to using jline instead of readline (whatever that means) but I don't know how to use this information to fix my problem.
I found the answer here:
I modified the lein startup script to call stty and set jline.terminal, and it seems to work:
stty -icanon min 1 -echo
$LEIN_JAVA_CMD \
-client -XX:+TieredCompilation \
-Djline.terminal=jline.UnixTerminal \
$LEIN_JVM_OPTS \
-Dfile.encoding=UTF-8 \
-Dmaven.wagon.http.ssl.easy=false \
-Dleiningen.original.pwd="$ORIGINAL_PWD" \
-Dleiningen.trampoline-file="$TRAMPOLINE_FILE" \
-cp "$CLASSPATH" \
clojure.main -m leiningen.core.main "$#"
EXIT_CODE=$?
stty icanon echo
I modified that section in the lein script and now up = history.
An alternative approach to the one you suggested would be to install rlwrap which is available in Cygwin. This will give you Readline capabilities (eg. command history search and navigation) to any interactive command line application. If you've used bash for any length of time you will know what these capabilities are.
You will need to start the applications as parameters to the readline wrapper but this can be hidden away using aliases or functions as appropriate:
rlwrap lein repl
The benefit of using rlwrap over your suggestion is that it can add this capabilities to more than just the specific case of the repl.

How do I add in a new template for my lift project (including url setups)?

I just created a hello-world project with maven command in the book:
mvn archetype:generate -U \
-DarchetypeGroupId=net.liftweb \
-DarchetypeArtifactId=lift-archetype-blank \
-DarchetypeVersion=1.0 \
-DgroupId=demo.helloworld \
-DartifactId=helloworld \
-Dversion=1.0-SNAPSHOT
And as instructed, I start it with:
mvn jetty:run
Everything works fine until the moment that I would like to add in another template besides:
my-project/src/main/webapp/index.html
For example, I put pricing page (pricing.html) just beside index.html to be "my-project/src/main/webapp/pricing.html". But the following url does not seem to work for me: http://localhost:8080/pricing
Am I missing anything here?
You need to add it to Boot.scala:
http://simply.liftweb.net/index-3.2.html