How can I reject files named '!' with wget? - regex
I'm using wget to recursively download my university's pages for later analysis and am filtering lots of extensions.
Here's a mwe with the relevant function:
#!/bin/sh
unwanted_extensions='*.apk,*.asc,*.asp,*.avi,*.bat,*.bib,*.bmp,*.bz2,*.c,*.cdf,*.cgi,*.class,*.cpp,*.crt,*.csp,*.css,*.cur,*.dat,*.dll,*.dvi,*.dwg,*.eot,*.eps,*.epub,*.exe,*.f,*.flv,*.for,*.ggb,*.gif,*.gpx,*.gz,*.h,*.heic,*.hpp,*.hqx,*.htc,*.ico,*.jfif,*.jpe,*.jpeg,*.jpg,*.js,*.lib,*.lnk,*.ly,*.m,*.m4a,*.m4v,*.mdb,*.mht,*.mid,*.mp3,*.mp4,*.mpeg,*.mpg,*.mso,*.odb,*.ogv,*.otf,*.out,*.pdb,*.pdf,*.php,*.plot,*.png,*.ps,*.psz,*.py,*.rar,*.sav,*.sf3,*.sgp,*.sh,*.sib,*.svg,*.swf,*.tex,*.tgz,*.tif,*.tiff,*.tmp,*.ttf,*.txt,*.wav,*.webm,*.webmanifest,*.webp,*.wmf,*.woff,*.woff2,*.wxm,*.wxmx,*.xbm,*.xml,*.xps,*.zip'
unwanted_regex='/([a-zA-Z0-9]+)$'
wget_custom ()
{
link="$1"
wget \
--recursive -e robots=off --level=inf --quiet \
--ignore-case --adjust-extension --convert-file-only \
--reject "$unwanted_extensions" \
--reject-regex "$unwanted_regex" --regex-type posix \
"$link"
}
wget_custom "$1"
It works nicely and filters most of the stuff. However, these webs serve many pdf and image files named ! (e.g. biologiacelular.ugr.es/pages/planoweb/!) which I don't need and want to reject. Here's what i've tried but hasn't worked:
Appending ,! to unwanted_extensions
Appending ,%21 to unwanted_extensions
Changing unwanted_regex to '/([a-zA-Z0-9!]+)$'
Changing unwanted_regex to '/([a-zA-Z0-9\!]+)$'
Adding nother --reject-regex '/!$
Adding nother --reject-regex '/\!$
None of these work and I'm out of ideas. How can I filter the ! files? Thank you!
Related
How come file is not excluded with gsutil rsync -x by the Google Cloud Builder?
I am currently running the gsutil rsync cloud build command: gcr.io/cloud-builders/gsutil -m rsync -r -c -d -x "\.gitignore" . gs://mybucket/ I am using the -x "\.gitignore" argument here to try and not copy over the .gitignore file, as mentioned here: https://cloud.google.com/storage/docs/gsutil/commands/rsync However, when looking in the bucket and the logs, it still says: 2021-04-23T13:29:37.870382893Z Step #1: Copying file://./.gitignore [Content-Type=application/octet-stream]... So rsync is still copying over the file despite the -x "\.gitignore" argument. According to the docs -x is a Python regexp, so //./.gitignore should be captured by \.gitignore Does anyone know why this isn't working and why the file is still being copied?
See the rsync.py source code: if cls.exclude_pattern.match(str_to_check): In Python, re.match only returns a match if it occurs at the start of string. So, in order to find a match anywhere using the -x parameter, you need to prepend the pattern you need to find with .* or with (?s).*: gcr.io/cloud-builders/gsutil -m rsync -r -c -d -x ".*\.gitignore" . gs://mybucket/ Note that to make sure .gitignore appears at the end of string, you need to append $, -x ".*\.gitignore$".
/bin/sh: jlink: not found. command '/bin/sh -c jlink' returned a non-zero code: 127
the dockerfile used - FROM azul/zulu-openjdk-alpine:11 as jdk RUN jlink \ --module-path /usr/lib/jvm/*/jmods/ \ --verbose \ --add-modules java.base,jdk.unsupported,java.sql,java.desktop \ --compress 2 \ --no-header-files \ --no-man-pages \ --output /opt/jdk-11-minimal FROM alpine:3.10 ENV JAVA_HOME=/opt/jdk-11-minimal ENV PATH=$PATH:/opt/jdk-11-minimal/bin COPY --from=jdk /opt/jdk-11-minimal /opt/jdk-11-minimal why jlink can't be found in azul/zulu-openjdk-alpine:11?
The simple answer is jlink is not on the PATH so can't be found. If you change the RUN line to RUN /usr/lib/jvm/zulu11/bin/jlink then it can be found. However, you still have an error using the wildcard in the module path. Change this to --module-path /usr/lib/jvm/zulu11/jmods/ and the docker command will complete successfully.
Please, use $JAVA_HOME/bin/jlink. For historical reasons $JAVA_HOME/bin is not included in PATH, so you need to state it directly.
I had the same problem. And it's an issue in the image https://github.com/zulu-openjdk/zulu-openjdk/issues/66 I tried with the version azul/zulu-openjdk-alpine:11.0.7-11.39.15 and it worked
Using rsync with RegEx
I am using rsync to sync folders and their content between a Linux server and a network storage to backup files. For this, I am using this line of code: rsync -rltPuz -k --chmod=ugo+rwx --prune-empty-dirs --exclude=*backup* --exclude=*.zip --exclude=*.zip.bak --password-file=/rsync_pw.txt /source/ user#storage::Kunden/Jobs This Code is running on the source via crontab. Everything works fine. But now I have a little problem. My directories are built like this: Jobs Job1 new all new files ready all ready files Job2 new all new files ready all ready files I need only to sync all ready folders and their content. I have tried around with --include and --exclude but I did not really got what I needed. Is there a way to tell rsync what I want? Thanks for your time!
You can use find /path/to/Jobs -name ready and pipe its output to rsync or use find option -exec and place you rsync call there. In your example the final command will look like: find Jobs/ -name 'ready' -exec rsync -rltPuz -k --chmod=ugo+rwx --prune-empty-dirs --exclude=*backup* --exclude=*.zip --exclude=*.zip.bak {}/ dest \; On my ubuntu it works: kammala#devuntu:~$ ls -R dest/ dest/: kammala#devuntu:~$ ls -R Jobs/ Jobs/: Job1 Job2 Jobs/Job1: new ready Jobs/Job1/new: new1.txt new2.txt some_new_backup.txt Jobs/Job1/ready: r1.txt r2.txt some_backup_file.txt Jobs/Job2: new ready Jobs/Job2/new: new3.txt new4.txt zipped_bckp.zip.bak Jobs/Job2/ready: r4.txt r5.txt r6.txt some_zipped_file.zip.bak kammala#devuntu:~$ find Jobs/ -name 'ready' -exec rsync -rltPuz -k --chmod=ugo+rwx --prune-empty-dirs --exclude=*backup* --exclude=*.zip --exclude=*.zip.bak {}/ dest \; building file list ... 3 files to consider ./ r1.txt 0 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=1/3) r2.txt 0 100% 0.00kB/s 0:00:00 (xfr#2, to-chk=0/3) building file list ... 4 files to consider ./ r4.txt 0 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=2/4) r5.txt 0 100% 0.00kB/s 0:00:00 (xfr#2, to-chk=1/4) r6.txt 0 100% 0.00kB/s 0:00:00 (xfr#3, to-chk=0/4) kammala#devuntu:~$ ls -R dest dest: r1.txt r2.txt r4.txt r5.txt r6.txt
Eight years later I find this post after days of pounding on globbing and escaping issues for command option parameters. This was doubly important as my IDE was applying "exclude" options for rsync without quotes or escaping. CompSci 101: Glob characters ? * [ ] are expanded by the shell before the command is executed. And, they are expanded based on the current working directory. (Yeah, I forget all the places that this applies, too.) This is why it might seem to work in situations. This includes your option to rsync, --exclude=*.zip. Those parameters need to be either escaped or quoted. So, omitting other options for brevity: rsync -av --exclude='*backup*' --exclude='*.zip' --exclude='*.zip.bak' /source/ user#storage::Kunden/Jobs or rsync -av --exclude=\*backup\* --exclude=\*.zip --exclude=\*.zip.bak /source/ user#storage::Kunden/Jobs If you are unsure of what the results of an include, exclude, or filter combination is and what is being sent to, say, a production server, you can test your command with the options --dry-run or -n and --debug=filter. You'll get a list of files that are shown or hidden from the planned transfer.
How can I get the "lein repl" history to work in cygwin?
I'm using Cygwin on Windows 7 and the latest lein, but when I am in the repl, pressing up and down moves me around the repl console instead of showing me history (which is what I expect). I've googled around and seen that this is related to using jline instead of readline (whatever that means) but I don't know how to use this information to fix my problem.
I found the answer here: I modified the lein startup script to call stty and set jline.terminal, and it seems to work: stty -icanon min 1 -echo $LEIN_JAVA_CMD \ -client -XX:+TieredCompilation \ -Djline.terminal=jline.UnixTerminal \ $LEIN_JVM_OPTS \ -Dfile.encoding=UTF-8 \ -Dmaven.wagon.http.ssl.easy=false \ -Dleiningen.original.pwd="$ORIGINAL_PWD" \ -Dleiningen.trampoline-file="$TRAMPOLINE_FILE" \ -cp "$CLASSPATH" \ clojure.main -m leiningen.core.main "$#" EXIT_CODE=$? stty icanon echo I modified that section in the lein script and now up = history.
An alternative approach to the one you suggested would be to install rlwrap which is available in Cygwin. This will give you Readline capabilities (eg. command history search and navigation) to any interactive command line application. If you've used bash for any length of time you will know what these capabilities are. You will need to start the applications as parameters to the readline wrapper but this can be hidden away using aliases or functions as appropriate: rlwrap lein repl The benefit of using rlwrap over your suggestion is that it can add this capabilities to more than just the specific case of the repl.
How do I add in a new template for my lift project (including url setups)?
I just created a hello-world project with maven command in the book: mvn archetype:generate -U \ -DarchetypeGroupId=net.liftweb \ -DarchetypeArtifactId=lift-archetype-blank \ -DarchetypeVersion=1.0 \ -DgroupId=demo.helloworld \ -DartifactId=helloworld \ -Dversion=1.0-SNAPSHOT And as instructed, I start it with: mvn jetty:run Everything works fine until the moment that I would like to add in another template besides: my-project/src/main/webapp/index.html For example, I put pricing page (pricing.html) just beside index.html to be "my-project/src/main/webapp/pricing.html". But the following url does not seem to work for me: http://localhost:8080/pricing Am I missing anything here?
You need to add it to Boot.scala: http://simply.liftweb.net/index-3.2.html