How to get wget to accept files with no suffix - regex

I'm using wget (from perl) to get web pages from a site. I'm really only interested in the html,htm,php,asp,aspx file types. However, at least one site has supplied links using file names with no extensions/suffix. I need those too.
My:
wget -A html,htm,php,asp,aspx
works great, except for the no suffix links.
I've tried a number of regex strings to try and get the no suffix pages, but to no avail. wget returns just the main page. So far, the only way to get these files is to open it up to all files (which isn't terrible for this website, but would be terrible for others).
Is there either a regex or regular way to specify I want links from wget with no suffixes?

wget version 1.14 seems to support a --accept-regex argument which is matched against the full URL, i.e. something like the following should in theory work (untested):
wget --accept-regex '/[^.]+(?:\.(?:html?|php|aspx?))?$'
Or perhaps it would be easier to just reject those extensions you do not want?

Related

Wget returns images in an unknown format (jpg#f=)

I am running the following line:
wget -P "C:\My Web Sites\REGEX" -r --no-parent -A jpg,jpeg https://www.mywebsite.com/directory1/directory2/
and it stops (no errors) without returning more than a small amount of the website (two files). I am then running this:
wget -P "C:\My Web Sites\REGEX" https://www.mywebsite.com/directory1/directory2/ -m
and expecting to see data only from the directory. As a start, I found out that the script downloaded everything from the website as if I gave the https://www.mywebsite.com/ url. Also, the images are returned with an additional string in the extension (e.g. instead of .jpg I get something like .jpg#f=l=q)
Is there anything wrong in my code that causes that? I only want to get the images from the links that are shown in the directory given initially.
If there is nothing I can change, then I want to only download the files that contain .jpg in their names. Then, I have a prepared script in Python that can rename the files to have the original extension. Worst case, I can try Python instead of the CMD in Windows (page scraping)?
Note that --no-parent doesn't work in this case because the images are saved in a different directory. --accept-regex can be used if there is no way to get the correct extension.
PS: I do this thing in order to learn more about the wget options and protect my future hobby website.
UPD: Any suggestions regarding a Python script are welcome.

How do I ag search into a specific set of folders using the -G option?

How do I ag search into a specific set of folders using the -G option?
Here's an example where I use -G routes, but it's picking up another result from another folder, because routes is still in the path. Yet if I try -G ^routes, it doesn't seem like the regex is taking?
On that note, Atom has a nice path searching syntax which lets me do things like app,routes,!storage (search in app and routes folders, but ignore storage folder). Since I've switched to vim, I'm finding it hard to get a searching workflow down with ack/ag. Anyone have any tips for me?
I suppose you'd really like to try this plugin https://github.com/junegunn/fzf.vim. It will (among other things) run your ag search and present the result in a fuzzyfinder. Then you'll have the very same feature you mention in atom by running :Ag middleware followed by(in the fzf window) 'app/ 'routes/ !storage/'.
about the use of regex for the file pattern, you'll need to quote your regex
ag -G '^routes' middleware
that being said, the '^...' doesn't work as intended here on my computer either, although '\b' does. I started using ripgrep instead of the-silver-searcher not so much because it's faster but rather because it has a better documentation, maybe you'd like to give it a try. rg -g will take a glob which is easier to understand than the "FILEPATTERN" mentioned in the ag man page

Dynamically-created 'zip' command not excluding directories properly

I'm the author of a utilty that makes compressing projects using zip a bit easier, especially when you have to compress regularly, such as for updating projects submitted to an application store (like Chrome's Web Store).
I'm attempting to make quite a few improvements, but have run into an issue, described below.
A Quick Overview
My utility's command format is similar to command OPTIONS DEST DIR1 {DIR2 DIR3 DIR4...}. It works by running zip -r DEST.zip DIR1; a fairly simple process. The benefit to my utility, however, is the ability to use a predetermined file (think .gitignore) to ignore specific files/directories, or files/directories which match a pattern.
It's pretty simple -- if the "ignorefile" exists in a target directory (DIR1, DIR2, DIR3, etc), my utility will add exclusions to the zip -r DEST.zip DIR1 command using the pattern -x some_file or -x some_dir/*.
The Issue
I am running into an issue with directory exclusion, however, and I can't quite figure out why (this is probably be because I am still quite the sh novice). I'll run through some examples:
Let's say that I want to ignore two things in my project directory: .git/* and .gitignore. Running command foo.zip project_dir builds the following command:
zip -r foo.zip project -x project/.git/\* -x project/.gitignore
Woohoo! Success! Well... not quite.
In this example, .gitignore is not added to the compressed output file, foo.zip. The directory, .git/*, and all of it's subdirectories (and files) are added to the compressed output file.
Manually running the command:
zip -r foo.zip project_dir -x project/.git/\* -x project/.gitignore
Works as expected, of course, so naturally I am pretty puzzled as to why my identical, but dynamically-built command, does not work.
Attempted Resolutions
I have attempted a few different methods of resolving this to no avail:
Removing -x project/.git/\* from the command, and instead adding each subdirectory and file within that directory, such as -x project/.git/config -x project/.git/HEAD, etc (including children of subdirectories)
Removing the backslash before the asterisk, so that the resulting exclusion option within the command is -x project/.git/*
Bashing my head on the keyboard in angst (I'm really surprised this didn't work, it usually does)
Some notes
My utility uses /bin/sh; I would prefer to keep it that way for maximum compatibility.
I am aware of the git archive feature -- my use of .git/* and .gitignore in the above example is simply as an example; my utility is not dependent on git nor is used exclusively for projects which are git repositories.
I suspected the problem would be in the evaluation of the generated command, since you said the same command when executed directly did right.
So as the comment section says, I think you already found the correct solution. This happens because if you run that variable directly, some things like globs can be expanded directly, instead of passed to the command. And arguments may be messed up, depending on the situation.
Yes, in that case:
eval $COMMAND
is the way to go.

Using two asterisks to add a file in git

I want to add a file which has a unique file name but a long preceding path (e.g. a/b/c/d/filename.java). Normally I would add this to my repository by doing
git add *filename.java.
However I have also done this before:
git add a/b/c/d/filename*
So I tried to combine the two:
git add *filename*
but this does something weird. It adds every untracked file. I can see possible reasons for failure but they all should occur in one of the previous two commands so I don't know why this is happening.
My question isn't so much about how to add a file to a git repository with just its file name (although that would be useful).
My question is what is my misunderstanding of the * operation which makes me think the above should work.
Info:
I am using Git Bash for Windows, which is based on minGW.
You're looking at globs
(not regular expressions, which are a different pattern-matching language), and they're expanded by your shell, not by git.
If you want to see how they're going to match, just pass the same glob to another command, eg.
$ ls -d *filename.java
vs
$ ls -d *filename*
(I've just added the -d so ls doesn't show the contents of any directories that match)
Since you're using git bash, and it's possible that glob expansion behaves differently from a regular shell, try
$ git add --dry-run --verbose -- *filename*
for example: this should show you how it really expands the glob and what effect that has.
Note the -- ... if you're using globs that might match a filename with a leading -, it's important to make sure git knows it's a filename and not an option.
Unfortunately, this will only show you the files which both match the glob, and have some difference between the index and working copy.
Answer from author:
The dry run helped a lot, here is what I found:
I was forgetting about the bin folder which I haven't added, so when I performed the dry run I realised it was finding two matches: filename.java and filename.class. When I changed the glob to *filename.j* it worked.
My next step was to remove the .class and try the command again: it worked! It is still unexplained why git bash added everything when it found two matches... since the dry run behaves differently from the actual run I think there must be a bug, but I think that discussion is to be held elsewhere (unless somebody thinks it isn't a bug).
You could try with git add ./**/*.java
Note: I tested with zsh, it should also work for bash as well.

How to use regular expressions in wget for rejecting files?

I am trying to download the contents of a website using wget tool. I used -R option to reject some file types. but there are some other files which I don't want to download. These files are named as follows, and don't have any extensions.
string-ID
for example:
newsbrief-02
How I can tell wget not to download these files (the files which their names start with specified string)?
Since (apparently) v1.14 wget accepts regular expressions : --reject-regex and --accept-regex (with --regex-type posix by default, can be set to pcre if compiled with libpcre support).
Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :
wget --reject-regex 'expr1|expr2|…' http://example.com
You can not specify a regular expression in the wget -R key, but you can specify a template (like file template in a shell).
The answer looks like:
$ wget -R 'newsbrief-*' ...
You can also use ? and symbol classes [].
For more information see info wget.