Apache mod_rewrite mapping path to parameters - regex

I'm moving over from IIS to Apache (on Windows) and struggling with adapting a rewrite rule (using Helicon ISAPI_Rewrite 3 in IIS).
The rule maps what looks like a directory structure path back into a set of query string parameters. There could be any number of parameters in the path.
E.g.
/basket/param1/value1/param2/value2/param3/value3 ...and so on...
Becomes...
/basket?param1=value1&param2=value2&param3=value3 ...and so on...
Rule in ISAPI_Rewrite:
# This rule simply reverts parameters that appear as folders back to standard parameters
# e.g. /search-results/search-value/red/results/10 becomes /search-results?search-value=red&results=10
RewriteRule ^/(.*?)/([^/]*)/([^/]*)(/.+)? /$1$4?$2=$3 [NC,LP,QSA]
I first spotted that Apache doesn't have the 'LP' flag, so swapped it for the N=10 as a test for looping...
RewriteRule ^(.*?)/([^/]*)/([^/]*)(/.+)? $1$4?$2=$3 [NC,N=10,QSA]
However the Apache error logs show the same parameters being added over and over again until the number of loops on the N flag is reached, ending in a HTTP 500 error.
Any ideas where I'm going wrong?!?

After having done much head scratching and engaging my Google Foo I have located the solution to all my problems on another Stack Overflow comment...
https://stackoverflow.com/a/5520004/14054970
Essentially...
apparently there's been an issue with mod_rewrite re-appending
post-fix part in certain cases
https://issues.apache.org/bugzilla/show_bug.cgi?id=38642
The problem:
If multiple RewriteRules within a .htaccess file match, unwanted
copies of PATH_INFO may accumulate at the end of the URI.
If you are on Apache 2.2.12 or later, you can use the DPI flag to
prevent this http://httpd.apache.org/docs/2.2/rewrite/flags.html
I'm using Apache 2.4, so my Rewrite rule now looks as follows (and I'll be adding the DPI flag to all rules to be safe)...
RewriteRule ^(.*?)/([^/]*)/([^/]*)(/.+)? $1$4?$2=$3 [NC,N=1000,QSA,DPI]

Related

Apache2 replace substring in RewriteRule capture?

I was wondering - is it possible to replace (all instances) of a substring inside a RewriteRule capture?
Here is an example, that I've started testing in https://htaccess.madewithlove.be/ :
For original/input url: https://example.com/subfold/dl/testDir/subTestDir/test.png
I have the rules:
RewriteEngine On
RewriteCond %{REQUEST_URI} ^/subfold/dl/
RewriteRule ^subfold/dl/(.*)/(.*)$ httpdocs__$1__$2 [NC,L]
... and I get output URL: https://example.com/httpdocs__testDir/subTestDir__test.png
So, essentially, I capture the entire subpath after ^subfold/dl into $1 variable, and the filename into $2; so in this case: $1 = testDir/subTestDir, and $2 = test.png.
So what I want, is to replace all instances of / (forward slash) in $1 with %2F, before it gets applied to the output URL -> so that I would eventually get output URL: https://example.com/httpdocs__testDir%2FsubTestDir__test.png
Is there a way to do this with Apache2 RewriteRule's - and if so, how?
EDIT: Jeez:
... at least give me this message the very first time I paste mysite.com - now I have to waste extra time reworking my entire example, after I spent all this time to make it work to begin with :(... Computers are never going to make life easier, are they? Just more ads, crap and espionage ...
OK, got somewhere (also see somewhat related post Debugging Apache2 RewriteRule (with headers)?)
So, the only thing I could find somewhat working is Using RewriteMap - Apache HTTP Server Version 2.4. Now, since I use a Windows build of Apache, it gets somewhat extra tricky - but is still doable.
There is very skint information online on how to get this working; crucial information was found here RewriteMap prg: issue on Windows - Apache Web Server forum at WebmasterWorld - By Pubcon:
RewriteMap program is kicked off IFF the "RewriteEngine On" directive is OUTSIDE as below
In my case, also, the RewriteMap program starts if and only if the RewriteMap directive is OUTSIDE <Location>; AND the "RewriteEngine On" is OUTSIDE <Location> - in any other case, the program does not start.
Second thing we should be careful about is this, from https://httpd.apache.org/docs/2.4/rewrite/rewritemap.html :
When a MapType of prg is used, the MapSource is a filesystem path to
an executable program which will providing the mapping behavior. This
can be a compiled binary file, or a program in an interpreted language
such as Perl or Python.
This program is started once, when the Apache HTTP Server is started,
and then communicates with the rewriting engine via STDIN and STDOUT.
That is, for each map function lookup, it expects one argument via
STDIN, and should return one new-line terminated response string on
STDOUT. If there is no corresponding lookup value, the map program
should return the four-character string "NULL" to indicate this.
External rewriting programs are not started if they're defined in a
context that does not have RewriteEngine set to on.
In other words - the program used HAS to open its STDIN and STDOUT - AND it MUST block continuously; even if what you wanted to do was perl -i -pe's/SEARCH/REPLACE/', that kind of a program reads input, processes, provides output, and exits - and so in this case, it would not do us any good.
So, based on the example given in rewritemap.html - here is a Perl script that replaces forward slash (/) with %2F, while blocking continuously, called convslash.pl, saved in C:\bin\Apache24\bin\
#!C:/msys64/usr/bin/perl.exe
$| = 1; # Turn off I/O buffering
while (<STDIN>) {
s|/|%2F|g; # Replace / with %2F
print $_;
}
Then, I add this in my httpd.conf:
# the below starts and runs ONLY if RewriteEngine On is outside of <Location>; also a cmd.exe window is started (plus another for perl!)
#RewriteMap doprg "prg:c:/msys64/usr/bin/perl.exe c:/bin/Apache24/bin/dash2under.pl"
# the below is slightly better - only one cmd.exe window is started:
RewriteMap doprg "prg:c:/Windows/System32/cmd.exe /c start /b c:/msys64/usr/bin/perl.exe c:/bin/Apache24/bin/convslash.pl"
# we MUST have RewriteEngine On here, outside of location - otherwise the RewriteMap program will never start:
RewriteEngine On
<Location /subfold/dl>
Options -Multiviews
RewriteEngine On
RewriteOptions Inherit
# first RewriteCond - this is just so we can capture the relevant parts into environment variables:
RewriteCond %{REQUEST_URI} ^/subfold/dl/(.*)/(.*)$
RewriteRule ^ - [E=ONE:%1,E=TWO:%2,NE]
# the above RewriteRule does not rewrite - but passes the input string further;
# so here, let's have another such RewriteRule - just so we can set our processed/desired output to a variable, which we can "print" via headers:
RewriteRule ^ - [E=MODDED:subfold/dl/${doprg:%{ENV:ONE}}/%{ENV:TWO},NE]
# the original URL will finally pass through unmodified to the "file handler" which will attempt to map it to the filesystem, it will fail, and return 404.
# the below headers should be returned along with that 404:
Header always set X-ONE "%{ONE}e"
Header always set X-TWO "%{TWO}e"
Header always set X-INPUT "%{INPUT}e"
Header always set X-MODDED "%{MODDED}e"
Header always set X-REQ "expr=%{REQUEST_URI}"
</Location>
So, now I start the server locally (./bin/httpd.exe), and to test this, I issue a request with curl:
$ curl -IkL http://127.0.0.1/subfold/dl/my/spec/test.html
HTTP/1.1 404 Not Found
Date: Mon, 18 Oct 2021 17:08:11 GMT
Server: Apache/2.4.46 (Win32) OpenSSL/1.1.1j
X-ONE: my/spec
X-TWO: test.html
X-INPUT: (null)
X-MODDED: subfold/dl/my%2Fspec/test.html
X-REQ: /subfold/dl/my/spec/test.html
Content-Type: text/html; charset=iso-8859-1
... and finally, we can see in the X-MODDED header, that indeed we managed to replace only a substring in (what would be) the rewritten URL ...
Well, I wish that this was documented somehow, and I didn't have to waste like 8 hours of my life to figure this out - but who cares, in couple of years there will be new servers du jour, where all of this will be irrelevant, so more time will have to be wasted - all of it to serve more crap, ads and espionage.

Write a url path parameter to a query string with haProxy

I'm trying to re-write a URL such as
http://ourdomain.com/hotels/vegas?cf=0
to
http://ourdomain.com?d=vegas&cf=0
using haProxy.
We used to do it with Apache using
RewriteRule ^hotels/([^/]+)/?\??(.*)$ ?d=$1&$2 [QSA]
I've tried
reqrep ^([^\ :]*)\ /hotels/(.*) \1\ /?d=\2
But that gives me http://ourdomain.com?d=vegas?cf=0
And
reqrep ^([^\ :]*)\ /hotels/([^/]+)/?\??(.*) \1\ /?d=\2&\3
Just gives me a 400 error.
It would be nice to do it with acl's but I can't see how that would work.
reqrep ^([^\ :]*)\ /hotels/([^/]+)/?\??(.*) \1\ /?d=\2&\3
Just gives me a 400 error.
([^/]+) is too greedy when everything following it /?\??(.*) is optional. It's mangling the last part of the request, leading to the 400.
Remember what sort of data you're working with:
GET /path?query HTTP/1.(0|1)
Replace ([^/]+) with ([^/\ ]+) so that anything after and including the space will be captured by \3, not \2.
Update: it seems that the above is not quite perfect, since the alignment of the ? still doesn't work out. This -- and the original 400 error -- highlight some of the pitfalls with req[i]rep -- it's very low level request munging.
HAProxy 1.6 introduced several new capabilities that make request tweaking much cleaner, and this is actually a good case to illustrate several of them together. Note that these examples also use anonymous ACLs, wrapped in { }. The documentation seems to discourage these a little bit -- but this is only because they're unwieldy to maintain when you need to test the same set of conditions for multiple reasons (named ACLs can of course be more easily reused), but they're perfect for a case like this. Note that the braces must be surrounded by at least 1 whitespace character due to configuration parser limitations.
Variables, scoped to request (go out of scope as soon as a back-end is selected), response (go into scope only after the back-end responds), transaction (persistent from request to response, these can be used before the trip to the back-end and are still in scope when the response comes back), or session (in scope across multiple requests by this browser during this connection, if the browser reuses the connection), can be used to stash values.
The regsub() converter takes the preceding value as its input and returns that value passed through a simple regex replacement.
If the path starts with /hotels/, capture the path, scrub out ^/hotels/ (replacing it with the empty string that appears after the next comma), and stash it in a request variable called req.hotel.
http-request set-var(req.hotel) path,regsub(^/hotels/,) if { path_beg /hotels/ }
Processing of most http-request steps is done in configuration file order, so, at the next instruction, if (and only if) that variable has a value, we use http-request set-path with an argument of / in order to empty the path. Testing the variable is needed so that we don't do this with every request -- only the ones for /hotels/. It might be that you actually need something more like if { path_reg /hotels/.+ } since /hotels/ by itself might be a valid path we should leave alone.
http-request set-path / if { var(req.hotel) -m found }
Then, we use http-request set-query to set the query string to a value created by concatenating the value of the req.hotel variable with & and the original query string, which we obtain with using the query fetch.
http-request set-query d=%[var(req.hotel)]&%[query] if { var(req.hotel) -m found }
Note that the query fetch and http-request set-query both have some magical behavior -- they take care of the ? for you. The query fetch does not return it, and http-request set-query does not expect you to provide it. This is helpful because we may need to be able to handle requests correctly whether or not the ? is present in the original request, without having to manage it ourselves.
With the above configuration, GET /hotels/vegas?&cf=0 HTTP/1.1 becomes GET /?d=vegas&cf=0 HTTP/1.1.
If the initial query string is completely empty, GET /hotels/vegas HTTP/1.1 becomes GET /?d=vegas& HTTP/1.1. That looks a little strange, but it should be completely valid. A slightly more convoluted configuration to test for the presence of an intial query string could prevent that, but I don't see it being an issue.
So, we've turned 1 line of configuration into 3, but I would argue that those three lines are much more intuitive about what they are accomplishing and it's certainly a less delicate operation than massaging the entire start line of the request with a regex. Here they are, together, with some optional whitespace:
http-request set-var(req.hotel) path,regsub(^/hotels/,) if { path_beg /hotels/ }
http-request set-path / if { var(req.hotel) -m found }
http-request set-query d=%[var(req.hotel)]&%[query] if { var(req.hotel) -m found }
This is a working solution using reqrep
acl is_destination path_beg /hotels/
reqrep ^([^\ :]*)\ /hotels/([^/\ \?]+)/?\??([^\ ]*)(.*)$ \1\ /?d=\2&\3\4 if is_destination
I'm hoping that the acl will remove the need to run regex on everything (hence lightening the load a bit), but I'm not sure that's the case.

How to guess full file name, having only first 2 letters

I have a directory full of files, which names are prefixed with sequential, unique number - like so:
/01 - Gruppe #1 - Potatisvalsen.mp3
/02 - Gruppe #1 - Wondrous Love & Hell Broke Loose in Georgia.mp3
Those are accessible at http://mysite/01 - Gruppe #1 - Potatisvalsen.mp3 etc.
I would like to rewrite calls like http://mysite/01.mp3 to the correct full URL as above.
I have tried the "obvious":
RewriteRule ^/(\d+)*\.mp3$ ./$1(.*)\.mp3
But that probably just shows my ignorance :)
Is this possible using mod_rewrite?
mod_rewrite cannot do this shell expansion. You will be better off forwarding these requests to a PHP script and load the actual file there.
Step 1: Forward to PHP
RewriteRule ^\d{2}\.mp3$ fileloader.php?f=$0 [L,QSA,NC]
Step 2: Inside fileloader.php
Load a list of files from current directory into an associative array
Perform a lookup on those filename using $_GET['f']
Serve the found file

.htaccess rule for getting requested file size

I wonder if there can be such a thing. I wanna check for size of the file and then do htaccess rules based on it. For example:
# like this line
CheckIf {REQUESTED_FILE_SIZE} 50 # MB
AuthName "Login title"
AuthType Basic
AuthUserFile /path/to/.htpasswd
require valid-user
It's clear that I want to make some files with specific file size available to some users only (Using Authentication)
Any idea is appreciated.
Update #1
should be done in htaccess
Update #2
There are so many files and their URLs are already posted in blog. So can't separate larger files to another folder and update each post, also the limitation of file size may change in future.
Update #3
It's a windows server with PHP & helicon App installed
Update #4
Some people got confused about the real issue and I didn't clear it as well either.
.htaccess + PHP file for authentication (uses API) and checking file size + All downloadable files are all in the same server BUT our website is hosted on a different server.
Obviously .htaccess cannot check the requested file size and act accordingly. What you can possibly do is to make use of External Rewriting Program feature of RewriteMap
You need to define a RewriteMap like this your Apache config first:
RewriteMap checkFileSize prg:/home/revo/checkFileSize.php
Then inside your .htaccess define a rule like this by passing :
RewriteRule - ${checkFileSize:%{REQUEST_FILENAME}}
%{REQUEST_FILENAME} is passed to PHP script on STDIN.
Then inside /home/revo/checkFileSize.php you can put PHP code to check for size of file and act accordingly like redirect to a URI that shows basic auth dialog.
I'd do it in 2 steps:
1: htaccess redirecting all requests to one php script, say you have your files inside /test/ and you wanna make it all handled by /test/index.php, eg:
RewriteEngine On
RewriteBase /test
RewriteCond %{REQUEST_URI} !/test/index.php
RewriteRule ^(.+)? /test/index.php?file=$1 [L]
The RewriteCond is just to avoid loop requests.
2: the index.php script does all the auth logic, based on the requested file size, like this:
define('LIMIT_FILESIZE',50*1024*1024); // eg.50Mb
define('AUTH_USER', 'myuser'); // change it to your likes
define('AUTH_PW','mypassword'); // change it to your likes
if( filesize($_GET['file'])>LIMIT_FILESIZE ){
if( !isset($_SERVER['PHP_AUTH_USER']) ) {
header('WWW-Authenticate: Basic realm="My realm"');
header('HTTP/1.0 401 Unauthorized');
echo 'Canceled';
exit;
}
else if( $_SERVER['PHP_AUTH_USER']!=AUTH_USER &&
$_SERVER['PHP_AUTH_PW']!=AUTH_PW ) {
header('HTTP/1.0 401 Unauthorized');
echo 'Wrong credentials';
exit;
}
}
// If we're here, it's fine (filesize is below or user is authenticated)
// offer file for download
$file = rawurldecode($_GET['file']);
$finfo = finfo_open(FILEINFO_MIME_TYPE);
$mime = finfo_file( $finfo, $file );
header("Content-type: ".$mime );
header("Content-Disposition: attachment; filename=".basename($file));
include( $file );
There are many possible improvements to it, but it should work as a basis.
I think both answers from #Paolo Stefan and #anubhava are valid (+1).
Note that using RewriteMap will enforce a modification on the apache configuration, not just on the .htaccess file. If you think a little about performance you should, in fact, put all the things you have in .htaccess files into <directory /same/filesystem/path/as/the/.htaccess/file/> directives and put an AllowOverride None in the VirtualHost configuration. You will certainly gain some speed by avoiding File I/O and dynamic check-for-configuration-files settings for apache for each query (.htaccess files are bad, really). So the fact that RewriteMap is not available in .htaccess should not be a problem.
Theses two answers provides a way to dinamically alter the Authentification headers based on the filesize. Now, one important fact you forgot to mention on your question is that the files are not directly available on the same server than the PHP ones and also that you do not want to launch a script on each download.
With current #anubhava solution you would have a call on the OS for file size at each access, and of course this script should be run on the file storage server.
One solution could be to store somewhere (a dedicated database or key value storage?) a file-size index. You could feed this index after each download, you could manage some asynchronous tasks to maintain it. Then on the file storage servers's apache configuration you will have to launch a script checking for file sizes. Using RewriteMap you have several options:
use a very fast script with prg: keyword (written in C, Perl, anything, you are not tied to PHP), requesting for this index of file size in this data storage, or even fastdbd: to directly execute the SQL query in apache. But this means a query at each request, so you have others solutions.
use directly o mapping file with txt: keyword, having for each filename the matching size already computed, no more queries, just
even better, use an hashmap of this file with dbm:` keyword.
With the last two options the file size index is the text file or hashed version of this text file. Apache is caching the hashmap and recompute the cache on restart or when the modification time of the file is altered. So you would just need to recompute this hashmap after each download to obtain a very fast filesize check in a RewriteRule as shown by anubhava but using
RewriteMap checkFileSize dbm:/path/to/filesize_precomputed_index.map
You could also try to use mod_security on the file servers and check the Content-Length header to add the HTTP Auth. Check this thread for a beginning of answer on that subject. But mod_security configuration is not an easy task.
Not directly, but with a little modification on the file system side you can do it.
If this is a system where users are uploading as well as downloading, you can key off of the Content-Length header in mod_rewrite to put the larger files into a separate directory. (or if you can in some other way move them either way on upload)
Next, in the directory with the larger files you set it to require auth for all files.
Finally use the existence tests in mod_rewrite to transparently redirect the user to the large_files directory if the file is there.
If you can't move the files, you could use symlinks to setup the two layouts.
According to this mod_rewrite supports a "-s" CondPattern which checks if the specified TestString is a regular file with non-zero size. The attached patch expands the "-s" option to support comparisons against arbitrary sizes.
So the questioned:
CheckIf {REQUESTED_FILE_SIZE} 50 # MB
should work with this .htaccess code:
RewriteCond %{REQUEST_FILENAME} -s=52428800
At the time I asked the question Apache 2.4 was out so this answer could be right.
Apache Expressions
As of Apache 2.4 a new capability called Expressions were introduced. According to this, you can use some functions within your directives and also in conjunction with RewriteCond.
Fortunately there is a function named filesize() which is able to compare a file size along with Comparison operators: -et -gt -ge -lt -le -ne
Saying that - however documentation lacks information regarding to this function - below is corresponding rules for comparing a file size:
RewriteCond expr "filesize('%{REQUEST_FILENAME}') -gt 52428800"
RewriteRule ** **
Using Expressions inside RewriteCond should be done within this syntax:
RewriteCond expr "..."
BNF:
expr ::= "true" | "false"
| "!" expr
| expr "&&" expr
| expr "||" expr
| "(" expr ")"
| comp
You can find complete BNF at documentation page.

Rewrite engine. How to translate URL

I am new to regular expressions and rewrite engine
I want to translate:
domain.com/type/id
on
domain.com/index.php?type=type&id=id
I use
RewriteRule (\w+)/(\d+)$ ./index.php?id=$1&type=$2
I works almost fine and I am able to get two variables but website has a problem with including other files. My main URL is: http://domain.com/repos/site and after trying to type an URL like http://domain.com/repos/site/ee/9, firebug says:
"NetworkError: 404 Not Found - http://domain.com/repos/site/ee/lib/geoext/script/geoext.js"
It seems site takes "ee" as a part of ulr, not as a GET variable.
Yes, you will certainly have to change your paths. Paths behavior:
- href="mypath": will append "/mypath" to the URL from the current URL
- href="./mypath": same as before
- href="/mypath": will append mypath to the root. This is the behavior you want
Note: you can also use "../" to come back to the parent directory of where you are.