htaccess REGEX to accept a particular set of characters in URL - regex

I have a particular set of characters that i want to mark as acceptable:
a-z,A-Z,0-9,*,%,/,\,_.
Now i am being able to create an .htaccess file that enables this for:
a-z,A-Z,0-9,_,
but the rest of the characters are not working. The url is giving a 404 error. here is the regex i am using
[a-zA-Z0-9_]
UPDATE
PS- I have check many of the SO questions related to htaccess, but my problem is not solved. These were helpful, but whenever i try to add the remaining characters in the CHARACTER SET of my regex, its giving an 404 error.

almost all characters lose their special meanings inside a character class, including \ * and /
Try this: [a-zA-Z0-9_\/%*]+

Related

How to only show id value on url path with htaccess?

What I have right now is
https://www.example.com/link.php?link=48k4E8jrdh
What I want to accomplish is to get this URL instead =
https://www.example.com/48k4E8jrdh
I looked on the internet but with no success :(
Could someone help me and explain how this works?
This is what I have right now (Am I in the right direction?)
RewriteEngine On
RewriteRule ^([^/]*)$ /link.php?link=$1
RewriteRule ^([^/]*)$ /link.php?link=$1
This is close, except that it will also match /link.php (the URL being rewritten to) so will result in an endless rewrite-loop (500 Internal Server Error response back to the browser).
You could avoid this loop by simply making the regex more restrictive. Instead of matching anything except a slash (ie. [^/]), you could match anything except a slash and a dot, so it won't match the dot in link.php, and any other static resources for that matter.
For example:
RewriteRule ^([^/.]*)$ link.php?link=$1 [L]
You should include the L flag if this is intended to be the last rule. Strictly speaking you don't need it if it is already the last rule, but otherwise if you add more directives you'll need to remember to add it!
If the id in the URL should only consist of lowercase letters and digits, as in your example, then consider just matching what is needed (eg. [a-z0-9]). Generally, the regex should be as restrictive as required. Also, how many characters are you expecting? Currently you allow from nothing to "unlimited" length.
Just in case it's not clear, you do still need to change the actual URLs you are linking to in your application to be of the canonical form. ie. https://www.example.com/48k4E8jrdh.
UPDATE:
It works but now the site always sees that page regardless if it is link.php or not? So what happens now is this: example.com/idu33d3dh#dj3d3j And if I just do this: example.com then it keeps coming back to link.php
This is because the regex ^([^/.]*)$ matches 0 or more characters (denoted by the * quantifier). You probably want to match at least one (or some minimum) of character(s)? For example, to match between 1 and 32 characters change the quantifier from * to {1,32}. ie. ^([^/.]{1,32})$.
Incidentally, the fragment identifier (fragid) (ie. everything after the #) is not passed to the server so this does not affect the regex used (server-side). The fragid is only used by client-side code (JavaScript and HTML) so is not strictly part of the link value.

regular expressions: catch any URLs of the domain example.com

I'm trying to get regexp code for the below case. I tried multiple tries but in vain.
I need to catch any URLs of the domain site.com. Tried using regexp '^site.com/*$
but it does not recognizes it.
i'm just looking for regexp code whichmatches site.com/*
With your expression ^site.com/*$ you match all strings that start with site.com and have zero or more trailing / characters (/*):
If you want to match any strings starting with site.com/ you might want to try ^site\.com/.*$:
There are already a lot of other regex questions regarding domain names on SO, but your question is not clear to me in what context you are trying to do this, or what is the actual goal you want to achieve. If you describe your needs more precisely you could probably find some answers on this forum.
I generally use a helper website like regex101.com.
Also, a few things to note, . has a special meaning in regex meaning any character, and if you wanted to capture site.com/foo you might want to use something where you are not limited to the number of characters by the end. I'd do this with groupings.
^(site\.com\/)(.+)$
You can see this in action here: https://regex101.com/r/AU2iYC/2
Your regex ^site.com/*$ is only matched follow sentences
ex) site.com/ site.com//////// site.com
because * asterisk in regex means Match 0 or more of the preceding token.
so, it should be work
^site.com\/.*$

JMeter Proxy exclusion patterns still being recorded

I am using JMeter to record traffic in my browser. In my URL Patterns to Exclude are:
.*\.jpg,
.*\.js,
.*\.png
Which looks like they should block these patterns (I've even tested it with a regex tester here)
Yet, I still see plenty of these files get pulled up. In a related forum someone had a similar issue, but his was caused by having additional url parameters afterwards (eg www.website.com/image.jpg?asdf=thisdoesntmatch). However this doesn't seem to be the case here. Can anyone point me in the right direction?
As already mentioned in the question comments it is probably a problem with the trailing characters. The pattern matcher is executed against the complete url including parameters.
So an URL http://example.com/layout.css?id=123 is not matched against the pattern .*\.css The JMeter HTTP Request Sample seperates the Path and the Parameters so it might be not obvious when you look at the URL.
Solution:Change the pattern to support trailing characters .*\.css.*
Explained
.* Any character
\. Matching the . (dot) character
css The character sequence css
.* Any character
Maybe you can do the oposite: leave blank the URL Patterns to exclude and negate those patterns in the URL Patterns to Include box:
(?!..(bmp|css|js|gif|ico|jpe?g|png|swf|woff))(.)

Regex with special characters

I am using some url rewriting for a site I am currently working on. I have the rewriting working fine for numbers, letters and - using this:
RewriteRule ^([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-]+)$ characters.php?realm=$1&name=$2 [NC]
However I need the first part of the rule to allow for ' to be allowed, and I need the second part of the rule to allow for special characters such as ú, æ, ä, ç, and pretty much all of the characters located here that resemble a letter.
I know special characters are bad, but I am not the one that allowed them to be used in character names. I just need them to be allowed in my rule so that the characters with those names can access my application.
Thanks.
Edit: The first part works now. Experimenting with the second part at the moment.
Edit #2: Trying both solutions for the second part ([^/]+) and excluding certain characters allows for the information to display instead of resulting in a 404 error. However it is causing my CSS to not be displayed, instead it is trying to call /css/error instead of /css. It is causing a redirect loop for the css file according to chrome.
The only way something should be redirected to /error is if the character data is invalid. This application is being used to pull character information from the blizzard character api, so it is essential that the accented characters can be used in the rewriting.
I'm not sure if this is important or not, but when I to allow for just ú to be included along with a-z and A-Z I get a 404 error stating that the page cannot be found but instead of displaying the ú it displays ú in its place.
^([a-zA-Z0-9_'-]+)/([^excluded characters here]+)$
adding the "'" inside the first brackets should allow it to be matched. As far as the second goes, you may want to use a negated character class, and list the characters that you do not allow.
edited to move the single quote before the "-".
Simplest solution I could come up with is that ú displays as ú for example, so I just simply included the ú in the rule. Not sure if it is the best way, but it works for now.

causing headache in RewriteRule

I am struggling with a very basic regex problem in my .htaccess file that I hope someone may be able to shed some light on. The basic premise is that I would like to teach Apache to switch any .html extension into a .var extension. I had thought that the rule would be positively trivial:
RewriteRule ^([^.]+)\.html$ $1.var
But the [^.] part simply doesn't work. Bizarrely, it works like so
RewriteRule ^([^A-Z]+)\.html$ $1.var
I do not understand why this latter rule works. Assume I am looking for a file called "index.html" then $1 should match to "index." and the ".html" bit should actually fail to match.
To widen the scope of the question slightly, I am actually racking my brain on how to implement a multi-lingual site. I don't like Apache's MultiView option because it forces upon me a flat directory structure with file extensions that aren't recognizable to many development tools. I could go the .var type-map route but am finding that the default config for Apache doesn't support this all that well either (hence my excursions into regex land). So while I am using mod_rewrite, I am thinking that I might go the whole hog: whenever a request for a name.html file is received and this file does not exist, check whether there exists a XX/name.html file instead, where "XX" is the language code according to the user's preferences.
This would give me a neater directory structure, though it does perhaps not perform as well as the .var approach in a situation where the language preference of the user's browser is not supported in by my site (in which situation .var would substitute EN or similar).
Any thoughts? Thanks.
Why don't you just use ^(.*)\.html$? This will match any string that ends in .html. After all, filenames can contain more than one dot.
[^A-Z]+ matches index if the regex is applied case-sensitively. Perhaps that's the reason? Why [^.]+ should fail is beyond me, though.
The . matches everything but newlines.
Inside of a character class, the ^ means "not".
The + means one or more of the preceding character class.
So when you write ([^.]+), that says "match one or more newlines". So unless you have a URL composed of newlines followed by ".html", this will not work.
^([^A-Z]+)\.html$ works because it matches one or more characters that are not uppercase letters. If you have any uppercase letters before the ".html" in your URL, this one will fail too.
Tim Pietzcker's suggestion is correct: just use ^(.*)\.html$,keeping in mind that this won't work in the odd case that you have newlines in your URL.
In the odd case that you actually have URL's with newlines in them, you can use ^([\d\D]+)\.html$, which will match digits and non-digits (i.e. everything) up until the ".html".