Regex help needed! - regex

I'm working on a multilingual application that uses IIS7-based url rewriting.
I'd like the following Url actions:
1. fr-ca > index.aspx?l=&lc=fr-ca
2. fr-ca/ > index.aspx?l=&lc=fr-ca
3. fr-ca/568/sometitle > index.aspx?l=568&lc=fr-ca
4. 568/sometitle > > index.aspx?l=568&lc=
Essentially, the initial fr-ca is optional.
My current rule:
<match url="^(fr-ca.)?([^/][0-9]+)?/*" />
Fails on #1
Another attempt:
<match url="^(fr-ca)?(.[0-9]+)?/*" />
Passes all requirements, except the back reference {R:2} gives /568 in this case.
I suppose I could add another rule that adds a / to the end of a fr-ca only, but that doesn't seem right.
Thanks for any help! Regex drives me bonkers.

Should you be doing it this way? Instead of seeing which url they hit (and redirecting to what I presume is a translated version of the page), you could actually check the Accept-Language header... probably even from within that ASP page.
This means that they see the language they want to see it in from the beginning, and without clicking on that dumb little flag at the top of the page. It doesn't rely on GeoIP, or user interaction.
Check it out.

Sorted it out myself.
^(fr-ca)?/?([0-9]+)?
passes:
fr-ca
fr-ca/
fr-ca/9
fr-ca/9/
fr-ca/9/sometitle
fr-ca/9/sometitle/
fr-ca/9/sometitle/anothertitle
9
9/
9/sometitle
9/sometitle/
9/sometitle/anothertitle

Related

Rewrite engine. How to translate URL

I am new to regular expressions and rewrite engine
I want to translate:
domain.com/type/id
on
domain.com/index.php?type=type&id=id
I use
RewriteRule (\w+)/(\d+)$ ./index.php?id=$1&type=$2
I works almost fine and I am able to get two variables but website has a problem with including other files. My main URL is: http://domain.com/repos/site and after trying to type an URL like http://domain.com/repos/site/ee/9, firebug says:
"NetworkError: 404 Not Found - http://domain.com/repos/site/ee/lib/geoext/script/geoext.js"
It seems site takes "ee" as a part of ulr, not as a GET variable.
Yes, you will certainly have to change your paths. Paths behavior:
- href="mypath": will append "/mypath" to the URL from the current URL
- href="./mypath": same as before
- href="/mypath": will append mypath to the root. This is the behavior you want
Note: you can also use "../" to come back to the parent directory of where you are.

How do I imitate twitters url-shortener?

the main question is a bit short so I'll collaborate.
I'm building an app for twitter with which you can do the basic actions (get posts, do a post, reply etc.)
Now I figured it would be a good idea if I'd check the max 140 char limit in my app.
So far so good, then someone asked if I could also do the url-shortener thing.
so at the moment I have a regex that picks op most (in fact too much) url's, takes the lenght of them and either adds or deduces the difference from the 140 max.
It's still a but buggy but I can manage that.
Now my problem....
It seems twitter is quite picky in what they think is an url:
I got the most basic ones (starting with http(s):// and such), but twitter also replaces some tld's very easily, (www.)google.com [whatever].net/.biz/.info are just a few of them)
but not .nl .de .tk
Now I was wondering if perhaps someone has found out which ones they do and which ones they don't 'shorten'.
now because I'm pretty sure my regex isn't the best either I'll drop that here as well:
((http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:\/~\+#]*[\w\-\#?^=%&\/~\+#])?)|([\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:\/~\+#]*[\w\-\#?^=%&\/~\+#])?)
http://support.twitter.com/articles/78124-how-to-shorten-links-urls# indicates that all URLs posted to Twitter will be rewritten to be exactly 19 characters long.
I am using this: var url_expression = /[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?/gi; Nobody has complained :)
I figured it out, I found a pretty important line on the tld wikipage. It states that all country TLD's are two chars long. And also the other way around; all 2 char tld's are countries. With that in mind, I started testing a bunch of them with twitter and I'm pretty sure I now know what url's twitter shortens and which ones they don't.
All url's starting with http:// or https://
All url's like [something].[non country tld] # .com .biz .mobi etc. (Except .arpa & .aero)
All url's like [something].[something].[valid tld] # including countries
links like http://[user]:[pass]#[something].[tld] will NOT be shortened
Now to build a regex for it, i'll post it here as soon as I think I have it :D
this is what I got this far:
/(^(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?:(?:[-\w]+\.)+(?:com|asia|cat|coop|edu|int|tel|pro|org|net|gov|mil|biz|info|mobi|name|jobs|museum|travel|([a-z]{2})))(?::[\d]{1,5})?(?:(?:(?:\/(?:[-\w~!$+|.,=\(\)]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)/gim;
one major flaw still in it, it also accepts [domain].[tld] which twitter doesn't.
I hope this will help someone in the future. I'm pretty sure there's not a whole lot easy-to-find info about this on the web (or at least I couldn't find it).

Can't finish my web-site validator regular expression

I am preparing to my exams and I am stuck at RegEx validation. I would like to validate an entered web-site. I've surfed for a solution here, but have not found any which will fulfill my needs. For example these links should be validated:
http://www.yahoo.com/cheers/peter.aspx
http://www.yahoo.com/asd/
http://www.regularexpressions.com/reference.html
http://www.gandon.com/
and this should not:
http://www.radsoftware.com.au/articles/regexsyntaxadvanced.aspx
For the moment the closest expression I got is:
http://(www\.)([^\.]+)(\.com)(/([^\.]+)(\.html|\.aspx))?
It can be a little bit dirty, since it is my first deal with regexes
But in regexTester it highlights/accepts (I am using regexpal):
http://www.yahoo.com from #2 (without /asd/)
http://www.yahoo.com/cheers/peter/steven/mar s.aspx from #6 (although there are spaces)
http://www.radsoftware.com from #5 (but should not accept it at all)
http://www.gandon.com from #4 (without / , but it is not so critical)
What should be changed in my regex?
P.S. Sorry for such a long story, I am just a beginner.
The only difference that I see is whether it has multiple top-level domains (like co.uk or com.au).
Therefore that is what i check for:
^.*www.[a-zA-Z]*.[a-zA-Z]{1,3}/([a-zA-Z].*|)
that actually just checks whether it has only a single TLD and optionally some more parts in the URL.
I do NOT validate whether it starts with HTTP:// as that is no actual requirement for an URL. I also do not check the document type (html or aspx) as that can be variable or even named as well.

ColdFusion -- Do I need URLDecode with form POSTs? / URLDecode randomly removes one character

I'm using a WYSIWYG to allow users to format text. This is the error-causing text:
<p><span style="line-height: 115%">This text starts with a 'T'</span></p>
The error is that the 'T' in "This", or whatever the first letter happens to be, is randomly removed when using URLDecode and saving to the DB. Removing URLDecode on the server side seems to fix it without any negative side-effects (the DB contains the same information).
The documentation says that
Query strings in HTTP are always URL-encoded.
Is this really the case? If so, why doesn't removing URLDecode seem to mess everything up?
So two questions:
Why is URLDecode causing the first text character to be removed like this (it seems to only happen when the line-height property is present)?
Do I really need (or would I even want) to use URLDecode before putting POSTed data into the database?
Edit: I made a test page to echo back the decoded text, and URLDecode is definitely removing that character, but I have no idea why.
I believe decoding is done automatically when form scope is populated. That's why characters after % (this char is used for encoding) are removed -- you are trying to decode the string second time.
For security reasons you might be interested in stripping script tags, or even cleaning up HTML using white-list. Try to search in CFLib.org for applicable functions.

Mod_rewrite syntax with query strings

Embarrassing as this may be, I've hit a wall with mod_rewrite trying to come up with what seems to be a simple rule.
I'd like to accomplish the following mapping:
/cat/subcat which may have a "?PageId=123" afterwards
should become
/cat.php?cid=148 or (/cat.php?cid=148&PageId=123)
So for example, the following 2 mappings would occur:
/cat/subcat => /cat.php?cid=148 (the 148 part can be ignored, it's taken care of)
/cat/subcat?PageId=2 => /cat.php?cid=148&PageId=2
Note that there's an & in the second clause... The parameter will always be PageId
Can this be done?
Thanks so much in advance!
Apparently a little elbow grease worked (after 5 hours)...
Ends up the rule is just:
^/cat/subcat /cat.php?cid=148 [QSA]
I was missing the QSA component...
-Adam