Lowercasing only the file extension in a requested URL - regex

THE ISSUE
I believe I need to rewrite only the file extension of PDF files requested by end users to be lowercase, but don't want to lowercase the entire URL.
BACKGROUND
I have a web application which includes links to a few PDF files. These files have names in mixed case, but the file extensions are lowercase (Bobs_Your_Mom.pdf).
An older version of this application was just static web content. In that version, the files had uppercase file extensions (Bobs_Your_Mom.PDF)
For the sake of this example, I have no access to change the names of the PDF files, or to force all URLs in the application to be lowercase.
In front of this application is an Apache webserver acting as a ReverseProxy. Traffic coming in on :80 gets redirected to :443 and the proxy redirects the traffic through the internal firewall to the backend server, etc etc.
Presently, there is no manipulation of the URL requested by the end user via their web browser. However, though the old 'site' and new 'application' are obviously different from a technical perspective… roughly two months after the switch I am still getting requests for the relevant PDF files with uppercase file extensions (.PDF).
The web application actually doesn't expose the PDF files directly the way the old site did and the user has to take some special action to even make that request now.
I had been hoping this issue would settle as 404 errors received would alert people to changes having been made, etc. But it has not, and I continue to receive 404's for the uppercase file names even as of a few minutes ago.
TRIED
Check Code for Errors
I have validated with the developers and manually myself that no reference exists to PDF files in all caps in the application. This is actually how I discovered the old site did have it this way.
Ask Devs to Change App / Lowercase Entire URI / Business to Change File Names
Developers have indicated no ability/time to alter the application to force all URIs lowercase within the application itself, and the business has indicated the client doesn't want us to actually alter the file names in any way.
301/302 REDIRECTS
This change isn't really for SEO, the old file names redirecting to the new file names would be ok if I knew the 404s were coming from the same set of users with bookmarks (the old site was live for two months before switching to the web app). But requests are coming from entirely new users in entirely different geographic regions, and I cannot make sense of how so many random users would have a bookmark to a URL which existed for only two months without much publicity.
DUDE DUPLICATE ISSUE / CHECK OTHER POSTS
From what I have seen other have required help rewriting whole URIs or part of URIs which aren't the file extension, or simply to hide the extension altogether.
I am not sufficiently skilled with regex to figure this one out on my own (regex is a life struggle for me). I can't really make heads or tails of the expressions in the other posts which makes understanding what I change and why as confusing for me as regex looks to my grandmother.
YOUR HELP
However, With dozens of what I believe to be unnecessarily negative user experiences each day, I am hoping mod-rewrite and Apache can come to my rescue. (ALERT: I am regex illiterate).
Normally on the stacks I like to ask just to be pointed in the right direction. I believe users (including myself) should be able to piece things together and get things working with only some guidance.
In this case, I have no-one around me sufficiently talented with regular expressions to assist in this quest of mine to simply convert .PDF to .pdf whenever requested in-flight.
If I can get help to convert:
Im_An_Example.PDF
to
Im_An_Example.pdf
You will be my savior this day and win 25 whole internets.
FINAL SOLUTION
The final solution, suggested by #signal2013 is as follows:
RewriteEngine on
RewriteRule ^(.*).PDF$ http://exmple.com/$1.pdf [R=301,L]
The solution is simple, and I acknowledge I was making it much more complex in my mind when trying to solve this on my own.

Yep, like #marekful said, your question is a bit too long. Are you looking for something like this... can go in a .htaccess file.
RewriteEngine on
RewriteRule ^(.*).PDF$ http://exmple.com/$1.pdf [R=301,L]

Related

How does an XSS script get executed?

For example,
http://testsite.test/<script>alert("TEST");</script>
I know that browsers either send a request for the url if it contains only domain and resource path. If there is a query string, it gets sent by GET method. But how exactly is a script executed in the client's browser?
And why would anyone "enable" XSS?
I'm learning XSS, so please help me out!
For a very basic example, say you have a form where you ask for your user's name. Then on the next page, you (as the application developer) write "Hello, [anything the user entered]". The problem is that if the user entered something like <script>alert(1)</script> for his name, this would be printed in a vulnerable page as is, and run in the browser. This is called reflected xss, and this is only the very tip of the iceberg, for example your users might store their real name in a database and a query on a different page might list user names, which may also contain javascript which in this case would be run in a different user's browser.
The solution btw is output encoding, which in practice means replacing certain characters with safe ones, like < or > with < and > so that when the script tag is written, it won't be executed (but this is just a small portion of html encoding, sometimes a different one is needed).
So the point is that XSS is not deliberately enabled by a developer, quite the opposite, everybody wants to avoid it. It's a vulnerability of the code. However, sometimes it is not that straightforward, especially for people not being aware of secure coding practices.
Please note that there is much more to XSS than I have mentioned here, both on the how it can manifest itself and the how you can prevent it side.

Match browsers set to Scandinavian languages based on "Accept-Language"

Question
I am trying to match browsers set to Scandinavian languages based on HTTP header "Accept-Language".
My regex is:
^(nb|nn|no|sv|se|da|dk).*
My question is if this is sufficient, and if anyone know about any other odd scandinavian (but "valid") language codes or obscure browser bugs causing false positives?
Used for
The regex is used for displaying a english link in the top of the Norwegian web pages (which is the primary language and the root of the domain and sub-domains) that takes you to the English web pages (secondary language and folder under root) when the browser language is not Scandinavian. The link can be closed / "opted-out" with hash stored in JavaScript localStorage if the user don't want to see the link again. We decided not to use IP geo-location because of limited time to implement.
Depending on the language you are working in there may be code in place you can use to parse this easily, e.g. this post: Parse Accept-Language header in Java <-- Also provides a good code example
Further - are you sure you want to limit your regex to the start of the string, as several lanaguages can be provided (the first is intended to be "I prefer x but also accept the following") : http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
Otherwise your regex should work fine based on the what you were asking and here is a list of all browser language codes: http://www.metamodpro.com/browser-language-codes
I would also - in your shoes, make the "switch to X language" link easy to find for all users until they had opted not to see it again. I would expect many people may have a preference set by default in their browser but find a site actually using it to be unexpected i.e. a user experience like:
I prefer english but don't know enough to change this setting and have never had a reason to before as so few sites make use of it.
That regular expression is enough if you are testing each item in accept-language individually.
If not individually, there are 2 problems:
One of the expected languages could not appear at the beginning of the header, but after.
Some of the expected languages abbreviations could appear as qualifier of a completely different language.

Custom client app - need ability to control where documents are saved

Okay SO. I need some guidance. I apologize for the length of this post, but I need to provide some details:
I've got someone who is interested in me to do a small project for them. The application in general is a fairly straightforward employee record keeping / documentation app, but it makes pretty heavy use templated Word and Lotus documents. The idea is you select the employee “event” such as commendation, promotion, discipline, etc., and it loads the appropriate template doc and you fill it in from there, and later you can select an employee, view all the “events,” and view the individual documents associated with each one.
Thus, the app must know where the .docs are saved when the user is done.
The client actually has a v1 of this app (it doesn’t do any management of the files or anything, just launches Word/Lotus with the document you wanted to view in a new instance, presumably via a system() call.) We’ve not gotten into a detailed requirements phase, but the client and I agree that for this to really work, some kind of control over where the user saves the .doc’s to is going to be critical , because otherwise the app provides them with the new copy of the template doc, they "Save as" somewhere else, and the app is pointing to the blank copy it provided them with.
Obviously, I can’t think of a way to achieve “Save as” restriction/control in any way via just launching a new instance of Word. The client has the idea of an embedded Word/Lotus instance in the app with the template doc when you choose one, but I’ve few reservations with that:
I’ve dug around online and I’ve read that whichever version of Word I borrow MSWORD.OLB from will be the one the end user would require?
I’ve tried to do the MSDN example of embedding a Word doc from here, but as I’ve come to get used to, the MSDN example doesn’t even compile.
Even if I CAN figure out how to embed a .doc file into their application, I don’t know that I could control the use of “Save as…”
All of this STILL hasn’t touched on Lotus (!)
So… instinctively, I feel the embedded Word/Lotus thing has to be more work than it’s worth in the end.
So I’ve had a few other ideas brewing around.
One is looking into using Office XML (and if there’s a lotus equivalent), and get the user’s “inputs” separately and generate the document on the fly each time. I’m not particularly thrilled with that idea, but I think it COULD work, provided I just use old features to try and stay far backwards compatible.
Get user’s “inputs” separately and generate a document in HTML. Meh. Works, very cross platform and easily parsed and understood, but not good if you want to be able to email it to someone (who emails a .html? Works, yes, very unconventional which to the average user will throw them off) and even worse if you need to email it to someone for revisions…
Perhaps some kind of editable PDF? I know there are PDF libraries out there, and the more I stew on it, the more this sounds like the best option, though I’ve not done much work with PDFs and I don’t know how easily embeddable they are / what options one has when creating them. I know they can be save-disabled, I’ve had that with my bloody state taxes before.
I need some input here. Here’s the TLDR questions:
Is launching a new instance of Word for each .doc as bad as I feel, given user can “Save as” document wherever and then application is left pointing to a blank document?
Is trying to support embedded Word as big of a trouble as I feel like it is / more work than it’s worth / likely to cause problems with supporting multiple versions of Word? (Forward compatibility as well as currently released versions?)
What are thoughts on the PDF plan?
Any other good ideas?
Word does allow for programming some "Save" and "Save As" control via its object model. Any subroutines coded in VBA and placed into your Word template will be copied into all documents generated from that template. Additionally, most menu and Ribbon commands can be intercepted by creating a module containing subroutines named for the intercepted commands. So, for example, if a module contains a sub named FileSaveAs(), any code in that sub will be executed instead of the standard File|Save As command. Lastly, this code will replace Save As commands executed via keystroke, toolbar, menu, or Ribbon.
The code below will launch a dialog box to a predetermined path whenever a "Save" or "Save As" command is executed:
Sub FileSave()
ControlSaveLocation
End Sub
Sub FileSaveAs()
ControlSaveLocation
End Sub
Sub ControlSaveLocation()
Dim Directory As String
Directory = "C:\Documents\"
With Application.Dialogs(wdDialogFileSaveAs)
.Name = Directory
.Show
End With
End Sub
Hope this helps.

Everything inside < > lost, not seen in html?

I have many source/text file, say file.cpp or file.txt . Now, I want to see all my code/text in browser, so that it will be easy for me to navigate many files.
My main motive for doing all this is, I am learning C++ myself, so whenever I learn something new, I create some sample code and then compile and run it. Also, along these codes, there are comments/tips for me to be aware of. And then I create links for each file for easy navigation purpose. Since, there are many such files, I thought it would be easy to navigate it if I use this html method. I am not sure if it is OK or good approach, I would like to have some feedback.
What I did was save file.cpp/file.txt into file.html and then use pre and code html tag for formatting. And, also some more necessare html tags for viewing html files.
But when I use it, everything inside < > is lost
eg. #include <iostream> is just seen as #include, and <iostream> is lost.
Is there any way to see it, is there any tag or method that I can use ?
I can use regular HTML escape code < and > for this, to see < > but since I have many include files and changing it for all of them is bit time-consuming, so I want to know if there is any other idea ??
So is there any other solution than s/</< and s/>/>
I would also like to know if there any other ideas/tips than just converting cpp file into html.
What I want to have is,
in my main page something like this,
tip1 Do this
tip2 Do that
When I click tip1, it will open tip1.html which has my codes for that tip. And also there is back link in tip1.html, which will take me back to main page on clicking it. Everything is OK just that everything inside < > is lost,not seen.
Thanks.
You might want to take a look at online tools such as CodeHtmler, which allows you to copy into the browser, select the appropriate language, and it'll convert to HTML for you, together with keyword colourisation etc.
Or, do like many other people and put your documentation in Doxygen format (/** */) with code samples in #verbatim/#endverbatim tags. Doxygen is good stuff.
A few ideas:
If you serve the files as mimetype text/plain, the browser should display the text for you.
You could also possibly configure your browser to assume .cpp is text/plain.
Instead of opening the files directly in the browser, you could serve them with a web server than can change the characters for you.
You could also use SyntaxHighlighter to display the code on the client side using JavaScript.
It is pretty much essential that somewhere along the line you use a program to prevent the characters '<>&' from being (mis-)interpreted by your browser (and expand significant repeated blanks into '` '). You have a couple of options for when/how to do that. You could use static HTML, simply converting each file once before putting it into the web server document hierarchy. This has the least conversion overhead if the files are looked at more often than they are modified. Alternatively, you can configure your web server to server the pages via a filter program (CGI, or something more sophisticated) and serve the output of that in lieu of the file. The advantage is that files are only converted when needed; the disadvantage is that the files are converted each time they are needed. You could get fancy and consider a caching solution - convert the file on first demand but retain the converted file for future use. The main downside there is that the web server needs to be able to write to where the converted file is cached - not necessarily a good idea for security reasons. (A minimalist approach to security requires the document hierarchy to be owned by and only writable by one user, say webmaster, and the web server runs as another user, say webserver. Now the web server cannot do any damage because it cannot write anywhere in the document hierarchy. Simple; effective; restrictive.)
The program can be a simple Perl script or a simple C program (the C source for webcode 1.3 is available here).

Regular expressions: Differences between browsers

I'm increasingly becoming aware that there must be major differences in the ways that regular expressions will be interpreted by browsers.
As an example, a co-worker had written this regular expression, to validate that a file being uploaded would have a PDF extension:
^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.pdf)$
This works in Internet Explorer, and in Google Chrome, but does NOT work in Firefox. The test always fails, even for an actual PDF. So I decided that the extra stuff was irrelevant and simplified it to:
^.+\.pdf$
and now it works fine in Firefox, as well as continuing to work in IE and Chrome.
Is this a quirk specific to asp:FileUpload and RegularExpressionValidator controls in ASP.NET, or is it simply due to different browsers supporting regex in different ways? Either way, what are some of the latter that you've encountered?
Regarding the actual question: The original regex requires the value to start with a drive letter or UNC device name. It's quite possible that Firefox simply doesn't include that with the filename. Note also that, if you have any intention of being cross-platform, that regex would fail on any non-Windows system, regardless of browser, as they don't use drive letters or UNC paths. Your simplified regex ("accept anything, so long as it ends with .pdf") is about as good of a filename check as you're going to get.
However, Jonathan's comment to the original question cannot be overemphasized. Never, ever, ever trust the filename as an adequate means of determining its contents. Or the MIME type, for that matter. The client software talking to your web server (which might not even be a browser) can lie to you about anything and you'll never know unless you verify it. In this case, that means feeding the received file into some code that understands the PDF format and having that code tell you whether it's a valid PDF or not. Checking the filename may help to prevent people from trying to submit obviously incorrect files, but it is not a sufficient test of the files that are received.
(I realize that you may know about the need for additional validation, but the next person who has a similar situation and finds your question may not.)
As far as I know firefox doesn't let you have the full path of an upload. Interpretation of regular expressions seems irrelevant in this case. I have yet to see any difference between modern browsers in regular expression execution.
If you're using javascript, not enclosing the regex with slashes causes error in Firefox.
Try doing var regex = /^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.pdf)$/;
As Dave mentioned, Firefox does not give the path, only the file name. Also as he mentioned, it doesn't account for differences between operating systems. I think the best check you could do would be to check if the file name ends with PDF. Also, this doesn't ensure it's a valid PDF, just that the file name ends with PDF. Depending on your needs, you may want to verify that it's actually a PDF by checking the content.
I have not noticed a difference between browsers in regards to the pattern syntax. However, I have noticed a difference between C# and Javascript as C#'s implementation allows back references and Javascript's implementation does not.
I believe JavaScript REs are defined by the ECMA standard, and I doubt there are many differences between JS interpreters. I haven't found any, in my programs, or seen mentioned in an article.
Your message is actually a bit confusing, since you throw ASP stuff in there. I don't see how you conclude it is the browser's fault when you talk about server-side technology or generated code. Actually, we don't even know if you are talking about JS on the browser, validation of upload field (you can no longer do it, at least in a simple way, with FF3) or on the server side (neither FF nor Opera nor Safari upload the full path of the uploaded file. I am surprised to learn that Chrome does like IE...).