Regex parse subdomain and redirect

Regex parse subdomain and redirect - regex

I have been trying to figure this out for nearly 2 hours now and can't get it working so I hope some of you can help me out. Please note I pretty new to PHP and regex.
Ok so I am trying to setup a redirect that will redirect to certain pages based on text in the subdomain. I have wildcards setup for my subdomains and any subdomain URL now redirects to this php file containing the code below.
So if someone comes to anytexthere.domain.com they arrive at domain.com/redirect.php
This php file will be used to send them to the correct relevant URL.
To do this I am going to use subdomains such as anythinghere-1.domain.com and then have my code check what appears in the subdomain after - and before the .domain. If it equals in this case 1 go to whatever URL matches it and so on.
Code so far:
<?php
$host = $_SERVER['SERVER_NAME'];
preg_replace('(?<=-).*?(?=\.)', $host, $matches);
$url = $matches;
switch ($url)
{
case "1":
header("Location: http://www.website.com/page-here/");
break;
case "test":
header("Location: http://www.website.com/this-is-a-test/");
break;
case "another":
header("Location: http://www.website.com/another-page-here/");
break;
default:
header("Location: http://www.website.com");
break;
}
?>
As you can see I have cases there that use text e.g "another" instead of numbers because I will need it to work if I choose to use text or numbers after the - in the subdomain.
This code as it is now is giving me the following errors:
Warning: preg_replace() [function.preg-replace]: Unknown modifier '.' in thelocationhere on line 4
and
Warning: Cannot modify header information - headers already sent by (output started at thelocationhere:4) in thelocationagain on line 22
I believe the second error above has something to do with the default case.
I really appreciate any help with this. I have spent a lot of time searching all over for snippets of code and testing them but I haven't got it to work on my own.
Thanks,
Peter

You have a small mistake - you used preg_replace instead of preg_match. also, you need to pad your regular expression with /. so it should be:
preg_match('/(?<=-).*?(?=\.)/', $host, $matches);
By the way, you didn't mention how you are making everyone who try and access *.domain.com to reach domain.com/redirect.php., but I'm assuming you are using htaccess and the RewriteEngine. You could just directly type it in the RewriteRule.
The "Headers already sent" means some output was sent to the browser. Did you save your file as UTF-8 without BOM? or maybe you've printed something to the browser prior to the redirect?
There could also be some stray spaces at the beginning of the file....

Related

How to replace characters in an nginx variable string?

Is there a way I can replace non alphanumeric characters returned with $request_uri with a space (or a +)?
What I'm trying to do is redirect all 404's in one of my sites to it's search engine, where the query is the uri requested. So, I have a block in my nginx.conf containing:
error_page 404 = #notfound;
location #notfound {
return 301 $scheme://$host/?s=$request_uri;
}
While this does indeed work, the url's it's returning are the actual uri's complete with -_/ characters causing the search to always return 0 results
For instance... give this url: https://example.com/my-articles, the redirect ends up as this: https://example.com/?s=/my-articles
What I would like is to end up (ultimately) like this: https://example.com/?s=my+articles (tho, the + at the beginning works fine too... https://example.com/?s=+my+articles
I will need to do this without LUA or Perl modules. So, how can I accomplish this?

You may need to tweak this depending upon how far down your directory structure you want the replacement to go, but this is the basic concept.
Named location for initial capture of 404s:
location #notfound {
rewrite (.*) /search$1 last;
}
Named locations are a bit limiting, so all this does is add /search/ to the beginning of the URI which returned 404. The last flag tells Nginx to break out of the current location and select the best location to process the request based on the rewritten URI, so we need a block to catch that:
location ^~ /search/ {
internal;
rewrite ^/search/(.*)([^a-z0-9\+])(.*)$ /search/$1+$3 last;
rewrite ^/search/(.*)$ /?s=$1 permanent;
}
The internal directive makes this location only accessible to the Nginx process itself, any client requests to this block will return 404.
The first rewrite will change the last non text, digit or + character into a + and then ask Nginx to reevaluate the rewritten URI.
The location block is defined with the ^~ modifier, which means requests matching this location will not be evaluated against any regex defined location blocks, so this block should keep catching the rewritten requests.
Once all the non word characters are gone the first rewrite will no longer match so the request will be passed to the next rewrite, which removes the /search from the front of the URI and adds the query string.
My logs look like this:
>> curl -L -v http://127.0.0.1/users-forum-name.1
<< "GET /?s=users+forum+name+1 HTTP/1.1"
>> curl -L -v http://127.0.0.1/users-forum-name/long-story/some_underscore
<< "GET /?s=users+forum+name+long+story+some+underscore"
You get the idea..

You can use lua module, transform this variable to what you need using lua string functions. I'am using OpenResty which is basicly nginx with lua enabled. But nginx lua module will do fine. Here is directive that allows you to use lua inside nginx configuration. It could be inside location using content_by_lua_block / access_by_lua_block or in separate file using content_by_lua_file / access_by_lua_file. Here is documentation on this https://github.com/openresty/lua-nginx-module#content_by_lua .
Here is an example from my app.
location ~/.*\.jpg$ {
set $test '';
access_by_lua_block {
ngx.var.test = string.sub(ngx.var.uri, 2)
}
root /var/www/luaProject/img/;
try_files $uri /index.html;
}

It is generally a bad idea to automatically issue redirects from 404 Not Found pages to elsewhere — the user might have simply mistyped a single character in the URL (e.g., on a mobile phone whilst copying the URL from a flier and having a "fat finger"), which would be very easy to correct once they see a 404 and the obvious typo in the address bar, yet may require starting from scratch if your search-engine doesn't deliver.
If you still want to do it, it might be more efficient to do it within the search engine itself — after all, if your search engine isn't capable of searching by URL, and correcting typos, then it doesn't sound like a very useful search engine, now does it?
If you still want to do it within the nginx alone in front of the search engine, then you can use the fact that http://nginx.org/r/rewrite directives essentially let you implement any sort of a DFA — Deterministic Finite Automaton — but, depending on the number of replacements required, it may result in too many cycles and somewhat inflexible rulesets.
Take a look at the following resources on recursive replacements of given characters within the URL for other characters:
How to replace underscore to dash with Nginx
nginx rewrite rule to remove - and _
https://serverfault.com/questions/477103/how-do-i-verify-site-ownership-on-google-webmaster-tools-through-nginx-conf
http://mdoc.su/

nginx - URL encode query string

I have an nginx reverse-proxy which needs to pass on the query string it receives. However this query string it receives is not well formatted and can contain JSON that is not URL encoded i.e. it contains curly brackets i.e. {}, commas, colons and double quotes! Unfortunately, I have no control over this and this causes the downstream server to barf when parsing the string.
Is there a way to correctly URL encode this string before proxying it?
I can replace the curly brackets as I know there will only be one instance of each using the config:
if ($args ~* '(.*){(.*)}(.*)') {
set $args $1%7B$2%7D$3;
rewrite (.*)$ $1;
}
proxy_pass http://127.0.0.1:8080;
However, I don't know in advance how many fields the JSON will have so it's difficult to use the same logic as above for the rest of the object.
I should also mention that I don't think this is related to nginx url-decoding parameters as I am not using a URI in the proxy_pass.
Thanks!
UPDATE: For the time being, the JSON object seems to be sending the same properties so this is what I've used as a workaround. It's pretty hideous and will break if the number of properties changes but does the job for now.
if ($args ~* '(.*){"(.*)":"(.*)","(.*)":"(.*)","(.*)":"(.*)","(.*)":"(.*)","(?<group10>.*)":"(?<group11>.*)"}(?<group12>.*)') {
set $args $1%7B%22$2%22%3A%22$3%22%2C%22$4%22%3A%22$5%22%2C%22$6%22%3A%22$7%22%2C%22$8%22%3A%22$9%22%2C%22${group10}%22%3A%22${group11}%22%7D${group12};
rewrite (.*)$ $1;
}
proxy_pass http://127.0.0.1:8080;
Note that since this returns more than 9 regex groups, I had to name groups 10, 11 and 12 otherwise they get interpreted as $1 + the digit 0, 1 or 2.
Is there a more robust way of doing this?

Personally, I don't like a solution with a single if statement, because it doesn't look very readable, flexible or maintainable. You may see whether having a combination of location or rewrite statements, where each one handles a specific encoding case, may work; see http://mdoc.su/ for a fun project that's very heavy with internal redirects, although I believe at one point nginx may have a limit on the total number of indirections.
Otherwise, provided that you cannot modify the backend, another option is to automatically redirect misbehaving clients and/or requests to an auxiliary backend, whose only purpose is to re-encode the string correctly, providing an X-Accel-Redirect HTTP Response Header as its output (as per http://nginx.org/r/proxy_ignore_headers), which nginx will use to make a subsequent internal redirect / request to the actual backend.

Get spamassassin to drop emails containing a specific REGEX in attached filenames

newbie asking first question :)
I'm running a mail server (Ubuntu/Postfix/Dovecot) with SpamAssassin. Most of the known spam is flagged (RBLs, and obvious UCE) except for this particular malspam in attached zip files like "order_info_654321.zip", "paymet_document_123456.zip", and so on, when it doesn't fit any other SA rules. I'd like to procure a rule which drops the matching offenders into oblivion.
After fiddling with regex101.com, I've come up with an expression that matches these patterns exclusively:
/\w+[_][0-9]{6}.zip$/img
Question is... How to format it all, get it to work, and where to put it? So far, I edited /etc/spamassassin/local.cf, added this to the bottom, and restarted:
mimeheader TROJAN_ATTACHED Content-Type =~ /\w+[_][0-9]{6}.zip$/img
describe ZIP_ATTACHED email contains a zip trojan attachment
score TROJAN_ATTACHED 99.
But it doesn't seem to do the magic. Where else can I look for this?
Thank you all,
Keijo.-

You have a wrong regex. You do not need a $ char at the end, because filename strings are not necessarily at the end of the Content-Type header. Instead, you can use a word boundary \b anchor. In my rules, I have the following, and it perfectly works:
mimeheader MIME_FAIL Content-Type =~ /\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh|reg)\b/i
describe MIME_FAIL Blacklisted file extension detected
score MIME_FAIL 5

First up, SA doesn't drop e-mails by default, but it can score them so high on spam content that they don't show up to anyone's inbox. Second, the "ingredients" I started with were incorrect, plus messed up with SA ability to function at all.
This actually did the trick when added into/etc/spamassassin/local.cf:
full TROJAN_ZIPUNDS /\w*[_][\d]{1,6}\.zip/img
score TROJAN_ZIPUNDS 99
describe TROJAN_ZIPUNDS RM zip attached trojan underscore
Even though these spammers altered from zip to rar, to underscores to dashes, different filenames, and so on, creating rules to counter them became simple after succeeding with the first one. Here's what I added too:
full TROJAN_RARDASH /\w*[-][\d]{1,6}\.rar/img
score TROJAN_RARDASH 99
describe TROJAN_RARDASH RM rar attached trojan dash
Also, as first described, I needed to specifically block certain zip file names which soon morphed to rar and dashes, so, morphing the regex and appending as a rule triad to spamassassin's local.cf (and restarting) is currently holding, until next spam wave :-)
Finally, this is a very very blunt workaround, so anyone with expertise on the subject is more than welcome to chime in.

You are using the wrong mime header to check for the filename. Use this instead:
mimeheader TROJAN_ATTACHED Content-Disposition =~ /\w+[_][0-9]{6}.zip/img
Also make sure you have the MimeHeader plugin loaded.
loadplugin Mail::SpamAssassin::Plugin::MIMEHeader

regex exclude hit if text contains a second string match

I am trying to search thru log files to see if any warnings have appeared so that I can warn in a Jenkins pipeline using Jenkins plug in "Text Finder".
However, I have a case where I do not want hits on the string "CRIT" int he logfile if the string also contains plms.
E.g.
I have the following text in the log file:
<CRIT> 23-Jun-2014::10:57:13.649 Upgrade committed
<CRIT> 23-Jun-2014::10:57:13.703 no registration found for callpoint plmsView/get_next of type=external
I am not interested in having a warning for the second line, so I have added the following regex to Text Finder in Jenkins:
WARN|ERROR|<ERR>|/^(?=<CRIT>)(?=^(?:(?!plms).)*$).*$/
This should get a hit on CRIT only if the string does not also contain plms, i.e the first line, but I do not get a hit on either line.
I got the code from here: Combine Regexp
Could someone please help me correct this? Thanks!

You should use something like this:
WARN|ERROR|<ERR>|<CRIT>(?!.*?no registration found)
Change the no registration found part to match the <CRIT> message you want to exclude.
This expression matches also for the line:
<INFO> User WARNER registered
so you should consider using something like:
^(WARN|ERROR|<ERR>|<CRIT>(?!.*?no registration found))
that matches only if the tokens are at the beginning of the line (change the tokens accordingly).

This should work for you:
^<CRIT>(.(?!plms))*$
Demo and explanation

URL general format

I have written a C++ program that allows URLs to be posted onto YouTube. It works by taking in the URL as input either from you typing it into the program or from direct input, and then it will replace every '/', '.' in the string with '*'. This modified string is then put on your clipboard (this is solely for Windows-users).
Of course, before I can even call the program usable, it has to go back: I will need to know when '.', '/' are used in URLs. I have looked at this article: http://en.wikipedia.org/wiki/Uniform_Resource_Locator , and know that '.' is used when dealing with the "master website" (in the case of this URL, "en.wikipedia.org"), and then '/' is used afterwards, but I have been to other websites, http://msdn.microsoft.com/en-us/library/windows/desktop/ms649048%28v=vs.85%29.aspx , where this simply isn't the case (it even replaced '(', ')' with "%28", "%29", respectively!)
I also seemed to have requested a .aspx file, whatever that is. Also, there is a '.' inside the parentheses in that URL. I have even tried to view the regular expressions (I don't quite fully understand those yet...) regarding URLs. Could someone tell me (or link me to) the rules regarding the use of '.', '/' in URLs?

Can you explain why you are doing this convoluted thing? What are you trying to achieve? It may be that you don't need to know as much as you think, once you answer that question.
In the mean time here is some information. A URL is really comprised of a number of sections
http: - the "scheme" or protocol used to access the resource. "HTTP", "HTTPS",
"FTP", etc are all examples of a scheme. There are many others
// - separates the protocol from the host (server) address
myserver.org - the host. The host name is looked up against a DNS (Dynamic Name Server)
service and resolved to an IP address - the "phone number" of the machine
which can serve up the resource (like "98.139.183.24" for www.yahoo.com)
www.myserver.org - the host with a prefix. Sometimes the same domain (`myserver.org`)
connects multiple servers (or ports) and you can be sent straight to the
right server with the prefix (mail., www., ftp., ... up to the
administrators of the domain). Conventionally, a server that serves content
intended for viewing with a browser has a `www.` prefix, but there's no rule
that says this must be the case.
:8080/ - sometimes, you see a colon followed by up to five digits after the domain.
this indicates the PORT on the server where you are accessing data
some servers allow certain specific services on just a particular port
they might have a "public access" website on port 80, and another one on 8080
the https:// protocol defaults to port 443, there are ports for telnet, ftp,
etc. Add these things only if you REALLY know what you are doing.
/the/pa.th/ this is the path relative to DOCUMENTROOT on the server where the
resource is located. `.` characters are legal here, just as they are in
directory structures.
file.html
file.php
file.asp
etc - usually the resource being fetched is a file. The file may have
any of a great number of extensions; some of these indicate to the server that
instead of sending the file straight to the requester,
it has to execute a program or other instructions in this file,
and send the result of that
Examples of extensions that indicate "active" pages include
(this is not nearly exhaustive - just "for instance"):
.php = contains a php program
.py = contains a python program
.js = contains a javascript program
(usually called from inside an .htm or .html)
.asp = "active server page" associated with a
Microsoft Internet Information Server
?something=value&somethingElse=%23othervalue%23
parameters that are passed to the server can be shown in the URL.
This can be used to pass parameters, entries in a form, etc.
Any character might be passed here - including '.', '&', '/', ...
But you can't just write those characters in your string...
Now comes the fun part.
URLs cannot contain certain characters (quite a few, actually). In order to get around this, there exists a mechanism called "escaping" a character. Typically this means replacing a character with the hexadecimal equivalent, prefixed with a % sign. Thus, you frequently see a space character represented as %20, for example. You can find a handly list here
There are many functions available for converting "illegal" characters in a URL automatically to a "legal" value.
To learn about exactly what is and isn't allowed, you really need to go back to the original specifications. See for example
http://www.ietf.org/rfc/rfc1738.txt
http://www.ietf.org/rfc/rfc2396.txt
http://www.ietf.org/rfc/rfc3986.txt
I list them in chronological order - the last one being the most recent.
But I repeat my question -- what are you really trying to do here, and why?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js