Adapting Regular Expression in Django URL to match filepath

Adapting Regular Expression in Django URL to match filepath - regex

So I am currently working on a web application that takes as input the location of a malware file for one of the functions.
This is passed via the views file. However after some altering of the models section of the application I found it was unable to parse the full filepath.
The code below works for the following pcap as input:
8cdddcd3-35fa-468d-8647-816518a9836a435be1c6e904836ad65f97f3eac4cbe19ee7ba0da48178fc7f00206270469165.pcap
url(r'^analyse/(?P<pcap>[\w\-]+\.pcap)$', views.analyse, name='analyse'),
However this code no longer works when it is a pcap containing the full filepath.
/home/freddie/malwarepcaps/8cdddcd3-35fa-468d-8647-816518a9836a435be1c6e904836ad65f97f3eac4cbe19ee7ba0da48178fc7f00206270469165.pcap
Any suggestions or pointers on how exactly I would alter the regular expression to accomodate the full filepath in the string being passed to the route would be very much appreciated.

regex: ((/\w+?)+/)?([\w-]+\.pcap)
django regex: ^analyse(?P<pcap>((/\w+?)+/)?([\w-]+\.pcap))$
note that there is no slash after analyse because it's part of pcap now.
so analyse/home/freddie/malwarepcaps/foo-bar.pcap should match this pattern and pcap will be equal to /home/freddie/malwarepcaps/foo-bar.pcap
test:
https://pythex.org/?regex=((%2F%5Cw%2B%3F)%2B%2F)%3F(%5B%5Cw-%5D%2B%5C.pcap)&test_string=8cdddcd3-35fa-468d-8647-816518a9836a435be1c6e904836ad65f97f3eac4cbe19ee7ba0da48178fc7f00206270469165.pcap%20%0A%2Fhome%2Ffreddie%2Fmalwarepcaps%2F8cdddcd3-35fa-468d-8647-816518a9836a435be1c6e904836ad65f97f3eac4cbe19ee7ba0da48178fc7f00206270469165.pcap&ignorecase=0&multiline=0&dotall=0&verbose=0
PS: I think it's better to move such parameter (path - /home/f/m/f.pcap) into querystring (for GET request) or into http-body (for POST request)
so it will be easier to obtain param without url-matching

Related

Regular Expression - How to capture a file name that exists in a URL

I am a relative new comer to RegEX and I have been trying to figure the best way to accomplish isolating a file name that exists in a URL.
The structure of the url is like this. with several examples.
https://www.somefileshare.com/s/fdfdeertyyus/Luke%20Movie%202.m4v?dl=0
https://www.somefileshare.com/s/fddderttfdf/Ariana%20Movie%20.mov?dl=0
https://www.somefileshare.com/s/fdderfddefdf/Dans%20AudioFile.m4a?dl=0
the portion of the URL before the file name (in the first example) /fdfdeertyyus is dynamic therefore cannot be used as a qualifier. Also the extension of the video is not static and can be a myriad of types like: .mov .mp4 .wma .m4a
The file name i am interested in is Luke%20Movie%20.m4v
The following Regular Expression:
(?<=.com\/s\/).*?(?=\?dl)
yields
fdfdeertyyus/Luke%20Movie%202.m4v
fddderttfdf/Ariana%20Movie%20.mov
fdderfddefdf/Dans%20AudioFile.m4a
respective to the examples provided.
The test editor that I am using to isolate the file name requires \ in front of forward slashes.
It is a requirement that I use REGEX to accomplish this and the language the REGEX will be performed on varies from C, to C++, Java and Swift.
Any advice would be greatly appreciated.

Batch rename URLs in XML file

I have a list with URLs and IPs for Office365 in XML format. Now I'd like to either write a script or use a text editor's search and replace function (regex) to automatically change some of these URLs.
Example:
These URLs
<address>scus-odc.officeapps.live.com</address>
<address>scus-roaming.officeapps.live.com</address>
<address>sea-odc.officeapps.live.com</address>
Should be changed to
<address>*.officeapps.live.com</address>
<address>*.officeapps.live.com</address>
<address>*.officeapps.live.com</address>
I would appreciate any input on this issue. Thanks in advance.
Here is what I have tried so far:
1)Search for ..(?=[^.].[^.]*$) and replace with an empty string.This does a good job but unfortunately it removes the preceeding as well...
2)As pointed out by Tim, the list consists of FQDNs with different domains.The list is available from https://go.microsoft.com/fwlink/?LinkId=533185 (This list includes all FQDNs - The IPs will get deleted)
3) Solved with the help of Sergio's input. The solution was to
search for (>)[^.\n\s]+ and substitute with \1\*
I will have to write another script to delete the multiple domains but that was not part of the question so I consider this issue closed. Thank you for your input.

You can use the regex:
(>)[^.\n\s]+
and substitute with \1\*

Regex to parse the "Accept" header

I'm working on a REST API. The client is using the Accept header in their request to send in stuff like
...application/vnd.mywebsite+json; version=1... or
...application/vnd.mywebsite+xml; version=2....
Currently, I am parsing the headers and picking out the media type and version to serve with string functions:
json and 1
xml and 2
I was wondering if I could do that faster with a regex.
How can I pull out the format and version from an "Accept" header in the request? I suppose, I would need to make 2 regex calls to get this done, and that's okay.
Update :
Using the answer below, I tried extracting those using ColdFusion, but the pattern just matches the whole string.
Ideally, I want an array of 2 elements, ie ['json', '1']. Any ideas ?
<cfscript>
arrTitles = reMatch(
"application/vnd.website\+([A-Za-z]+);\s*version=(\d+)",
"application/vnd.website+json; version=2"
);
writedump(arrTitles);
</cfscript>
Please refer this runnable example.

You could use something simple like this:
application/vnd.mywebsite\+([A-Za-z]+);\s*version=(\d+)
The type (json or xml) would be in capturing group 1, the version in group 2.
You can see it working here.

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .
The logs have roughly this format here:
DEBUG: 24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.
the format regex I am using is as follows:
format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/
Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:
Record
Key Value
pri DEBUG
date 24.04.2014
subject SingleActivityStrategy
msg Start Activitiy 'barbecue' zu verabeiten.
Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.
I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.
In general, the logs always look roughly like this:
First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.
Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:
Fluentular
Sorry I couldn't make it work like a regular link to just click on.
Another link to test out regex with my format and test string is this one:
http://rubular.com/r/dfXOkQYNXP
tl;dr version:
my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?

How the regex would look if you're trying to match the data specifically:
(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)
In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.
So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):
(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))
Meaning your td-agent.conf should look like:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/
</source>
I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.
And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.

What is the mappings.ts file and how should it be set up in Tritium?

I'm using the Moovweb SDK and using Tritium. I want my mobile site to behave like my desktop site. I have different URLs pointing to my homepage. Should I use regex? A common element? And what's the best syntax for matching the path?

The mappings.ts file in the scripts directory is where particular pages are matched. The file is imported in html.ts and allows us to say "when a certain page is matched, make the following transformations."
Most projects already have a mappings file generated. A simple layout will be as so:
match($path) {
with(/home/) {
log("--> Importing pages/homes.ts in mappings.ts")
#import pages/home.ts
}
}
Every time you start working on a new page, you need to set up a new "map".
First: Match with a unique path
The Tritium above matches the path for the homepage. The path is the bit of a URL after the domain. For example, in www.example.com/search/item, "www.example.com" is the domain and "search/item" is the path.
The <>/home/<> is specifying the "home" part with regular expressions. You could also use a plain string if necessary:
with("home")
If Tritium matches the path with the matcher, it will import the home page.
It's probably true that the homepage of a site doesn't actually contain the word home. Most homepages are the URL without any matcher. A better string matcher could be:
match($path) {
with ("/")
}
Or, using regex:
with(/index|^\/$/) {
As you can see, the <>with()<> function of the mappings file is where knowledge of Regex can really come in handy. Check out our short guide on regex. Sometimes it will be simpler, such as <>(/search/)<>.
Remember to come up with the most unique aspect of the URL possible. If two <>with()<> functions match the same URL, then the one that appears first in the mappings file will be used. If you cannot find a unique URL matcher for different page types, you may have to match via other means.
Why Use Regex?
It might seem easier to use a string rather than a regex matcher. However, regex provides a lot more flexibility over which URLs are matched.
For example, a site could use a string of numbers in its product page URLs. Using a normal string matcher would not be practical - you'd have to list out all the numbers possible for all the items on the site. An easier way would be to use regex to say, "If there's a string of 5 digits, continue!" (The code for matching 5 digits: <>/\d{5}/<>.)
Second: Log the match
When matching a particular path, you should also use <>log()<> statements so you know exactly what's getting imported. The log statement will be printed in the command line window, so you can see if your regular expression accurately matches your path.
match($path) {
with(/index|^\/$/) {
log("--> importing pages/home.ts in mappings.ts")
}
}
Third: Import the file
Finally, use the <>#import<> function to include the page-specific tritium file.
match($path) {
with(/index|^\/$/) {
log("--> importing pages/home.ts in mappings.ts")
#import pages/home.ts
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Adapting Regular Expression in Django URL to match filepath - regex

Related

Regular Expression - How to capture a file name that exists in a URL

Batch rename URLs in XML file

Regex to parse the "Accept" header

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

What is the mappings.ts file and how should it be set up in Tritium?

Categories

Resources