Most clever way to parse a Facebook OAuth 2 access token string - regex

It's a bit late, but I'm disappointed in myself for not coming up with something more elegant. Anyone have a better way to do this...
When you pass an OAuth code to Facebook, it response with a query string containing access_token and expires values.
access_token=121843224510409|2.V_ei_d_rbJt5iS9Jfjk8_A__.3600.1273741200-569255561|TxQrqFKhiXm40VXVE1OBUtZc3Ks.&expires=4554
Although if you request permission for offline access, there's no expires and the string looks like this:
access_token=121843224510409|2.V_ei_d_rbJt5iS9Jfjk8_A__.3600.1273741200-569255561|TxQrqFKhiXm40VXVE1OBUtZc3Ks.
I attempted to write a regex that would suffice for either condition. No dice. So I ended up with some really ugly Ruby:
s = s.split("=")
#oauth = {}
if s.length == 3
#oauth[:access_token] = s[1][0, s[1].length - 8]
#oauth[:expires] = s[2]
else
#oauth[:access_token] = s[1]
end
I know there must be a better way!

Split on the & symbol first, and then split each of the results on =? That's the method that can cope with the order changing, since it parses each key-value pair individually.
Alternatively, a regex that should work would be...
/access_token=(.*?)(?:&expires=(.*))/

If the format is strict, then you use this regex:
access_token=([^&]+)(?:&expires=(.*))?
Then access_token value is in \1, and expires, if there's any, will be in \2.

Query string parsing usually involves these steps:
If it is a complete URL, take the part after the first ?
split the rest on &
for each of the resulting name=value pairs, split them on =
URL-decode to the value and the name (!)
stuff the result into a data structure of your liking
Using regex is possible, but makes invalid assumptions about the state of URL-encoding the string is in and fails if the order of the query string changes, which is perfectly allowed and therefore cannot be ruled out. Better to encapsulate the above in a little parsing function or use one of the existing URL-handling libraries of your platform.

Related

Burp : Data is read from document.location and passed to $() via the following statements

How to deal with this code ?
Data is read from document.location and passed to $() via the following statements:
var url = document.location.toString();
$('.nav-tabs a[href="#' + url.split('#')[1] + '"]').tab('show');
The question is this case would be what could come after the # in the URL. Most browsers will URL-encode special chars, but there may still be ways to get around it, especially in legacy browsers.
Regardless of that, I think it's a good idea to validate data before you use it. If you can restrict the accepted values to let's say a-z (and possibly numbers and dashes), there shouldn't be a way to exploit it.
var tab = document.location.toString().split('#')[1];
if (tab && !/[^a-zA-Z0-9]/.test(tab)) $('.nav-tabs a[href="#' + tab + '"]').tab('show');

GoogleTagManager | Parsing URL - With or Without regex

I want to pass into a variable, the language of the user.
But, my client can't/didn't pass this information trough datalayer. So, the unique solution I've is to use the URL Path.
Indeed - The structure is:
http://www.website.be/en/subcategory/subsubcategory
I want to extract "en" information
No idea to get this - I check on Stack, on google, some people talk about regex, other ones about CustomJS, but, no result on my specific setup.
Do you have an idea how to proceed on this point ?
Many thanks !!
Ludo
Make sure the built in {{Page Path}} variable is enabled. Create a custom Javascript variable.
function() {
var parts = {{Page Path}}.split("/");
return parts[1];
}
This splits the path by the path delimiter "/" and gives you an array with the parts. Since the page path has a leading slash (I think), the first part is empty, so you return the second one (since array indexing starts with 0 the second array element has the index 1).
This might need a bit of refinement (for pages that do not start with a language signifier, if any), but that's the basic idea.
Regex is an alternative (via the regex table variable), but the above solution is a little easier to implement.

Google Bigtable python api rowkey regex filter issues

Recently I've been trying to get data from our Bigtable tables using python. I'm able to connect and authenticate with the api, and I can get some sample data, but when I go to add a simple rowkey regex filter I get an empty data set, even though I know there should be data there.
All the rowkeys have a format like this:
XY_1234567_Z
where X and Y are capital letters A-Z and Z is a number 0-9. The _1234567_ is the constant that I provide. So basically I need to get all rows where rowkey is everything that contains _1234567_ for example.
This is the regex I use:
^.._1234567_.$
And this an example of my current code:
...
tbl = instance.table(tableID)
regex = ("^.._" + str(rowID) + "_.$").encode()
fltr = RowKeyRegexFilter(regex)
row_data = tbl.read_rows(filter_=fltr)
print(row_data.rows)
row_data.rows always ends up being an empty dict. I've tried removing encode() and just sending a string, and I've also tried a different regex to be more specific like this "([A-Z][A-Z])_" + str(rowID) + "_([0-9])" which still didn't work. If I try to do row_data.consume_next(), it hangs for a while and eventually gives me a StopIteration error. I've also tested the regex with regex101 and that seems to be fine, so I'm not sure where the issue is.
Looks like you've already figured it out, but please see the documentation for the python Data API [1, 2]. Calling row_data.consume_next() will fetch the next ReadRowsResponse in the stream and store it in row_data.rows. consume_next() will raise a StopIteration exception when there are no more results to consume. Alternatively, you can call row_data.consume_all() to consume all of the results from the stream (up to an optional limit).
[1] https://googleapis.dev/python/bigtable/latest/data-api.html?highlight=consume_next#stream-many-rows-from-a-table
[2] https://gcloud-python-bigtable.readthedocs.io/en/data-api-complete/row-data.html#gcloud_bigtable.row_data.PartialRowsData
It seems all I needed to do was row_data.consume_next() to get the set of data I requested. Stupid mistake. The row_data object is initially an empty dict but is populated once consume_next() reads the next item in the stream. This brought up a new issue where consume_next() would hang if read_rows() didn't find any match, but this is for another question.

Is there a way to search terms in order with RegexpQuery in lucene?

I would like to search my indexed documents in order using RegexpQuery.
For example I have 2 Document
text: Oracle unveils better than expected quarterly results.
text: Research In Motion shares gained almost 13 per cent on the Toronto Stock Exchange Friday, a day after the smartphone maker posted better than expected quarterly results.
So far I tried this but I got no luck.
Query regexq = new RegexpQuery(new Term("text", "^.+better.+quarterly.+results"));
Is there another way of implementing this?
Thanks
I believe a PhraseQuery fits what you are looking for better. You can use PhraseQuery.setSlop(int) to allow terms to appear between the terms of the query. This would like like:
Query pq = new PhraseQuery();
pq.add(new Term("text", "better"));
pq.add(new Term("text", "quarterly"));
pq.add(new Term("text", "results"));
pq.setSlop(10); //Or whatever is an appropriate slop value for you.
This sort of query is also supported by the standard QueryParser, as seen here, like:
text:"better quarterly results"~10
I think a PhraseQuery is most definitely the better implementation here, but...
Regarding RegexpQuery:
I believe it is intended to compare terms against the regex, and since the phrase you are searching for (I am assuming) is tokenized, no single Term matches your whole regex. You would need to index the entire field as a single Term to make this work, using StringField, KeywordAnalyzer, or similar.
I believe it works like Matcher.matches(), rather than Matcher.find(), which is to say, it must match the entire input term, rather than a portion of it. So, if you had specified "text" as a StringField, you would need to add a .* to the end to consume the rest of the input.
On a similar note, I'm not sure if it supports the use of the character "^" as the start of input, being that it is redundant in that case. I don't see it specified in Lucene's Regexp, but I have seen reference to it's use, so I'm not sure whether it would be accepted or not.
To summarize, a RegexpQuery could work like:
Query regexq = new RegexpQuery(new Term("text", ".+better.+quarterly.+results.*"));
If you used a StringField, or KeywordAnalyzer index the entire field as a single Term.
With the leading wildcard in your regexp, though, you could expect very poor performance from it (See the warning at the top of the RegexpQuery documentation).

RegEx to parse or validate Base64 data

Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.
I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.
So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.
Content-Transfer-Encoding: base64
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
Ok, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.
Content-Transfer-Encoding: base64
http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will.
My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!
[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8
Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?
UPDATE:
Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, I've found the following to work very well for me.
[^-A-Za-z0-9+/=]|=[^=]|={3,}$
And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommend that you read RFC4648 that Gumbo mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.
From the RFC 4648:
Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII data.
So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.
But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
This one is good, but will match an empty String
This one does not match empty string :
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$
The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://www.rfc-editor.org/rfc/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.
Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$
The cited RFC considers the empty string as valid (see https://www.rfc-editor.org/rfc/rfc4648#section-10) therefore the above regex also does.
The equivalent regular expression for base64url (again, refer to the above RFC) is:
^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$
Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like
my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;
say decode_base64($sanitized_str);
might be what you want. It produces
This is simple ASCII Base64 for StackOverflow exmaple.
The best regexp which I could find up till now is in here
https://www.npmjs.com/package/base64-regex
which is in the current version looks like:
module.exports = function (opts) {
opts = opts || {};
var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';
return opts.exact ? new RegExp('(?:^' + regex + '$)') :
new RegExp('(?:^|\\s)' + regex, 'g');
};
Here's an alternative regular expression:
^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$
It satisfies the following conditions:
The string length must be a multiple of four - (?=^(.{4})*$)
The content must be alphanumeric characters or + or / - [A-Za-z0-9+/]*
It can have up to two padding (=) characters on the end - ={0,2}
It accepts empty strings
To validate base64 image we can use this regex
/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}
private validBase64Image(base64Image: string): boolean {
const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp|svg\+xml)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
return base64Image && regex.test(base64Image);
}
The shortest regex to check RFC-4648 compiliance enforcing canonical encoding (i.e. all pad bits set to 0):
^(?=(.{4})*$)[A-Za-z0-9+/]*([AQgw]==|[AEIMQUYcgkosw048]=)?$
Actually this is the mix of this and that answers.
I found a solution that works very well
^(?:([a-z0-9A-Z+\/]){4})*(?1)(?:(?1)==|(?1){2}=|(?1){3})$
It will match the following strings
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
YW55IGNhcm5hbCBwbGVhcw==
YW55IGNhcm5hbCBwbGVhc3U=
YW55IGNhcm5hbCBwbGVhc3Vy
while it won't match any of those invalid
YW5#IGNhcm5hbCBwbGVhcw==
YW55IGNhc=5hbCBwbGVhcw==
YW55%%%%IGNhcm5hbCBwbGVhc3V
YW55IGNhcm5hbCBwbGVhc3
YW55IGNhcm5hbCBwbGVhc
YW***55IGNhcm5hbCBwbGVh=
YW55IGNhcm5hbCBwbGVhc==
YW55IGNhcm5hbCBwbGVhc===
My simplified version of Base64 regex:
^[A-Za-z0-9+/]*={0,2}$
Simplification is that it doesn't check that its length is a multiple of 4. If you need that - use other answers. Mine is focusing on simplicity.
To test it: https://regex101.com/r/zdtGSH/1