How to parse GET tokens from URL with regular expression - regex

Given a URL with GET arguments such as
http://www.domain.com?key1=value1+value2+value3&key2=value4+value5
I wish to capture all the values for a given key (into separate references if possible). For example if the desired key was key1 i would want to capture value1 in \1 (or $1 depending on language), value2 in \2, and value3 in \3.
My flawed regex is:
/[?&](?:key1)=((?:[^+&]+[+&$])+)/
which yields 0 results.
I am writing this in c++ using ECMA syntax, but I think I could convert a solution or advice from any regex flavor to ECMA. Any help would be appreciated.

This has been answered before and there are compact scripts written for it.
Regular expressions are not optimal for extracting query string values. At the end of this answer, I will give you an expression which can extract the value(s) for a given field into separate references. But not that it takes a "lot" of time to extract the parameters one at a time using regular expressions, but they can all be completely extracted very quickly with no regular expression engine needed. For instance, http://www.htmlgoodies.com/beyond/javascript/article.php/3755006/How-to-Use-a-JavaScript-Query-String-Parser.htm
What language are you trying to use to extract these parameters, C++?
If you are using, JavaScript, you use the small functions mentioned in the article above, i.e.,
function ptq(q)
{
/* parse the query */
var x = q.replace(/;/g, '&').split('&'), i, name, t;
/* q changes from string version of query to object */
for (q={}, i=0; i<x.length; i++)
{
t = x[i].split('=', 2);
name = unescape(t[0]);
if (!q[name])
q[name] = [];
if (t.length > 1)
{
q[name][q[name].length] = unescape(t[1]);
}
/* next two lines are nonstandard, allowing programmer-friendly Boolean parameters */
else
q[name][q[name].length] = true;
}
return q;
}
function param() {
return ptq(location.search.substring(1).replace(/+/g, ' '));
}
Once you have that code included in your page's scripts, then you can parse the current URLs data by doing query = param(); and then using the value of query.key1, etc.
You can parse other query-string formatted data by using the ptq() function directly, i.e., query_object = ptq(query_string).
If you are using another language and regular expressions are the way you want to do it, then this would return all values matching key1, for instance:
/key1=([^&;]*)/g
That will return all the values with a certain field name (which in the query string definition, are written like this, key1=value1&key1=value2&key1=value3, etc.).
The way you ask your question makes it sound like you want to create your own programmer-friendly way of supplying values (i.e., by constructing your own custom URLs rather than receiving data from form submissions through browsers) in which your values are separated by spaces (spaces are encoded as + signs in an HTTP GET query string, and as %20 in generic query strings).
You could make a complicated regular expression to do this in one step, but it is faster to match the entire field (all the values and the + signs as well), and then split the result at the + signs.
For each of the results from the regular expression I indicate, you can extract the plus-sign separated values by simply doing /[^+]*/g

Related

parse URL params in Perl

I am working on some tutorials to explain things like GET/POST's and need to parse the URI manually. The follow perl code works, but I am trying to do two things:
list each key/value
be able to look up one specific value
What I do NOT care about is replacing the special chars to spaces or anything, the one value I need to get should be a number. In other languages I have used, the regular expression in question should group each key/value into one grouping with a part 1/part 2, does Perl do the same? If so, how do I put that into a map?
my #paramList = split /(?:\?|&|;)([^=]+)=([^&|;]+)/, $ENV{'REQUEST_URI'};
if(#paramList)
{
print "<h1>The Params</h1><ul>";
foreach my $i (#paramList) {
if($i) {
print "<li>$i</li>";
}
}
print "<ul>";
}
Per the request, here is a basic example of the input:
REQUEST_URI = /cgi-bin/printenv_html.pl?customer_name=fdas&phone_number=fdsa&email_address=fads%40fd.com&taxi=van&extras=tip&pickup_time=2020-01-14T20%3A45&pickup_place=&dropoff_place=Airport&comments=
goal is the following where the left of the equal is the key, and the right is the value:
customer_name=fdas
phone_number=fdsa
email_address=fads%40fd.com
taxi=van
extras=tip
pickup_time=2020-01-14T20%3A45
pickup_place=
dropoff_place=Airport
comments=
How about feeding your list of key-value pairs into a hash?
my %paramList = $ENV{'REQUEST_URI'} =~ /(?:\?|&|;)([^=]+)=([^&|;]+)/g;
(no reason for the split as far as I can tell)
This relies crucially on there being an even-sized list of matches, where each "before-=" thing becomes a key in the hash, with the value being its pairing "after-=" thing.
In order to also get "pairs" without a value (like comments=) change + in the last pattern to *

Scala regex find/replace with additional formatting

I'm trying to replace parts of a string that contains what should be dates, but which are possibly in an impermissible format. Specifically, all of the dates are in the form "mm/dd/YYYY" and they need to be in the form "YYYY-mm-dd". One caveat is that the original dates may not exactly be in the mm/dd/YYYY format; some are like "5/6/2015". For example, if
val x = "where date >= '05/06/2017'"
then
x.replaceAll("'([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})'", "'$3-$1-$2'")
performs the desired replacement (returns "2017-05-06"), but for
val y = "where date >= '5/6/2017'"
this does not return the desired replacement (returns "2017-5-6" -- for me, an invalid representation). With the Joda Time wrapper nscala-time, I've tried capturing the dates and then reformatting them:
import com.github.nscala_time.time.Imports._
import org.joda.time.DateTime
val f = DateTimeFormat.forPattern("yyyy-MM-dd")
y.replaceAll("'([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'",
"'"+f.print(DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime("$1"))+"'")
But this fails with a java.lang.IllegalArgumentException: Invalid format: "$1". I've also tried using the f interpolator and padding with 0s, but it doesn't seem to like that either.
Are you not able to do additional processing on the captured groups ($1, etc.) inside the replaceAll? If not, how else can I achieve the desired result?
The $1 like backreferences can only be used inside string replacement patterns. In your code, "$1" is not a backreference any longer.
You may use a "callback" with replaceAllIn to actually get the match object and access its groups to further manipulate them:
val pattern = "'([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'".r
y = pattern replaceAllIn (y, m => "'"+f.print(DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime(m.group(1)))+"'")
Regex.replaceAllIn is overloaded and can take a Match => String.

How to replace parts of a string in lua "in a single pass"?

I have the following string of anchors (where I want to change the contents of the href) and a lua table of replacements, which tells which word should be replaced for:
s1 = '<a href="word7">'
replacementTable = {}
replacementTable["word1"] = "potato1"
replacementTable["word2"] = "potato2"
replacementTable["word3"] = "potato3"
replacementTable["word4"] = "potato4"
replacementTable["word5"] = "potato5"
The expected result should be:
<a href="word7">
I know I could do this iterating for each element in the replacementTable and process the string each time, but my gut feeling tells me that if by any chance the string is very big and/or the replacement table becomes big, this apporach is going to perform poorly.
So I though it could be best if I could do the following: apply the regular expression for finding all the matches, get an iterator for each match and replace each match for its value in the replacementTable.
Something like this would be great (writing it in Javascript because I don't know yet how to write lambdas in Lua):
var newString = patternReplacement(s1, '<a[^>]* href="([^"]*)"', function(match) { return replacementTable[match] })
Where the first parameter is the string, the second one the regular expression and the third one a function that is executed for each match to get the replacement. This way I think s1 gets parsed once, being more efficient.
Is there any way to do this in Lua?
In your example, this simple code works:
print((s1:gsub("%w+",replacementTable)))
The point is that gsub already accepts a table of replacements.
In the end, the solution that worked for me was the following one:
local updatedBody = string.gsub(body, '(<a[^>]* href=")(/[^"%?]*)([^"]*")', function(leftSide, url, rightSide)
local replacedUrl = url
if (urlsToReplace[url]) then replacedUrl = urlsToReplace[url] end
return leftSide .. replacedUrl .. rightSide
end)
It kept out any querystring parameter giving me just the URI. I know it's a bad idea to parse HTML bodies with regular expressions but for my case, where I required a lot of performance, this was performing a lot faster and just did the job.

Regular expression for custom syntax in text input

I'm supposed to enforce a certain search-syntax in a text input, and after watching several RegEx videos and tutorials, I'm still having difficulties creating a regex for my purpose.
The expression structure should be something like that:
$earch://site.com?y=7, app=app1, wu>7, cnt<>8, url=http://kuku.com?adasd=343 , p=8
may start with a free text search that may contain any character other than the delimiter, which is ,. (free text must be first, and the string may be ONLY free text search).
after free text comma-separated parts of field names which consist only [a-z][A-Z], followed by operator: (=|<|>|<>) and followed by field search value that may be anything but ,.
between the commas that separate the parts there may be spaces (\s*).
The free text part or at least one field=value must appear in order for the string to be valid.
Did anyone understand the question? :)
^[^,]*(?:,\s*[a-zA-Z]+(?:[=><]|<>)[^,]+)*$? – Rawing
Thanks, that seems to work. Why did you use non-capturing groups?
He did it most probably because he didn't assume that the groups are to be captured (you didn't specify that).
Plus - if I start out the string with a comma, it is valid, whereas I
want it to not be valid (if there's no free text at the beginning).
That can be accomplished by changing the first * to a +, i. e. ^[^,]+…
I'm using javascript. I want to be able to separate each key=value
pair (including the possible free text as a group), and within that
group I would like to be able to capture key, operator, and value as
separate entities (or groups)
That's not doable with only one RegExp invocation, see e. g. How to capture an arbitrary number of groups in JavaScript Regexp? Here's an example solution:
s = '$earch://site.com?y=7, app=app1, wu>7, cnt<>8, url=http://kuku.com?adasd=343 , p=8'
part = /,\s*([a-zA-Z]+)(<>|[=><])([^,]+)/
re = RegExp('^([^,]+)('+part.source+')*$')
freetext = re.exec(s)[1] // validate s and take free text as 1st capture group of re
if (freetext)
{ document.write('free text:', freetext, '<p>')
parts = RegExp(part.source, 'g')
m = s.slice(freetext.length).match(parts) // now get all key=value pairs into m[]
if (m)
{ field = []
for (i = 0; i < m.length; ++i)
{ f = m[i].match(part) // and now capture key, operator and value from m[i]
field[i] = { key:f[1], operator:f[2], value:f[3] }
for (property in field[i]) // display them
document.write(property, ':', field[i][property], '; ')
document.write('<p>')
}
document.write(field.length, ' key/value pairs total<p>')
}
}

Railo, remove some double quotes from SerializeJSON result

Let's say I have:
<cfscript>
arrButtons = [
{
"name" = "Add",
"bclass" = "add",
"onpress" = "addItem"
},
{
"name" = "Edit",
"bclass" = "edit",
"onpress" = "editItem"
},
{
"name" = "Delete",
"bclass" = "delete",
"onpress" = "deleteItem"
}
];
jsButtons = SerializeJSON(arrButtons);
// result :
// [{"onpress":"addItem","name":"Add","bclass":"add"},{"onpress":"editItem","name":"Edit","bclass":"edit"},{"onpress":"deleteItem","name":"Delete","bclass":"delete"}]
</cfscript>
For every onpress item, I need to remove the double quotes from its value to match the JS library requirement (onpress value must a callback function).
How do I remove the double quotes using a regular expression?
The final result must be:
[{"onpress":addItem,"name":"Add","bclass":"add"},{"onpress":editItem,"name":"Edit","bclass":"edit"},{"onpress":deleteItem,"name":"Delete","bclass":"delete"}]
No double quotes surrounding addItem, editItem, and deleteItem.
Edit 2012-07-13
Why I need this? I created a CFML function that the result is a collection of JS that will be used in many files. jsButton object will be used as one part of the options available in the JS library. One of that function's arguments is an array of struct (the default is arrButtons), and the supplied arguments value can merge with the default value.
Since we can't (in CFML) write onpress value without double quotes, so I have to add double quotes to that value, and convert the (CFML) array of struct to JSON (which is just a string) and remove the double quotes before place it in the JS library option.
with Railo, we can declare the struct as a linked struct to make sure we have same ordered key for loop or conversion (from above example onpress always the latest key in the struct). with this linked struct and same key order, we can remove the double quotes with simple Replace function, but of course we can't guarantee every programmer who use the CFML function doesn't forget to use linked struct and key order same as example above
I'm not sure this is actually necessary - depending on how/where you're dealing with the JS callbacks, it might be possible to use the string function names to reference the function without needing to remove the quotes (i.e. object[button.onpress]).
However, since you asked, here is a regex solution:
jsButtons = jsButtons.replaceAll('(?<="onpress":)"([^"]+)"','$1');
The regex there is made up of two parts:
(?<="onpress":) -- lookbehind to ensure we are dealing with the text "onpress":
"([^"]+)" -- match the quotes and capture their contents.
The $1 on the replacement side is to replace the matched text (i.e. the entire quoted value) with the first capture group (i.e. the contents of the quotes).
If case-sensitivity of "onpress" might be an issue, you can prefix the regex with (?i) to ignore case.
If there will be multiple different events (not just "onpress") you can update the relevant part of the expression above to be (?<="on(?:press|hover|squeek)":) etc.
Note: All the above relies on the format output from serializeJson not changing - if it's possible that there might be comments, whitespace, single quotes, or anything else in future then a longer expression would be needed to cater for those - which is part of why you should investigate if you even need regex to solve this problem in the first place.
What you're wanting to output is not JSON, so using SerializeJSON is a kludge.
Is there any reason you are putting it into a ColdFusion Array first, instead of writing the Javascript directly?
JSON is purely meant to be a data description language. Per
http://www.json.org, it is a "lightweight data-interchange format." -
not a programming language.
Per http://en.wikipedia.org/wiki/JSON, the "basic types" supported
are:
Number (integer, real, or floating point)
String (double-quoted Unicode with backslash escaping)
Boolean (true and false)
Array (an ordered sequence of values, comma-separated and enclosed in square brackets)
Object (collection of key:value pairs, comma-separated and enclosed in curly braces)
null
--Source
I guess in this case you can simply use serialize(). That should do the trick...
Gert