How to extract a keyword from a string in Regex - regex

For example: I want to extract the variable id from this string: validateId(id, name) && id=="123". However, I do not want to extract the word id from other places, such as in this case in validateId. Therefore, the result in this case would be: validateId(id, name) && id=="123"
I have tried matching on the opening brackets in validateId( however I am not managing to extract exactly what I need.
Note: id is just an example, I want to be able to accept any form of keywords. However, I can then change the regex accordingly to the keyword that I want.

I'm assuming that the keywords that you want to extract are always found in the same pattern. In this example I've added another.
test_str = '''validateId(id, name) && id=="123"
ParameterValue(Test, name) && Test=="123"'''
Match the pattern of the provided test string. This will find ìd and test.
reg_keyword = '(?<=\()(\w+)|(?<=&&\s)(\w+)'
re.findall returns a tuple in this case, where we are removing empty values and returning a list.
keyword = [k[0] for k in re.findall(reg_keyword, test_str) if len(k[0])>1]
Output
['id', 'Test']

Related

Scala regex find/replace with additional formatting

I'm trying to replace parts of a string that contains what should be dates, but which are possibly in an impermissible format. Specifically, all of the dates are in the form "mm/dd/YYYY" and they need to be in the form "YYYY-mm-dd". One caveat is that the original dates may not exactly be in the mm/dd/YYYY format; some are like "5/6/2015". For example, if
val x = "where date >= '05/06/2017'"
then
x.replaceAll("'([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})'", "'$3-$1-$2'")
performs the desired replacement (returns "2017-05-06"), but for
val y = "where date >= '5/6/2017'"
this does not return the desired replacement (returns "2017-5-6" -- for me, an invalid representation). With the Joda Time wrapper nscala-time, I've tried capturing the dates and then reformatting them:
import com.github.nscala_time.time.Imports._
import org.joda.time.DateTime
val f = DateTimeFormat.forPattern("yyyy-MM-dd")
y.replaceAll("'([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'",
"'"+f.print(DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime("$1"))+"'")
But this fails with a java.lang.IllegalArgumentException: Invalid format: "$1". I've also tried using the f interpolator and padding with 0s, but it doesn't seem to like that either.
Are you not able to do additional processing on the captured groups ($1, etc.) inside the replaceAll? If not, how else can I achieve the desired result?
The $1 like backreferences can only be used inside string replacement patterns. In your code, "$1" is not a backreference any longer.
You may use a "callback" with replaceAllIn to actually get the match object and access its groups to further manipulate them:
val pattern = "'([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'".r
y = pattern replaceAllIn (y, m => "'"+f.print(DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime(m.group(1)))+"'")
Regex.replaceAllIn is overloaded and can take a Match => String.

Regular expression for custom syntax in text input

I'm supposed to enforce a certain search-syntax in a text input, and after watching several RegEx videos and tutorials, I'm still having difficulties creating a regex for my purpose.
The expression structure should be something like that:
$earch://site.com?y=7, app=app1, wu>7, cnt<>8, url=http://kuku.com?adasd=343 , p=8
may start with a free text search that may contain any character other than the delimiter, which is ,. (free text must be first, and the string may be ONLY free text search).
after free text comma-separated parts of field names which consist only [a-z][A-Z], followed by operator: (=|<|>|<>) and followed by field search value that may be anything but ,.
between the commas that separate the parts there may be spaces (\s*).
The free text part or at least one field=value must appear in order for the string to be valid.
Did anyone understand the question? :)
^[^,]*(?:,\s*[a-zA-Z]+(?:[=><]|<>)[^,]+)*$? – Rawing
Thanks, that seems to work. Why did you use non-capturing groups?
He did it most probably because he didn't assume that the groups are to be captured (you didn't specify that).
Plus - if I start out the string with a comma, it is valid, whereas I
want it to not be valid (if there's no free text at the beginning).
That can be accomplished by changing the first * to a +, i. e. ^[^,]+…
I'm using javascript. I want to be able to separate each key=value
pair (including the possible free text as a group), and within that
group I would like to be able to capture key, operator, and value as
separate entities (or groups)
That's not doable with only one RegExp invocation, see e. g. How to capture an arbitrary number of groups in JavaScript Regexp? Here's an example solution:
s = '$earch://site.com?y=7, app=app1, wu>7, cnt<>8, url=http://kuku.com?adasd=343 , p=8'
part = /,\s*([a-zA-Z]+)(<>|[=><])([^,]+)/
re = RegExp('^([^,]+)('+part.source+')*$')
freetext = re.exec(s)[1] // validate s and take free text as 1st capture group of re
if (freetext)
{ document.write('free text:', freetext, '<p>')
parts = RegExp(part.source, 'g')
m = s.slice(freetext.length).match(parts) // now get all key=value pairs into m[]
if (m)
{ field = []
for (i = 0; i < m.length; ++i)
{ f = m[i].match(part) // and now capture key, operator and value from m[i]
field[i] = { key:f[1], operator:f[2], value:f[3] }
for (property in field[i]) // display them
document.write(property, ':', field[i][property], '; ')
document.write('<p>')
}
document.write(field.length, ' key/value pairs total<p>')
}
}

Postgresql - How do I extract the first occurence of a substring in a string using a regular expression pattern?

I am trying to extract a substring from a text column using a regular expression, but in some cases, there are multiple instances of that substring in the string.
In those cases, I am finding that the query does not return the first occurrence of the substring. Does anyone know what I am doing wrong?
For example:
If I have this data:
create table data1
(full_text text, name text);
insert into data1 (full_text)
values ('I 56, donkey, moon, I 92')
I am using
UPDATE data1
SET name = substring(full_text from '%#"I ([0-9]{1,3})#"%' for '#')
and I want to get 'I 56' not 'I 92'
You can use regexp_matches() instead:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1];
As no additional flag is passed, regexp_matches() only returns the first match - but it returns an array so you need to pick the first (and only) element from the result (that's the [1] part)
It is probably a good idea to limit the update to only rows that would match the regex in the first place:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1]
where full_text ~ 'I [0-9]{1,3}'
Try the following expression. It will return the first occurrence:
SUBSTRING(full_text, 'I [0-9]{1,3}')
You can use regexp_match() In PostgreSQL 10+
select regexp_match('I 56, donkey, moon, I 92', 'I [0-9]{1,3}');
Quote from documentation:
In most cases regexp_matches() should be used with the g flag, since
if you only want the first match, it's easier and more efficient to
use regexp_match(). However, regexp_match() only exists in PostgreSQL
version 10 and up. When working in older versions, a common trick is
to place a regexp_matches() call in a sub-select...

How to parse GET tokens from URL with regular expression

Given a URL with GET arguments such as
http://www.domain.com?key1=value1+value2+value3&key2=value4+value5
I wish to capture all the values for a given key (into separate references if possible). For example if the desired key was key1 i would want to capture value1 in \1 (or $1 depending on language), value2 in \2, and value3 in \3.
My flawed regex is:
/[?&](?:key1)=((?:[^+&]+[+&$])+)/
which yields 0 results.
I am writing this in c++ using ECMA syntax, but I think I could convert a solution or advice from any regex flavor to ECMA. Any help would be appreciated.
This has been answered before and there are compact scripts written for it.
Regular expressions are not optimal for extracting query string values. At the end of this answer, I will give you an expression which can extract the value(s) for a given field into separate references. But not that it takes a "lot" of time to extract the parameters one at a time using regular expressions, but they can all be completely extracted very quickly with no regular expression engine needed. For instance, http://www.htmlgoodies.com/beyond/javascript/article.php/3755006/How-to-Use-a-JavaScript-Query-String-Parser.htm
What language are you trying to use to extract these parameters, C++?
If you are using, JavaScript, you use the small functions mentioned in the article above, i.e.,
function ptq(q)
{
/* parse the query */
var x = q.replace(/;/g, '&').split('&'), i, name, t;
/* q changes from string version of query to object */
for (q={}, i=0; i<x.length; i++)
{
t = x[i].split('=', 2);
name = unescape(t[0]);
if (!q[name])
q[name] = [];
if (t.length > 1)
{
q[name][q[name].length] = unescape(t[1]);
}
/* next two lines are nonstandard, allowing programmer-friendly Boolean parameters */
else
q[name][q[name].length] = true;
}
return q;
}
function param() {
return ptq(location.search.substring(1).replace(/+/g, ' '));
}
Once you have that code included in your page's scripts, then you can parse the current URLs data by doing query = param(); and then using the value of query.key1, etc.
You can parse other query-string formatted data by using the ptq() function directly, i.e., query_object = ptq(query_string).
If you are using another language and regular expressions are the way you want to do it, then this would return all values matching key1, for instance:
/key1=([^&;]*)/g
That will return all the values with a certain field name (which in the query string definition, are written like this, key1=value1&key1=value2&key1=value3, etc.).
The way you ask your question makes it sound like you want to create your own programmer-friendly way of supplying values (i.e., by constructing your own custom URLs rather than receiving data from form submissions through browsers) in which your values are separated by spaces (spaces are encoded as + signs in an HTTP GET query string, and as %20 in generic query strings).
You could make a complicated regular expression to do this in one step, but it is faster to match the entire field (all the values and the + signs as well), and then split the result at the + signs.
For each of the results from the regular expression I indicate, you can extract the plus-sign separated values by simply doing /[^+]*/g

Struggling with regex logic: how do I remove a param from a url query string?

I'm comparing 2 URL query strings to see if they're equal; however, I want to ignore a specific query parameter (always with a numeric value) if it exists. So, these 2 query strings should be equal:
firstName=bobby&lastName=tables&paramToIgnore=2
firstName=bobby&lastName=tables&paramToIgnore=5
So, I tried to use a regex replace using the REReplaceNoCase function:
REReplaceNoCase(myQueryString, "&paramToIgnore=[0-9]*", "")
This works fine for the above example. I apply the replace to both strings and then compare. The problem is that I can't be sure that the param will be the last one in the string... the following 2 query strings should also be equal:
firstName=bobby&lastName=tables&paramToIgnore=2
paramToIgnore=5&firstName=bobby&lastName=tables
So, I changed the regex to make the preceding ampersand optional... "&?paramToIgnore=[0-9]*". But - these strings will still not be equal as I'll be left with an extra ampersand in one of the strings but not the other:
firstName=bobby&lastName=tables
&firstName=bobby&lastName=tables
Similarly, I can't just remove preceding and following ampersands ("&?paramToIgnore=[0-9]*&?") as if the query param is in the middle of the string I'll strip one ampersand too many in one string and not the other - e.g.
firstName=bobby&lastName=tables&paramToIgnore=2
firstName=bobby&paramToIgnore=5&lastName=tables
will become
firstName=bobby&lastName=tables
firstName=bobbylastName=tables
I can't seem to get my head around the logic of this... Can anyone help me out with a solution?
If you can't be sure of the order the parameters appear i would recommend, that you don't compare them by the string itsself.
I recommend splitting the string up like this:
String stringA = "firstName=bobby&lastName=tables&paramToIgnore=2";
String stringB = "firstName=bobby&lastName=tables&paramToIgnore=5";
String[] partsA = stringA.split("&");
String[] partsB = stringB.split("&");
Then go through arrays and make the paramToIgnore somehow euqal:
for(int i = 0; i < partsA.length; i++)
{
if(partsA[i].startsWith("paramToIgnore"){
partsA[i] = "IgnoreMePlease";
}
}
for(int j = 0; j < partsB.length; j++)
{
if(partsB[i].startsWith("paramToIgnore"){
partsB[i] = "IgnoreMePlease";
}
}
Then you can sort and compare the arrays to see if they are equal:
Arrays.sort(partsA);
Arrays.sort(partsB);
boolean b = Arrays.equals(partsA, partsB);
I'm pretty sure it's possible to make this more compact and give it a better performance. But with comparing strings like you do, you somehow alsways have to care about the order of your parameters.
You can use the QueryStringDeleteVar UDF on cflib to remove the query string variables you want to ignore from both strings, then compare them.
Make it in two steps:
first remove your param, as you described in example
then remove ampersand which is left at the begining or the end of query with separate regex, or any double/triple/... ampersands in the middle of the query
How about having an 'or' in the RegEx to match an ampersand at the start or the end?
&paramToIgnore=[0-9]*|paramToIgnore=[0-9]*&
Seems to do the job when testing in regexpal.com
try changing it to:
REReplaceNoCase(myQueryString, "&?paramToIgnore=[0-9]+", "")
plus instead of star should capture 1 or more of the preceding matched characters. It won't match anything but 0-9 so if there is another parameter after that it'll stop when it can't match any more digits.
Alternatively, you could use:
REReplaceNoCase(myQueryString, "&?paramToIgnore=[^&]", "")
This will match anything but an ampersand. It will cover the case if the parameter exists but there is no value; which is probably something you'd want to account for.