Regular expression for validating input string in R - regex

I am trying to write a regular expression in R to validate user input and run program accordingly.
3 types of queries are expected, all are character vectors.
query1 = "Oct4[Title/Abstract] AND BCR-ABL1[Title/Abstract]
AND stem cells[Title] AND (2000[PDAT] :2015[PDAT])"
query2 <-c("26527521","26711930","26314551")
The following code works. But the challenge is restricting special characters in both the cases
all(grepl("[A-Za-z]+",query,perl=TRUE)) validates False for query 2
or as #sebkopf suggested
all(grepl("^[0-9 ,]+$", query)) # evaluates to TRUE only for query 2
However, query 1 also takes in year as input, which means it numeric input should be accepted for query 1. To add complexity, space , . - [] () are allowed in query1. And, the format for query2, Should be ONLY numbers, separated by , or space. Anything else should throw an error.
How to incorporate both these conditions as part of R regular expression ? So that, the following if conditions are validated accordingly to run respective codes ?
if (grepl("regex for query 1& 2",query,perl=TRUE) == True {
Run code 1
} else { print ("these characters are not allowed # ! & % # * ~ `_ = +") }
if (grepl("regex for query3",query,perl=TRUE) == True {
Run code 2
} else { print ("these characters are not allowed # ! & % # * ~ `_ = + [] () - . ")}

In your current regexps you are just looking for the occurence of the pattern ("[A-Za-z]+") anywhere in the query. If you want to specifically only allow certain character patterns, you need to make sure it matches across the whole query using "^...$".
With regular expressions there's always multiple ways of doing anything but to provide an example for matching a query without specific special characters (but everything else allowed), you could use the following (here wrapped in all to account for your query3 being a vector):
all(grepl("^[^#!&%#*~`_=+]+$", query)) # evaluates to TRUE for your query1, 2 & 3
For instead doing the positive match to only catch queries that are numbers plus space and comma:
all(grepl("^[0-9 ,]+$", query)) # evaluates to TRUE only for query3

Related

regular expression

I am trying to find a regular expression which should satisfy the following needs.
It should identify all space(s) as separators until a doublepoint is passed 2 times. After this pass, it should continue to use spaces as separators until a 3rd doublepoint is identified. This 3rd colon should be used as separator as well. But all spaces before and after this specific colon should not be used as separator. After this special doublepoint has been identified, no more separator should be found even its a space or a colon.
2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf c.w.f.w.NiceController : z rest as async texting: json, special character, spacses.....
I would like to have the separators her identified as following (Separator shown as X)
2019-12-28X13:00:00.112XDEBUGXn-somethingspecial.atX---X[9999-118684]X3894ß8349ß84930ßaa14e38eae18e3ebfXc.w.f.w.NiceControllerXz rest as async texting: json, special character, spacses.....
2019-12-28 X 13:00:00.112 X DEBUG X n-somethingspecial.at X --- X [9999-118684] X 3894ß8349ß84930ßaa14e38eae18e3ebf X c.w.f.w.NiceController X z rest as async texting: json, special character, spacses.....
Exactly 8 separtors are found here.
Any ideas how to do this via regular expression?
My current approach does not work as I tried to to this like the following
Any ideas about this?
Update:
(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?<=DEBUG)\s|(?<=\s---)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=\[[0-9a-z\#\.\-]{15}\])\s|((?<=\[[0-9a-z\#\.\-]{15}\]\s)\s|(?<=\[[0-9a-z\#\.\-]{15}\]\s[a-z0-9]{32})\s)|\s(?=---)|(?<=[a-zA-Z])\s+\:\s
That's my current syntax to identify the separators.
Update 2:
Regex above is faulty.
Update 3:
(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
This is the current regex. Targetapproach is to call
df = pd.read_csv(file_name,
sep="(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})",
names=['date', 'time', 'level', 'host', 'template', 'threadid', 'logid', 'classmethods', 'line'],
engine='python',
nrows=100)
This could be extended later to dask which gives me the change to parse multiple log files in one dataframe.
The last column line is not identified correctly. For unknown reasons yet.
If that log format is sufficiently regular, you can take the lines apart much more easily with str.split.
The assumptions are that none of the first eight fields have an internal space, and that all of them are always present (or, if not all are present, that the last field, which starts after the colon, is also not present). You can then use the maxsplit argument to str.split in order to stop splitting when the ninth field starts:
def separate(logline):
fields = logline.split(maxsplit=8) # 8 space separate fields + the rest
if len(fields) > 8:
# Fix up the ninth field. Perhaps you want to remove the colon:
fields[8] = fields[8][1:]
# or perhaps you want the text starting at the first non-whitespace
# character after the colon:
#
# if fields[8][0] == ':':
# fields[8] = fields[8].split(maxsplit=1)[1]
#
# etc.
return fields
>>> logline = ( "2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at"
... + " --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf"
... + " c.w.f.w.NiceController"
... + " : z rest as async texting: json, special character, spaces.....")
>>> separate(logline)
['2019-12-28', '13:00:00.112', 'DEBUG', 'n-somethingspecial.at', '---',
'[9999-118684]', '3894ß8349ß84930ßaa14e38eae18e3ebf',
'c.w.f.w.NiceController',
' z rest as async texting: json, special character, spaces.....']
Solution
The current outcome of my problem can be solved via the following regular expression.
(?:(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.hostname\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s))|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
Maybe minor adaptions have to be done maybe but for now it works pretty good.

Is there a way to extract a string while parsing using Scala combinators?

I have a language which accepts similar strings to this: !A(){assertion}.
I want assertion to be considered as a string without " ", meaning that instead of assertion anything can be placed. I am currently using the parser combinators in Scala, and for the assertion part I am using a Regex function as below:
def regex ( r: Regex ): Parser[String] = new Parser[String] {
def apply ( in: Input )
= r.findPrefixMatchOf(in.source.subSequence(in.offset,in.source.length)) match {
case Some(matched) =>
Success(matched.group(1).toString, in.drop(in.offset+matched.end-1))
case None => Failure("string matching regex `"+r+"' expected but "+in.first+" found",in)
}
}
However, I have an issue because the drop function of the Input takes as parameter the number of elements that should be dropped from the beginning. The issue is that the number of elements taken out by the regex function is unknown.
Example
If I have: !A(){num >= count} and using the regex function I extract num >= count, in total I would have extracted 3 elements. Therefore, I need to drop those 3 elements from the input, i.e. it should return: Success(matched.group(1).toString, in.drop(3)). However, I cannot find a way to know the number of elements to be dropped from the input.

How would you write a regex for input validation so certain symbols can't be repeated?

I'm attempting to write a regex to prevent certain user input in mathematical expressions. (e.g. '1+1' would be valid whereas'1++1' should be invalidated)
Acceptable characters include *digits 0-9* (\d works in lieu of 0-9), + - # / ( ) and white-spaces.
I've attempted to put together a regex but I cant find anything in python regular expression syntax that would validate (or consequently invalidate certain characters when typed together.
(( is ok
++, --, +-, */, are not
I hope there is a simple way to do this, but I anticipate if there isn't, I will have to write regex's for every possible combination of characters I don't want to allow together.
I've tried:
re.compile(r"[\d\s*/()+-]")
re.compile(r"[\d]\[\s]\[*]\[/]\[(]\[)]\[+]\[-]")
I expect to be able to invalidate the expression if someone were to type "1++1"
Edit: Someone suggested the below link is similar to my question...it is not :)
Validate mathematical expressions using regular expression?
Probably the way to go is by inverting your logic:
abort if the regex detects any invalid combination - those are much less compared to the amount of valid combinations.
So e.g.:
re.compile(r"++")
Also, is it possible at all to enumerate all valid terms? If the length of the term is not limit, it is impossible to enumerate all vaild terms
Perhaps one option might be to check the string for the unwanted combinations:
[0-9]\s*(?:[+-][+-]|\*/)\s*[0-9]
Regex demo | Python demo
For example
pattern = r"[0-9]\s*(?:[+-][+-]|\*/)\s*[0-9]"
strings = [
'This is test 1 -- 1',
'This is test 2',
'This is test 3+1',
'This is test 4 */4'
]
for s in strings:
res = re.search(pattern, s)
if not res:
print("Valid: " + s)
Result
Valid: This is test 2
Valid: This is test 3+1
Below is a snippet from my code. This is hardly the solution I was originally looking for but it does accomplish what I was trying to do. When a user curls to an api endpoint 1++1, it will return "Forbidden" based on the below regex for "math2 =...." Alternatively, it will return "OK" if a user curls 1+1. I hope I am understanding how Stack Overflow works and have formatted this properly...
# returns True if a valid expression, False if not.
def validate_expression(calc):
math = re.compile(r"^[\d\s*/()+-]+$")
math2 = re.compile(r"^[\d++\d]+$")
print('2' " " 'validate_expression')
if math.search(calc) is not None and math2.search(calc) is None:
return True
else:
return False

Convert equation from string in postgresql

I am trying to write a query that takes in a string, where an equation in the form
x^3 + 0.0046x^2 - 0.159x +1.713
is expected. The equation is used to calculate new values in the output table from a list of existing values. Hence I will need to convert whatever the input equation string is into an equation that postgresql can process, e.g.
power(data.value,3) + 0.0046 * power(data.value,2) - 0.159 * data.value + 1.713
A few comforting constraints in this task are
The equation will always be in the form of a polynomial, e.g. sum(A_n * x^n)
The user will always use 'x' to represent the variable in the input equation
I have been pushing my queries into a string and executing it at the end, e.g.
_query TEXT;
SELECT 'select * from ' INTO _query;
SELECT _query || 'product.getlength( ' || min || ',' || max || ')' INTO _query;
RETURN QUERY EXECUTE _query;
Hence I know I only need to somehow
Replace the 'x''s to 'data.values'
Find all the places in the equation string where a number
immediately precede a 'x', and add a '*'
Find all exponential operations (x^n) in the equation string and
convert them to power(x,n)
This may very well be something very trivial for a lot of people, unfortunately postgresql is not my best skill and I have already spent way more time than I can afford to get this working. Any type of help is highly appreciated, cheers.
Your 9am-noon time frame is over, but here goes.
Every term of the polynomial has 4 elements:
Addition/subtraction modifier
Multiplier
Parameter, always x in your case
Power
The problem is that these elements are not always present. The first term has no addition element, although it could have a subtraction sign - which is then typically connected to the multiplier. Multipliers are only given when not equal to 1. The parameter is not present in the last term and neither is a power in the last two terms.
With optional capture groups in regular expression parsing you can sort out this mess and PostgreSQL has the handy regexp_matches() function for this:
SELECT * FROM
regexp_matches('x^3 + 0.0046x^2 - 0.159x +1.713',
'\s*([+-]?)\s*([0-9.]*)(x?)\^?([0-9]*)', 'g') AS r (terms);
The regular expression says this:
\s* Read 0 or more spaces.
([+-]?) Capture 0 or 1 plus or minus sign.
\s* Read 0 or more spaces.
([0-9.]*) Capture a number consisting of digit and a decimal dot, if present.
(x?) Capture the parameter x. This is necessary to differentiate between the last two terms, see query below.
\^? Read the power symbol, if present. Must be escaped because ^ is the constraint character.
([0-9]*) Capture an integer number, if present.
The g modifier repeats this process for every matching pattern in the string.
On your string this yields, in the form of string arrays:
| terms |
|-----------------|
| {'','',x,3} |
| {+,0.0046,x,2} |
| {-,0.159,x,''} |
| {+,1.713,'',''} |
| {'','','',''} |
(I have no idea why the last line with all empty strings comes out. Maybe a real expert can explain that.)
With this result, you can piece your query together:
SELECT id, sum(term)
FROM (
SELECT id,
CASE WHEN terms[1] = '-' THEN -1
WHEN terms[1] = '+' THEN 1
WHEN terms[3] = 'x' THEN 1 -- If no x then NULL
END *
CASE terms[2] WHEN '' THEN 1. ELSE terms[2]::float
END *
value ^ CASE WHEN terms[3] = '' THEN 0 -- If no x then 0 (x^0)
WHEN terms[4] = '' THEN 1 -- If no power then 1 (x^1)
ELSE terms[4]::int
END AS term
FROM data
JOIN regexp_matches('x^3 + 0.0046x^2 - 0.159x +1.713',
'\s*([+-]?)\s*([0-9.]*)(x?)\^?([0-9]*)', 'g') AS r (terms) ON true
) sub
GROUP BY id
ORDER BY id;
SQLFiddle
This assumes you have an id column to join on. If all you have is a value then you can still do it but you should then wrap the above query in a function that you feed the polynomial and the value. The power is assumed to be integral but you can easily turn that into a real number by adding a dot . to the regular expression and a ::float cast instead of ::int in the CASE statement. You can also support negative powers by adding another capture group to the regular expression and a case statement in the query, same as for the multiplier term; I leave this for your next weekend hackfest.
This query will also handle "odd" polynomials such as -4.3x^3+ 101.2 + 0.0046x^6 - 0.952x^7 +4x just so long as the pattern described above is maintained.

subset doesn't recognize regex

I just cannot get this working. i want to subset all the rows containing "mail". I use this:
Email <- subset(Total_Content, source == ".*mail.*")
I have rows like this ones:
"snt152.mail.live.com",
"mailing.serviciosmovistar.com",
"blu179.mail.live.com"
But when using: "View(Email)"
I just get a data.frame empty (just see the columns). I don't need to "scape" any metacharacter, because i need the "." to mean "anycharacter" and the "*" (0 or more times), right? Thanks.
Well, no, it doesn't - it's not meant to. You're not passing it a regular expression to be evaluated against each row, you're just passing it a character string; it doesn't know that . and * are regex characters because it's not performing a regex search. It's returning all rows where source is the literal string .mail. - which in this case is 0 rows.
What you probably want to be doing (I'm assuming this is a data.frame, here) is:
Email <- Total_Content[grepl(x = Total_Content$source, pattern = ".*mail.*"),]
grepl produces a set of boolean values of whether each entry in Total_Content$source matched the pattern. Total_Content[boolean_vector,] limits to those rows of Total_Content where the equivalent boolean is TRUE.
Why not use subset with a logical regex funtion?
Email <- subset(Total_Content, grepl(".*mail.*", source) )
The subset function does create a local environment for the evaluation of expressions that are used in either the 'subset' (row targets) or the 'select' (column targets) arguments.