Scala: Tokenizing simple arithmetic expressions

Scala: Tokenizing simple arithmetic expressions - regex

How can I split 23+3*5 or 2 + 3*5 into a list List("23", "+", "3", "*", "5")?.
I tried things like split, splitAt, but nothing with the wished result.
I want that it splits at the arithmetic operators.

Try something like
"2 + 4 - 3 * 5 / 7 / 3".split("(?=[+/*-])|(?<=[+/*-])").map(_.trim)
In this particular example, it gives you:
Array(2, +, 4, -, 3, *, 5, /, 7, /, 3)
The (?= ) are lookaheads, (?<= ) are lookbehinds. Essentially, it cuts the string before and after every operator. Note that - in [+/*-] is at the last position: otherwise it's interpreted as a character range (e.g. [a-z]).

I suggest matching on what you want as tokens.
e.g.
"\\d+|[-+*/]".r.findAllIn(" 23 + 4 * 5 / 7").toList
// List(23, +, 4, *, 5, /, 7)

Related

Regex match character after lazy grouping

I want to match specifically the comma "," after the two groups ( AS) and (.*?).
I have a positive lookbehind that skips the AS but I cant get the grouping to skip the wildcard lazy group.
Regex:
(?<= AS)(.*?)(,)
Sample text
SELECT LEFT(CustomerCode, 5) AS SMSiteCode, SUBSTRING(CustomerCode, 6, LEN(CustomerCode) - 5) AS SMCustCode, SUBSTRING(AgreeNo, 6, LEN(AgreeNo) - 5)
AS SMAgreeNo, CAST(SeqNo AS int) AS SeqNo, SUBSTRING(TrxDate, 7, 2) + SUBSTRING(TrxDate, 4, 2) + SUBSTRING(TrxDate, 1, 2) AS TrxDate, TrxTime,
CAST(Charge AS bit) AS Charge, CASE WHEN LEN(AnalysisCode) > 5 THEN SUBSTRING(AnalysisCode, 6, LEN(AnalysisCode) - 5)
ELSE AnalysisCode END AS AnalysisCode, CAST(ISNULL(Description, N'') AS nvarchar(100)) AS Description, CAST(TaxAmt AS money) AS TaxAmt,
CAST(TotAmt AS money) AS TotAmt, CAST(Match AS bigint) AS Match, CAST(Confirmed AS bit) AS Confirmed, CAST(Balance AS money) AS Balance,
CAST(QtyBal AS money) AS QtyBal, CAST(ISNULL(Drawer, N'') AS nvarchar(50)) AS Drawer, SUBSTRING(DateBanked, 7, 2) + SUBSTRING(DateBanked, 4, 2)
+ SUBSTRING(DateBanked, 1, 2) AS DateBanked, CAST(ISNULL(BankBranch, N'') AS nvarchar(50)) AS BankBranch, CAST(Qty AS float) AS Qty, CAST(ISNULL(Narration,
N'') AS nvarchar(100)) AS Narration, SUBSTRING(DateFrom, 7, 2) + SUBSTRING(DateFrom, 4, 2) + SUBSTRING(DateFrom, 1, 2) AS DateFrom, SUBSTRING(DateTo, 7, 2)
+ SUBSTRING(DateTo, 4, 2) + SUBSTRING(DateTo, 1, 2) AS DateTo, CAST(PrintNarration AS bit) AS PrintNarration, CAST(DiscAmt AS float) AS DiscAmt,
CAST(ISNULL(CCAuthNo, N'') AS nvarchar(20)) AS CCAuthNo, CAST(ISNULL(CCTransID, N'') AS nvarchar(20)) AS CCTransID, CAST(UserLogin AS nvarchar(20))
AS UserLogin, CAST(Reconciled AS bit) AS Reconciled, SUBSTRING(DateReconciled, 7, 2) + SUBSTRING(DateReconciled, 4, 2) + SUBSTRING(DateReconciled, 1, 2)
AS DateReconciled, CAST(PrimaryKey AS bigint) AS PrimaryKey, SUBSTRING(InvDate, 7, 2) + SUBSTRING(InvDate, 4, 2) + SUBSTRING(InvDate, 1, 2) AS InvDate,
CAST(InvNo AS int) AS InvNo FROM SomeDatabase.dbo.tblTransaction WHERE IsDate(trxTime) = 1

You could try \K, but make sure to change Javescript in RegExr from top right of the screen to PCRE.
\K is defined as:
Sets the given position in the regex as the new "start" of the match. This means that nothing preceding the K will be captured in the overall match.
With \K, you could try something like this:
(?<= AS).*?\K(,)
Example: https://regex101.com/r/X3AdbH/1/

If \K is supported, you could get your matches without using a lookbehind and a capturing group by matching AS and use a negated character class to match any char except a comma.
AS [^,]+\K,
Explanation
AS Match space, AS and space
[^,]+ Match 1+ times any char except a comma
\K, Forget what was matched and match a comma
Regex demo

I'm guessing that your expression is just fine, you maybe want to limit the first capturing group to some specific chars, if you wish, maybe looking like:
(?<= AS)([A-Za-z\d\s]+)(,)
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

Regex Queries for Lines

I am trying to figure out a couple regular expressions for the below cases:
Lines with a length divisible by n, but not by m for integers n and m
Lines that do not contain a certain number n of a given character,
but may contain more or less
I am a newcomer and would appreciate any clarification on these.

I've used JavaScript for my examples.
For the first one the key is to note that 'multiples' are just repeats. So using /(...)+/ will match 3 characters and then repeat that match as many times as it can. Each matching group doesn't need to be the same set of 3 characters but they do need to be consecutive. Suitable anchoring using ^ and $ ensures you're checking the exact string length and ?: can be used to negate.
e.g. Multiple of 5 but not 3:
/^(?!(.{3})+$)(.{5})+$/gm
Note that JavaScript uses / to mark the beginning and end of the expression and gm are modifiers to perform global, multiline matches. I wasn't clear what you meant by matching 'lines' so I've assumed the string itself contains newline characters that must be taken into consideration. If you had, say, and array of lines and could check each one individually then things get slightly easier, or a lot easier in the case of your second question.
Demo for the first question:
var chars = '12345678901234567890',
str = '';
for (var i = 1 ; i <= chars.length ; ++i) {
str += chars.slice(0, i) + '\n';
}
console.log('Full text is:');
console.log(str);
console.log('Lines with length that is a multiple of 2 but not 3:');
console.log(lineLength(str, 2, 3));
console.log('Lines with length that is a multiple of 3 but not 2:');
console.log(lineLength(str, 3, 2));
console.log('Lines with length that is a multiple of 5 but not 3:');
console.log(lineLength(str, 5, 3));
function lineLength(str, multiple, notMultiple) {
return str.match(new RegExp('^(?!(.{' + notMultiple + '})+$)(.{' + multiple + '})+$', 'gm'));
}
For the second question I couldn't come up with a nice way to do it. This horror show is what I ended up with. It wasn't too bad to match n occurrences of a particular character in a line but matching 'not n' proved difficult. I ended up matching {0,n-1} or {n+1,} but the whole thing doesn't feel so great to me. I suspect there's a cleverer way to do it that I'm currently not seeing.
var str = 'a\naa\naaa\naaaa\nab\nabab\nababab\nabababab\nba\nbab\nbaba\nbbabbabb';
console.log('Full string:');
console.log(str);
console.log('Lines with 1 occurrence of a:');
console.log(mOccurrences(str, 'a', 1));
console.log('Lines with 2 occurrences of a:');
console.log(mOccurrences(str, 'a', 2));
console.log('Lines with 3 occurrences of a:');
console.log(mOccurrences(str, 'a', 3));
console.log('Lines with not 1 occurrence of a:');
console.log(notMOccurrences(str, 'a', 1));
console.log('Lines with not 2 occurrences of a:');
console.log(notMOccurrences(str, 'a', 2));
console.log('Lines with not 3 occurrences of a:');
console.log(notMOccurrences(str, 'a', 3));
function mOccurrences(str, character, m) {
return str.match(new RegExp('^[^' + character + '\n]*(' + character + '[^' + character + '\n]*){' + m + '}[^' + character + '\n]*$', 'gm'));
}
function notMOccurrences(str, character, m) {
return str.match(new RegExp('^([^' + character + '\n]*(' + character + '[^' + character + '\n]*){0,' + (m - 1) + '}[^' + character + '\n]*|[^' + character + '\n]*(' + character + '[^' + character + '\n]*){' + (m + 1) + ',}[^' + character + '\n]*)$', 'gm'));
}
The key to how that works is that it tries to find n occurrences of a separated by sequences of [^a], with \n thrown in to stop it walking onto the next line.
In a real world scenario I would probably do the splitting into lines first as that makes things much easier. Counting the number of occurrences of a particular character is then just:
str.replace(/[^a]/g, '').length;
// Could use this instead but note in JS it'd fail if length is 0
str.match(/a/g, '').length;
Again, this assumes a JavaScript environment. If you were using regexes in an environment where you could literally only pass in the regex as an argument then it's back to my earlier horror show.

Regex: match and tokenize in Scala

I am trying to extract certain patterns from the input string. These patterns are +, - , *, / , (, ), log , integer and float numbers.
Here's example for the needed behavior:
//input string
var str = "log6*(12+5)/2-34.2"
//wanted result
var rightResp = Array("log","6","*","(","12","+","5",")","/","2","-","34.2")
I have tried to do this for some time but I have to admit that regex is not my specialty. Next piece of code shows where I am stuck:
import scala.util.matching.Regex
var str = "log6*(12+5)/2-34.2"
val pattern = new Regex("(\\+|-|log|\\*|\\/|[0-9]*\.?[0-9]*)")
pattern.findAllIn(str).toArray
Result is not good cause there is no matching for brackets "(" and ")" and also numbers , both integer(6,12,5,2) and float(34.2) are messed up. Thanks for your help!

You can use
[+()*/-]|log|[0-9]*\\.?[0-9]+
See regex demo
The regex contains 3 alternatives joined with the help of | alternation operator.
[+()*/-] - matches a single literal character: +, (, ), *, /, - (note that the hyphen is not escaped as it is at the end of the character class)
log - a literal letter sequence log
[0-9]*\\.?[0-9]+ - a float number that accepts values like .05, 5.55 as it matches...
[0-9]* - 0 or more digits
\\.? - and optional (1 or 0) literal periods
[0-9]+ - 1 or more digitis.
Here is a Scala code sample:
import scala.util.matching.Regex
object Main extends App {
var str = "log6*(12+5)/2-34.2"
val pattern = new Regex("[+()*/-]|log|[0-9]*\\.?[0-9]+")
val res = pattern.findAllIn(str).toArray
println(res.deep.mkString(", "))
}
Result: log, 6, *, (, 12, +, 5, ), /, 2, -, 34.2

Extract number (with different delimiters) from string using REGEXP_SUBSTR in plsql

I need to extract patterned numbers from I/P string. I have the following patterns:
xxx-xxx-xxxx
xxx xxx-xxxx
xxx xxx xxxx
I am using this query to find matching string:
select REGEXP_substr('phn: 678 987-0987 Date: 12/2029',
'[0-9]{3}(\-|\ |\ )[0-9]{3}(\-|\--|\ )[0-9]{4}')
from dual;
I also want to extract the following patterns:
xxxxxx-xxxx
xxxxxxxxxx
etc...
Where do I modify the query?

Change your regex to,
[0-9]{3}(\-|\ |\ )?[0-9]{3}(\-|\--|\ )?-?[0-9]{4}
DEMO
(\-|\ |\ )? turns the whole group as optional. And -? turns - as optional. The function of ? after a character literal is, it makes the preceding token as optional.

Regular expressions are not always a good approach, since they are high resource consuming feature. I would still use old SUBSTR + INSTR technique :
16777216 * to_number(substr(ip, 1, instr(ip, '.', 1, 1) - 1))
+ 65536 * to_number(substr(ip, instr(ip, '.', 1, 1) + 1, instr(ip, '.', 1, 2) - instr(ip, '.', 1, 1) - 1))
+ 256 * to_number(substr(ip, instr(ip, '.', 1, 2) + 1, instr(ip, '.', 1, 3) - instr(ip, '.', 1, 2) - 1))
+ to_number(substr(ip, instr(ip, '.', 1, 3) + 1))
IP# is a simple 32-bit (4 bytes) integer; which is being presented in "dotted quad" format.
Each byte will contain a value between 0 and 255.
so converting to number & using between is as efficient as possible.

regex for position matching with OR condition

Newbie to regex and looking for help in creating regexp to seek out following:
The data items consists of six character strings as shown in example below
1) "100100"
2) "110011"
3) "010000"
4) "110011"
5) "111111"
6) "000111"
Need to use regexp to find data with say
1 in the 1st position OR 1 in the 4th position: Items 1, 2, 4, 5 and 6 should be matched
1 in 2nd position: Items 2,4 ad 5 should be matched
1 in 5th and 6th position: Items 2, 4, 5 and 6 should be matched

Given your samples, these will work:
1 in the 1st position OR 1 in the 4th position: Items 1, 2, 4, 5 and 6 should be matched
1.....|...1...
1 in 2nd position: Items 2,4 ad 5 should be matched
.1....
1 in 5th and 6th position: Items 2, 4, 5 and 6 should be matched
....11
Or if you want to match any of these rules, combine them with the | (or) operator.
Example:
http://regexpal.com/?flags=g&regex=(1.....%7C...1...%7C.1....%7C....11)&input=100100%0A%0A110011%0A%0A010000%0A%0A110011%0A%0A111111%0A%0A000111

If it is always strings with only 1s and 0s, you should treat them as binary numbers and use logical operators to find the matches.

Try this regex
([1][0-1]{2}[1][0-1]{2})|([0-1][1][0-1]{4})|([0-1]{4}[1]{2})
Find the explanation and demo here http://www.regex101.com/r/vD9jE7

Here's an example. Change dots with zeros if necessary. /^(11..|.1.1)11$/
^ # beginning of string
( # either
11.. # 11 and any 2 char
| # or
.1.1 # any char, 1, any char, 1
)
11
$ # end of string

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scala: Tokenizing simple arithmetic expressions - regex

How can I split 23+35 or 2 + 35 into a list List("23", "+", "3", "*", "5")?. I tried things like split, splitAt, but nothing with the wished result. I want that it splits at the arithmetic operators.

I suggest matching on what you want as tokens. e.g. "\\d+|[-+/]".r.findAllIn(" 23 + 4 5 / 7").toList // List(23, +, 4, *, 5, /, 7)

Related

Regex match character after lazy grouping

Regex Queries for Lines

Regex: match and tokenize in Scala

Extract number (with different delimiters) from string using REGEXP_SUBSTR in plsql

regex for position matching with OR condition

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scala: Tokenizing simple arithmetic expressions - regex

How can I split 23+3*5 or 2 + 3*5 into a list List("23", "+", "3", "*", "5")?. I tried things like split, splitAt, but nothing with the wished result. I want that it splits at the arithmetic operators.

I suggest matching on what you want as tokens. e.g. "\\d+|[-+*/]".r.findAllIn(" 23 + 4 * 5 / 7").toList // List(23, +, 4, *, 5, /, 7)

Related

Regex match character after lazy grouping

Regex Queries for Lines

Regex: match and tokenize in Scala

Extract number (with different delimiters) from string using REGEXP_SUBSTR in plsql

regex for position matching with OR condition

Categories

Resources

How can I split 23+35 or 2 + 35 into a list List("23", "+", "3", "*", "5")?. I tried things like split, splitAt, but nothing with the wished result. I want that it splits at the arithmetic operators.

I suggest matching on what you want as tokens. e.g. "\\d+|[-+/]".r.findAllIn(" 23 + 4 5 / 7").toList // List(23, +, 4, *, 5, /, 7)