regex to catch assembler C-command - regex

I'm taking the Nand-2-Tetris course. We are asked to write and assembler. A C-command is in the type of dest=comp;jump where each part is optional.
I was trying to write a regex to make everything easier - I want to be able to compile the expression on a given line, and just by the group number, know which part of the expression I'm using. For example, for the expression: A=M+1;JMP I want to get group(1) = A, group(2) = M and group(3) = JMP.
My problem is that each part is optional, so I don't know exactly how to write this regex. So far I come up with:
(A?M?D?)\s=([^;\s]*)\s?(?=;[\s]*([a-zA-Z]{1,4})|$)
This works for most cases, but it doesn't work as I expect it. For example, lack of comp won't work (D;JGT). I have tried positive lookahead but it didn't work.

The RegEx that you are looking for is as follows:
(?P<dest>[AMD]{1,3}=)?(?P<comp>[01\-AMD!|+&><]{1,3})(?P<jump>;[JGTEQELNMP]{3})?
Let's break it down into parts:
(?P<dest>[AMD]{1,3}=)? - will search for optional destination to store the computation result in it.
(?P<comp>[01\-AMD!|+&><]{1,3}) - will search for computation instruction.
(?P<jump>;[JGTEQELNMP]{3})? - will search for optional jump directive.
Do note, that dest and jump parts of every C-Instruction are optional.
They only appear with postfix = and prefix ; respectively.
Hence, you will have to take care of these signs:
if dest is not None:
dest = dest.rstrip("=")
if jump is not None:
jump = jump.lstrip(";")
Finally, you will get the desired C-Instrucion parsing:
For the line A=A+M;JMP you will get:
dest = 'A'
comp = 'A+M'
jump = 'JMP'
For the line D;JGT you will get:
dest = None
comp = 'D'
jump = 'JGT'
And for the line M=D you will get:
dest = 'M'
comp = 'D'
jump = None

Not quite sure what you want to do, but based on your examples you can make a regular expression like this:
([\w]+)[=]?([\w])*[+-]*[\w]*;([\w]+)
Then for that line:
A=M+1;JMP
You'll get the following:
Full match A=M+1;JMP
Group 1 A
Group 2 M
Group 3 JMP
And for that line:
D;JGT
You'll get:
Full match D;JGT
Group 1 D
Group 3 JGT
See example here: https://regex101.com/r/v8t4Ma/1

Related

Replacing Everything Except specific pattern BigQuery

I would like to use regex to replace everything (except a specific pattern) with empty string in BigQuery. I have following values:
AX/88/8888888
AX/99/999999
AX/11/222222 - AX/22/33333 - AX/999/99999
BX/99/9999
1234455121
AX/00/888888 // BX/890/90890
NULL
[XYZ-ASA
BX/890/90890 + AX/10/1010101
AX/99/9999M
AX/111/111,AX-99
AX/11/222222 BX/99/99 AX/22/33333
The pattern will always have "AX" in the beginning, then a slash (/) and some numbers and slash(/) again and some numbers after it. (The pattern would always be AX/\d+/\d+)
I would like to replace anything (any character,brackets,digit etc) that doesn't follow that pattern mention above.
For the cases where the pattern doesn't match at all for example (BX/99/9999,1234455121, NULL,[XYZ-ASA) are the only cases from the above dataset.
** doesn't match at all means cases where the entire values doesn't have any value
that matches with the AX/\d+/\d+. In those situations, I would like to return then original text as final output.
The case where we have matching pattern for example AX/00/888888 // BX/890/90890, AX/111/111,AX-99 the pattern matches but the latter part needs to be replaced i.e [// BX/890/90890] and [,AX-99] , which should then return only the AX/00/888888, and AX/111/111 as final output.
The expected output from the above example is following:
AX/88/8888888
AX/99/999999
AX/11/222222 AX/22/33333 AX/999/99999
BX/99/9999
1234455121
AX/00/888888
NULL
[XYZ-ASA
AX/10/1010101
AX/99/9999
AX/111/111
AX/11/222222 AX/22/33333
Later I would like to split all the values by space, to get each AX/xx/xx on a different row where I have multiple of those for example case 3 from above would produce 3 rows.
AX/88/8888888
AX/99/999999
AX/11/222222
AX/22/33333
AX/999/99999
BX/99/9999
1234455121
AX/00/888888
NULL
[XYZ-ASA
AX/10/1010101
AX/99/9999
AX/111/111
AX/11/222222
AX/22/33333
Use below
select coalesce(result, col) as col
from your_table
left join unnest(regexp_extract_all(col, r'AX/\d+/\d+')) result
if applied to sample data in your question
output is

How to extract part of a string with slash constraints?

Hello I have some strings named like this:
BURGERDAY / PPA / This is a burger fest
I've tried using regex to get it but I can't seem to get it right.
The output should just get the final string of This is a burger fest (without the first whitespace)
Here, we can capture our desired output after we reach to the last slash followed by any number of spaces:
.+\/\s+(.+)
where (.+) collects what we wish to return.
const regex = /.+\/\s+(.+)/gm;
const str = `BURGERDAY / PPA / This is a burger fest`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log(result);
DEMO
Advices
Based on revo's advice, we can also use this expression, which is much better:
\/ +([^\/]*)$
According to Bohemian's advice, it may not be required to escape the forward slash, based on the language we wish to use and this would work for JavaScript:
.+/\s+(.+)
Also, we assume in target content, we would not have forward slash, otherwise we can change our constraints based on other possible inputs/scenarios.
Note: This is a pythonish answer (my mistake). I'll leave this for it's value as it could apply many languages
Another approach is to split it and then rejoin it.
data = 'BURGERDAY / PPA / This is a burger fest'
Here it is in four steps:
parts = data.split('/') # break into a list by '/'
parts = parts[2:] # get a new list excluding the first 2 elements
output = '/'.join(parts) # join them back together with a '/'
output = output.strip() # strip spaces from each side of the output
And in one concise line:
output= str.join('/', data.split('/')[2:]).strip()
Note: I feel that str.join(..., ...) is more readable than '...'.join(...) in some contexts. It is the identical call though.

regex replace : if not followed by letter or number

Okay so I wanted a regex to parse uncontracted(if that's what it is called) ipv6 adresses
Example ipv6 adress: 1050:::600:5:1000::
What I want returned: 1050:0000:0000:600:5:1000:0000:0000
My try at this:
ip:gsub("%:([^0-9a-zA-Z])", ":0000")
The first problem with this: It replaces the first and second :
So :: gets replaced with :0000
Replacing it with :0000: wouldn't work because then it will end with a :. Also this would note parse the newly added : resulting in: 1050:0000::600:5:1000:0000:
So what would I need this regex to do?
Replace every : by :0000 if it isn't followed by a number or letter
Main problem: :: gets replaced instead of 1 :
gsub and other functions from Lua's string library use Lua Patterns which are much simpler than regex. Using the pattern more than once will handle the cases where the pattern overlaps the replacement text. The pattern only needs to be applied twice since the first time will catch even pairings and the second will catch the odd/new pairings of colons. The trailing and leading colons can be handled separately with their own patterns.
ip = "1050:::600:5:1000::"
ip = ip:gsub("^:", "0000:"):gsub(":$", ":0000")
ip = ip:gsub("::", ":0000:"):gsub("::", ":0000:")
print(ip) -- 1050:0000:0000:600:5:1000:0000:0000
There is no single statement pattern to do this but you can use a function to do this for any possible input:
function fill_ip(s)
local ans = {}
for s in (s..':'):gmatch('(%x*):') do
if s == '' then s = '0000' end
ans[ #ans+1 ] = s
end
return table.concat(ans,':')
end
--examples:
print(fill_ip('1050:::600:5:1000::'))
print(fill_ip(':1050:::600:5:1000:'))
print(fill_ip('1050::::600:5:1000:1'))
print(fill_ip(':::::::'))

Extract root, month letter-year and yellow key from a Bloomberg futures ticker

A Bloomberg futures ticker usually looks like:
MCDZ3 Curcny
where the root is MCD, the month letter and year is Z3 and the 'yellow key' is Curcny.
Note that the root can be of variable length, 2-4 letters or 1 letter and 1 whitespace (e.g. S H4 Comdty).
The letter-year allows only the letter listed below in expr and can have two digit years.
Finally the yellow key can be one of several security type strings but I am interested in (Curncy|Equity|Index|Comdty) only.
In Matlab I have the following regular expression
expr = '[FGHJKMNQUVXZ]\d{1,2} ';
[rootyk, monthyear] = regexpi(bbergtickers, expr,'split','match','once');
where
rootyk{:}
ans =
'mcd' 'curncy'
and
monthyear =
'z3 '
I don't want to match the ' ' (space) in the monthyear. How can I do?
Assuming there are no leading or trailing whitespaces and only upcase letters in the root, this should work:
^([A-Z]{2,4}|[A-Z]\s)([FGHJKMNQUVXZ]\d{1,2}) (Curncy|Equity|Index|Comdty)$
You've got root in the first group, letter-year in the second, yellow key in the third.
I don't know Matlab nor whether it covers Perl Compatible Regex. If it fails, try e.g. with instead of \s. Also, drop the ^...$ if you'd like to extract from a bigger source text.
The expression you're feeding regexpi with contains a space and is used as a pattern for 'match'. This is why the matched monthyear string also has a space1.
If you want to keep it simple and let regexpi do the work for you (instead of postprocessing its output), try a different approach and capture tokens instead of matching, and ignore the intermediate space:
%// <$1><----------$2---------> <$3>
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2}) (.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
You can also simplify the expression to a more genereic '(.+)(\w{1}\d{1,2})\s+(.+)', if you wish.
Example
bbergtickers = 'MCDZ3 Curncy';
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2})\s+(.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
The result is:
tickinfo =
'MCD'
'Z3'
'Curncy'
1 This expression is also used as a delimiter for 'split'. Removing the trailing space from it won't help, as it will reappear in the rootyk output instead.
Assuming you just want to get rid of the leading and or trailing spaces at the edge, there is a very simple command for that:
monthyear = trim(monthyear)
For removing all spaces, you can do:
monthyear(isspace(monthyear))=[]
Here is a completely different approach, basically this searches the letter before your year number:
s = 'MCDZ3 Curcny'
p = regexp(s,'\d')
s(min(p)
s(min(p)-1:max(p))

Regex Returning extra empty Value

Set Regex = New RegExp
Regex.Pattern = """[^""]*""|[^,]*"
Regex.Global = True
//I have a for loop here to loop through records
text = Cells.Item(r, 7).Value
For Each Match In Regex.Execute(text)
count = count + 1
Next Match
This is my Regex Code, and here is the table where I am pulling the data from,
When I run the code in debug mode the PCBaa count comes up as two, c3 and c4 come up as 14 and C6-c36 come up as 36, Is my regex code wrong for extracting the codes between the commas ??
Ok, I have tried that myself and it seems that first off, it seems you don't reset the count value to 0 after each line. That could be intentional, but just so you know.
The second thing is that the regular expression seems to work nearly fine but always gives you the double amount because it matches a zero length string at the end of each match.
So for the last line (C6-C26) it machtes:
1) "C6" 2) "" 3) "C7" 4) "" ... and so on.
To be hounest, I'm a little bit surprised myself and don't exactly know why that's the case for now.
But the solution is pretty easy: Since you want there to be no zero length strings in the result (so they don't get counted) you simply have to exchange the * for a + and that will tell the regular expression to match only if there's at least one character.
So your regular expression string should look like:
Regex.Pattern = """[^""]+""|[^,]+"
Why you've got a count of 14 on the c3, c4 surprises me... I got a 4 which makes sence because of the double counting due to the zero length matches.