C++ regex capture group confusion - c++

I'm implementing the nand2tetris Assembler in C++ (I'm pretty new to C++), and I'm having a lot of trouble parsing a C-instruction using regex. Mainly I really don't understand the return value of regex_search and how to use it.
Setting aside the various permutations of a C instruction, the current example I'm having trouble with is D=D-M. The result should have dest = "D"; comp = "D-M".
With the current code below, the regex appears to find the results correctly (confirmed by regex101.com), but, not really correctly, or something, or I don't know how to get to it. See the debugger screenshot. matches[n].second (which appears to contain the correct comp value) is not a string but an iterator.
Note that the 3rd capture group is correctly empty for this example.
auto regex_str = regex("([AMD]{1,3}=)?([01\-AMD!|+&><]{1,3})?(;[A-Z]{3})?");
regex_search(assemblyCode, matches, regex_str);
string dest = matches[1]; // this automatically casts some object (submatch) into a string?
string comp = matches[2];
string jump = matches[3];
I will note, though, that D=D+M works, but not D=D-M!

gcc warns about unknows escape sequence \- Demo.
You have to escape \,
std::regex("([AMD]{1,3}=)?([01\\-AMD!|+&><]{1,3})?(;[A-Z]{3})?");
or use raw string
std::regex(R"(([AMD]{1,3}=)?([01\-AMD!|+&><]{1,3})?(;[A-Z]{3})?)");
Demo

Related

Can't get an Array of matches using Regular Expression

const stringWithDate: string = "4/7/20 This is a date!";
const reg: RegExp = new RegExp("^(\d{1,2}\/\d{1,2}\/\d{1,2})").compile();
const exist: boolean = reg.test(stringWithDate)
const matches: RegExpExecArray | null = reg.exec(stringWithDate);
console.log(exist);
console.log(matches);
I am trying to get the date (4/7/20) extracted from strngWithDate. When I log the value of 'exist' it says true but the matches array says [""]. I'm not sure what I'm doing wrong here. I know the regex isn't that good but I know it works because I tried the same in python and
here. As far as I can tell it should give me "4/7/20" from stringWithDate. But isn't happening.
There are two problems:
You're not allowing for the fact your backslashes are in a string literal.
You're not passing anything into compile.
1. Backslashes
Remember that in a string literal, a backslash is an escape character, so the \d in your string is an unnecessary escape of d, which results in just d. So your actual regular expression is:
^(d{1,2}/d{1,2}/d{1,2})
Use the literal form instead:
const reg: RegExp = /^(\d{1,2}\/\d{1,2}\/\d{1,2})/; // No `compile`, see next point
Live Example:
const stringWithDate/*: string*/ = "4/7/20 This is a date!";
const reg/*: RegExp*/ = /^(\d{1,2}\/\d{1,2}\/\d{1,2})/; // No `compile`, see next point
const exist/*: boolean*/ = reg.test(stringWithDate)
const matches/*: RegExpExecArray | null*/ = reg.exec(stringWithDate);
console.log(exist);
console.log(matches);
2. compile
compile accepts a new expression to compile, replacing the existing expression. By not passing an expression in as an argument, you're getting the expression (?:), which matches the blank at the beginning of your string.
You dont need compile (spec | MDN). It's an Annex B feature (supposedly only in JavaScript engines in web browsers). Here's what the spec has to say in a note about it:
The compile method completely reinitializes the this object RegExp with a new pattern and flags. An implementation may interpret use of this method as an assertion that the resulting RegExp object will be used multiple times and hence is a candidate for extra optimization.
...but JavaScript engines can figure out whether a regular expression needs optimization without your telling them.
If you wanted to use compile, you'd do it like this:
const reg: RegExp = /x/.compile(/^(\d{1,2}\/\d{1,2}\/\d{1,2})/);
The contents of the initial regular expression are completely replaced with the pattern and flags from the one passed into compile.
Side note: There's no reason for the type annotations on any of those consts. TypeScript will correctly infer them.

How to access the results of .match as string value in Crystal lang

In many other programming languages, there is a function which takes as a parameter a regular expression and returns an array of string values. This is true of Javascript and Ruby. The .match in crystal, however, does 1) not seem to accept the global flag and 2) it does not return an array but rather a struct of type Regex::MatchData. (https://crystal-lang.org/api/0.25.1/Regex/MatchData.html)
As an example the following code:
str = "Happy days"
re = /[a-z]+/i
matches = str.match(re)
puts matches
returns Regex::MatchData("Happy")
I am unsure how to convert this result into a string or why this is not the default as it is in the inspiration language (Ruby). I understand this question probably results from my inexperience dealing with structs and compiled languages but I would appreciate an answer in hopes that it might also help someone else coming from a JS/Ruby background.
What if I want to convert to a string merely the first match?
puts "Happy days"[/[a-z]+/i]?
puts "Happy days".match(/[a-z]+/i).try &.[0]
It will try to match a string against /[a-z]+/i regex and if there is a match, Group 0, i.e. the whole match, will be output. Note that the ? after [...] will make it fail gracefully if there is no match found. If you just use puts "??!!"[/[a-z]+/i], an exception will be thrown.
See this online demo.
If you want the functionality similar to String#scan that returns all matches found in the input, you may use (shortened version only left as per #Amadan's remark):
matches = str.scan(re).map(&.string)
Output of the code above:
["Happy days", "Happy days"]
Note that:
String::scan will return an array of Regex::MatchData for each match.
You can call .string on the match to return the actual matched text.
Actually the posted example returns a #<MatchData "Happy"> in Ruby, which also has no "global" flag – thats what String#scan(Regex) is for as mentioned by others.
If you want only a single match without going through Regex::MatchData, you can use String#[](Regex):
str = "Happy days"
p str[/[a-z]+/i] # => "Happy"

How to replace parts of a string in lua "in a single pass"?

I have the following string of anchors (where I want to change the contents of the href) and a lua table of replacements, which tells which word should be replaced for:
s1 = '<a href="word7">'
replacementTable = {}
replacementTable["word1"] = "potato1"
replacementTable["word2"] = "potato2"
replacementTable["word3"] = "potato3"
replacementTable["word4"] = "potato4"
replacementTable["word5"] = "potato5"
The expected result should be:
<a href="word7">
I know I could do this iterating for each element in the replacementTable and process the string each time, but my gut feeling tells me that if by any chance the string is very big and/or the replacement table becomes big, this apporach is going to perform poorly.
So I though it could be best if I could do the following: apply the regular expression for finding all the matches, get an iterator for each match and replace each match for its value in the replacementTable.
Something like this would be great (writing it in Javascript because I don't know yet how to write lambdas in Lua):
var newString = patternReplacement(s1, '<a[^>]* href="([^"]*)"', function(match) { return replacementTable[match] })
Where the first parameter is the string, the second one the regular expression and the third one a function that is executed for each match to get the replacement. This way I think s1 gets parsed once, being more efficient.
Is there any way to do this in Lua?
In your example, this simple code works:
print((s1:gsub("%w+",replacementTable)))
The point is that gsub already accepts a table of replacements.
In the end, the solution that worked for me was the following one:
local updatedBody = string.gsub(body, '(<a[^>]* href=")(/[^"%?]*)([^"]*")', function(leftSide, url, rightSide)
local replacedUrl = url
if (urlsToReplace[url]) then replacedUrl = urlsToReplace[url] end
return leftSide .. replacedUrl .. rightSide
end)
It kept out any querystring parameter giving me just the URI. I know it's a bad idea to parse HTML bodies with regular expressions but for my case, where I required a lot of performance, this was performing a lot faster and just did the job.

Using a Variable in an AS3, Regexp

Using Actionscript 3.0 (Within Flash CS5)
A standard regex to match any digit is:
var myRegexPattern:Regex = /\d/g;
What would the regex look like to incorporate a string variable to match?
(this example is an 'IDEAL' not a 'WORKING' snippet) ie:
var myString:String = "MatchThisText"
var myRegexPatter_WithString:Regex = /\d[myString]/g;
I've seen some workarounds which involve creating multiple regex instances, then combine them by source, with the variable in question, which seems wrong. OR using the flash string to regex creator, but it's just plain sloppy with all the double and triple escape sequences required.
There must be some pain free way that I can't find in the live docs or on google. Does AS3 hold this functionality even? If not, it really should.
Or I am missing a much easier means of simply avoiding this task that I'm simply naive too due to my newness to regex?
I've actually blogged about this, so I'll just point you there: http://tyleregeto.com/using-vars-in-regular-expressions-as3 It talks about the possible solutions, but there is no ideal one like you mention.
EDIT
Here is a copy of the important parts of that blog entry:
Here is a regex to strip the tags from a block of text.
/<("[^"]*"|'[^']*'|[^'">])*>/ig
This nifty expression works like a charm. But I wanted to update it so the developer could limit which tags it stripped to those specified in a array. Pretty straight forward stuff, to use a variable value in a regex you first need to build it as a string and then convert it. Something like the following:
var exp:String = 'start-exp' + someVar + 'more-exp';
var regex:Regexp = new RegExp(exp);
Pretty straight forward. So when approaching this small upgrade, that's what I did. Of course one big problem was pretty clear.
var exp:String = '/<' + tag + '("[^"]*"|'[^']*'|[^'">])*>/';
Guess what, invalid string! Better escape those quotes in the string. Whoops, that will break the regex! I was stumped. So I opened up the language reference to see what I could find. The "source" parameter, (which I've never used before,) caught my eye. It returns a String described as "the pattern portion of the regular expression." It did the trick perfectly. Here is the solution:
var start:Regexp = /])*>/ig;
var complete:RegExp = new RegExp(start.source + tag + end.source);
You can reduce it down to this for convenience:
var complete:RegExp = new RegExp(/])*>/.source + tag, 'ig');
As Tyler correctly points out (and his answer works just fine), you can assemble your regex as a string end then pass this string to the RegExp constructor with the new RegExp("pattern", "flags") syntax.
function assembleRegex(myString) {
var re = new RegExp('\\d' + myString, "i");
return re;
}
Note that when using a string to store a regex pattern, you do need to add some extra backslashes to get it to work right (e.g. to get a \d in the regex, you need to specify \\d in the string). Note also that the string pattern does not use the forward slash delimiters. In other words, the following two statements are equivalent:
var re1 = /\d/ig;
var re2 = new Regexp("\\d", "ig");
Additional note: You may need to process the myString variable to escape any backslashes it might contain (if they are to be interpreted as literal). If this is the case the function becomes:
function assembleRegex(myString) {
myString = myString.replace(/\\/, '\\\\');
var re = new RegExp('\\d' + myString);
return re;
}

Regex - If contains '%', can only contain '%20'

I am wanting to create a regular expression for the following scenario:
If a string contains the percentage character (%) then it can only contain the following: %20, and cannot be preceded by another '%'.
So if there was for instance, %25 it would be rejected. For instance, the following string would be valid:
http://www.test.com/?&Name=My%20Name%20Is%20Vader
But these would fail:
http://www.test.com/?&Name=My%20Name%20Is%20VadersAccountant%25
%%%25
Any help would be greatly appreciated,
Kyle
EDIT:
The scenario in a nutshell is that a link is written to an encoded state and then launched via JavaScript. No decoding works. I tried .net decoding and JS decoding, each having the same result - The results stay encoded when executed.
Doesn't require a %:
/^[^%]*(%20[^%]*)*$/
Which language are you using?
Most languages have a Uri Encoder / Decoder function or class.
I would suggest you decode the string first and than check for valid (or invalid) characters.
i.e. something like /[\w ]/ (empty is a space)
With a regex in the first place you need to respect that www.example.com/index.html?user=admin&pass=%%250 means that the pass really is "%250".
Another solution if look-arounds are not available:
^([^%]|%([013-9a-fA-F][0-9a-fA-F]|2[1-9a-fA-F]))*$
Reject the string if it matches %[^2][^0]
I think that would find what you need
/^([^%]|%%|%20)+$/
Edit: Added case where %% is valid string inside URI
Edit2: And fixed it for case where it should fail :-)
Edit3:
In case you need to use it in editor (which would explain why you can't use more programmatic way), then you have to correctly escape all special characters, for example in Vim that regex should lool:
/^\([^%]\|%%\|%20\)\+$/
Maybe a better approach is to deal with that validation after you decode that string:
string name = HttpUtility.UrlDecode(Request.QueryString["Name"]);
/^([^%]|%20)*$/
This requires a test against the "bad" patterns. If we're allowing %20 - we don't need to make sure it exists.
As others have said before, %% is valid too... and %%25would be %25
The below regex matches anything that doesn't fit into the above rules
/(?<![^%]%)%(?!(20|%))/
The first brackets check whether there is a % before the character (meaning that it's %%) and also checks that it's not %%%. it then checks for a %, and checks whether the item after doesn't match 20
This means that if anything is identified by the regex, then you should probably reject it.
I agree with dominic's comment on the question. Don't use Regex.
If you want to avoid scanning the string twice, you can just iteratively search for % and then check that it is being followed by 20 and nothing else. (Update: allow a % after to be interpreted as a literal %nnn sequence)
// pseudo code
pos = 0
while (pos = mystring.find(pos, '%'))
{
if mystring[pos+1] = "%" then
pos = pos + 2 // ok, this is a literal, skip ahead
else if mystring.substring(pos,2) != "20"
return false; // string is invalid
end if
}
return true;