My RAKU Code:
sub comments {
if ($DEBUG) { say "<filtering comments>\n"; }
my #filteredtitles = ();
# This loops through each track
for #tracks -> $title {
##########################
# LAB 1 TASK 2 #
##########################
## Add regex substitutions to remove superflous comments and all that follows them
## Assign to $_ with smartmatcher (~~)
##########################
$_ = $title;
if ($_) ~~ s:g:mrx/ .*<?[\(^.*]> / {
# Repeat for the other symbols
########################## End Task 2
# Add the edited $title to the new array of titles
#filteredtitles.push: $_;
}
}
# Updates #tracks
return #filteredtitles;
}
Result when compiling:
Error Compiling! Placeholder variable '#_' may not be used here because the surrounding block doesn't take a signature.
Is there something obvious that I am missing? Any help is appreciated.
So, in contrast with #raiph's answer, here's what I have:
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
Just that. Nothing else. Let's dissect it, from the inside out:
This part: / <[\(^]> / is a regular expression that will match one character, as long as it is an open parenthesis (represented by the \() or a caret (^). When they go inside the angle brackets/square brackets combo, it means that is an Enumerated character class.
Then, the: S introduces the non-destructive substitution, i.e., a quoting construct that will make regex-based substitutions over the topic variable $_ but will not modify it, just return its value with the modifications requested. In the code above, S:g brings the adverb :g or :global (see the global adverb in the adverbs section of the documentation) to play, meaning (in the case of the substitution) "please make as many as possible of this substitution" and the final / marks the end of the substitution text, and as it is adjacent to the second /, that means that
S:g / <[\(^]> //
means "please return the contents of $_, but modified in such a way that all its characters matching the regex <[\(^]> are deleted (substituted for the empty string)"
At this point, I should emphasize that regular expressions in Raku are really powerful, and that reading the entire page (and probably the best practices and gotchas page too) is a good idea.
Next, the: .map method, documented here, will be applied to any Iterable (List, Array and all their alikes) and will return a sequence based on each element of the Iterable, altered by a Code passed to it. So, something like:
#x.map({ S:g / foo /bar/ })
essencially means "please return a Sequence of every item on #x, modified by substituting any appearance of the substring foo for bar" (nothing will be altered on #x). A nice place to start to learn about sequences and iterables would be here.
Finally, my one-liner
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
can be translated as:
I have a List with three string elements
Foo
Ba(r
B^az
(This would be a placeholder for your "list of titles"). Take that list and generate a second one, that contains every element on it, but with all instances of the chars "open parenthesis" and "caret" removed.
Ah, and store the result in the variable #tracks (that has my scope)
Here's what I ended up with:
my #tracks = <Foo Ba(r B^az>;
sub comments {
my #filteredtitles;
for #tracks -> $_ is copy {
s:g / <[\(^]> //;
#filteredtitles.push: $_;
}
return #filteredtitles;
}
The is copy ensures the variable set up by the for loop is mutable.
The s:g/...//; is all that's needed to strip the unwanted characters.
One thing no one can help you with is the error you reported. I currently think you just got confused.
Here's an example of code that generates that error:
do { #_ }
But there is no way the code you've shared could generate that error because it requires that there is an #_ variable in your code, and there isn't one.
One way I can help in relation to future problems you may report on StackOverflow is to encourage you to read and apply the guidance in Minimal Reproducible Example.
While your code did not generate the error you reported, it will perhaps help you if you know about some of the other compile time and run time errors there were in the code you shared.
Compile-time errors:
You wrote s:g:mrx. That's invalid: Adverb mrx not allowed on substitution.
You missed out the third slash of the s///. That causes mayhem (see below).
There were several run-time errors, once I got past the compile-time errors. I'll discuss just one, the regex:
.*<?[...]> will match any sub-string with a final character that's one of the ones listed in the [...], and will then capture that sub-string except without the final character. In the context of an s:g/...// substitution this will strip ordinary characters (captured by the .*) but leave the special characters.
This makes no sense.
So I dropped the .*, and also the ? from the special character pattern, changing it from <?[...]> (which just tries to match against the character, but does not capture it if it succeeds) to just <[...]> (which also tries to match against the character, but, if it succeeds, does capture it as well).
A final comment is about an error you made that may well have seriously confused you.
In a nutshell, the s/// construct must have three slashes.
In your question you had code of the form s/.../ (or s:g/.../ etc), without the final slash. If you try to compile such code the parser gets utterly confused because it will think you're just writing a long replacement string.
For example, if you wrote this code:
if s/foo/ { say 'foo' }
if m/bar/ { say 'bar' }
it'd be as if you'd written:
if s/foo/ { say 'foo' }\nif m/...
which in turn would mean you'd get the compile-time error:
Missing block
------> if m/⏏bar/ { ... }
expecting any of:
block or pointy block
...
because Raku(do) would have interpreted the part between the second and third /s as the replacement double quoted string of what it interpreted as an s/.../.../ construct, leading it to barf when it encountered bar.
So, to recap, the s/// construct requires three slashes, not two.
(I'm ignoring syntactic variants of the construct such as, say, s [...] = '...'.)
I am trying to figure out how I would go about extracting text from a text file if it matches the same pattern as a second text file and putting the extracted values into another text file.
I have never done anything like this before so I don't even know where to start.
So as an example, In file 1 we might have something like this:
else
{
if (func_133212(13))
{
if (unk_0x44334545("test"))
{
if (!0x22224334545("test"))
{
0x44444237945("test", true);
}
}
if (Global_2398334.f_502.f_11 >= 2)
{
if (unk_0x44334545("test2"))
{
if (!0x22224334545("test2"))
{
0x44444237945("test2", true);
}
}
}
}
And then in file 2 we have something like:
else
{
if (func_12312(13))
{
if (unk_0x433877545("test"))
{
if (!unk_0x3434344("test"))
{
unk_0x42224442111("test", true);
}
}
if (Global_23445454.f_502.f_11 >= 2)
{
if (unk_0x433877545("test2"))
{
if (!unk_0x3434344("test2"))
{
unk_0x42224442111("test2", true);
}
}
}
}
The program would recognize they have the same pattern and extract the unk_ 's into a list with the unk_ from file 1 on the left and the unk_ from file 2 on the right like so:
unk_0x44334545, unk_0x433877545
etc. etc.
I know this is quite complicated so any help is really really appreciated, Let me know if you need more info or anything like that. Just trying to get an idea of how to go about doing this.
Thanks :)
How do you define "same pattern". Does whitespace matter, for example? And do you want to do this for any language or just for one?
One quick algorithm for one specific language could be:
Load each file in memory as a single string.
Tokenize these strings on the special separator characters (such as: {}()>=.;), but keep those tokens in your sequences. This transforms your input strings into lists of tokens.
Trim all tokens of whitespace. This allows you to ignore whitespace differences. For example, your goal is to get a sequence like this from the first file: else, {, if, (, func_133212, (, 13, ), ...
Now you can compare the two lists linearly and you can output the pairs that don't match. You can also add some logic that says that for outputting a pair, you need a match of both the element before and after - thus, two successive mismatches would stop your comparison.
Something like this would work for the example you showed and could also be adjusted to work for more complicated examples. Also, if you need to handle larger files, you could stream in their contents and compare them gradually, but that would require more coding.
I am downloading a webpage and converting into a string using LWP::Simple. When I copy the results into an editor I find multiple instances of the pattern I'm looking for "data-src-hq".
While I'm trying to do something more complex using regex I am starting in baby steps so I can properly learn how to use regex, I started off with just to match "data-src-hq" with the following code:
if($html =~ /data-src-hq/ism)
{
print "match\n";
}
else
{
print "nope\n";
}
My code returns "nope". However, if I modify the pattern search to just "data" or "data-src" I do get a match. The same happens no matter how I use and combine the string and multiline modifier.
My understanding is that a hyphen is not a special character unless it's within brackets, am I missing something simple?
How to fix this?
You are likely getting two outputs, one of match and one of nope. Your code is missing the keyword else:
See your code's current execution here
if($html =~ /data-src-hq/ism)
{
print "match\n";
}
{
print "nope\n";
}
Should be:
See this code's execution here
if($html =~ /data-src-hq/ism)
{
print "match\n";
}
else {
print "nope\n";
}
Otherwise, your code is fine and works to identify whether data-src-hq exists in $html.
So why does your existing code output nope?
That's because {} is a basic block (see Basic BLOCKs in Perl's documentation). An excerpt from the documentation:
A BLOCK by itself (labeled or not) is semantically equivalent to a
loop that executes once. Thus you can use any of the loop control
statements in it to leave or restart the block. (Note that this is NOT
true in eval{}, sub{}, or contrary to popular belief do{} blocks,
which do NOT count as loops.) The continue block is optional.
I am currently working on a program that can parse a chemical formula and return molecular weight and percent composition. The following code works very well with compounds such as H2O, LiOH, CaCO3, and even C12H22O11. However, it is not capable of understanding compounds with polyatomic ions that lie within parenthesis, such as (NH4)2SO4.
I am not looking for someone to necessarily write the program for me, but just give me a few tips on how I might accomplish such a task.
Currently, the program iterates through the inputted string, raw_molecule, first finding each element's atomic number, to store in a vector (I use a map<string, int> to store names and atomic #). It then finds the quantities of each element.
bool Compound::parseString() {
map<string,int>::const_iterator search;
string s_temp;
int i_temp;
for (int i=0; i<=raw_molecule.length(); i++) {
if ((isupper(raw_molecule[i]))&&(i==0))
s_temp=raw_molecule[i];
else if(isupper(raw_molecule[i])&&(i!=0)) {
// New element- so, convert s_temp to atomic # then store in v_Elements
search=ATOMIC_NUMBER.find (s_temp);
if (search==ATOMIC_NUMBER.end())
return false;// There is a problem
else
v_Elements.push_back(search->second); // Add atomic number into vector
s_temp=raw_molecule[i]; // Replace temp with the new element
}
else if(islower(raw_molecule[i]))
s_temp+=raw_molecule[i]; // E.g. N+=a which means temp=="Na"
else
continue; // It is a number/parentheses or something
}
// Whatever's in temp must be converted to atomic number and stored in vector
search=ATOMIC_NUMBER.find (s_temp);
if (search==ATOMIC_NUMBER.end())
return false;// There is a problem
else
v_Elements.push_back(search->second); // Add atomic number into vector
// --- Find quantities next --- //
for (int i=0; i<=raw_molecule.length(); i++) {
if (isdigit(raw_molecule[i])) {
if (toInt(raw_molecule[i])==0)
return false;
else if (isdigit(raw_molecule[i+1])) {
if (isdigit(raw_molecule[i+2])) {
i_temp=(toInt(raw_molecule[i])*100)+(toInt(raw_molecule[i+1])*10)+toInt(raw_molecule[i+2]);
v_Quantities.push_back(i_temp);
}
else {
i_temp=(toInt(raw_molecule[i])*10)+toInt(raw_molecule[i+1]);
v_Quantities.push_back(i_temp);
}
}
else if(!isdigit(raw_molecule[i-1])) { // Look back to make sure the digit is not part of a larger number
v_Quantities.push_back(toInt(raw_molecule[i])); // This will not work for polyatomic ions
}
}
else if(i<(raw_molecule.length()-1)) {
if (isupper(raw_molecule[i+1])) {
v_Quantities.push_back(1);
}
}
// If there is no number, there is only 1 atom. Between O and N for example: O is upper, N is upper, O has 1.
else if(i==(raw_molecule.length()-1)) {
if (isalpha(raw_molecule[i]))
v_Quantities.push_back(1);
}
}
return true;
}
This is my first post, so if I have included too little (or maybe too much) information, please forgive me.
While you might be able to do an ad-hoc scanner-like thing that can handle one level of parens, the canonical technique used for things like this is to write a real parser.
And there are two common ways to do that...
Recursive descent
Machine-generated bottom-up parser based on a grammar-specification file.
(And technically, there is a third category, PEG, that is machine-generated-top-down.)
Anyway, for case 1, you need to code a recursive call to your parser when you see a ( and then return from this level of recursion on the ) token.
Typically a tree-like internal representation is created; this is called a syntax tree, but in your case, you can probably skip that and just return the atomic weight from the recursive call, adding to the level you will be returning from the first instance.
For case 2, you need to use a tool like yacc to turn a grammar into a parser.
Your parser understands certain things. It know that when it sees N, that this means "Atom of Nitrogen type". When it sees O, it means "Atom of Oxygen type".
This is very similar to the concept of identifiers in C++. When the compiler sees int someNumber = 5;, it says, "there exists a variable named someNumber of int type, into which the number 5 is stored". If you later use the name someNumber, it knows that you're talking about that someNumber (as long as you're in the right scope).
Back to your atomic parser. When your parser sees an atom followed by a number, it knows to apply that number to that atom. So O2 means "2 Atoms of Oxygen type". N2 means "2 Atoms of Nitrogen type."
This means something for your parser. It means that seeing an atom isn't sufficient. It's a good start, but it is not sufficient to know how many of that atom exists in the molecule. It needs to read the next thing. So if it sees O followed by N, it knows that the O means "1 Atom of Oxygen type". If it sees O followed by nothing (the end of the input), then it again means "1 Atom of Oxygen type".
That's what you have currently. But it's wrong. Because numbers don't always modify atoms; sometimes, they modify groups of atoms. As in (NH4)2SO4.
So now, you need to change how your parser works. When it sees O, it needs to know that this is not "Atom of Oxygen type". It is a "Group containing Oxygen". O2 is "2 Groups containing Oxygen".
A group can contain one or more atoms. So when you see (, you know that you're creating a group. Therefore, when you see (...)3, you see "3 Groups containing ...".
So, what is (NH4)2? It is "2 Groups containing [1 Group containing Nitrogen followed by 4 Groups containing Hydrogen]".
The key to doing this is understanding what I just wrote. Groups can contain other groups. There is nesting in groups. How do you implement nesting?
Well, your parser looks something like this currently:
NumericAtom ParseAtom(input)
{
Atom = ReadAtom(input); //Gets the atom and removes it from the current input.
if(IsNumber(input)) //Returns true if the input is looking at a number.
{
int Count = ReadNumber(input); //Gets the number and removes it from the current input.
return NumericAtom(Atom, Count);
}
return NumericAtom(Atom, 1);
}
vector<NumericAtom> Parse(input)
{
vector<NumericAtom> molecule;
while(IsAtom(input))
molecule.push_back(ParseAtom(input));
return molecule;
}
Your code calls ParseAtom() until the input runs dry, storing each atom+count in an array. Obviously you have some error-checking in there, but let's ignore that for now.
What you need to do is stop parsing atoms. You need to parse groups, which are either a single atom, or a group of atoms denoted by () pairs.
Group ParseGroup(input)
{
Group myGroup; //Empty group
if(IsLeftParen(input)) //Are we looking at a `(` character?
{
EatLeftParen(input); //Removes the `(` from the input.
myGroup.SetSequence(ParseGroupSequence(input)); //RECURSIVE CALL!!!
if(!IsRightParen(input)) //Groups started by `(` must end with `)`
throw ParseError("Inner groups must end with `)`.");
else
EatRightParen(input); //Remove the `)` from the input.
}
else if(IsAtom(input))
{
myGroup.SetAtom(ReadAtom(input)); //Group contains one atom.
}
else
throw ParseError("Unexpected input."); //error
//Read the number.
if(IsNumber(input))
myGroup.SetCount(ReadNumber(input));
else
myGroup.SetCount(1);
return myGroup;
}
vector<Group> ParseGroupSequence(input)
{
vector<Group> groups;
//Groups continue until the end of input or `)` is reached.
while(!IsRightParen(input) and !IsEndOfInput(input))
groups.push_back(ParseGroup(input));
return groups;
}
The big difference here is that ParseGroup (the analog to the ParseAtom function) will call ParseGroupSequence. Which will call ParseGroup. Which can call ParseGroupSequence. Etc. A Group can either contain an atom or a sequence of Groups (such as NH4), stored as a vector<Group>
When functions can call themselves (either directly or indirectly), it is called recursion. Which is fine, so long as it doesn't recurse infinitely. And there's no chance of that, because it will only recurse every time it sees (.
So how does this work? Well, let's consider some possible inputs:
NH3
ParseGroupSequence is called. It isn't at the end of input or ), so it calls ParseGroup.
ParseGroup sees an N, which is an atom. It adds this atom to the Group. It then sees an H, which is not a number. So it sets the Group's count to 1, then returns the Group.
Back in ParseGroupSeqeunce, we store the returned group in the sequence, then iterate in our loop. We don't see the end of input or ), so it calls ParseGroup:
ParseGroup sees an H, which is an atom. It adds this atom to the Group. It then sees a 3, which is a number. So it reads this number, sets it as the Group's count, and returns the Group.
Back in ParseGroupSeqeunce, we store the returned Group in the sequence, then iterate in our loop. We don't see ), but we do see the end of input. So we return the current vector<Group>.
(NH3)2
ParseGroupSequence is called. It isn't at the end of input or ), so it calls ParseGroup.
ParseGroup sees an (, which is the start of a Group. It eats this character (removing it from the input) and calls ParseGroupSequence on the Group.
ParseGroupSequence isn't at the end of input or ), so it calls ParseGroup.
ParseGroup sees an N, which is an atom. It adds this atom to the Group. It then sees an H, which is not a number. So it sets the group's count to 1, then returns the Group.
Back in ParseGroupSeqeunce, we store the returned group in the sequence, then iterate in our loop. We don't see the end of input or ), so it calls ParseGroup:
ParseGroup sees an H, which is an atom. It adds this atom to the Group. It then sees a 3, which is a number. So it reads this number, sets it as the Group's count, and returns the Group.
Back in ParseGroupSeqeunce, we store the returned group in the sequence, then iterate in our loop. We don't see the end of input, but we do see ). So we return the current vector<Group>.
Back in the first call to ParseGroup, we get the vector<Group> back. We stick it into our current Group as a sequence. We check to see if the next character is ), eat it, and continue. We see a 2, which is a number. So it reads this number, sets it as the Group's count, and returns the Group.
Now, way, way back at the original ParseGroupSequence call, we store the returned Group in the sequence, then iterate in our loop. We don't see ), but we do see the end of input. So we return the current vector<Group>.
This parser uses recursion to "descend" into each group. Therefore, this kind of parser is called a "recursive descent parser" (there's a formal definition for this kind of thing, but this is a good lay-understanding of the concept).
It is often helpful to write down the rules of the grammar for the strings you want to read and recognise. A grammar is just a bunch of rules which say what sequence of characters is acceptable, and by implication which are not acceptable. It helps to have the grammar before and while writing the program, and might be fed into a parser generator (as described by DigitalRoss)
For example, the rules for the simple compound, without polyatomic ions looks like:
Compound: Component { Component };
Component: Atom [Quantity]
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.
For example, the function that implements the 'Quantity' rule just needs to read one or mre digits characters, and converts them to an integer. The function that implements the Atom rule reads enough characters to figure out which atom it is, and stores that away.
A nice thing about recursive descent parsers is the error messages can be quite helpful, and of the form, "Expecting an Atom name, but got %c", or "Expecting a ')' but reached tghe end of the string". It is a bit complicated to recover after an error, so you might want to throw an exception at the first error.
So are polyatomic ions just one level of parenthesis? If so, the grammar might be:
Compound: Component { Component }
Component: Atom [Quantity] | '(' Component { Component } ')' [Quantity];
Atom: 'H' | 'He' | 'Li' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
Or is it more complex, and the notation must allow for nested parenthesis. Once that is clear, you can figure out an approach to parsing.
I do not know the entire scope of your problem, but recursive descent parsers are relatively straightforward to write, and look adequate for your problem.
Consider re-structuring your program as a simple Recursive Descent Parser.
First, you need to change the parseString function to take a string to be parsed, and the current position from which to start the parse, passed by reference.
This way you can structure your code so that when you see a ( you call the same function at the next position get a Composite back, and consume the closing ). When you see a ) by itself, you return without consuming it. This lets you consume formulas with unlimited nesting of ( and ), although I am not sure if it is necessary (it's been more than 20 years since the last time I saw a chemical formula).
This way you'd write the code for parsing composite only once, and re-use it as many times as needed. It will be easy to supplement your reader to consume formulas with dashes etc., because your parser will need to deal only with the basic building blocks.
Maybe you can get rid of brackets before parsing. You need to find how many "brackets in brackets" (sorry for my english) are there and rewrite it like that beginning with the "deepest":
(NH4(Na2H4)3Zn)2SO4 (this formula doesn't mean anyting, actually...)
(NH4Na6H12Zn)2SO4
NH8Na12H24Zn2SO4
no brackets left, let's run your code with NH8Na12H24Zn2SO4
We've become fairly adept at generating various regular expressions to match input strings, but we've been asked to try to validate these strings iteratively. Is there an easy way to iteratively match the input string against a regular expression?
Take, for instance, the following regular expression:
[EW]\d{1,3}\.\d
When the user enters "E123.4", the regular expression is met. How do I validate the user's input while they type it? Can I partially match the string "E1" against the regular expression?
Is there some way to say that the input string only partially matched the input? Or is there a way to generate sub-expressions out of the master expression automatically based on string length?
I'm trying to create a generic function that can take any regular expression and throw an exception as soon as the user enters something that cannot meet the expression. Our expressions are rather simple in the grand scheme of things, and we are certainly not trying to parse HTML :)
Thanks in advance.
David
You could do it only by making every part of the regex optional, and repeating yourself:
^([EW]|[EW]\d{1,3}|[EW]\d{1,3}\.|[EW]\d{1,3}\.\d)$
This might work for simple expressions, but for complex ones this is hardly feasible.
Hard to say... If the user types an "E", that matches the begining but not the rest. Of course, you don't know if they will continue to type "123.4" or if they will just hit "Enter" (I assume you use "Enter" to indicate the end of input) right away. You could use groups to test that all 3 groups match, such as:
([EW])(\d{1,3})(\.\d)
After the first character, try to match the first group. After the next few inputs, match the first AND second group, and when they enter the '.' and last digit you have to find a match for all 3 groups.
You could use partial matches if your regex lib supports it (as does Boost.Regex).
Adapting the is_possible_card_number example on this page to the example in your question:
#include <boost/regex.hpp>
// Return false for partial match, true for full match, or throw for
// impossible match
bool
CheckPartialMatch(const std::string& Input, const boost::regex& Regex)
{
boost::match_results<std::string::const_iterator> what;
if(0 == boost::regex_match(Input, what, Regex, boost::match_default | boost::match_partial))
{
// the input so far could not possibly be valid so reject it:
throw std::runtime_error(
"Invalid data entered - this could not possibly be a match");
}
// OK so far so good, but have we finished?
if(what[0].matched)
{
// excellent, we have a result:
return true;
}
// what we have so far is only a partial match...
return false;
}
int main()
{
const boost::regex r("[EW]\\d{1,3}\\.\\d");
// The input is incomplete, so we expect a "false" result
assert(!CheckPartialMatch("E1", r));
// The input completely satisfies the expression, so expect a "true" result
assert(CheckPartialMatch("E123.4", r));
try{
// Input can't match the expression, so expect an exception.
CheckPartialMatch("EX3", r);
assert(false);
}
catch(const std::runtime_error&){
}
return 0;
}