Say I want to parse jquery selector syntax and turn things into tokens.
Should I parse things as an array of bytes? As a string with std.string? Char by char or maybe there's boyer-moore search somewhere in phobos? D has the fastest regex so maybe regex?
If someone could link to any good parsers written in D that would also be appreciated.
Pegged is simple to use parser generator
I wrote a little CSS selector thing in my dom.d file:
https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff
Grab just the files dom.d and characterencodings.d if you want to play with it.
The way I did it is to use std.string. I wouldn't call this idiomatic or even good... but it was simple to write and got the job done for me. Selector strings are so short I don't think speed would matter much anyway.
For the html parser, I did that char by char. A more idiomatic way would probably be to be templated on an input range and return an output range. I did something more like this for a toy example a while ago:
http://arsdnet.net/dcode/lex.d
Again, I won't say this is the ideal D way... or even a good D way, but it is one possibility that can be made to work.
Related
I want to write a program that takes an string like x^2+1 and understand it.
I want to ask the user to enter her/his function and I want to be able to process and understand it. Any Ideas?
char s[100];
s <- "x*I+2"
x=5;
I=2;
res=calc(s);
I think it could be done by something like string analyses but I think Its so hard for me.
I have another Idea and that is using tcc in main program and doing a realtime compile and run and delete a seprated program (or maybe function) that has the string s in it.
and I will create a temp file every time and ask tcc to compile it and run it by exec or similar syntax.
/*tmp.cpp:*/
#include <math.h>
void main(/*input args*/){
return x*I+2;
}
the tmp.cpp will created dynamically.
thanks in advance.
I am not sure what do you expect. It's too complex to give the code as answer, but the general idea is not very complex. It's not out of reach to code, even for a normal hobbyist programmer.
You need to define grammar, tokenize string, recognize operators, constants and variables.
Probably put expression into a tree. Make up a method for substituting the variables... and you can evaluate!
You need to have some kind of a parser. The easiest way to have math operations parsable is to have them written in RPN. You can, however, write your own parser using parser libraries, like Spirit from boost or Yacc
I use with success , function parser
from www it looks like it supports also std::complex, but I never used it
As luck would have it, I recently wrote one!
Look for {,include/}lib/MathExpression/Term. It handles complex numbers but you can easily adapt it for plain old floats.
The licence is GPL 2.
The theory in brief, when you have an expression like
X*(X+2)
Your highest level parser can parse expressions of the form A + B + C... In this case A is the whole expression.
You recurse to parse an operator of higher precedence, A * B * C... In this case A is X and B is (X+2)
Keep recursing until you're parsing either basic tokens such as X or hit an opening parenthesis, in which case push some kind of stack to track where your are and recurse into the parentheses with the top-level low-precedence parser.
I recommend you use RAII and throw exceptions when there are parse errors.
use a Recursive descent parser
Sample: it's in german, but a small and powerfull solution
look here
here is exactly what You are searching for. Change the function read_varname to detect a variable like 'x' or 'I'.
In C++ I need to convert a string to any type at runtime where I do not know what type I might be getting in the string. I have heard there is a lexical_cast in boost that I can use, but what would be the most effective way to implement it?
I might get a bunch of string like this from a client: Date="25/08/2010", Someval="2", Blah="25.5".
Now I want to be able to convert these strings to their type, eg, the Somval is obviously an int, and the Date could be a boost::date or whatever. The point is, I don't know at this time in what order these would be given to me, so it's hard to write some code that will perform a bunch of casts.
I could use a bunch of if/else statements or a switch/case statements, however I'm thinking that there is possibly a better way to do this.
I'm not looking for something different to lexical_cast, I can totally use that, I am looking to see if someone knows a better way then doing this:
std::string str = "256";
int a = lexical_cast<int>(str);
//now check if the cast worked, if not, try another...
This is too much of a guessing game, and if I have 10 possible types, for any given string, it sounds a bit ineffective. Especially if it has to do 1000's of these at any given time.
Can anybody advice?
Alex Brown notes - the example string is a fragment of the XML data that comes from the client.
Use an XML parser to read XML data, it will do almost all of the legwork for you, and deal with the ordering issues. Then you simply need to ask the parser for the data you need for the calculation.
Details differ with different XML parsers - go find one, read the documentation. If you need more help, come back here with an XML parser question.
GMan is right, you can not cast an arbitrary string to for example a Date type if the underlaying data structure is different. You can, however, parse the content and instantiate a new object using the data in the string. std::atoi() parses a c-string to an int for example.
You need to parse the string, not cast it.
What you're describing is actually a parser. Even the trial-and-error approach using lexical_cast is really just a (crude) parser.
I suggest to clarify the format of the input string and then, if it's simple enough, write a Recursive descent parser by hand to parse the input string into whatever data structure is convenient for your need.
you could use a VARIANT type of struct (i.e. one of every possible results, and a "type" specifying which it was, and a big enum of types), and a ConvertStringToVariant() function.
This is too much of a guessing game,
and if I have 10 possible types, for
any given string
If you're concerned about this, you need a lexical analyzer, such as flex or Boost::Spirit.
It will still be a guessing game, but a more "informed" guessing one.
One of the more interesting "programming languages" I've been stuck with lately is MediaWiki templates. You can do a surprising amount of stuff with the limited syntax they give you, but recently I've run into a problem that stumps me: using string functions on template arguments. What I'd like to do (somewhat simplified) is:
{{myTemp|a=1,2,3,4}}
then write a template that can do some sort of magic like
You told me _a_ starts with {{#split:{{{a}}}, ",", 0}}
At present, I can do this with embedded javascript, capturing regexp matching, and document.write, but a) it's huge, b) it's hacky, and c) it will break horribly if anybody turns off javascript. (Note that "split" is merely an example; concatenate, capturing-regexp matching, etc., would be even better)
I realize the right solution is to have the caller invoke the template with separate arguments, but for various reasons that would be hard in my particular case. If it's simply not possible, I guess that's the answer, but if there is some way to have templates do string-manipulation on the back end, that'd be great.
Concatenate is easy. To assign x = y concat z
{{#vardefine:x|{{{y}}}{{{z}}}}}
And, to add to Mark's answer, there are also RegexParserFunctions
Ceterum censeo: MediaWiki will never be not hacky.
You can do this with extensions, e.g. StringFunctions. But see also ParserFunctions and ParserFunctions/Extended. (You'll find a lot more examples in the Category:Parser function extensions.)
A great overview Help:Extension:ParserFunctions.
Is it possible/practical to build a single regular expression that matches hierarchical data?
For example:
<h1>Action</h1>
<h2>Title1</h2><div>data1</div>
<h2>Title2</h2><div>data2</div>
<h1>Adventure</h1>
<h2>Title3</h2><div>data3</div>
I would like to end up with matches.
"Action", "Title1", "data1"
"Action", "Title2", "data2"
"Adventure", "Title3", "data3"
As I see it this would require knowing that there is a hierarchical structure at play here and if I code the pattern to capture the H1, it only matches the first entry of that hierarchy. If I don't code for H1 then I can't capture it. Was wondering if there are any special tricks I an employ to solve this.
This is a .NET project.
The solution is to not use regular expressions. They're not powerful enough for this sort of thing.
What you want is a parser - since it looks like you're trying to match HTML, there are plenty to choose from.
It's generally considered bad practice to attempt to parse HTML/XML with RegEx, precisely because it's hierarchical. You COULD use a recursive function to do so, but a better solution in this case is to use a real XML parser. I couldn't give you better advice than that without knowing the platform you're using.
EDIT: Regex is also very slow, which is another reason it's bad for processing HTML; however, I don't know that an XML/DOM processor is likely to be faster since it's likely to use a lot more memory.
If you JUST want data from a simple document like you've demonstrated, and/or if you want to build a solution yourself, it's not that tough to do. Just build a simple, recursive state-based stream processor that looks for tags and passes the contents to the the next recursive level.
For example:
- In a recursive function, seek out a "<" character.
- Now find a ">" character.
- Preserve everything you find until the next "<" character.
- Find a ">" character.
- Pass whatever you found between those tags into the recursive function.
You'd have to work out error checking yourself, but the base case (when you return back up to the previous level) is just when there's nothing else to find.
Maybe this helps, maybe not. Good luck to you.
Regex does not work for this type of data. It is not regular, per se.
You should use an XML parser for this.
Are there particular cases where native text manipulation is more desirable than regex?
In particular .net?
Note:
Regex appears to be a highly emotive subject, so I am wary of asking such a question. This question is not inviting personal/profession opinions on regex, only specific situations where a solution including its use is not as good as language native commands (including those which have underlying code using regex) and why.
Also, note that Desirable can mean performance, can mean code-readability; it does not mean panacea, as each solution for a problem has its benefits and limitations.
Apologies if this is a duplicate, I have searched SO for a similar question.
I prefer text manipulation over regular expressions to parse delimited string input. It's far simpler (for me at least) to issue a string split than to manage a regular expression.
Given some text:
value1, value2, value3
You can parse the line easily:
var values = myString.Split(',');
I'm sure there's a better way but with regular expressions you'd have to do something like:
var match = Regex.Match(myString, "^([^,]*),([^,]*),([^,]*)$");
var value1 = match.Group[1];
...
When you can do it simply with native text manipulation, it is usually preferable (simpler to read & better performance) not to use regex.
Personal rule of thumb: if it's tricky or relatively longer to do it "manually" and that performance gain is negligible, don't. Else do.
Don't examples:
split
simple find & replace
long text
loop
existing native functions (like, in PHP, strrchr, ucwords...)
Using a regex basically means embedding a tiny program, written in a different programming language, in the middle of your program. I'll ignore the inefficiency of using a regex over native string manipulation, because it probably isn't relevant in most cases.
I prefer native text manipulation over regex any time native text manipulation will be easier to follow for other people. Which is true quite frequently, since plenty of the people around me are not strongly familiar with regex. Unless working with something that is very much about parsing (via regex) they should not need to be!
Regular expressions are usually slower, less readable, and harder to debug than native string manipulation.
The main case where I'll prefer regex over string manipulation is when I want to be able to have different ways to parse strings dependning on the source, and the types of sources will increase over time. Native string manipulation is not really practical in this case. I've had cases where I've stuck a regex column in a database...
RegEx's are very flexible and powerful, because they are in many ways similar to an eval() statement. That being said, depending on the implementation, they can be a bit slow. Normally, this is not an issue, however, if they can be avoided in a particularly costly loop, that can boost performance.
That being said, I tend to use them, and only worry about performance when the app is "done" and I have real benchmarks to prove I need to tweak performance. i.e, avoid premature optimization.
Whenever the same result can be achieved with a reasonable amount of code.
Regular expressions are very powerful, but they tend to get hard to read. If you can do the same with simple string operations that usually means that the code gets easier to manage and maintain.
There is some overhead in setting up the object and parsing the expression. For simpler string manipulation you can get better performance with simple string methods.
Example:
Getting the file name from a file path (yes, I know that the Path class should be used for that, it's just an example...)
string name = Regex.Match(path, #"([^\\]+)$").Groups[0].Value;
vs.
string name = path.Substring(path.LastIndexOf('\\') + 1);
The second solution is straight forward and does the minimal work needed to get the result. The regular expression solution produces the same result, but it does more work to parse the string, and it produces a bunch of objects that is not needed for the result.
Regex parsing and execution refers the host language to defer processing to its regex "engine". This adds overhead, so for any instance where native string manipulation could be used it is preferable for speed (and readability!).
I'll usually just use text manipulation for simple string replacements (e.g. replacing tokens in a template with actual values). You could certainly do this with Regex, but replacements are much easier.
Yes. Example:
char* basename (const char* path)
{
char* p = strrchr(path, '/');
return (p != NULL) ? (p+1) : path;
}