Swift: fastest way to parse HTML - regex

I have a large file of source code that I need to parse some specific text out of. I want to get it done as fast as possible. What would be the fastest way to do this in Swift? These are all the options I could think of?
Using a third-party library of string functions- I've tried this. It works well, but I imagine this is much slower compared to other lower level methods in general, unless there are some notably fast ones out there specifically for Swift.
Using a third-party HTML parser. I've looked into a few, but I'm not sure if they will suit my needs. Before I proceed with this, I just want to know if these are generally faster, if there are any notabley fast ones out there, and if I'm able to tweak them to get specifically what I want from the source code.
Using String or NSString. From what I understand, using String vs NSString should give no difference in speed. I am quite comfortable with this approach, and it's lower level than some of the other ones, so should I expect fairly fast performance?
Using regular expressions. I've been told that since these are lower-level, they should ideally be the fastest. I've used regular expressions before, but not in ios. Is it easy to do string parsing with NSRegularExpression, and is it faster?
Thank you!

Came upon this link while researching your question: http://benedictcohen.co.uk/blog/archives/74
The authors explains an older approach to what #CodaFi suggested, but there is a relevant update at the end you should check out:
The easiest way to parse HTML is to treat it as XML and use the
NSXMLParser. iOS comes with LibTidy which is capable of fixing a
multitude of markup sins. Use LibTidy to create clean XML and pass
this XML to NSXMLParser. Only use the approach outlined above if it’s
not possible to use NSXMLParser.
So perhaps option 4 or 5 for you to check out?

Related

Tiny C++ YAML reader/writer

I'm writing an embedded C++ program, and need to add serialization/deserialization. The format should be human readable and writeable, and I would much prefer to use (a subset of) a standard format like YAML. I also prefer YAML to JSON since it is more concise.
While yaml-cpp has the exact functionality I'd like, the source code is almost 300K and would almost double my code size, which seems excessive to me just in order to add human readable serialization/deserialization.
Before I start writing my own reader/writer for a subset of YAML, I'd like to first check whether this already exists? I have not been able to find one, but would much prefer to use existing code rather than rolling my own. Are there any C or C++ YAML readers/writers out there of, say, 50K code or less? I only need functionality for the basic data structures (scalar, array, hash), not any advanced stuff.
With many thanks in advance.
The Oops library is doing what you are looking for. It is written for serialization using reflection and supports YAML format as well.
https://bitbucket.org/barczpe/oops

XML vs YAML vs JSON for a 2D RPG [duplicate]

This question already has answers here:
Is there a C++ library to read JSON documents into C++ objects? [duplicate]
(4 answers)
Closed 8 years ago.
I can't figure out whether or not to use XML, YAML, or JSON for a C++ 2D RPG.
Here are my thoughts:
I need something which is simple to save not just player data, but environment data, such as object (x, y) coordinates; load times; dates; graphics configurations, etc.
I need something flexible, easy to use, and definitely light weight, but powerful to handle the above.
Which is the best choice? I have experience with JSON in JavaScript, but not C++. Are there any good references for parsing JSON in C++ if this is the route to go?
Edit
Honestly, if a text file seems like the simplest and most effective solution for something like this (especially if I can just write it to binary), then I'm all ears.
Edit 2
Feel free to provide other suggestions as well.
I would use the simplest thing that satisfies your requirements.
If you don't need hierarchical storage, then flat tabular files are so much easier to deal with than anything else. All you have to do is read lines off disk and split on tab.
If you are looking at more of key/value pair type storage (as opposed to lists of things), then INI files can be reasonable. This format has a lot of flexibility, though reasoning about it can less approachable when you start doing more complicated things than it was designed for.
If you need hierarchical, it's possible that JSON would be simpler. There are JSON libraries in wide range of languages, and it sounds like you already familiar.
https://stackoverflow.com/questions/245973/whats-the-best-c-json-parser
sqlite may be another option. There be dragons in SQL, but with a nice C++ wrapper around sqlite, it can be manageable. The primary benefit would be ACID, in my opinion.
The YAML spec looks somewhat lengthy, so I can guess that it has more kitchen sinks. Just skimming the libyaml docs, the API looks somewhat like SAX interfaces that I've used in the past. I have no a posteriori knowledge of it, but I would be reticent to start using it without a good reason.
XML sucks to deal with, don't opt in to it. There lots of reasons for this. I think the most relevant one in my mind is that it's prone to make the code that uses it more complicated than it would be otherwise. Any system I've seen designed with XML, reasoning about the XML is more complicated than the design interests that its trying to support. There are valid uses for it, though it's rare that another storage system wouldn't have been just as adequate.
Regardless of which one you choose, write as little code as you can managing it. You really want to write the classes your engine will use first. Then worry about serializing them. If you let your serialization influence your class design, you'll probably regret it. :)

Good Option for XML Edit/Replace

I have a huge (100k+ lines, 5MB+) XML which acts as a database for my C++ Application. The structure of the XML is quite straight forward, for example, it has chunks of:
<foo>
<bar prop="true"/>
<baz>blah</baz>
</foo>
The nesting of tags is several levels deep and there are many items with multiple properties. What is a good way to find and replace chunks of this kind of a file? For example, assume that the above section is repeated a few dozen times and in each chunk the value of the tag <baz> is different. I'd like to make edits such as:
Setting all the values contained in tag <baz> to a given value.
Remove chunks containing certain values
Etc.
So far, I've learnt of the following methods for accomplishing this:
Find/Replace: A no-brainer, trivial solution and also my last fall-back. This approach, IMHO is the most time consuming, error prone and painful method. The absolute last resort.
RegExes: Use regular expressions to match blocks of interest and edit them using replacement expressions. Kinda like this blog entry: http://blogs.msdn.com/b/vseditor/archive/2004/08/12/213770.aspx. But I feel this would be error prone and there could be a bunch of missed items if the regex is not exactly right the first time around.
Parser & Save: Whip up a quick program to parse the XML using Xerces or XML DOM Interfaces (or some other XML library), read the XML in, manipulate it as desired and save back to disk. Again, this approach is a slow process, but once its up and running, easy to make modifications and more flexible then RegExes.
Are there any better ways to deal with this?
(EDIT: Thanks for all the redo it to use a DB suggestions, I know its a huge mess but by "better ways to deal with this" I meant the "find/replace" part. )
If you don't want to put the entire document in memory, I would read it using a SAX parser. As you read it, you append the transformed document to a second (or a temp) file. I think it could be pretty fast, and use only a little memory footprint.
Are there any better ways to deal with this?
If you must use XML, you could use an XML database such as BDB XML (which has C++ APIs). It supports XQuery, transactions, etc.
Other options include TinyXML which I've used with success in the past. Quick and easy to use, not necessarily the fastest on a file of that size, but it will get the job done.
What are your actual memory constraints? 5MB is large but not enormous by current RAM standards.
I would use DOM with XPath if you can, it will be a lot less development work than SAX or other stream-based parsing. My problem with SAX is that if you are really using this as a in-memory DB, that implies random access on-demand and SAX is not well-suited for that - you will have to parse and reserialize over and over, whereas once you have the DOM at least you can play with it as you like.
Echo comments about to store in-RAM database info too. Plenty of alternatives that are better suited to this than XML. Maybe you could implement a tactical solution using DOM/XPath and investigate rip-and-replace as a longer-term project.

Writing a tokenizer, where to begin?

I'm trying to write a tokenizer for CSS in C++, but I have no idea how to write a tokenizer. I know that it should be greedy, reading as much input as possible, for each token, and in theory I know how I could put that in code.
I have looked at Boost.Tokenizer, and it seems nice, but it doesn't help me whatsoever. It sure is a nice wrapper for a tokenizer, but the problem lies in writing the token splitter, the TokenizerFunction in Boost terms.
I have no idea how to write this tokenizer, are there any "neat" ways of doing it, like something that closely resembles the syntax itself?
Please note, I'm not looking for a parser! My application doesn't need to be able to understand CSS, just read a CSS file to a general internal tokenized format, process some things and output again.
Writing a "correct" lexer and/or parser is more difficult than you might think. And it can get ugly when you start dealing with weird corner cases.
My best suggestion is to invest some time in learning a proper lexer/parser system. CSS should be a fairly easy language to implement, and then you will have acquired an amazingly powerful tool you can use for all sorts of future projects.
I'm an Old Fart® and I use lex/yacc (or things that use the same syntax) for this type of project. I first learned to use them back in the early 80's and it has returned the effort to learn them many, many times over.
BTW, if you have anything approaching a BNF of the language, lex/yacc can be laughably easy to work with.
Boost.Spirit.Qi would be my first choice.
Spirit.Qi is designed to be a practical parsing tool. The ability to generate a fully-working parser from a formal EBNF specification inlined in C++ significantly reduces development time. Programmers typically approach parsing using ad hoc hacks with primitive tools such as scanf. Even regular-expression libraries (such as boost regex) or scanners (such as Boost tokenizer) do not scale well when we need to write more elaborate parsers. Attempting to write even a moderately-complex parser using these tools leads to code that is hard to understand and maintain.
The Qi tutorials even finish by implementing a parser for an XMLish language; writing a grammar for CSS should be considerably easier.

Parsing a string in C++

I have a huge set of log lines and I need to parse each line (so efficiency
is very important).
Each log line is of the form
cust_name time_start time_end (IP or URL )*
So ip address, time, time and a possibly empty list of ip addresses or urls separated by semicolons. If there is only ip or url in the last list there is no separator. If there
is more than 1, then they are separated by semicolons.
I need a way to parse this line and read it into a data structure. time_start or
time_end could be either system time or GMT. cust_name could also have multiple strings
separated by spaces.
I can do this by reading character by character and essentially writing my own parser.
Is there a better way to do this ?
Maybe Boost RegExp lib will help you.
http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/index.html
I've had success with Boost Tokenizer for this sort of thing. It helps you break an input stream into tokens with custom separators between the tokens.
Using regular expressions (boost::regex is a nice implementation for C++) you can easily separate different parts of your string - cust_name, time_start ... and find all that urls\ips
Second step is more detailed parsing of that groups if needed. Dates for example you can parse using boost::datetime library (writing custom parser if string format isn't standard).
Why do you want to do this in C++? It sounds like an obvious job for something like perl.
Consider using a Regular Expressions library...
Custom input demands custom parser. Or, pray that there is an ideal world and errors don't exist. Specially, if you want to have efficiency. Posting some code may be of help.
for such a simple grammar you can use split, take a look at http://www.boost.org/doc/libs/1_38_0/doc/html/string_algo/usage.html#id4002194
UPDATE changed answer drastically!
I have a huge set of log lines and I need to parse each line (so efficiency is very important).
Just be aware that C++ won't help much in terms of efficiency in this situation. Don't be fooled into thinking that just because you have a fast parsing code in C++ that your program will have high performance!
The efficiency you really need here is not the performance at the "machine code" level of the parsing code, but at the overall algorithm level.
Think about what you're trying to do.
You have a huge text file, and you want to convert each line to a data structure,
Storing huge data structure in memory is very inefficient, no matter what language you're using!
What you need to do is "fetch" one line at a time, convert it to a data structure, and deal with it, then, and only after you're done with the data structure, you go and fetch the next line and convert it to a data structure, deal with it, and repeat.
If you do that, you've already solved the major bottleneck.
For parsing the line of text, it seems the format of your data is quite simplistic, check out a similar question that I asked a while ago: C++ string parsing (python style)
In your case, I suppose you could use a string stream, and use the >> operator to read the next "thing" in the line.
see this answer for example code.
Alternatively, (I didn't want to delete this part!!)
If you could write this in python it will be much simpler. I don't know your situation (it seems you're stuck with C++), but still
Look at this presentation for doing these kinds of task efficiently using python generator expressions: http://www.dabeaz.com/generators/Generators.pdf
It's a worth while read.
At slide 31 he deals with what seems to be something very similar to what you're trying to do.
It'll at least give you some inspiration.
It also demonstrates quite strongly that performance is gained not by the particular string-parsing code, but the over all algorithm.
You could try to use a simple lex/yacc|flex/bison vocabulary to parse this kind of input.
The parser you need sounds really simple. Take a look at this. Any compiled language should be able to parse it at very high speed. Then it's an issue of what data structure you build & save.