I've to read and parse HTML file and populate a data structure (in C++). I'm planning to do the parsing using perl (so I can use some other perl modules.). My requirement is this.
Get the file from gui (C++ code)
Pass it to perl.
Parse file on perl side (perl script using some other perl modules), populate the C++ structure
Continue working on C++ side with the populated structure.
I'm reading about extending and embedding perl, but unable to figure out the correct procedure. Any help would be greatly appreciated.
In your reading did you find perlembed in Perl's documentation? That's the definitive resource for learning how to embed Perl in a C/C++ program. The author of the document was one of the original mod_perl developers, I believe.
I don't think that embedding Perl for a trivial task would be the easiest solution when compared to doing a system call to perl and parsing the result, but for more involved needs it's certainly a solution.
I've used swig to connect C++ and Python. The documentation says it works for Perl, also.
Yet another alternative is to have perl drive your C++ code. Write a function that has a perl-side implementation that calls a corresponding C-side implementation. Do man perlxs and perlxstut for more info.
Edit: Or read it online at http://perldoc.perl.org/perlxs.html and http://perldoc.perl.org/perlxstut.html.
Related
I'm trying to find a simple/elegant command-line solution for a process that is often used in scripts. Something like: (Fictional example)
CopyWithReplace <SourceFile> <DestFile> -m <match regular expression> -r <replacement regular expression>
It would copy the text file with the matched text replaced as specified. Ideally, the find/replace would happen in the pipeline, rather than as a secondary step. (Destinations quite often are remote locations, and long distance WAN links are often not as fast and reliable as desired.)
What would be the simplest** way to achieve this scriptable functionality in a windows environment?
** Simplest = easy to write batch code, fewest 3rd party tools, etc. Bonus points for a reasonably standard Regex implementation.
This can be achieved with sed.
The basic usage pattern, for a substitution as you described, is:
sed 's/regexp/replacement/g' inputFileName > outputFileName
sed is a Unix utility, but there are several ways of using it in Windows if you wish. This StackOverflow post lists the various options available.
I need to extract all links from html page using regular expressions in C++. Can anybody help me please ?
This is a hard job for a regex, and in C++ it's even harder. I actually wrote a parser for a project I did for school a few years ago. You can use this if you find that it works, but I would test it on what you want before you rely on it for anything important.
Feel free to modify/use it, whatever
I realized there were some mistakes in my code, and that I should probably include the header file. Also included is the cmakelists file but it's trivial. The ParserTest.cpp file basically lets you parse links from an input string from the command line.
http://www.mediafire.com/?0u5ppq0gzgdyg
If I'm given a .doc file with special tags in it such as [first_name], how do I go about replacing all occurrences of it with something like "Clark"? A simple binary replacement only works if the replacement string is the exact same length.
Haskell, C, and C++ answers would be best, but any compiled language would do. I'd also prefer to do this without an external library since it has to be deployed on Windows and Linux and cross-platform dependency handling is a bitch.
To summarize...
.doc -> magic program -> .doc with strings replaced
You could use the Word COM component ("Word.Application") on Windows to open the file, do the replacements, save the file, and close it. However, this is Windows-only and can be buggy.
Another thing you could do is use the OpenOffice.org command line interface to convert the file to the ODF format, unzip the file (ODF is mostly zipped XML), do the replacements with the files inside, re-zip the file, and re-convert it to .doc format. However, OpenOffice.org doesn't always read Word files correctly (especially if there is a lot of complex formatting) and it can make it harder to distribute (users must either have OpenOffice.org or you must distribute it with your program).
Also, if you have a file in the .docx format, you can unzip it, do the replacements, and re-zip it.
First read the Word Document Specification.
If that hasn't terrified you, then you should find it fairly straightforward to figure out how to read and write it. It must be possible; Word manages to do it most of the time.
You probably have to use .Net programming (VB or C#) to create an object of Word.Application and then use the MS Word object model to manipulate your document.
Why do you want to be using C/C++/Haskell or another compiled language? I'm not too familiar with Haskell, but in general I would say that C is not a great language for performing text processing. A lot of interpreted languages (Perl, Python, etc.) also have powerful regular expression libraries that are suited for finding and replacing phrases.
With that said, as the other posters have noted, you will still have to deal with the eccentricities of the .doc format.
I've been trying out fnparse library written by Joshua Choi in Clojure and I'm having difficulties trying to work out how to call the rules on the text that I want to parse. I've been experimenting with cat which is part of the new release. Lets take the example code listed. Could anyone give me some ideas how I could call the rule on an expression?
Thank you!
thanks for trying out FnParse 3.
In general, you use the edu.arizona.fnparse/match form (as well as the complementary find, substitute, and substitute-1 forms) to use the rules that you create. Check their documentation strings.
Sorry about the confusion—I should have added an example of match in math.clj—but take a look at the bottom of the sample Clojure parser. Even though the Clojure parser uses FnParse Hound, match works the same way in both Cat and Hound.
Is there any way to write a RegEx which can be used to find files with different Extensions.
This works in Bash:
find . -regex '.*\\.\\(pdf\|chm\|doc\\)'
Assuming you have a list of files and you are looking for .pdf, .chm and .doc, you can check it with:
\.pdf$|\.chm$|\.doc$
Regex above should work if you will check it against single filenames.
I'm sure there is, but the question you should be asking is "What's the best way to find files which have specific extensions?".
Regular expressions are not the best answer to every question.
I would suggest just getting a list of all files and passing them into a function like IsThisFileOneIWant(fileName,extensionList). That's far easier than trying to shoehorn the use of regular expressions into your problem.
Something like this should do it:
function IsThisFileOneIWant(fileName,extensionList):
for each extension in extensionList:
if fileName.endsWith (extension):
return true
return false
Done in pseudo-code since it should be simple enough to turn into any other language.
If you must have a regex, it's going to look something like (based on the values in your question):
"ASPX$|ASCX$|\.js$|\.rpt$|\.xml$"
but it depends entirely on the RE engine that you want to use. For example, here's the output from an egrep command in my work directory:
pax#paxbox1:~/work$ ls -1 | egrep '\.sh$|\.c$'
backup0.sh
backup1.sh
eclipse.sh
monbt.sh
qq.c
qq.sh
xx yy.sh