RTF / doc / docx text extraction in program written in C++/Qt - c++

I am writing some program in Qt/C++, and I need to read text from Microsoft Word/RTF/docx files.
And I am looking for some command-line program that can make that extraction. It may be several programs.
The closest thing I found is DocToText, but it has several bugs, so I can't use it.
I have also Microsoft Word installed on the PC. Maybe there is some way to read text using it (have no idea how to use COM)?

Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
So that's:
unzip -p file.docx: -p == "unzip to stdout"
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
sed 's/<[^<]>//g'*: Remove everything inside tags
grep -v '^[[:space:]]$'*: Remove blank lines
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)

Try Apache Tika

I recommend not to use COM as this would defeat the usage of a portable library like Qt in the first place.
You might want to use the classic catdoc or a similar tool such as wvWare.
Note that although the catdoc author claims that catdoc doesn't work under Windows, there is a posting of 2001 which states the opposite.

To read .doc files you can use the structured storage API. A .doc is basically a structured storage repository with various streams corresponding to the various parts of the document.
Be warned that it is quite a hairy API and that even using this API, a .doc file can be quite messy to look at.
Ofcouse this is still windows only but atleast it's not COM. just a plain old C API.

This might help. It is cross-platform and has an API http://www.winfield.demon.nl/
Otherwise the iFilter methods are the way to go if this is windows only. It will allow you to parse anything that has an iFilter on your system. Here is examples of this http://the-lazy-programmer.com/blog/?p=8 . I have used iFilter from the C# end of things quite a bit.

Related

Reading data from an online text file in C++

Alright. What I'm after is, what seems to me, fairly simple.
I've got File I/O down to a fine art for basic text files.
But, what I need now, is a way to read a text file that's online.
Let's say, something like: http://www.iamawebsite.com.au/file.txt
I CAN download the file and store it locally, but that will produce a lot more pain for me in the future, and more-so for redistribution of the end program, so if I can get around in doing so, I will be forever grateful. (also, if possible, to refrain from any additional libraries or anything. If I have to use one, I will, but if there's a way around that, I'm happy)
I have looked around for a while on ways to do similar tasks, but they seem to be going for more than what I'm after, and skipping the small steps which are the ones I can't quite get.
(If it helps, using Windows 8, Visual Studio 2010 Ultimate, needs to work in Windows 7 and 8 if possible)
I tried many different things, and I couldn't get anything to work without over complicating things to a ridiculous level. I also tried libCurl but I couldn't manage to link it properly, not sure why.
I ended up using a combination of Batch and Powershell scripts, simple and powerful, and best of all it works.
If anybody is interested:
Batch script:
powershell -ExecutionPolicy Bypass "& "fileUrl\name.ps1"
Powershell script:
$webClient = New-Object System.Net.WebClient;
$url = "http://www.iamawebsite.com.au/iamafile.txt";
$file = "whereToSaveFile\desiredNameOfFile.txt";
$webClient.DownloadFile($url, $file);
I have both my Batch and Powershell files in the same directory, just to make it a little easier on myself
Thanks!
Try URLDownloadToCacheFile function maybe.
One way would be to use InternetOpen, thenInternetOpenUrl, and finally InternetReadFile.

Convert .odt .doc .ods files to .txt files

I want to convert all the .odt .doc .xls .pdf files to .txt files.
I want to convert these files to text files using a shell script or a perl script
There's a program for odt files and alikes:
odt2txt - avaliable in repos.
$ unoconv --format=txt document1.odt
Should produce document1.txt.
OpenOffice has a built-in document converter capable of handling a bunch of formats- take a look at unoconv: http://dag.wieers.com/home-made/unoconv/
That being said, I have had some troubles getting that to work in the past- If you're having trouble, take a look at similar programs for AbiWord (another open source word processor).
For word documents, you can try antiword, at least on linux. It's a command line utility that takes a word document as an argument, and spits out the text from that document (as best as it can figure) to Standard Output. Maybe you can specify an ouput file too. I can't remember the details of how it works. I haven't used it in a while. Not sure if it can handle OO documents.
It's certainly possible to do this, though there is something strange and impenetrable about the OO project and its documentation that makes things like this hard to research and follow. However, OO has the capability to convert all of those types, not just the OO native ones, and it can do it via two different forms of automatic control.
These are the two general approaches.
You can start OO and tell it to execute a macro which does this job for you for a given file. You then just have to write the macro and a script to loop over your files. The syntax is something like
$ oowriter -headless filename macro://dir/Standard.Module1.sMySub
The other thing OO has is a network API. This is based on something called UNO.
$ oowriter -accept=accept-string
Notifies the OpenOffice.org software that upon the creation of
"UNO Acceptor Threads", a "UNO Accept String" will be used.
You will need some sort of client library. I think they have one for Python at least. Using this technology a Python program or some other scripting language with an OO client library could drive the program and convert all the files. Since OO reads MSO, it should be able to do all of them.
Open the file in LibreOffice. Click on "File", "Save-as" scroll down to find the text option. Click that and it will be saved as a text file.
FYI, I had an *.ODT file that was 339.2 KB in size. When I save-as text the size of the file shrunk to ONLY 5.0 KB. Another reason for saving your files as text files.
For the Microsoft formats, look into the wvWare tools.
Open .ods file normally in libre office
Highlight text to be converted
Open a terminal
Run vi
Press "i" to get insert mode
Press ctrl-shift-v
Done!
Need some formatting?
Save the file as
Get out of vi
Run:
$cat | column >filename2
This worked in opensuse running KDE
Substitute "kwrite" for "vi", if you want

HD Regular Expression Search

I am working on a project for my computer security class and I have a couple questions. I had an idea to write a program that would search the whole hard drive looking for email addresses. I am just looking for addresses stored in plain text since it would be hard to find anything otherwise. I figured the best way to find addresses would be to use a regular expression.
I wrote an application in C# that works fairly well but it I would like to see if anyone has any better ideas. I am completely up for writing this in another language since I'm assuming C# isn't the best for this type of thing. So far the application I created just starts at the C:/ and recursively locates all files on the drive skipping those that aren't accessible. It also skips all common image, video, audio, compressed, and files over 512mb. This speeds it up quite a bit but there is a small chance that a large file could contain something useful. It takes about 12 seconds to generate the list of files and I'm guessing about an hour to check them all. One downside is that it uses about 50% cpu while scanning.
I'm looking for ideas on how to improve the search. Is there a faster way, a more efficient way, a more thorough way, things like that? I was trying to think if there was any way that you could tell if the file would contain plain text strings or not. Just let me know if you have any cool ideas. Thanks.
To be honest, the easiest existing way to do this is to use grep. As you improve your program, compare your speeds to it, and when you get close, stop worrying about optimizing. Alternatively, take a look at its source for an example of an existing product that does what you're looking for.
As noted elsewhere, tools already exist for this if you install Win32 ports of UNIX tools. Alternatively, the Windows equivalent is:
for /r c:\ %i in (*.*) do findstr /i /r "regular expression" "%i"
you should just use grep + find. grep is optimized for searching files fast, and find is optimized for providing lists of appropriate files for things like this. people have spent a long time optimizing these tools - no need to reinvent the wheel.

ctags best practicies

I'm working on +1M LOC C/C++ project on Solaris (remote, via VNC or SSH). I have a daily updated copy of source code on my local machine too (Windows, just for browsing code).
I use VIM and ctags combo (on both Solaris and Windows) but I'm not happy with results / speed. What settings for ctags would you recommend? There are a lot of options what should be tagged and how. Should I use single tag file per project, per dir or perhaps just one for everything?
Using anything less than one for everything doesn't really make sense to me. Being able to quickly jump around your project is what tags are for in the first place. For instance, our code is divided into 3 main sections, Include/, Processes/, Libraries/. Without being able to jump between these I would be incredibly unproductive.
Personally I use cscope (its C++ parsing isn't great, but its ok, and its VIM integration is better than just ctags), but when I do use ctags I usually just add --c++-kinds=+p.
I use etags:
find src1 src2 src3 | grep -v "\\.svn" | xargs etags --append
In emacs, position cursor on identifier and press M-. ([alt] + [period], or [esc] followed by [period]).
I don't know how it compares to your setup as far as speed goes, or if you're willing to use emacs. I'm just posting in case you want to try some alternatives.

Do you know of a good program for editing/translating resource (.rc) files?

I'm building a C++/MFC program in a multilingual environment. I have one main (national) language and three international languages. Every time I add a feature to the program I have to keep the international languages up-to-date with the national one. The resource editor in Visual Studio is not very helpful because I frequently end up leaving a string, dialog box, etc., untranslated.
I wonder if you guys know of a program that can edit resource (.rc) files and
Build a file that includes only the strings to be translated and their respective IDs and accepts the same (or similar) file in another language (this would be helpful since usually the translation is done by someone else), or
Handle the translations itself, allowing to view the same string in different languages at the same time.
In my experience, internationalization requires a little more than translating strings. Many strings when translated, require more space on a dialog. Because of this it's useful to be able to customize the dialogs for each language. Otherwise you have to create dialogs with extra space for the translated strings which then looks less than optimal when displayed in English.
Quite a while back I was using a translation tool for an MFC application but the company that produced the software stopped selling it. When I tried to find a reasonably priced replacement I did not find one.
Check out Lingobit Localizer. Expensive, but well worth it.
Here's a script I use to generate resource files for testing in different languages. It just parses a response from babelfish so clearly the translation will be about as high quality as that done by a drunken monkey, but it's useful for testing and such
for i in $trfile
do
key=`echo $i | sed 's/^\(.*\)=\(.*\)$/\1/g'`
value=`echo $i | sed 's/^\(.*\)=\(.*\)$/\2/g'`
url="http://babelfish.altavista.com/tr?doit=done&intl=1&tt=urltext&lp=$langs&btnTrTxt=Translate&trtext=$value"
wget -O foo.html -A "$agent" "$url" *&> /dev/null
tx=`grep "<td bgcolor=white class=s><div style=padding:10px;>" foo.html`
tx=`echo $tx | iconv -f latin1 -t utf-8 | sed 's/<td bgcolor=white class=s><div style=padding:10px;>\(.*\)<\/div><\/td>/\1/g'`
echo $key=$tx
done
rm foo.html
Check out appTranslator, its relatively cheap and works rather well. The guy developing it is really responsive to enhancement requests and bug report, so you get really good support.
You might take a look at Sisulizer http://www.sisulizer.com. Expensive though. We're evaluating it for use at my company to manage the headache of ongoing translation. I read on their About page that the company was founded by people who left Multilizer and other similar companies.
If there isn't one, it would be pretty easy to loop through all the strings in a resource a compare them to the international resources. You could probably do this with a simple grid.
In the end we have ended up building our own external tools to manage this. Our devs work in the english string table and every automated build sends our strings that have been added/changed and deleted to translation manager. He can also run a report at anytime from an old build to determine what is required for translation.
Check out RC-WinTrans. Its a commercial tool that my company uses. It basically imports our .RC files (or .resx files) into a database which we send to a different office for translation. The tool can then export a translated .RC file (or .resx file) for each language from the database. It even has a basic dialog box editor so the translator can adjust the size of various controls in the dialog box to be sure the translated text fits.
It also accepts a number of command line arguments and has a COM automation interface so you can integrate it into a build process more easily. It works quite well for us and we literally have thousands and thousands of strings and dialog boxes, etc.
(We currently have version 7 so what I've said might be a little bit different than their latest version 8.)
Also try AppTranslator: http://www.apptranslator.com/. It has a build-in resource editor so that translators can, for example, enlargen a text box when need bo. It has separate versions for developers and translators and much more.
We are using Multilizer (http://www.multilizer.com/) and although sometimes it's a bit tricky to use, at the end with a bit of patient it works pretty well.
We even have a translation web site where translators can download our projects and then upload the translations using Multilizer command-line features.
Managing localization and translations using .rc files and Visual Studio is not a good idea. It's much smarter (though counter-intuitive) to start localization through the exe. Read here why: http://www.apptranslator.com/misconceptions.html
I've written this recently, which integrates into VS:
https://github.com/ekkis/Powershell/blob/master/MT.ps1
largely because I was unsatisfied with the solutions out there. you'll need to get a client id from M$ (but they give you 2M words/month translation free - not bad)
ResxCrunch will be out soon, it will edit multiple resource files in multiple languages in one single table.