PoDoFo Extract text + coords from a pdf - c++

I have been trying for a while to use the PoDoFo C++ library to extract text and lines (with their respective coordinates). But I have no way to do this.
This is what I have so far:
#include <iostream>
#include <stdio.h>
#include <vector>
#include <podofo/podofo.h>
using namespace PoDoFo;
using namespace std;
int main( int argc, char* argv[] )
{
const char* filename = "hello.pdf";
PdfVecObjects *x = new PdfVecObjects();
PdfParser parser(x, filename);
parser.ParseFile("hello.pdf");
for (TIVecObjects obj = x->begin(); obj != x->end(); obj++){
PdfObject * a = x->RemoveObject(obj);
// THIS IS MY PROBLEM VVVVVVVVVV
cout << a->Reference().ToString() << endl;
}
return 0;
}
However, this only gives me incredibly basic information (seems to be object number)
DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
1 0 R
2 0 R
3 0 R
4 0 R
5 0 R
6 0 R
7 0 R
8 0 R
9 0 R
10 0 R
11 0 R
I want to print out the coordinates of an object, and if it's a line or text. If it's text, I would also like to be able to print out the text. Does anyone that knows this library better than I do know what I could do to fix this?

This answer will show you how to extract the text.
To get text positioning information, you will also have to process the following commands:
Tc, Tw, Tz, TL, T*, Tr and Tm.
You definitely need to download the PDF spec from Adobe to get all the details. There is a chapter devoted entirely to text processing. It is well worth your time to print out that chapter as you will be referring to it a lot. Everything you need to know is in there, but it's not always obvious.
You will also need to use a bit of Linear Algebra. Nothing too complicated, though.
Since there are many ways to achieve the same results, it is important to implement all the commands thoroughly, even if the documents you are going to process might not seem to need certain features. For example: I ran across a document which set all text sizes to one point, which threw off all my calculations until I realized it was using the text scaling factor to set the actual font sizes.

Use the PoDoFo tools "podofotxtextract" it gives you x,y coordinate (tool folder of PoDoFo package). Extract text from Pdf.

Related

Reading text fies with inconsistent format

I am trying to perform some operations on a text file containing a repetition of a C based string and some numbers. My code successfully carried out the operation on the first set but it would not get to the remaining sets.
Please see the content of the text file below:
Max Scherzer 2017
6.2 4 2 2 2 7
6.0 4 3 1 2 10
mod Cameron 2018
6.4 4 1 2 1 3
6.0 4 3 5 2 8
John Brandonso 2019
6.1 1 3 5 2 7
6.5 4 7 3 4 10
I have used .eof() and it completely messed up what i am doing.
#include <iostream>
#include <fstream>
#include <iomanip>
#include <cmath>
using namespace std;
int main()
{
char playername [25];
int season;
ifstream gamefilein;
gamefilein.open("C:\\Users\\troy\\Desktop\\GAME_SCORE\\gameinfo.txt");
if(!gamefilein)
{
cout<<"unable to open file";
}
double IP;
int H,R,ER,BB,K;
int counter=0;
double totalscore=0;
while(!gamefilein.fail())
{
gamefilein.get(playername,25);
gamefilein>>season;
cout<<playername<<season<<endl;
cout<<"Game Scores:"<<endl;
while(gamefilein>>IP>>H>>R>>ER>>BB>>K)
{
int IPa=IP;
int IPb=(IP-IPa)*10;
int IPc=0;
if(IPa>4)
{
IPc=IPa-4;
}
int score=50+(IPa*3)+(IPb*1)+(IPc*2)+(K*1)-(H*2)-(ER*4)-((R-ER)*2)-(BB*1);
cout<<score<<endl;
counter++;
totalscore+=score;
}
cout<<"Number of Games Started: "<<counter<<endl;
cout<<fixed<<setprecision(2)<<"Average Game Score:
<<(totalscore/counter)<<endl<<endl;
}
gamefilein.close();
return 0;
}
I get the below result, but I want the same result for the rest of the information in the text file, for example, I am expecting two more results like the one I have below.
Max Scherzer 2017
Game Scores:
63
64
Number of Games Started: 2
Average Game Score: 63.50
Aren't you reading the file as a char array?
If I read this correctly you try to shift an int and double over a char array with numbers in a STRING right?
e.g. "6.2" string is different than a 6.2 double number in your memory, hence why it cant work.
You also seem to have a lot of spaces which should not forget as well.
Where do you get that string to begin with? I would recommend you change the creation of that file to a more convenient format e.g. cv or json
I just solved my problem myself. The problem occurred when the loop operating on the integers and double completes its run and sees the character-based string that is in the next dataset. So i inserted a clear member function just at the point where i check for end of file
(gamefilein.clear())
and that solved my problem.
Thanks for attempting to help

parsing /proc/stat using c++

I'am new to c++ and a little bit in Linux. I have simple project that need to parse CPU stat from /proc/stat file and compute CPU usage. I have tried doing it on full bash script. but what i need is c++. I just need a little help. /proc/stat gives a lot of numbers and i know different column represent on something. like User,Nice,System,Idle etc. For example i just want to get the Idle value, and store it as Integer using c++, how would i do it? Please Help. What I tried right now is just getting the whole line i need using ifstream and getline()
std::ifstream filestat("/proc/stat");
std::string line;
std::getline(filestat,line);
and what i get is this.
cpu 349585 0 30513 875546 0 935 0 0 0 0
To clarify my question, for example i want to get the 875546 value and store it to an integer using c++. how would i do it? thank you
The format of stat is described in detail under the proc(5) manual page.
You can see it either by running the command man 5 proc from a Linux terminal or online.
The methods described above for parsing the stat file are fine for academic purposes, but a production grade parser should take extra precaution when using these methods.
If you need a production grade parser in C++ for files in /proc, you can check out pfs - A library for parsing the procfs. (Disclaimer: I'm the author of the library)
The biggest issue is usually the comm field (The second field in the file).
According to the man pages, this field is a string that should be "scanned" using some scanf flavor and the formatter %s. But that is wrong!
The comm field is controlled by the application (Can be set using prctl(PR_SET_NAME, ...)) and can easily include spaces or brackets, easily causing 99% of the parsers out there to fail.
And a simple change like that won't just return a bad comm value, it will screw up with all the values that come after it.
The right way to parse the file are one of the following:
Option #1:
Read the entire content of the file
Find the first occurrence of '('
Find the last occurrence of ')'
Assign to comm the string between those indices
Parse the rest of the file after the last occurrence of ')'
Option #2:
Read the PID (the first value in the file)
Read 18 bytes (16 is the largest comm value + 2 for the wrapping brackets)
Extract the comm value from that buffer just like we did for option #1
Find out the actual length of the value, fix your stream and continue reading from there
You really need to study up on how file input works. This should be simple enough. You just need to ignore the first 3 characters "cpu" and then read through 4 integer values:
unsigned n;
if(std::ifstream("/proc/stat").ignore(3) >> n >> n >> n >> n)
{
// use n here...
std::cout << n << '\n';
}
Alternatively if you already have the line (maybe you are reading the file one line at a time) you can use std::istringstream to turn the line into a new input stream:
std::ifstream filestat("/proc/stat");
std::string line;
std::getline(filestat, line);
unsigned n;
if(std::istringstream(line).ignore(3) >> n >> n >> n >> n)
{
// use n here...
std::cout << n << '\n';
}
There are several ways to the problem. You can use regular expression library to get the part of the string or if you know this is always going to the 5th element then you can use this:
std::string text = "cpu 349585 0 30513 875546 0 935 0 0 0 0";
std::istringstream iss(text);
std::vector<std::string> results(std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>());
int data = std::stoi( results[4] ); //check size before accessing
std::cout << data << std::endl;
I hope it helps.

Misunderstood the usage of cin.unget() while handling I/O files C++

For example, I have a file named Bjarne.txt and in it there's the integers:
16 2 3 4
I have made a program to read the integers available inside the file and output them to me in the console window , however , I'm trying to use cin.unget() and by that get understanding of what it does actually , here's the source code:
#include <iostream>
#include <fstream>
using namespace std;
int main () {
ifstream ifs("Bjarne.txt");
int a;
for(int i = 0;i<4;++i){
ifs>>a;
cout<<endl<<a;
if(i==0){
ifs.unget();
}
}
And the output is:
16 6 2 3
Why is the output like that? ( it should be 16 2 3 4 ) , it only occurs when I put ifs.unget() in the program , so my questions are , what is the purpose of cin.unget() while using I/O files and why is the number 6 ( as part of 16 ) getting outputted?
Thanks in advance for any help.
Something wrong with the documentation?
Makes the most recently extracted character available again.
At the end of your first loop iteration, 6 was the last extracted character (as the final digit of the extracted formatted int with value 16).
Unget does exactly that: it un-gets it.
The next operation has the 6 to work with. So, surprise, you get 6 next time.

Reading a file of number into an array while skipping first two values every 1026 entries

I am trying to read in a text file of numbers in which there are 2 values in the beginning that I do not care about, followed by 1024 values that I do care about. The file has approximately 100000 entries that I need to do a calculation on every 1024 of them. The format is something like
1
1025
3000
3572
3579
4023
3593
2930
.
.
.
1
1025
.
.
.
So basically the 1 and the 1025 are header values describing the data set which I need to ignore, then I need to read every value after those header values into an array so I can then run calculations on the values in the array. I was thinking of using while(!file.eof()) but I can not think of how to have the code skip those two numbers while it reads through the 100000 entries. I am pretty new to c++, I usually use GUI's to do my data analysis, but I am on a project that is requiring me to us C++, so I'm really out of my comfort zone here. I appreciate any help I can get.
There are a lot of ways you can do it. The most straight forward example I could think of was:
#include <iostream>
#include <string>
int main()
{
int i = 0;
std::string s;
while( std::cin >> s )
{
if( i++ < 2 ) continue;
std::cout << s;
if( i == 1024 ) i = 0;
}
}

Using C++ libraries in an R package

What is the best way to make use of a C++ library in R, hopefully preserving the C++ data structures. I'm not at all a C++ user, so I'm not clear on the relative merits of the available approaches. The R-ext manual seems to suggest wrapping every C++ function in C. However, at least four or five other means of incorporating C++ exist.
Two ways are packages w/ similar lineage, the Rcpp (maintained by the prolific overflower Dirk Eddelbuettel) and RcppTemplate packages (both on CRAN), what are the differences between the two?
Another package, rcppbind available, on R forge that claims to take a different approach to binding C++ and R (I'm not knowledgeable to tell).
The package inline available on CRAN, claims to allow inline C/C++ I'm not sure this differs from the built in functionality, aside for allowing the code to be inline w/R.
And, finally RSwig which appears to be in the wild but it is unclear how supported it is, as the author's page hasn't been updated for years.
My question is, what are the relative merits of these different approaches. Which are the most portable and robust, which are the easiest to implement. If you were planning to distribute a package on CRAN which of the methods would you use?
First off, a disclaimer: I use Rcpp all the time. In fact, when (having been renamed by the time from Rcpp) RcppTemplate had already been orphaned and without updates for two years, I started to maintain it under its initial name of Rcpp (under which it had been contributed to RQuantLib). That was about a year ago, and I have made a couple of incremental changes that you can find documented in the ChangeLog.
Now RcppTemplate has very recently come back after a full thirty-five months without any update or fix. It contains interesting new code, but it appears that it is not backwards compatible so I won't use it where I already used Rcpp.
Rcppbind was not very actively maintained whenever I checked. Whit Armstrong also has a templated interface package called rabstraction.
Inline is something completely different: it eases the compile / link cycle by 'embedding' your program as an R character string that then gets compiled, linked, and loaded. I have talked to Oleg about having inline support Rcpp which would be nice.
Swig is interesting too. Joe Wang did great work there and wrapped all of QuantLib for R. But when I last tried it, it no longer worked due to some changes in R internals. According to someone from the Swig team, Joe may still work on it. The goal of Swig is larger libraries anyway. This project could probably do with a revival but it is not without technical challenges.
Another mention should go to RInside which works with Rcpp and lets you embed R inside of C++ applications.
So to sum it up: Rcpp works well for me, especially for small exploratory projects where you just want to add a function or two. It's focus is ease of use, and it allows you to 'hide' some of the R internals that are not always fun to work with. I know of a number of other users whom I have helped on and and off via email. So I would say go for this one.
My 'Intro to HPC with R' tutorials have some examples of Rcpp, RInside and inline.
Edit: So let's look at a concrete example (taken from the 'HPC with R Intro' slides and borrowed from Stephen Milborrow who took it from Venables and Ripley). The task is to enumerate all possible combinations of the determinant of a 2x2 matrix containing only single digits in each position. This can be done in clever vectorised ways (as we discuss in the tutorial slides) or by brute force as follows:
#include <Rcpp.h>
RcppExport SEXP dd_rcpp(SEXP v) {
SEXP rl = R_NilValue; // Use this when there is nothing to be returned.
char* exceptionMesg = NULL; // msg var in case of error
try {
RcppVector<int> vec(v); // vec parameter viewed as vector of ints
int n = vec.size(), i = 0;
if (n != 10000)
throw std::length_error("Wrong vector size");
for (int a = 0; a < 9; a++)
for (int b = 0; b < 9; b++)
for (int c = 0; c < 9; c++)
for (int d = 0; d < 9; d++)
vec(i++) = a*b - c*d;
RcppResultSet rs; // Build result set to be returned as list to R
rs.add("vec", vec); // vec as named element with name 'vec'
rl = rs.getReturnList(); // Get the list to be returned to R.
} catch(std::exception& ex) {
exceptionMesg = copyMessageToR(ex.what());
} catch(...) {
exceptionMesg = copyMessageToR("unknown reason");
}
if (exceptionMesg != NULL)
Rf_error(exceptionMesg);
return rl;
}
If you save this as, say, dd.rcpp.cpp and have Rcpp installed, then simply use
PKG_CPPFLAGS=`Rscript -e 'Rcpp:::CxxFlags()'` \
PKG_LIBS=`Rscript -e 'Rcpp:::LdFlags()'` \
R CMD SHLIB dd.rcpp.cpp
to build a shared library. We use Rscript (or r) to ask Rcpp about its header and library locations. Once built, we can load and use this from R as follows:
dyn.load("dd.rcpp.so")
dd.rcpp <- function() {
x <- integer(10000)
res <- .Call("dd_rcpp", x)
tabulate(res$vec)
}
In the same way, you can send vectors, matrics, ... of various R and C++ data types back end forth with ease. Hope this helps somewhat.
Edit 2 (some five+ years later):
So this answer just got an upvote and hence bubbled up in my queue. A lot of time has passed since I wrote it, and Rcpp has gotten a lot richer in features. So I very quickly wrote this
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::IntegerVector dd2(Rcpp::IntegerVector vec) {
int n = vec.size(), i = 0;
if (n != 10000)
throw std::length_error("Wrong vector size");
for (int a = 0; a < 9; a++)
for (int b = 0; b < 9; b++)
for (int c = 0; c < 9; c++)
for (int d = 0; d < 9; d++)
vec(i++) = a*b - c*d;
return vec;
}
/*** R
x <- integer(10000)
tabulate( dd2(x) )
*/
which can be used as follows with the code in a file /tmp/dd.cpp
R> Rcpp::sourceCpp("/tmp/dd.cpp") # on from any other file and path
R> x <- integer(10000)
R> tabulate( dd2(x) )
[1] 87 132 105 155 93 158 91 161 72 104 45 147 41 96
[15] 72 120 36 90 32 87 67 42 26 120 41 36 27 75
[29] 20 62 16 69 19 28 49 45 12 18 11 57 14 48
[43] 10 18 7 12 6 46 23 10 4 10 4 6 3 38
[57] 2 4 2 3 2 2 1 17
R>
Some of the key differences are:
simpler build: just sourceCpp() it; even executes R test code at the end
full-fledged IntegerVector type
exception-handling wrapper automatically added by sourceCpp() code generator