Reading text from binary file like PDF

Reading text from binary file like PDF - c++

I have a problem with reading binary file in C++. Currently my code is like this:
FILE *s=fopen(source, "rb");
fseek(s,0,SEEK_END);
size_file size=ftell(s);
rewind(s);
char *sbuffer=(char *) malloc(sizeof(char) * size);
if(sbuffer==NULL){
fputs("Memory error", stderr);
exit(2);
}
size_t result=fread(sbuffer,1,size,s);
if(result != size){
fputs("Reading error",stderr);
exit(3);
}
fclose(s);
cout<<sbuffer<<endl;
However, the characters printed out on the terminal are all random characters instead of what I write in the PDF file. They are like:
% P D F - 1 . 3
% ? ? ? ? ? ? ? ? ? ? ?
4 0 o b j
< < / L e n g t h 5 0 R / F i l t e r / F l a t e D e c o d e > >
s t r e a m
x ? ? ? j ? 0 E ? ? ? k ? y Q E # ? ? ? m ? & ? ? # % + ? . ? ? ? ? A i ? 4 z \ 1 G W ? ? - , ? ? ? ( ? ? ? 9 ? ? ? ? ? \ ? } ? ? ? e ? ? ? ? 0 ? ? ? ~ ? , ? ? & 8 ? ? x e 4 ? r
| ? ? ?
? ? ? ? E > a ? ? z & ? Z ? < ? } ' ? ? ? j p ? ? Q 7 0 ? ? ? S % - p ? ? ? 7 D ? ? ? ' Q z Q ? ? ? ? ? ? ? ? ? ? \ 2 ? ? 7 ? ? ? < ? ? D ~ ? ? ?
e n d s t r e a m
e n d o b j
5 0 o b j
2 2 8
e n d o b j
2 0 o b j
And many others characters like the above. I tried to search for a long time but cannot find out how to get the actual characters out for later processing. By the way, I'm trying to write a compressor which takes binary file as input and output. Any help here is highly appreciated!

Only a few file formats like plain raw .TXT text files can be "read" and "understood" directly. Most of the file formats, including almost any binary format, is a .. format. This implies certain structure held inside the file. Completely contrary to the .TXT text file that is completely structure-less, or rather, it is one huge block of pure data.
Open a WordPad or Word or any other a least somewhat intelligent text editor and write some text there and then save it as RTF, DOC, ODT or any other non-TXT file. Then save it as TXT file too.
Download a HEX VIEWER/HEX EDITOR. Whatever one. Take one of those free, you don't need many features, just the one that displays raw binary values in one column and ASCII text in the other column. Almost any of free hex viewers/editors can do that.
Open and compare those two files. You will immediatelly see difference.
Back to the PDF:
The PDF even can contain graphics interleaved with the text. How'd you expect to keep it, if the text were "just sitting in the file" like in TXT? How would the image position/description/data be embedded? The PDF can even contain scripts, if I remember well, similar to JavaScripts. Executable. In PDF-type document you can have buttons that do something. That's much more complicated than just text-in a-file.
Binary files usually does not contain any plain-readable text for your eyes. They have that text structured in blocks, wrapped in metadata about colors, text layout, paging and such, or even special structures about document versioning, authoring, classification, (...). This everything has to be stored somewhere.
Usually, binary files have sections. First section usually is called the HEADER. Inside, there will be information about: format type, format version, file/block/data length, image resolution, and similar. All those most probably will be kept in binary form: no "800x600" texts, just "|00|00|03|20|00|00|02|58|" assuming 32bit BE. After your have read, decoded and understood the description, then you will know where the actual data starts, how the data blocks are laid out, and how to decode them and understand what they contain.
edit:
After you understand what is the difference between text files and binary files, check out the absolute basics on http://en.wikipedia.org/wiki/Entropy_(information_theory). Then try playing with RLE (http://www.daniweb.com/software-development/cpp/code/216388/basic-rle-file-compression-routine) or Huffman (http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html) just to start on something relatively simple. Then start reading more about Huffman codes, and then, well, you will be reasonably prepared to the task, like ZIP or LZH..

To parse PDF as text, use some PDF library, such as gnupdf or
poppler.

Related

Excel nesting - IF / AND Query part two?

Hi I Had a query earlier and thought I had cracked it with the help of Richard but it doesn't appear
I have attached an image and what I am trying to achieve to make my query clearer.
* If E is correct then cell F will be set to match D manually
* If E is yes and F is set to 111 then G will populate with the contents of C
* If E is no and F is set to anything but 111 then it will return 0
* If E is correct then cell F will be set to match D manually
* If E is yes and F is set to 112 then H will populate with the contents of C
* If E is no and F is set to anything but 112 then it will return 0
* If E is correct then cell F will be set to match D manually
* If E is yes and F is set to 118 then I will populate with the contents of C
* If E is no and F is set to anything but 118 then it will return 0
* If E is correct then cell F will be set to match D manually
* If E is yes and F is set to 119 then J will populate with the contents of C
* If E is no and F is set to anything but 119 then it will return 0

It's not 100% clear, but sounds like this is what you're after:
F2 = =IF(E2="Yes",IF(OR(D2=111,D2=112,D2=118,D2=119)=TRUE,D2,""),"")
G2 = =IF(AND(E2="Yes",F2=111)=TRUE,C2,"")
H2 = =IF(AND(E2="Yes",F2=112)=TRUE,C2,"")
I2 = =IF(AND(E2="Yes",F2=118)=TRUE,C2,"")
J2 = =IF(AND(E2="Yes",F2=119)=TRUE,C2,"")
Then just fill down. I've put "" instead of 0, because it's a lot easier to see what's going on without zero's everywhere. You can change them back once you're happy with the outcome.
Incidentally, sometimes it's easier to parse the code out. Excel works fine if you have code on different lines, like the following for D2:
=
IF(
E2="Yes",
IF(
OR(
D2=111,D2=112,D2=118,D2=119
)=TRUE,
D2,
""
),
""
)

Use a regular expression extract substring from data frame columns in R

I am fairly new to R so please go easy on me if this is a stupid question.
I have a dataframe called foo:
< head(foo)
Old.Clone.Name New.Clone.Name File
1 A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
2 B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
3 C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
4 D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
5 E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
6 F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
I want to extract codes from the File column that match the regular expression (S[A-Z]{3}[0-9]{1,2}-[0-9]_02), to give me:
SAEE7-1_02
SADQ15-1_02
SAEC16-1_02
SAEJ6-1_02
SAED9-1_02
SAGP3-1_02
I then want to use these codes to search another directory for other files that contain the same code.
I fail, however, at the first hurdle and cannot extract the codes from that column of the data frame.
I have tried:
library('stringr')
str_extract(foo[3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = TRUE))
but this just returns [1] NA.
Am I simply missing something obvious? I look forward to cracking this with a bit of help from the community.

Hello if you are reading the data as a table file then foo[3] is a list and str_extract does not accept lists, only strings, then you should use lapply to extract the match of every element.
lapply(foo[3], function(x) str_extract(x, "[sS][a-zA-Z]{3}[0-9]{1,2}-[0-9]_02"))
Result:
[1] "SAEE7-1_02" "SADQ15-1_02" "SAEC16-1_02" "SAEJ6-1_02" "SAED9-1_02"
[6] "SAGP3-1_02"

str_extract(foo[3],"(?i)S[A-Z]{3}[0-9]{1,2}-[0-9]_02")
seems to work. Somehow, my R gave me
"Error in check_pattern(pattern, string) : could not find function "regex""
when using your original expression.

The following code will repeat what you asked (just copy and paste to your R console):
library(stringr)
foo = scan(what='')
Old.Clone.Name New.Clone.Name File
A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
foo = matrix(foo,ncol=3,byrow=T)
colnames(foo)=foo[1,]
foo = foo[-1,]
foo
str_extract(foo[,3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = T))
The reason you get NULL is hidden: R stores entries by column, hence foo[3] is the 3rd row and 1st column of foo matrix/data frame. To quote the third column, you may need to use foo[,3]. or foo<-data.frame(foo); foo[[3]].

Execute two commands in single-line If -statement

I want to execute two commands as part of a single-line If -statement:
Though below snippet runs without error, variable $F_Akr_Completed is not set to 1, but the MsgBox() is displayed properly (with "F is 0").
$F_Akr_Completed = 0
$PID_Chi = Run($Command)
If $F_Akr_Completed = 0 And Not ProcessExists($PID_Chi) Then $F_Akr_Completed = 1 And MsgBox(1,1,"[Info] Akron parser completed. F is " & $F_Akr_Completed)
Any idea why there is no syntax-error reported when it's not functional?

There is no error, because
If x = x Then x And x
is a valid statement, and x And x is a logical expression. There are many ways you can do this, e.g.:
If Not ($F_Akr_Completed And ProcessExists($PID_Chi)) Then $F_Akr_Completed = 1 + 0 * MsgBox(1,1,"[Info] Akron parser completed. F is " & 1)
But that is a bad style of coding. AutoIt is a mostly verbose language and I recommend to seperate multiple statements.
You can also assign values using the ternary operator:
$F_Akr_Completed = (Not ($F_Akr_Completed And ProcessExists($PID_Chi))) ? 1 : 0
which is the same as
$F_Akr_Completed = Int(Not ($F_Akr_Completed And ProcessExists($PID_Chi)))

Multiplication of NDArray in oct file

I am trying to convert an .m file to an .oct file in Octave and part of the .m file code is:-
for hh = 1 : nt
bi_star = bi_star + A( : , : , hh ) * data( : , : , hh + 1 )' ;
end
where both "A" and "data" are of NDArray type. I have tried to extract the values from the NDArrays by using something like
A.extract( 0 , 0 , num_dims-1 , num_dims-1 , hh ) ;
but get the error message
error: ‘class NDArray’ has no member named ‘extract’
when compiling. The only other way I can think of doing this at the moment is to put nested loops within the hh loop to loop over both "A" and "data" to fill in intermediate calculation matrices and do the matrix multiplications and additions using these intermediate matrices. However, this seems to be a very long winded way of doing things. Is there a more efficient way of accomplishing this?

Thanks to Andy's answer here and its included link to the Octave sources page, I have been able to figure out that I simply need to use:
A.page ( hh )
to accomplish what I what.
Thanks Andy!

filter dplyr's tbl_df using variable names

I am having trouble using dplyr's tbl_df, respectively the regular data.frame. I got a big tbl_df (500x30K) and need to filter it.
So what I would like to do is:
filter(my.tbl_df, row1>0, row10<0)
which would be similar to
df[df$row1>0 & df$row10<0,]
Works great. But I need to build the filter functions dynamically while running, so I need to access the DF/tbl_df columns by one or multiple variables.
I tried something like:
var=c("row1","row10")
op=c(">","<")
val=c(0,0)
filter(my.tbl_df, eval(parse(text=paste(var,op,val,sep="")))
Which gives me an error: not compatible with LGLSXP
This seems to be deeply rooted in the Cpp code.
I would be thankful for any hint. Also pointing out the "string to environment variable" conversion would be helpful, since I am pretty that I am doing it wrong.
With best,
Mario

This is related to this issue. In the meantime, one way could be to construct the whole expression, i.e.:
> my.tbl_df <- data.frame( row1 = -5:5, row10 = 5:-5)
> call <- parse( text = sprintf( "filter(my.tbl_df, %s)", paste(var,op,val, collapse="&") ) )
> call
expression(filter(my.tbl_df, row1 > 0&row10 < 0))
> eval( call )
row1 row10
1 1 -1
2 2 -2
3 3 -3
4 4 -4
5 5 -5

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js