Read an image or pdf using C++ without external library

Read an image or pdf using C++ without external library - c++

I was just thinking after reading about Java & C#, whether C++ can also read image & pdf files without the use of external libraries ? C++ doesn't have the byte type like Java & C#. Then how can we accomplish the task ( again without using an external library) ?
Can anyone give a small demonstration (ie a program or code to read or copy or write image or pdf files) ?

You can use unsigned char or char reinterpreted as some integer type to parse binary file formats like pdf, jpeg etc. You can create a buffer as std::vector<char> and read it as following:
std::vector<char> buffer((
std::istreambuf_iterator<char>(infile)), // Ensure infile was opened with binary attribute
(std::istreambuf_iterator<char>()));
Related questions: Reading and writing binary file

There is no difference what file you are reading opened in binary mode, there is only difference is how you should interpret the data you get from the file.
It's significantly better to take ready to use library like e.g. libjpeg or whatever. There are plenty of them. But If you really want to do this, at first you should define suitable structures and constants (see links below) to make code to be convinient and useable. Then you just read the data and try to interpret it step by step. The code below is just pseudo code, I didn't compile it.
#include <fstream>
// define header structure
struct jpeg_header
{
enum class marker: unsigned short { eoi = 0xffd8, sof0 = 0xffc0 ... };
...
};
bool is_eoi(unsigned short m) { return jpeg_header::eoi == m; }
jpeg_header read_jpeg_header(const std::string& fn)
{
std::ifstream inf(fn, std::ifstream::binary);
if (!inf)
{
throw std::runtime_error("Can't open file: " + fn);
}
inf.exceptions(std::ifstream::failbit | std::ifstream::eofbit);
unsigned short marker = inf.get() << 8;
marker |= inf.get();
if (!is_eoi(marker))
{
throw std::runtime_error("Invalid jpeg header");
}
...
jpeg_header header;
// read further and fill header structure
...
return header;
}
To read huge block of data use ifstream::read(), ifstream::readsome() methods. Here is the good example http://en.cppreference.com/w/cpp/io/basic_istream/read.
Those functions also work faster then stream iterators. It's also better define your own exception classes derived from std::runtime_error.
For details on file formats you interested in look here
Structure of a PDF file?
https://en.wikipedia.org/wiki/JPEG_File_Interchange_Format
https://en.wikipedia.org/wiki/JPEG

It would be a strange world to have a system language like C and in this case C++ without a type byte :).
Yeah, I take it, it has strange name, unsigned char, but it is still there:).
Really just think about the magnitude of re-development of all things to avoid byte:). Peripherals, many registers in CPU's and other chips, communication, data protocols. It would all have to be redone:).

Related

most efficient way to Read and write binary

I'm working on a project that I need to first read data from a file, then make some change to it, and then save it to another file (all in binary mode).
For reading, my first try was to open the file with ifstream and read directly from the file with read(), but because I need to read small bytes from the file back to back, I think it's not a good idea to keep reading data directly from the file itself. I mean, currently I'm doing it this way for reading the file into a structure and normal variables:
namespace DBinary {
#pragma pack(push, 1)
struct Structure
{
int32_t iData1;
int16_t iData2;
int16_t iData3;
int16_t iData4a;
int16_t iData4b;
int32_t iData4c;
};
#pragma pack(pop)
}
int main()
{
std::ifstream input(path, std::ios::binary);
//for reading structure
DBinary::Structure tstruc{};
file.read((char*)&tstruc, sizeof(DBinary::Structure));
//read single value
uint16_t anint = 0;
core_file.read((char*)&anint, sizeof(anint));
}
It's OK, but I think I can do it better, because the file isn't that big. Maybe I can read it fully into memory and then work on it? But I'm not sure what is the best way to do that, and how to do that, because I don't have much experience in C++ and I'm new to it.
I also want to be able to freely edit and change the data that I read from files, so its important for me to also support that.

i prefer this
std::fstream fa("/etc/passwd",std::ios_base::in|std::ios_base::binary);
std::stringstream mj;
fa>>mj.rdbuf();
then you have all stuff in mj.str()

how to use boost::iostreams::mapped_file_source with a gzipped input file

I am using boost::iostreams::mapped_file_source to read a text file from a specific position to a specific position and to manipulate each line (compiled using g++ -Wall -O3 -lboost_iostreams -o test main.cpp):
#include <iostream>
#include <string>
#include <boost/iostreams/device/mapped_file.hpp>
int main() {
boost::iostreams::mapped_file_source f_read;
f_read.open("in.txt");
long long int alignment_offset(0);
// set the start point
const char* pt_current(f_read.data() + alignment_offset);
// set the end point
const char* pt_last(f_read.data() + f_read.size());
const char* pt_current_line_start(pt_current);
std::string buffer;
while (pt_current && (pt_current != pt_last)) {
if ((pt_current = static_cast<const char*>(memchr(pt_current, '\n', pt_last - pt_current)))) {
buffer.assign(pt_current_line_start, pt_current - pt_current_line_start + 1);
// do something with buffer
pt_current++;
pt_current_line_start = pt_current;
}
}
return 0;
}
Currently, I would like to make this code handle gzip files as well and modify the code like this:
#include<iostream>
#include<boost/iostreams/device/mapped_file.hpp>
#include<boost/iostreams/filter/gzip.hpp>
#include<boost/iostreams/filtering_streambuf.hpp>
#include<boost/iostreams/filtering_stream.hpp>
#include<boost/iostreams/stream.hpp>
int main() {
boost::iostreams::stream<boost::iostreams::mapped_file_source> file;
file.open(boost::iostreams::mapped_file_source("in.txt.gz"));
boost::iostreams::filtering_streambuf< boost::iostreams::input > in;
in.push(boost::iostreams::gzip_decompressor());
in.push(file);
std::istream std_str(&in);
std::string buffer;
while(1) {
std::getline(std_str, buffer);
if (std_str.eof()) break;
// do something with buffer
}
}
This code also work well but I don't know how can set the start point (pt_current) and the end point (pt_last) like the first code. Could you let me know how I can set the two values in the second code?

The answer is no, that's not possible. The compressed stream would need to have indexes.
The real question is Why?. You are using a memory mapped file. Doing on-the-fly compression/decompression is only going to reduce performance and increase memory consumption.
If you're not short on actual file storage, then you should probably consider a binary representation, or keep the text as it is.
Binary representation could sidestep most of the complexity involved when using text files with random access.
Some inspirational samples:
Simplest way to read a CSV file mapped to memory?
Using boost::iostreams::mapped_file_source with std::multimap
Iterating over mmaped gzip file with boost
What you're basically discovering is that text files aren't random access, and compression makes indexing essentially fuzzy (there is no precise mapping from compressed stream offset to uncompressed stream offset).
Look at the zran.c example in the zlib distribution as mentioned in the zlib FAQ:
28. Can I access data randomly in a compressed stream?
No, not without some preparation. If when compressing you periodically use Z_FULL_FLUSH, carefully write all the pending data at those points, and keep an index of those locations, then you can start decompression at those points. You have to be careful to not use Z_FULL_FLUSH too often, since it can significantly degrade compression. Alternatively, you can scan a deflate stream once to generate an index, and then use that index for random access. See examples/zran.c
¹ you could specifically look at parallel implementations such as e.g. pbzip2 or pigz; These will necessarily use these "chunks" or "frames" to schedule the load across cores

Parsing a binary file. What is a modern way?

I have a binary file with some layout I know. For example let format be like this:
2 bytes (unsigned short) - length of a string
5 bytes (5 x chars) - the string - some id name
4 bytes (unsigned int) - a stride
24 bytes (6 x float - 2 strides of 3 floats each) - float data
The file should look like (I added spaces for readability):
5 hello 3 0.0 0.1 0.2 -0.3 -0.4 -0.5
Here 5 - is 2 bytes: 0x05 0x00. "hello" - 5 bytes and so on.
Now I want to read this file. Currently I do it so:
load file to ifstream
read this stream to char buffer[2]
cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.
read a stream to vector<char> and create a std::string from this vector. Now I have string id.
the same way read next 4 bytes and cast them to unsigned int. Now I have a stride.
while not end of file read floats the same way - create a char bufferFloat[4] and cast *((float*)bufferFloat) for every float.
This works, but for me it looks ugly. Can I read directly to unsigned short or float or string etc. without char [x] creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?
P.S.: while I wrote a question, the more clearer explanation raised in my head - how to cast arbitrary number of bytes from arbitrary position in char [x]?
Update: I forgot to mention explicitly that string and float data length is not known at compile time and is variable.

If it is not for learning purpose, and if you have freedom in choosing the binary format you'd better consider using something like protobuf which will handle the serialization for you and allow to interoperate with other platforms and languages.
If you cannot use a third party API, you may look at QDataStream for inspiration
Documentation
Source code

The C way, which would work fine in C++, would be to declare a struct:
#pragma pack(1)
struct contents {
// data members;
};
Note that
You need to use a pragma to make the compiler align the data as-it-looks in the struct;
This technique only works with POD types
And then cast the read buffer directly into the struct type:
std::vector<char> buf(sizeof(contents));
file.read(buf.data(), buf.size());
contents *stuff = reinterpret_cast<contents *>(buf.data());
Now if your data's size is variable, you can separate in several chunks. To read a single binary object from the buffer, a reader function comes handy:
template<typename T>
const char *read_object(const char *buffer, T& target) {
target = *reinterpret_cast<const T*>(buffer);
return buffer + sizeof(T);
}
The main advantage is that such a reader can be specialized for more advanced c++ objects:
template<typename CT>
const char *read_object(const char *buffer, std::vector<CT>& target) {
size_t size = target.size();
CT const *buf_start = reinterpret_cast<const CT*>(buffer);
std::copy(buf_start, buf_start + size, target.begin());
return buffer + size * sizeof(CT);
}
And now in your main parser:
int n_floats;
iter = read_object(iter, n_floats);
std::vector<float> my_floats(n_floats);
iter = read_object(iter, my_floats);
Note: As Tony D observed, even if you can get the alignment right via #pragma directives and manual padding (if needed), you may still encounter incompatibility with your processor's alignment, in the form of (best case) performance issues or (worst case) trap signals. This method is probably interesting only if you have control over the file's format.

Currently I do it so:
load file to ifstream
read this stream to char buffer[2]
cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.
That last risks a SIGBUS (if your character array happens to start at an odd address and your CPU can only read 16-bit values that are aligned at an even address), performance (some CPUs will read misaligned values but slower; others like modern x86s are fine and fast) and/or endianness issues. I'd suggest reading the two characters then you can say (x[0] << 8) | x[1] or vice versa, using htons if needing to correct for endianness.
read a stream to vector<char> and create a std::string from this vector. Now I have string id.
No need... just read directly into the string:
std::string s(the_size, ' ');
if (input_fstream.read(&s[0], s.size()) &&
input_stream.gcount() == s.size())
...use s...
the same way read next 4 bytes and cast them to unsigned int. Now I have a stride.
while not end of file read floats the same way - create a char bufferFloat[4] and cast *((float*)bufferFloat) for every float.
Better to read the data directly over the unsigned ints and floats, as that way the compiler will ensure correct alignment.
This works, but for me it looks ugly. Can I read directly to unsigned short or float or string etc. without char [x] creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?
struct Data
{
uint32_t x;
float y[6];
};
Data data;
if (input_stream.read((char*)&data, sizeof data) &&
input_stream.gcount() == sizeof data)
...use x and y...
Note the code above avoids reading data into potentially unaligned character arrays, wherein it's unsafe to reinterpret_cast data in a potentially unaligned char array (including inside a std::string) due to alignment issues. Again, you may need some post-read conversion with htonl if there's a chance the file content differs in endianness. If there's an unknown number of floats, you'll need to calculate and allocate sufficient storage with alignment of at least 4 bytes, then aim a Data* at it... it's legal to index past the declared array size of y as long as the memory content at the accessed addresses was part of the allocation and holds a valid float representation read in from the stream. Simpler - but with an additional read so possibly slower - read the uint32_t first then new float[n] and do a further read into there....
Practically, this type of approach can work and a lot of low level and C code does exactly this. "Cleaner" high-level libraries that might help you read the file must ultimately be doing something similar internally....

I actually implemented a quick and dirty binary format parser to read .zip files (following Wikipedia's format description) just last month, and being modern I decided to use C++ templates.
On some specific platforms, a packed struct could work, however there are things it does not handle well... such as fields of variable length. With templates, however, there is no such issue: you can get arbitrarily complex structures (and return types).
A .zip archive is relatively simple, fortunately, so I implemented something simple. Off the top of my head:
using Buffer = std::pair<unsigned char const*, size_t>;
template <typename OffsetReader>
class UInt16LEReader: private OffsetReader {
public:
UInt16LEReader() {}
explicit UInt16LEReader(OffsetReader const or): OffsetReader(or) {}
uint16_t read(Buffer const& buffer) const {
OffsetReader const& or = *this;
size_t const offset = or.read(buffer);
assert(offset <= buffer.second && "Incorrect offset");
assert(offset + 2 <= buffer.second && "Too short buffer");
unsigned char const* begin = buffer.first + offset;
// http://commandcenter.blogspot.fr/2012/04/byte-order-fallacy.html
return (uint16_t(begin[0]) << 0)
+ (uint16_t(begin[1]) << 8);
}
}; // class UInt16LEReader
// Declined for UInt[8|16|32][LE|BE]...
Of course, the basic OffsetReader actually has a constant result:
template <size_t O>
class FixedOffsetReader {
public:
size_t read(Buffer const&) const { return O; }
}; // class FixedOffsetReader
and since we are talking templates, you can switch the types at leisure (you could implement a proxy reader which delegates all reads to a shared_ptr which memoizes them).
What is interesting, though, is the end-result:
// http://en.wikipedia.org/wiki/Zip_%28file_format%29#File_headers
class LocalFileHeader {
public:
template <size_t O>
using UInt32 = UInt32LEReader<FixedOffsetReader<O>>;
template <size_t O>
using UInt16 = UInt16LEReader<FixedOffsetReader<O>>;
UInt32< 0> signature;
UInt16< 4> versionNeededToExtract;
UInt16< 6> generalPurposeBitFlag;
UInt16< 8> compressionMethod;
UInt16<10> fileLastModificationTime;
UInt16<12> fileLastModificationDate;
UInt32<14> crc32;
UInt32<18> compressedSize;
UInt32<22> uncompressedSize;
using FileNameLength = UInt16<26>;
using ExtraFieldLength = UInt16<28>;
using FileName = StringReader<FixedOffsetReader<30>, FileNameLength>;
using ExtraField = StringReader<
CombinedAdd<FixedOffsetReader<30>, FileNameLength>,
ExtraFieldLength
>;
FileName filename;
ExtraField extraField;
}; // class LocalFileHeader
This is rather simplistic, obviously, but incredibly flexible at the same time.
An obvious axis of improvement would be to improve chaining since here there is a risk of accidental overlaps. My archive reading code worked the first time I tried it though, which was evidence enough for me that this code was sufficient for the task at hand.

I had to solve this problem once. The data files were packed FORTRAN output. Alignments were all wrong. I succeeded with preprocessor tricks that did automatically what you are doing manually: unpack the raw data from a byte buffer to a struct. The idea is to describe the data in an include file:
BEGIN_STRUCT(foo)
UNSIGNED_SHORT(length)
STRING_FIELD(length, label)
UNSIGNED_INT(stride)
FLOAT_ARRAY(3 * stride)
END_STRUCT(foo)
Now you can define these macros to generate the code you need, say the struct declaration, include the above, undef and define the macros again to generate unpacking functions, followed by another include, etc.
NB I first saw this technique used in gcc for abstract syntax tree-related code generation.
If CPP is not powerful enough (or such preprocessor abuse is not for you), substitute a small lex/yacc program (or pick your favorite tool).
It's amazing to me how often it pays to think in terms of generating code rather than writing it by hand, at least in low level foundation code like this.

You should better declare a structure (with 1-byte padding - how - depends on compiler). Write using that structure, and read using same structure. Put only POD in structure, and hence no std::string etc. Use this structure only for file I/O, or other inter-process communication - use normal struct or class to hold it for further use in C++ program.

Since all of your data is variable, you can read the two blocks separately and still use casting:
struct id_contents
{
uint16_t len;
char id[];
} __attribute__((packed)); // assuming gcc, ymmv
struct data_contents
{
uint32_t stride;
float data[];
} __attribute__((packed)); // assuming gcc, ymmv
class my_row
{
const id_contents* id_;
const data_contents* data_;
size_t len;
public:
my_row(const char* buffer) {
id_= reinterpret_cast<const id_contents*>(buffer);
size_ = sizeof(*id_) + id_->len;
data_ = reinterpret_cast<const data_contents*>(buffer + size_);
size_ += sizeof(*data_) +
data_->stride * sizeof(float); // or however many, 3*float?
}
size_t size() const { return size_; }
};
That way you can use Mr. kbok's answer to parse correctly:
const char* buffer = getPointerToDataSomehow();
my_row data1(buffer);
buffer += data1.size();
my_row data2(buffer);
buffer += data2.size();
// etc.

I personally do it this way:
// some code which loads the file in memory
#pragma pack(push, 1)
struct someFile { int a, b, c; char d[0xEF]; };
#pragma pack(pop)
someFile* f = (someFile*) (file_in_memory);
int filePropertyA = f->a;
Very effective way for fixed-size structs at the start of the file.

Use a serialization library. Here are a few:
Boost serialization and Boost fusion
Cereal (my own library)
Another library called cereal (same name as mine but mine predates theirs)
Cap'n Proto

The Kaitai Struct library provides a very effective declarative approach, which has the added bonus of working across programming languages.
After installing the compiler, you will want to create a .ksy file that describes the layout of your binary file. For your case, it would look something like this:
# my_type.ksy
meta:
id: my_type
endian: be # for big-endian, or "le" for little-endian
seq: # describes the actual sequence of data one-by-one
- id: len
type: u2 # unsigned short in C++, two bytes
- id: my_string
type: str
size: 5
encoding: UTF-8
- id: stride
type: u4 # unsigned int in C++, four bytes
- id: float_data
type: f4 # a four-byte floating point number
repeat: expr
repeat-expr: 6 # repeat six times
You can then compile the .ksy file using the kaitai struct compiler ksc:
# wherever the compiler is installed
# -t specifies the target language, in this case C++
/usr/local/bin/kaitai-struct-compiler my_type.ksy -t cpp_stl
This will create a my_type.cpp file as well as a my_type.h file, which you can then include in your C++ code:
#include <fstream>
#include <kaitai/kaitaistream.h>
#include "my_type.h"
int main()
{
std::ifstream ifs("my_data.bin", std::ifstream::binary);
kaitai::kstream ks(&ifs);
my_type_t obj(&ks);
std::cout << obj.len() << '\n'; // you can now access properties of the object
return 0;
}
Hope this helped! You can find the full documentation for Kaitai Struct here. It has a load of other features and is a fantastic resource for binary parsing in general.

I use ragel tool to generate pure C procedural source code (no tables) for microcontrollers with 1-2K of RAM. It did not use any file io, buffering, and produces both easy to debug code and .dot/.pdf file with state machine diagram.
ragel can also output go, Java,.. code for parsing, but I did not use these features.
The key feature of ragel is the ability to parse any byte-build data, but you can't dig into bit fields. Other problem is ragel able to parse regular structures but has no recursion and syntax grammar parsing.

Writing to file using c and c++

When I try to write the file using C; fwrite which accepts void type as data, it is not interpreted by text editor.
struct index
{
index(int _x, int _y):x(_x), y(_y){}
int x, y;
}
index i(4, 7);
FILE *stream;
fopen_s(&stream, "C:\\File.txt", "wb");
fwrite(&i, sizeof(index), 1, stream);
but when I try with C++; ofstream write in binary mode, it is readable. why doesn't it come up same as written using fwrite?

This is the way to write binary data using a stream in C++:
struct C {
int a, b;
} c;
#include <fstream>
int main() {
std::ofstream f("foo.txt",std::ios::binary);
f.write((const char*)&c, sizeof c);
}
This shall save the object in the same way as fwrite would. If it doesn't for you, please post your code with streams - we'll see what's wrong.

C++'s ofstream stream insertion only does text. The difference between opening a iostream in binary vs text mode is weather or not end of line character conversion happens. If you want to write a binary format where a 32 bit int takes exactly 32 bits use the c functions in c++.
Edit on why fwrite may be the better choice:
Ostream's write method is more or less a clone of fwrite(except it is a little less useful since it only takes a byte array and length instead of fwrite's 4 params) but by sticking to fwrite there is no way to accidentally use stream insertion in one place and write in another. More less it is a safety mechanism. While you gain that margin of safety you loose a little flexibility, you can no longer make a iostream derivative that compresses output with out changing any file writing code.

Portable way to get file size in C/C++

I need to determin the byte size of a file.
The coding language is C++ and the code should work with Linux, windows and any other operating system. This implies using standard C or C++ functions/classes.
This trivial need has apparently no trivial solution.

Using std's stream you can use:
std::ifstream ifile(....);
ifile.seekg(0, std::ios_base::end);//seek to end
//now get current position as length of file
ifile.tellg();
If you deal with write only file (std::ofstream), then methods are some another:
ofile.seekp(0, std::ios_base::end);
ofile.tellp();

You can use stat system call:
#ifdef WIN32
_stat64()
#else
stat64()

If you only need the file size this is certainly overkill but in general I would go with Boost.Filesystem for platform-independent file operations.
Amongst other attribute functions it contains
template <class Path> uintmax_t file_size(const Path& p);
You can find the reference here. Although Boost Libraries may seem huge I found it to often implement things very efficiently. You could also only extract the function you need but this might proof difficult as Boost is rather complex.

std::intmax_t file_size(std::string_view const& fn)
{
std::filebuf fb;
return fb.open(fn.data(), std::ios::binary | std::ios::in) ?
std::intmax_t(fb.pubseekoff({}, std::ios::end, std::ios::in)) :
std::intmax_t(-1);
}
We sacrifice 1 bit for the error indicator and standard disclaimers apply when running on 32-bit systems. Use std::filesystem::file_size(), if possible, as std::filebuf may dynamically allocate buffers for file io. This would make all the iostream-based methods wasteful and slow. Files were/are meant to be streamed, though much more so in the past than today, which relegates file sizes to secondary importance.
Working example.

Simples:
std::ifstream ifs;
ifs.open("mybigfile.txt", std::ios::bin);
ifs.seekg(0, std::ios::end);
std::fpos pos = ifs.tellg();

Portability requires you to use the least common denominators, which would be C. (not c++)
The method that I use is the following.
#include <stdio.h>
long filesize(const char *filename)
{
FILE *f = fopen(filename,"rb"); /* open the file in read only */
long size = 0;
if (fseek(f,0,SEEK_END)==0) /* seek was successful */
size = ftell(f);
fclose(f);
return size;
}

The prize for absolute inefficiency would go to:
auto file_size(std::string_view const& fn)
{
std::ifstream ifs(fn.data(), std::ios::binary);
return std::distance(std::istream_iterator<char>(ifs), {});
}
Example.

Often we want to get things done in the most portable manner, but in certain situations, especially like this, I would strongly recommend using system API's for best performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js