Get line/column of an XPath query in Pugixml

Get line/column of an XPath query in Pugixml - c++

We want to get the line/column of an xpath query result in pugixml :
pugi::xpath_query query_child(query_str);
std::string value = Convert::toString(query_child.evaluate_string(root_node));
We can retrieve the offset, but not the line/column :
unsigned int = query_child.result().offset;
If we re-parse the file we can convert offset => (line, column), but it's not efficient.
Is there an efficient method to achieve this ?

result().offset is the last parsed offset in the query string; it will be equal to 0 if the query got parsed successfully; so this is not the offset in XML file.
For XPath queries that return strings the concept of 'offset in XML file' is not defined - i.e. what would you expect for concat("a", "b") query?
For XPath queries that return nodes, you can get the offset of node data in file. Unfortunately, due to parsing performance and memory consumption reasons, this information can't be obtained without reparsing. There is a task in the TODO list to make it easier (i.e. with couple of lines of code), but it's going to take a while.
So, assuming you want to find the offset of node that is a result of XPath query, the only way is to get XPath query result as a node set (query.evaluate_node_set or node.select_single_node/select_nodes), get the offset (node.offset_debug()) and convert it to line/column manually.
You can prepare a data structure for offset -> line/column conversion once, and then use it multiple times; for example, the following code should work:
#include <vector>
#include <algorithm>
#include <cassert>
#include <cstdio>
typedef std::vector<ptrdiff_t> offset_data_t;
bool build_offset_data(offset_data_t& result, const char* file)
{
FILE* f = fopen(file, "rb");
if (!f) return false;
ptrdiff_t offset = 0;
char buffer[1024];
size_t size;
while ((size = fread(buffer, 1, sizeof(buffer), f)) > 0)
{
for (size_t i = 0; i < size; ++i)
if (buffer[i] == '\n')
result.push_back(offset + i);
offset += size;
}
fclose(f);
return true;
}
std::pair<int, int> get_location(const offset_data_t& data, ptrdiff_t offset)
{
offset_data_t::const_iterator it = std::lower_bound(data.begin(), data.end(), offset);
size_t index = it - data.begin();
return std::make_pair(1 + index, index == 0 ? offset : offset - data[index - 1]);
}
This does not handle Mac-style linebreaks and does not handle tabs; this can be trivially added, of course.

Related

Searching for an exact string match in a (arbitrary large) stream - C++

I am building a simple multi-server for string matching. I handle multiple clients at the same time by using sockets and select. The only job that the server does is this: a client connects to a server and sends a needle (of size less than 10 GB) and a haystack (of arbitrary size) as a stream through a network socket. Needle and haystack are an arbitrary binary data.
Server needs to search the haystack for all occurrences of the needle (as an exact string match) and sends a number of needle matches back to the client. Server needs to process clients on the fly and be able to handle any input in a reasonable time (that is a search algorithm have to have a linear time complexity).
To do this I obviously need to split the haystack into a small parts (possibly smaller than the needle) in order to process them as they are coming through the network socket. That is I would need a search algorithm that is able to handle a string, that is split into parts and search in it, the same way as strstr(...) does.
I could not find any standard C or C++ library function nor a Boost library object that could handle a string by parts. If I am not mistaken, algorithms in strstr(), string.find() and Boost searching/knuth_morris_pratt.hpp are only able to handle the search, when a whole haystack is in a continuous block of memory. Or is there some trick, that I could use to search a string by parts that I am missing? Do you guys know of any C/C++ library, that is able to cope with such a large needles and haystacks resp. that is able to handle haystack streams or search in haystack by parts?
I did not find any useful library by googling and hence I was forced to create my own variation of Knuth Morris Pratt algorithm, that is able to remember its own state (shown bellow). However I do not find it to be an optimal solution, as a well tuned string searching algorithm would surely perform better in my opinion, and it would be a less worry for a debugging later.
So my question is:
Is there some more elegant way to search in a large haystack stream by parts, other than creating my own search algorithm? Is there any trick how to use a standard C string library for this? Is there some C/C++ library that is specialized for a this kind of task?
Here is a (part of) code of my midified KMP algorithm:
#include <cstdlib>
#include <cstring>
#include <cstdio>
class knuth_morris_pratt {
const char* const needle;
const size_t needle_len;
const int* const lps; // a longest proper suffix table (skip table)
// suffix_len is an ofset of a longest haystack_part suffix matching with
// some prefix of the needle. suffix_len myst be shorter than needle_len.
// Ofset is defined as a last matching character in a needle.
size_t suffix_len;
size_t match_count; // a number of needles found in haystack
public:
inline knuth_morris_pratt(const char* needle, size_t len) :
needle(needle), needle_len(len),
lps( build_lps_array() ), suffix_len(0),
match_count(len == 0 ? 1 : 0) { }
inline ~knuth_morris_pratt() { free((void*)lps); }
void search_part(const char* haystack_part, size_t hp_len); // processes a given part of the haystack stream
inline size_t get_match_count() { return match_count; }
private:
const int* build_lps_array();
};
// Worst case complexity: linear space, linear time
// see: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
// see article: KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R., 1977, Fast pattern matching in strings
void knuth_morris_pratt::search_part(const char* haystack_part, size_t hp_len) {
if(needle_len == 0) {
match_count += hp_len;
return;
}
const char* hs = haystack_part;
size_t i = 0; // index for txt[]
size_t j = suffix_len; // index for pat[]
while (i < hp_len) {
if (needle[j] == hs[i]) {
j++;
i++;
}
if (j == needle_len) {
// a needle found
match_count++;
j = lps[j - 1];
}
else if (i < hp_len && needle[j] != hs[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
suffix_len = j;
}
const int* knuth_morris_pratt::build_lps_array() {
int* const new_lps = (int*)malloc(needle_len);
// check_cond_fatal(new_lps != NULL, "Unable to alocate memory in knuth_morris_pratt(..)");
// length of the previous longest prefix suffix
size_t len = 0;
new_lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
size_t i = 1;
while (i < needle_len) {
if (needle[i] == needle[len]) {
len++;
new_lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = new_lps[len - 1];
// Also, note that we do not increment
// i here
}
else // if (len == 0)
{
new_lps[i] = 0;
i++;
}
}
}
return new_lps;
}
int main()
{
const char* needle = "lorem";
const char* p1 = "sit voluptatem accusantium doloremque laudantium qui dolo";
const char* p2 = "rem ipsum quia dolor sit amet";
const char* p3 = "dolorem eum fugiat quo voluptas nulla pariatur?";
knuth_morris_pratt searcher(needle, strlen(needle));
searcher.search_part(p1, strlen(p1));
searcher.search_part(p2, strlen(p2));
searcher.search_part(p3, strlen(p3));
printf("%d \n", (int)searcher.get_match_count());
return 0;
}

You can have a look at BNDM, which has same performances as KMP:
O(m) for preprocessing
O(n) for matching.
It is used for nrgrep, the sources of which can be found here which containts C sources.
C source for BNDM algo are here.
See here for more information.

If I have well understood your problem, you want to search if a large std::string received part by part contains a substring.
If it is the case, I think you can store for each iteration the overlapping section between two contiguous received packets. And then you just have to check for each iteration that either the overlap or the packet contains the desired pattern to find.
In the example below, I consider the following contains() function to search a pattern in a std::string:
bool contains(const std::string & str, const std::string & pattern)
{
bool found(false);
if(!pattern.empty() && (pattern.length() < str.length()))
{
for(size_t i = 0; !found && (i <= str.length()-pattern.length()); ++i)
{
if((str[i] == pattern[0]) && (str.substr(i, pattern.length()) == pattern))
{
found = true;
}
}
}
return found;
}
Example:
std::string pattern("something"); // The pattern we want to find
std::string end_of_previous_packet(""); // The first part of overlapping section
std::string beginning_of_current_packet(""); // The second part of overlapping section
std::string overlap; // The string to store the overlap at each iteration
bool found(false);
while(!found && !all_data_received()) // stop condition
{
// Get the current packet
std::string packet = receive_part();
// Set the beginning of the current packet
beginning_of_current_packet = packet.substr(0, pattern.length());
// Build the overlap
overlap = end_of_previous_packet + beginning_of_current_packet;
// If the overlap or the packet contains the pattern, we found a match
if(contains(overlap, pattern) || contains(packet, pattern))
found = true;
// Set the end of previous packet for the next iteration
end_of_previous_packet = packet.substr(packet.length()-pattern.length());
}
Of course, in this example I made the assumption that the method receive_part() already exists. Same thing for the all_data_received() function. It is just an example to illustrate the idea.
I hope it will help you to find a solution.

Writing aligned data in binary file

I am creating a file with some data objects inside. data object have different sizes and are something like this (very simplified):
struct Data{
uint64_t size;
char blob[MAX_SIZE];
// ... methods here:
}
At some later step, the file will be mmap() in memory,
so I want the beginning of every data objects to starts on memory address aligned by 8 bytes where uint64_t size will be stored (let's ignore endianness).
Code looks more or less to this (currently hardcoded 8 bytes):
size_t calcAlign(size_t const value, size_t const align_size){
return align_size - value % align_size;
}
template<class ITERATOR>
void process(std::ofstream &file_data, ITERATOR begin, ITERATOR end){
for(auto it = begin; it != end; ++it){
const auto &data = *it;
size_t bytesWriten = data.writeToFile(file_data);
size_t const alignToBeAdded = calcAlign(bytesWriten, 8);
if (alignToBeAdded != 8){
uint64_t const placeholder = 0;
file_data.write( (const char *) & placeholder, (std::streamsize) alignToBeAdded);
}
}
}
Is this the best way to achieve alignment inside a file?

you don't need to rely on writeToFile to return the size, you can use ofstream::tellp
const auto beginPos = file_data.tellp();
// write stuff to file
const auto alignSize = (file_data.tellp()-beginPos)%8;
if(alignSize)
file_data.write("\0\0\0\0\0\0\0\0",8-alignSize);
EDIT post OP comment:
Tested on a minimal example and it works.
#include <iostream>
#include <fstream>
int main(){
using namespace std;
ofstream file_data;
file_data.open("tempfile.dat", ios::out | ios::binary);
const auto beginPos = file_data.tellp();
file_data.write("something", 9);
const auto alignSize = (file_data.tellp() - beginPos) % 8;
if (alignSize)
file_data.write("\0\0\0\0\0\0\0\0", 8 - alignSize);
file_data.close();
return 0;
}

You can optimize the process by manipulating the input buffer instead of the file handling. Modify your Data struct so the code that fills the buffer takes care of the alignment.
struct Data{
uint64_t size;
char blob[MAX_SIZE];
// ... other methods here
// Ensure buffer alignment
static_assert(MAX_SIZE % 8 != 0, "blob size must be aligned to 8 bytes to avoid Buffer Overflow.");
uint64_t Fill(const char* data, uint64_t dataLength) {
// Validations...
memcpy(this->blob, data, dataLength);
this->size = dataLength;
const auto paddingLen = calcAlign(dataLength, 8) % 8;
if (padding > 0) {
memset(this->blob + dataLength, 0, paddingLen);
}
// Return the aligned size
return dataLength + paddingLen;
}
};
Now when you pass the data to the "process" function simply use the size returned from Fill, which ensures 8 byte alignment.
This way you still takes care of the alignment manually but you don't have to write twice to the file.
note: This code assumes you use Data also as the input buffer. You should use the same principals if your code uses some another object to hold the buffer before it is written to the file.
If you can use POSIX, see also pwrite

Targa Run-Length encoding

I am currently trying to decompress targa (RGB24_RLE) image data.
My algorithm looks like this:
static constexpr size_t kPacketHeaderSize = sizeof(char);
//http://paulbourke.net/dataformats/tga/
inline void DecompressRLE(unsigned int a_BytePerPixel, std::vector<CrByte>& a_In, std::vector<CrByte>& a_Out)
{
for (auto it = a_In.begin(); it != a_In.end();)
{
//Read packet header
int header = *it & 0xFF;
int count = (header & 0x7F) + 1;
if ((header & 0x80) != 0) //packet type
{
//For the run length packet, the header is followed by
//a single color value, which is assumed to be repeated
//the number of times specified in the header.
auto paStart = it + kPacketHeaderSize;
auto paEnd = paStart + a_BytePerPixel;
//Insert packets into output buffer
for (size_t pk = 0; pk < count; ++pk)
{
a_Out.insert(a_Out.end(), paStart, paEnd);
}
//Jump to next header
std::advance(it, kPacketHeaderSize + a_BytePerPixel);
}
else
{
//For the raw packet, the header s followed by
//the number of color values specified in the header.
auto paStart = it + kPacketHeaderSize;
auto paEnd = paStart + count * a_BytePerPixel;
//Insert packets into output buffer
a_Out.insert(a_Out.end(), paStart, paEnd);
//Jump to next header
std::advance(it, kPacketHeaderSize + count * a_BytePerPixel);
}
}
}
It is called here:
//Read compressed data
std::vector<CrByte> compressed(imageSize);
ifs.seekg(sizeof(Header), std::ifstream::beg);
ifs.read(reinterpret_cast<char*>(compressed.data()), imageSize);
//Decompress
std::vector<CrByte> decompressed(imageSize);
DecompressRLE(bytePerPixel, compressed, decompressed);
imageSize is defined like this:
size_t imageSize = hd.width * hd.height * bytePerPixel
However, after DecompressRLE() finishes (which takes a very long time with 2048x2048 textures) decompressed is still empty/only contains zeros. Maybe I am missing somehting out.
count seems to be unreasonably high sometimes, which I think is abnormal.
compressedSize should be less than imageSize, otherwise it's not compressed. However, using ifstream::tellg() gives me wrong results.
Any help?

If you look carefully at your variables in your debugger, you would see that std::vector<CrByte> decompressed(imageSize); declares a vector with imageSize elements in it. In DecompressRLE you then insert at the end of that vector, causing it to grow. This is why your decompressed image is full of zeros, and also why it takes so long (because the vector will be resized periodically).
What you want to do is reserve the space:
std::vector<CrByte> decompressed;
decompressed.reserve(imageSize);
Your compressed buffer looks like it is larger than the file content, so you're still decompressing past the end of the file. The compressed file size should be in Header. Use it.

Character pointers messed up in simple Boyer-Moore implementation

I am currently experimenting with a very simple Boyer-Moore variant.
In general my implementation works, but if I try to utilize it in a loop the character pointer containing the haystack gets messed up. And I mean that characters in it are altered, or mixed.
The result is consistent, i.e. running the same test multiple times yields the same screw up.
This is the looping code:
string src("This haystack contains a needle! needless to say that only 2 matches need to be found!");
string pat("needle");
const char* res = src.c_str();
while((res = boyerMoore(res, pat)))
++res;
This is my implementation of the string search algorithm (the above code calls a convenience wrapper which pulls the character pointer and length of the string):
unsigned char*
boyerMoore(const unsigned char* src, size_t srcLgth, const unsigned char* pat, size_t patLgth)
{
if(srcLgth < patLgth || !src || !pat)
return nullptr;
size_t skip[UCHAR_MAX]; //this is the skip table
for(int i = 0; i < UCHAR_MAX; ++i)
skip[i] = patLgth; //initialize it with default value
for(size_t i = 0; i < patLgth; ++i)
skip[(int)pat[i]] = patLgth - i - 1; //set skip value of chars in pattern
std::cout<<src<<"\n"; //just to see what's going on here!
size_t srcI = patLgth - 1; //our first character to check
while(srcI < srcLgth)
{
size_t j = 0; //char match ct
while(j < patLgth)
{
if(src[srcI - j] == pat[patLgth - j - 1])
++j;
else
{
//since the number of characters to skip may be negative, I just increment in that case
size_t t = skip[(int)src[srcI - j]];
if(t > j)
srcI = srcI + t - j;
else
++srcI;
break;
}
}
if(j == patLgth)
return (unsigned char*)&src[srcI + 1 - j];
}
return nullptr;
}
The loop produced this output (i.e. these are the haystacks the algorithm received):
This haystack contains a needle! needless to say that only 2 matches need to be found!
eedle! needless to say that only 2 matches need to be found!
eedless to say that eed 2 meed to beed to be found!
As you can see the input is completely messed up after the second run. What am I missing? I thought the contents could not be modified, since I'm passing const pointers.
Is the way of setting the pointer in the loop wrong, or is my string search screwing up?
Btw: This is the complete code, except for includes and the main function around the looping code.
EDIT:
The missing nullptr of the first return was due to a copy/paste error, in the source it is actually there.
For clarification, this is my wrapper function:
inline char* boyerMoore(const string &src, const string &pat)
{
return (const char*) boyerMoore((const unsigned char*) src.c_str(), src.size(),
(const unsigned char*) pat.c_str(), pat.size());
}

In your boyerMoore() function, the first return isn't returning a value (you have just return; rather than return nullptr;) GCC doesn't always warn about missing return values, and not returning anything is undefined behavior. That means that when you store the return value in res and call the function again, there's no telling what will print out. You can see a related discussion here.
Also, you have omitted your convenience function that calculates the length of the strings that you are passing in. I would recommend double checking that logic to make sure the sizes are correct - I'm assuming you are using strlen or similar.

String compression (Interview prepare)

I need to compress a string. Can make an assumption that each character in the string doesn`t appear more than 255 times. I need return the compressed string and its length.
Last 2 years I worked with C# and forgot C++. I will be glad to hear your comments about code , algorithm and c++ programming practices
// StringCompressor.h
class StringCompressor
{
public:
StringCompressor();
~StringCompressor();
unsigned long Compress(string str, string* strCompressedPtr);
string DeCompress(string strCompressed);
private:
string m_StrCompressed;
static const char c_MaxLen;
};
// StringCompressor.cpp
#include "StringCompressor.h"
const char StringCompressor::c_MaxLen = 255;
StringCompressor::StringCompressor()
{
}
StringCompressor::~StringCompressor()
{
}
unsigned long StringCompressor::Compress(string str, string* strCompressedPtr)
{
if (str.empty())
{
return 0;
}
char currentChar = str[0];
char count = 1;
for (string::iterator it = str.begin() + 1; it != str.end(); ++it)
{
if (*it == currentChar)
{
count++;
if (count == c_MaxLen)
{
return -1;
}
}
else
{
m_StrCompressed+=currentChar;
m_StrCompressed+=count;
currentChar = *it;
count = 1;
}
}
m_StrCompressed += currentChar;
m_StrCompressed += count;
*strCompressedPtr = m_StrCompressed;
return m_StrCompressed.length();
}
string StringCompressor::DeCompress(string strCompressed)
{
string res;
if (strCompressed.length() % 2 != 0)
{
return res;
}
for (string::iterator it = strCompressed.begin(); it != strCompressed.end(); it+=2)
{
char dup = *(it + 1);
res += string(dup, *it);
}
return res;
}

There can be many improvement:
Do not return -1 for a unsigned long function.
consider use size_t or ssize_t to represent size.
Learn const
m_StrCompressed has bogus state if Compress is called repeatedly. Since those member cannot be reused, you may as well make the function static.
Compressed stuff generally should not be considered string, but byte buffer. Redesign your interface.
Comments! Nobody knows you are doing RLE here.
Bonus: Fallback mechanism if your compression yield larger result. e.g. a flag to denote uncompressed buffer, or just return failure.
I assume efficiency is not major concern here.

A few things:
I'm all for using classes, and perhaps you could do that here in a way that makes more sense. But given the scope of what you are trying to do, this here would be better off as two functions. One for compression, one for decompression. For instance, why are you storing the string in the class as an object and never using it? How does grouping this as a class actually enhance the functionality or make it more reusable?
You should pass your compressed string return as a reference instead of a pointer.
It looks like you are trying to count the number of times characters are repeated in a row and save that. For most common strings this will make the size of your compressed string larger than uncompressed as it takes two bytes to store each non-repeated character.
There are a lot of characters, there are two kinds of bits. If you do this method trying to group repeated bits, you'd be more successful (and that's actually one simple method of lossless compression).
If you are allowed, just use a library like zlib to do compression of arbitrary data types.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Get line/column of an XPath query in Pugixml - c++

Related

Searching for an exact string match in a (arbitrary large) stream - C++

Writing aligned data in binary file

Targa Run-Length encoding

Character pointers messed up in simple Boyer-Moore implementation

String compression (Interview prepare)

Categories

Resources