how to read a particular string from a buffer - c++

i have a buffer
char buffer[size];
which i am using to store the file contents of a stream(suppose pStream here)
HRESULT hr = pStream->Read(buffer, size, &cbRead );
now i have all the contents of this stream in buffer which is of size(suppose size here). now i know that i have two strings
"<!doctortype html" and ".html>"
which are present somewhere (we don't their loctions) inside the stored contents of this buffer and i want to store just the contents of the buffer from the location
"<!doctortype html" to another string ".html>"
in to another buffer2[SizeWeDontKnow] yet.
How to do that ??? (actually contents from these two location are the contents of a html file and i want to store the contents of only html file present in this buffer). any ideas how to do that ??

You can use strnstr function to find the right position in your buffer. After you've found the starting and ending tag, you can extract the text inbetween using strncpy, or use it in place if the performance is an issue.
You can calculate needed size from the positions of the tags and the length of the first tag nLength = nPosEnd - nPosStart - nStartTagLength

Look for HTML parsers for C/C++.
Another way is to have a char pointer from the start of the buffer and then check each char there after. See if it follows your requirement.

If that's the only operation which operates on HTML code in your app, then you could use the solution I provided below (you can also test it online - here). However, if you are going to do some more complicated parsing, then I suggest using some external library.
#include <iostream>
#include <cstdio>
#include <cstring>
using namespace std;
int main()
{
const char* beforePrefix = "asdfasdfasdfasdf";
const char* prefix = "<!doctortype html";
const char* suffix = ".html>";
const char* postSuffix = "asdasdasd";
unsigned size = 1024;
char buf[size];
sprintf(buf, "%s%sTHE STRING YOU WANT TO GET%s%s", beforePrefix, prefix, suffix, postSuffix);
cout << "Before: " << buf << endl;
const char* firstOccurenceOfPrefixPtr = strstr(buf, prefix);
const char* firstOccurenceOfSuffixPtr = strstr(buf, suffix);
if (firstOccurenceOfPrefixPtr && firstOccurenceOfSuffixPtr)
{
unsigned textLen = (unsigned)(firstOccurenceOfSuffixPtr - firstOccurenceOfPrefixPtr - strlen(prefix));
char newBuf[size];
strncpy(newBuf, firstOccurenceOfPrefixPtr + strlen(prefix), textLen);
newBuf[textLen] = 0;
cout << "After: " << newBuf << endl;
}
return 0;
}
EDIT
I get it now :). You should use strstr to find the first occurence of the prefix then. I edited the code above, and updated the link.

Are you limited to C, or can you use C++?
In the C library reference there are plenty of useful ways of tokenising strings and comparing for matches (string.h):
http://www.cplusplus.com/reference/cstring/
Using C++ I would do the following (using buffer and size variables from your code):
// copy char array to std::string
std::string text(buffer, buffer + size);
// define what we're looking for
std::string begin_text("<!doctortype html");
std::string end_text(".html>");
// find the start and end of the text we need to extract
size_t begin_pos = text.find(begin_text) + begin_text.length();
size_t end_pos = text.find(end_text);
// create a substring from the positions
std::string extract = text.substr(begin_pos,end_pos);
// test that we got the extract
std::cout << extract << std::endl;
If you need C string compatibility you can use:
char* tmp = extract.c_str();

Related

Convert from vector<unsigned char> to char* includes garbage data

I'm trying to base64 decode a string, then convert that value to a char array for later use. The decode works fine, but then I get garbage data when converting.
Here's the code I have so far:
std::string encodedData = "VGVzdFN0cmluZw=="; //"TestString"
std::vector<BYTE> decodedData = base64_decode(encodedData);
char* decodedChar;
decodedChar = new char[decodedData.size() +1]; // +1 for the final 0
decodedChar[decodedData.size() + 1] = 0; // terminate the string
for (size_t i = 0; i < decodedData.size(); ++i) {
decodedChar[i] = decodedData[i];
}
vector<BYTE> is a typedef of unsigned char BYTE, as taken from this SO answer. The base64 code is also from this answer (the most upvoted answer, not the accepted answer).
When I run this code, I get the following value in the VisualStudio Text Visualiser:
TestStringÍ
I've also tried other conversion methods, such as:
char* decodedChar = reinterpret_cast< char *>(&decodedData[0]);
Which gives the following:
TestStringÍÍÍýýýýÝÝÝÝÝÝÝ*b4d“
Why am I getting the garbage data at the end of the string? What am i doing wrong?
EDIT: clarified which answer in the linked question I'm using
char* decodedChar;
decodedChar = new char[decodedData.size() +1]; // +1 for the final 0
Why would you manually allocate a buffer and then copy to it when you have std::string available that does this for you?
Just do:
std::string encodedData = "VGVzdFN0cmluZw=="; //"TestString"
std::vector<BYTE> decodedData = base64_decode(encodedData);
std::string decodedString { decodedData.begin(), decodedData.end() };
std::cout << decodedString << '\n';
If you need a char * out of this, just use .c_str()
const char* cstr = decodedString.c_str();
If you need to pass this on to a function that takes char* as input, for example:
void someFunc(char* data);
//...
//call site
someFunc( &decodedString[0] );
We have a TON of functions and abstractions and containers in C++ that were made to improve upon the C language, and so that programmers wouldn't have to write things by hand and make same mistakes every time they code. It would be best if we use those functionalities wherever we can to avoid raw loops or to do simple modifications like this.
You are writing beyond the last element of your allocated array, which can cause literally anything to happen (according to the C++ standard). You need decodedChar[decodedData.size()] = 0;

Write C++ array to file, avoiding creating a std::string

I have a struct representing ASCII data:
struct LineData
{
char _asciiData[256];
uint8_t _asciiDataLength;
}
created using:
std::string s = "some data here";
memcpy(obj._asciiData, s.length());
obj._asciiDataLength = s.length();
How do I write the char array to file as ASCII, in the lowest latency? I want to avoid the intermediary stage of creating a temporary std::string.
I tried:
file.write((char *)obj._asciiData, sizeof(obj._asciiDataLength));
file << std::endl;
but my file just contains '0' each line.
That's because sizeof(obj._asciiDataLength) is probably 1 on your system so only one character is written. You want the actual length, not the size of the uint8_t:
file.write(obj._asciiData, obj._asciiDataLength);

How to read custom string with C++ from binary recursively

I've recently been getting in to IO with C++. I am trying to read a string from a binary file stream.
The custom type is saved like this:
The string is prefixed with the length of the string. So hello, would be stored like this: 6Hello\0.
I am basically reading text from a table (in this case a name table) in a binary file. The file header tells me the offset of this table (112 bytes in this case) and the number of names (318).
Using this information I can read the first byte at this offset. This tells me the length of the string (e.g. 6). So I'll start at the next byte and read 5 more to get the full string "Hello". This seems to work fine with the first name at the offset. trying to recursively read the rest provides a lot of garbage really. I've tried using loops and recursive functions but its not working out so well. Not sure what the problem is, so reverted to the original one name retrieval method. Here's the code:
int printName(fstream& fileObj, __int8 buff, DWORD offset, int& iteration){
fileObj.seekg(offset);
fileObj.read((char*)&buff, sizeof(char));
int nameSize = (int)buff;
char* szName = new char[nameSize];
for(int i=1; i <= nameSize; i++){
fileObj.seekg(offset+i);
fileObj.read((char*)&szName[i-1], sizeof(char));
}
cout << szName << endl;
return 0;
}
Any idea how to iterate through all 318 names without creating dodgy output?
Thanks for taking the time to look through this, your help is greatly appreciated.
You're overcomplicating a bit - there's no need to seek to the next sequential read.
Removing unused and pointless parameters, I would write this function something like this:
void printName(fstream& fileObj, DWORD offset) {
char size = 0;
if (fileObj.seekg(offset) && fileObj.read(&size, sizeof(char)))
{
char* name = new char[size];
if (fileObj.read(name, size))
{
cout << name << endl;
}
delete [] name;
}
}

C/C++ HDF5 Read string attribute

A colleague of mine used labview to write an ASCII string as an attribute in an HDF5 file. I can see that the attribute exist, and read it, but I can't print it.
The attribute is, as shown in HDF Viewer:
Date = 2015\07\09
So "Date" is its name.
I'm trying to read the attribute with this code
hsize_t sz = H5Aget_storage_size(dateAttribHandler);
std::cout<<sz<<std::endl; //prints 16
hid_t atype = H5Aget_type(dateAttribHandler);
std::cout<<atype<<std::endl; //prints 50331867
std::cout<<H5Aread(dateAttribHandler,atype,(void*)date)<<std::endl; //prints 0
std::cout<<date<<std::endl; //prints messy characters!
//even with an std::string
std::string s(date);
std::cout<<s<<std::endl; //also prints a mess
Why is this happening? How can I get this string as a const char* or std::string?
I tried also using the type atype = H5Tcopy (H5T_C_S1);, and that didn't work too...
EDIT:
Here I provide a full, self-contained program as it was requested:
#include <string>
#include <iostream>
#include <fstream>
#include <hdf5/serial/hdf5.h>
#include <hdf5/serial/hdf5_hl.h>
std::size_t GetFileSize(const std::string &filename)
{
std::ifstream file(filename.c_str(), std::ios::binary | std::ios::ate);
return file.tellg();
}
int ReadBinFileToString(const std::string &filename, std::string &data)
{
std::fstream fileObject(filename.c_str(),std::ios::in | std::ios::binary);
if(!fileObject.good())
{
return 1;
}
size_t filesize = GetFileSize(filename);
data.resize(filesize);
fileObject.read(&data.front(),filesize);
fileObject.close();
return 0;
}
int main(int argc, char *argv[])
{
std::string filename("../Example.hdf5");
std::string fileData;
std::cout<<"Success read file into memory: "<<
ReadBinFileToString(filename.c_str(),fileData)<<std::endl;
hid_t handle;
hid_t magFieldsDSHandle;
hid_t dateAttribHandler;
htri_t dateAtribExists;
handle = H5LTopen_file_image((void*)fileData.c_str(),fileData.size(),H5LT_FILE_IMAGE_DONT_COPY | H5LT_FILE_IMAGE_DONT_RELEASE);
magFieldsDSHandle = H5Dopen(handle,"MagneticFields",H5P_DEFAULT);
dateAtribExists = H5Aexists(magFieldsDSHandle,"Date");
if(dateAtribExists)
{
dateAttribHandler = H5Aopen(magFieldsDSHandle,"Date",H5P_DEFAULT);
}
std::cout<<"Reading file done."<<std::endl;
std::cout<<"Open handler: "<<handle<<std::endl;
std::cout<<"DS handler: "<<magFieldsDSHandle<<std::endl;
std::cout<<"Attributes exists: "<<dateAtribExists<<std::endl;
hsize_t sz = H5Aget_storage_size(dateAttribHandler);
std::cout<<sz<<std::endl;
char* date = new char[sz+1];
std::cout<<"mem bef: "<<date<<std::endl;
hid_t atype = H5Aget_type(dateAttribHandler);
std::cout<<atype<<std::endl;
std::cout<<H5Aread(dateAttribHandler,atype,(void*)date)<<std::endl;
fprintf(stderr, "Attribute string read was '%s'\n", date);
date[sz] = '\0';
std::string s(date);
std::cout<<"mem aft: "<<date<<std::endl;
std::cout<<s<<std::endl;
H5Dclose(magFieldsDSHandle);
H5Fclose(handle);
return 0;
}
Printed output of this program:
Success read file into memory: 0
Reading file done.
Open handler: 16777216
DS handler: 83886080
Attributes exists: 1
16
mem bef:
50331867
0
Attribute string read was '�P7'
mem aft: �P7
�P7
Press <RETURN> to close this window...
Thanks.
It turned out that H5Aread has to be called with a reference of the char pointer... so pointer of a pointer:
H5Aread(dateAttribHandler,atype,&date);
Keep in mind that one doesn't have to reserve memory for that. The library will reserve memory, and then you can free it with H5free_memory(date).
This worked fine.
EDIT:
I learned that this is the case only when the string to be read has variable length. If the string has a fixed length, then one has to manually reserve memory with size length+1 and even manually set the last char to null (to get a null-terminated string. There is a function in the hdf5 library that checks whether a string is fixed in length.
I discovered that if you do not allocate date and pass the &date to H5Aread, then it works. (I use the C++ and python APIs, so I do not know the C api very well.) Specifically change:
char* date = 0;
// std::cout<<"mem bef: "<<date<<std::endl;
std::cout << H5Aread(dateAttribHandler, atype, &date) << std::endl;
And you should see 2015\07\09 printed.
You may want to consider using the C++ API. Using the C++ API, your example becomes:
std::string filename("c:/temp/Example.hdf5");
H5::H5File file(filename, H5F_ACC_RDONLY);
H5::DataSet ds_mag = file.openDataSet("MagneticFields");
if (ds_mag.attrExists("Date"))
{
H5::Attribute attr_date = ds_mag.openAttribute("Date");
H5::StrType stype = attr_date.getStrType();
std::string date_str;
attr_date.read(stype, date_str);
std::cout << "date_str= <" << date_str << ">" << std::endl;
}
As a simpler alternative to existing APIs, your use-case could be solved as follows in C using HDFql:
// declare variable 'value'
char *value;
// register variable 'value' for subsequent use (by HDFql)
hdfql_variable_register(&value);
// read 'Date' (from 'MagneticFields') and populate variable 'value' with it
hdfql_execute("SELECT FROM Example.hdf5 MagneticFields/Date INTO MEMORY 0");
// display value stored in variable 'value'
printf("Date=%s\n", value);
FYI, besides C, the code above can be used in C++, Python, Java, C#, Fortran or R with minimal changes.

C++ use regex on char array with \0's and get result

I want to use regex on a binary file, which contains 0 bytes, which renders me unable to use a string. I'm using a char array, and I'm able to use regex on the char array.
Buffer is a copy of the file mapped into memory, and read is the total size. This code works, but now I want to get the result back from the function. How do I do this?
if(std::regex_search(buffer, buffer + read, *params->pattern))
{
std::cout << "Found.";
}
I did not test this but it should work..
auto it = std::cregex_iterator(&buffer[0], &buffer[read], *params->pattern);
for (int i = 0; i < it->size(); ++i)
{
const char* str = (*it)[i].str();
size_t size = (*it)[i].length();
std::cout.write(str, size);
}
I've tested this only on regular strings.. Not strings containing null chars. I don't see why it shouldn't work though because it does return a sequence of chars and the length of said sequence.