How to identify i.e. distinquish textual content from other serialized representations? - c++

Are there any algorithms that are could be used to distinguish or identify textual content from other serialized representations?
In my test case, I have at various locations of a file stored textual information such as (file-)names intermixed with other information such as serialized floating points. The textual content that I have has a leading 7-bit encoded length (similar to how BinaryWriter in .NET serializes strings). Ideally, I'd like to able to list all candidates that can be considered 'text'.
I have implemented a naive, rough and semi working algorithm. It goes like this: for every byte, decode a 7-bit encoded integer, evaluate if the content matches to human readable characters. However this approach gives a lot of false positives, duplicated entries. So my question is are there any algorithms that I could explore or alternatively how could I strengthen the conditions so that it matches the content I am looking for?.
bool read_string( char* buffer )
{
unsigned char num3;
int num = 0;
int num2 = 0;
do
{
if (num2 == 0x23)
{
return false;
}
//stream.read(reinterpret_cast<char*>( &num3), sizeof(num3));
num3 = *buffer;
num |= (num3 & 0x7f) << num2;
num2 += 7;
buffer++;
}
while ((num3 & 0x80) != 0);
if( num > 0 && num < 2048 )
{
bool is_false = false;
for( int i = 0; i < num; ++i )
{
bool ischaracter = ( buffer[i] >= 'a' && buffer[i] <= 'z' ) || ( buffer[i] >= 'A' && buffer[i] <= 'Z' ) || ( buffer[i] >= '0' && buffer[i] <= '9' ) || ( buffer[i] == '/' || buffer[i] == '\\' || buffer[i] == '.' || buffer[i] == ' ' || buffer[i] == '{' || buffer[i] == '}' || buffer[i] == ':' || buffer[i] == '_' );
if( ischaracter == false ) {
is_false = true;
}
}
if( !is_false )
{
std::string v;
for( int i = 0; i < num; ++i )
v.push_back( buffer[i] );
printf("%s\r\n", v.c_str() );
}
}
else
{
return false;
}
}

Related

run-length encoding is not working with big numbers

I have a assingment were I need to code and decode txt files, for example: hello how are you? has to be coded as hel2o how are you? and aaaaaaaaaajkle as a10jkle.
while ( ! invoer.eof ( ) ) {
if (kar >= '0' && kar <= '9') {
counter = kar-48;
while (counter > 1){
uitvoer.put(vorigeKar);
counter--;
}
}else if (kar == '/'){
kar = invoer.get();
uitvoer.put(kar);
}else{
uitvoer.put(kar);
}
vorigeKar = kar;
kar = invoer.get ( );
}
but the problem I have is if need to decode a12bhr, the answer is aaaaaaaaaaaabhr but I can't seem to get the 12 as number without problems, I also can't use any strings.
try to put a repeated character when next is not numeric or end of string.
For prepare this, it needs to make number by parsing string.
about this, I recommend you to find how to convert string to integer in real time at C++.
bool isNumeric(char ch) {
return '0' <= ch && ch <= '9';
}
string decode(const string& s) {
int counter = 0;
string result;
char prevCh;
for (int i = 0; i < s.length(); i++) {
if (isNumeric(s[i])) { // update counter
counter = counter * 10 + (s[i] - '0');
if (isNumeric(s[i + 1]) == false || i + 1 == s.length()) {
// now, put previous character stacked
while (counter-- > 1) {
result.push_back(prevCh);
}
counter = 0;
}
}
else {
result.push_back(s[i]);
prevCh = s[i];
}
}
return result;
}
now, decode("a12bhr3") returns aaaaaaaaaaaabhrrr. it works well.

How to obtain the whole integers from file has strings and integers and store them into array in C++?

I want to obtain integers from file that has strings too, and store them into array to do some operation on them. the integers can be 1 or 12 or 234, so 3 digits. I am trying to do that but the output stops when I run the code
void GetNumFromFile (ifstream &file1, char & contents)
{
int digits[20];
file1.get(contents);
while(!file1.eof())
{
for (int n = 0; n < 10; n++)
{
if(('0' <= contents && contents <= '9') && ('0' >= contents+1 && contents+1 > '9'));
digits[n]=contents;
if(('0' <= contents && contents <= '9') && ('0' <= contents+1 && contents+1 < '9'));
digits[n]=contents;
if(('0' <= contents && contents <= '9') && ('0' <= contents+1 && contents+1 <= '9') && ('0' <= contents+2 && contents+2 < '9'));
digits[n]=contents;
}
continue;
}
for (int i = 0; i <= 20; i++)
{
cout << *(digits + i) << endl;
}
}
You have to deal with the number of digits of the number found:
int digits[20];
int i = 0;
short int aux[3]; // to format each digit of the numbers
ifstream file1("filepath");
char contents;
file1.get(contents); //first char
if (!file1.eof()) //test if you have only one char in the file
{
while (!file1.eof() && i < 20) // limit added to read only 20 numbers
{
if (contents <= '9' && contents >= '0') // if character is in number range
{
aux[0] = contents - '0'; // converting the char to the right integer
file1.get(contents);
if (contents <= '9' && contents >= '0') // if contents is number, continue on
{
aux[1] = contents - '0';
if (!file1.eof()) // if has mor char to read, continue on
{
file1.get(contents);
if (contents <= '9' && contents >= '0') // if is integer, continue on
{
aux[2] = contents - '0';
file1.get(contents); // will read same of last char if eof, but will have no effect at all
//aux[0] *= 100; // define houndred
//aux[1] *= 10; // define ten
digits[i++] = (aux[0] * 100) + (aux[1] * 10) + aux[2];
}
else
{
//aux[0] *= 10; // define ten
digits[i++] = (aux[0] * 10) + aux[1];
}
}
else
{
digits[i++] = (aux[0] * 10) + aux[1];
}
}
else
{
digits[i++] = aux[0];
}
}
}
}
else if (contents <= '9' && contents >= '0' && i < 20) // check if the only one char is number
{
digits[i++} = contents - '0';
}
If you want read an undefined size number, then you will have to allocate memory to format each digit of the numers with new (c++) or malloc(c/c++).
First observation: you iterate out of bounds of the array:
int digits[20];
for (int i = 0; i <= 20; i++)
20 elements and 21 iteration. That is an undefined behavior, so everything is possible here (if your program eventually gets here).
Next, you read from file once and then you have an infinite loop because the expression !file1.eof() is either true or false for the rest of the program run. Isn't that the reason of "output stops"?
The third finding: your if statements are useless because of the semicolon after the statement:
if(('0' <= contents && contents <= '9') && ('0' >= contents+1 && contents+1 > '9'));
digits[n]=contents;
You just assign digits[n]=contents; without any check.
I neither see any reason of providing a reference to char in the function. Why not to make it a local variable?
You will need first to add get() functionality inside the loop as well in order to reach end of file.
Forthmore try to add a while loop once a char was found to be an integer to continue in asking for the next character.
e.g.
int digits[20];
int i = 0;
ifstream file1("filepath");
char contents;
while (!file1.eof())
{
file1.get(contents); // get the next character
if (contents <= '9' && contents >= '0' && i < 20) // if character is in number range
{
digits[i++] = contents - '0'; // converting the chat to the right integer
file1.get(contents);
while (contents <= '9' && contents >= '0' && i < 20) // while is integer continue on
{
digits[i++] = contents - '0';
file1.get(contents);
}
}
}
// do other stuff here

how do you change arrays?

help! im trying to replace 'a' and 'e' with ' ' in my array but it keeps replacing all of the array instead.
for(int x = 0; x < array_length); x++)
{
if(city_name[x] == 'a' || 'e')
city_name[x] = " ";
}
if(city_name[x] == 'a' || 'e')
should be
if(city_name[x] == 'a' || city_name[x] == 'e')
Your code is equivalent to
if( ( city_name[x] == 'a' ) || 'e')
which does city_name[x] == 'a' and then checks the result of that statement || 'e'
First of all the loop is wrong. it contains a typo
for(int x = 0; x < array_length); x++)
^^^
Remove the redundant parenthesis.
Also this condition
city_name[x] == 'a' || 'e'
is always equal to true because it is equivalent to
( city_name[x] == 'a' ) || 'e'
And instead of string literal " " you have to use character literal ' '
The correct loop can look like
for ( int i = 0; i < array_length; i++ )
{
if ( city_name[i] == 'a' || city_name[i] == 'e' ) city_name[i] = ' ';
}
Take into account that there is standard algorithm std::replace_if declared in header <algorithm> that can be used instead of the loop. For example
std::replace_if( city_name, city_name + array_length,
[]( char c ) { return c == 'a' || c == 'e'; },
' ' );

Simplifying a function based on matching the pattern of a string

Question: I'm new to C++ and after writing the following code seems like there should be a way to shorten it. Maybe by somehow matching the string? How would this be done?
The function takes a string message received via Serial port and sets the value of a particular element of the pinValues[] array depending on the message. The value that will be set is determined by the last character H or L just before the \n.
String pattern: (a number)(H or L)\n
Eg: message == "4H\n" will set the 5th element pinValues[4] to HIGH. The number at the start of the string can be 1 to 2 digits.
void setPinValues(String message) {
if( message == "1H\n" ) {
pinValues[1] = HIGH;
}
if( message == "1L\n" ) {
pinValues[1] = LOW;
}
if( message == "2H\n" ) {
pinValues[2] = HIGH;
}
if( message == "2L\n" ) {
pinValues[2] = LOW;
}
if( message == "3H\n" ) {
pinValues[3] = HIGH;
}
if( message == "3L\n" ) {
pinValues[3] = LOW;
}
if( message == "4H\n" ) {
pinValues[4] = HIGH;
}
if( message == "4L\n" ) {
pinValues[4] = LOW;
}
if( message == "5H\n" ) {
pinValues[5] = HIGH;
}
if( message == "5L\n" ) {
pinValues[5] = LOW;
}
if( message == "6H\n" ) {
pinValues[6] = HIGH;
}
if( message == "6L\n" ) {
pinValues[6] = LOW;
}
}
This is probably not the official "C++"-approved way of doing it, but you could do:
unsigned int pinNo = 0;
unsigned char level = 0;
int result = sscanf(message.c_str(), "%u%c", &pinNo, &level);
if (result < 2)
// it failed
if (pinNo > 6)
// bad data
levelVal = (level == 'H') ? HIGH : LOW;
I'd do some sanity checking on the string while extracting the key and value from the first two chars. If you don't need to sanity check the message, it could be as short as
void setPinValues(String message) {
pinValues[ message[0] - '0' ] = (message[1] == 'H') ? HIGH:LOW;
}
Although you may want to make that a little longer, i.e. check the string length, and that the 2 chars your checking are in the right range. i.e
void setPinValues(string message) {
if (
message.size() >= 2
and
message[0] >= '1' and message[0] <= '6'
and (message[1]=='H' or message[1]=='L')
) {
pinValues[ message[0] - '0' ] = (message[1] == 'H') ? HIGH:LOW;
}
}
EDIT: you could also extend that to checking two leading digits, i.e.
int n, off=0;
if ( s[off] <= '9' and s[off] >= '0')
{
n = s[off++] - '0';
}
if ( s[off] <= '9' and s[off] >= '0')
{
n = 10*n + s[off++] - '0';
}
if (off > 0 and (s[1]=='H' or s[1]=='L')) {
pinValues[ message[0] - '0' ] = (message[1] == 'H') ? HIGH:LOW;
}
Assuming String is actually a std::string or has an identical interface, and also assuming an ASCII-compatible character set...
void setPinValues(String message) {
const size_t sz = message.size();
// input validation, ignore the message if it doesn't fit the pattern
// you can remove this "if" block if the message has already been validated
if ( (sz < 3) || (sz > 4)
// note how message[0] will be checked twice if sz == 3
// once as message[0] and once as message[sz -3]
// but if sz == 4 we check message[0] and message[1]
|| (message[0] < '0') || (message[0] > '9')
|| (message[sz - 3] < '0') || (message[sz - 3] > '9')
|| ((message[sz - 2] != 'H') && (message[sz - 2] != 'L'))
|| (message[sz - 1] != '\n'))
return;
// convert the first or two characters to a number
int pinNumber = message[0] - '0';
if (sz == 4)
pinNumber = (pinNumber * 10) + (message[1] - '0');
// additional check to verify the pin number is in the correct range
if ((pinNumber < 1) || (pinNumber > 6))
return;
// apply
pinValues[pinNumber] = (message[sz - 2] == 'H' ? HIGH : LOW);
}

How to convert char* to array of utf characters and url encode utf (cannot use c++11)?

I am getting from server text (couple words) which contains utf-8 characters (like Ž,ć) and another library (which I cannot change storing that in char[], not w_char). I need those words pass as parameters in URL (to open webview on mobile client). I tried urlencode(function which I have which works fine on ascii) but it encodes as ascii and Javascript cannot decode that back to use. How to convert char* to array of utf characters and url encode utf ?
std::string urlencode(const std::string &c)
{
std::string escaped="";
int max = c.length();
for(int i=0; i<max; i++)
{
if ( (48 <= c[i] && c[i] <= 57) ||//0-9
(65 <= c[i] && c[i] <= 90) ||//abc...xyz
(97 <= c[i] && c[i] <= 122) || //ABC...XYZ
(c[i]=='~' || c[i]=='!' || c[i]=='*' || c[i]=='(' || c[i]==')' || c[i]=='\'')
)
{
escaped.append( &c[i], 1);
}
else
{
escaped.append("%");
escaped.append( char2hex(c[i]) );//converts char 255 to string "ff"
}
}
return escaped;
}
std::string char2hex( char dec )
{
char dig1 = (dec&0xF0)>>4;
char dig2 = (dec&0x0F);
if ( 0<= dig1 && dig1<= 9) dig1+=48; //0,48inascii
if (10<= dig1 && dig1<=15) dig1+=97-10; //a,97inascii
if ( 0<= dig2 && dig2<= 9) dig2+=48;
if (10<= dig2 && dig2<=15) dig2+=97-10;
std::string r;
r.append( &dig1, 1);
r.append( &dig2, 1);
return r;
}