String encoding issue? - c++

I have developed a convertBase function that is able to convert a value into different bases and back.
string convertBase(string value, int fBase, int tBase) {
string charset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/",
fromRange = charset.substr(0, fBase),
toRange = charset.substr(0, tBase),
cc(value.rbegin(), value.rend()),
res = "";
unsigned long int dec = 0;
int index = 0;
for(char& digit : cc) {
if (charset.find(digit) == std::string::npos) return "";
dec += fromRange.find(digit) * pow(fBase, index);
index++;
}
while (dec > 0) {
res = toRange[dec % tBase] + res;
dec = (dec - (dec % tBase)) / tBase;
}; return res;
}
The code is working while encoding simple string like "Test" and back again but gets it problems with encoding long strings like "Test1234567" because it gets encoded as "33333333333333333333333333333333" and that seems to be absolutely wrong!
Why is this happening and how to fix this issue?

A long int is typically 32 or 64 bits in size, depending on which CPU architecture you are on, but it can even have other sizes. You are adding bigger and bigger numbers to dec. At some point, the numbers become larger than a long int can hold, and then your program breaks down.
If you need to handle arbitrarily large inputs, you need to use a different approach. If you can, use a "bignum" or "bigint" library like GMP.

When starting debugging, the big issues become obvious. Let's start with a call of convertBase("Test",3, 4).
string charset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/",
fromRange = charset.substr(0, fBase),
toRange = charset.substr(0, tBase),
The substr copies from the beginning (0) xBase characters. This will result in the example case fromRange = "012" and toRange="0123".
for(char& digit : cc) {
if (charset.find(digit) == std::string::npos) return "";
dec += fromRange.find(digit) * pow(fBase, index);
The test of digit in charset will be succesful, so no return will be executed.
But already in the first Iteration, when digit=='t', in fromRange.find(digit) it is not found and std::string::npos is multiplied with something. But not only t will be mapped to this value, any character in charset but not in fromRange will get the same value. This is not invertible, no bijection!
......
This leads to the conclusion that the algorithm will not work independent of any integer value limits.

Related

How to convert large number strings into integer in c++?

Suppose, I have a long string number input in c++. and we have to do numeric operations on it. We need to convert this into the integer or any possible way to do operations, what are those?
string s="12131313123123213213123213213211312321321321312321213123213213";
Looks like the numbers you want to handle are way to big for any standard integer type, so just "converting" it won't give you a lot. You have two options:
(Highly recommended!) Use a big integer library like e.g. gmp. Such libraries typically also provide functions for parsing and formatting the big numbers.
Implement your big numbers yourself, you could e.g. use an array of uintmax_t to store them. You will have to implement all sorts of arithmetics you'd possibly need yourself, and this isn't exactly an easy task. For parsing the number, you can use a reversed double dabble implementation. As an example, here's some code I wrote a while ago in C, you can probably use it as-is, but you need to provide some helper functions and you might want to rewrite it using C++ facilities like std::string and replacing the struct used here with a std::vector -- it's just here to document the concept
typedef struct hugeint
{
size_t s; // number of used elements in array e
size_t n; // number of total elements in array e
uintmax_t e[];
} hugeint;
hugeint *hugeint_parse(const char *str)
{
char *buf;
// allocate and initialize:
hugeint *result = hugeint_create();
// this is just a helper function copying all numeric characters
// to a freshly allocated buffer:
size_t bcdsize = copyNum(&buf, str);
if (!bcdsize) return result;
size_t scanstart = 0;
size_t n = 0;
size_t i;
uintmax_t mask = 1;
for (i = 0; i < bcdsize; ++i) buf[i] -= '0';
while (scanstart < bcdsize)
{
if (buf[bcdsize - 1] & 1) result->e[n] |= mask;
mask <<= 1;
if (!mask)
{
mask = 1;
// this function increases the storage size of the flexible array member:
if (++n == result->n) result = hugeint_scale(result, result->n + 1);
}
for (i = bcdsize - 1; i > scanstart; --i)
{
buf[i] >>= 1;
if (buf[i-1] & 1) buf[i] |= 8;
}
buf[scanstart] >>= 1;
while (scanstart < bcdsize && !buf[scanstart]) ++scanstart;
for (i = scanstart; i < bcdsize; ++i)
{
if (buf[i] > 7) buf[i] -= 3;
}
}
free(buf);
return result;
}
Your best best would be to use a large numbers computational library.
One of the best out there is the GNU Multiple Precision Arithmetic Library
Example of a useful function to solve your problem::
Function: int mpz_set_str (mpz_t rop, const char *str, int base)
Set the value of rop from str, a null-terminated C string in base
base. White space is allowed in the string, and is simply ignored.
The base may vary from 2 to 62, or if base is 0, then the leading
characters are used: 0x and 0X for hexadecimal, 0b and 0B for binary,
0 for octal, or decimal otherwise.
For bases up to 36, case is ignored; upper-case and lower-case letters
have the same value. For bases 37 to 62, upper-case letter represent
the usual 10..35 while lower-case letter represent 36..61.
This function returns 0 if the entire string is a valid number in base
base. Otherwise it returns -1.
Documentation: https://gmplib.org/manual/Assigning-Integers.html#Assigning-Integers
If string contains number which is less than std::numeric_limits<uint64_t>::max(), then std::stoull() is the best opinion.
unsigned long long = std::stoull(s);
C++11 and later.

efficiency of using stringstream to convert string to int?

Is the code below less (or more, or equally) efficient than:
make substring from cursor
make stringstream from substring
extract integer using stream operator
? (question edit) or is it less (or more, or equally) efficient than:
std::stoi
? and why?
Could this function be made more efficient?
(The class brings these into scope:)
std::string expression // has some numbers and other stuff in it
int cursor // points somewhere in the string
The code:
int Foo_Class::read_int()
{
/** reads an integer out of the expression from the cursor */
// make stack of digits
std::stack<char> digits;
while (isdigit(expression[cursor])) // this is safe, returns false, for the end of the string (ISO/IEC 14882:2011 21.4.5)
{
digits.push(expression[cursor] - 48); // convert from ascii
++cursor;
}
// add up the stack of digits
int total = 0;
int exponent = 0; // 10 ^ exponent
int this_digit;
while (! digits.empty())
{
this_digit = digits.top();
for (int i = exponent; i > 0; --i)
this_digit *= 10;
total += this_digit;
++exponent;
digits.pop();
}
return total;
}
(I know it doesn't handle overflow.)
(I know someone will probably say something about the magic numbers.)
(I tried pow(10, exponent) and got incorrect results. I'm guessing because of floating point arithmetic, but not sure why because all the numbers are integers.)
I find using std::stringstream to convert numbers is really quite slow.
Better to use the many dedicated number conversion functions like std::stoi, std::stol, std::stoll. Or std::strtol, std::strtoll.
I found lots of information on this page:
http://www.kumobius.com/2013/08/c-string-to-int/
As Galik said, std::stringstream is very slow compared to everything else.
std::stoi is much faster than std::stringstream
The manual code can be faster still, but as has been pointed out, it doesn't do all the error checking and could have problems.
This website also has an improvement over the code above, multiplying the total by 10, instead of the digit before it's added to the total (in sequential order, instead of reverse, with the stack). This makes for less multiplying by 10.
int Foo_Class::read_int()
{
/** reads an integer out of the expression from the cursor */
int to_return = 0;
while (isdigit(expression[cursor])) // this is safe, returns false, for the end of the string (ISO/IEC 14882:2011 21.4.5)
{
to_return *= 10;
to_return += (expression[cursor] - '0'); // convert from ascii
++cursor;
}
return to_return;
}

String compression (Interview prepare)

I need to compress a string. Can make an assumption that each character in the string doesn`t appear more than 255 times. I need return the compressed string and its length.
Last 2 years I worked with C# and forgot C++. I will be glad to hear your comments about code , algorithm and c++ programming practices
// StringCompressor.h
class StringCompressor
{
public:
StringCompressor();
~StringCompressor();
unsigned long Compress(string str, string* strCompressedPtr);
string DeCompress(string strCompressed);
private:
string m_StrCompressed;
static const char c_MaxLen;
};
// StringCompressor.cpp
#include "StringCompressor.h"
const char StringCompressor::c_MaxLen = 255;
StringCompressor::StringCompressor()
{
}
StringCompressor::~StringCompressor()
{
}
unsigned long StringCompressor::Compress(string str, string* strCompressedPtr)
{
if (str.empty())
{
return 0;
}
char currentChar = str[0];
char count = 1;
for (string::iterator it = str.begin() + 1; it != str.end(); ++it)
{
if (*it == currentChar)
{
count++;
if (count == c_MaxLen)
{
return -1;
}
}
else
{
m_StrCompressed+=currentChar;
m_StrCompressed+=count;
currentChar = *it;
count = 1;
}
}
m_StrCompressed += currentChar;
m_StrCompressed += count;
*strCompressedPtr = m_StrCompressed;
return m_StrCompressed.length();
}
string StringCompressor::DeCompress(string strCompressed)
{
string res;
if (strCompressed.length() % 2 != 0)
{
return res;
}
for (string::iterator it = strCompressed.begin(); it != strCompressed.end(); it+=2)
{
char dup = *(it + 1);
res += string(dup, *it);
}
return res;
}
There can be many improvement:
Do not return -1 for a unsigned long function.
consider use size_t or ssize_t to represent size.
Learn const
m_StrCompressed has bogus state if Compress is called repeatedly. Since those member cannot be reused, you may as well make the function static.
Compressed stuff generally should not be considered string, but byte buffer. Redesign your interface.
Comments! Nobody knows you are doing RLE here.
Bonus: Fallback mechanism if your compression yield larger result. e.g. a flag to denote uncompressed buffer, or just return failure.
I assume efficiency is not major concern here.
A few things:
I'm all for using classes, and perhaps you could do that here in a way that makes more sense. But given the scope of what you are trying to do, this here would be better off as two functions. One for compression, one for decompression. For instance, why are you storing the string in the class as an object and never using it? How does grouping this as a class actually enhance the functionality or make it more reusable?
You should pass your compressed string return as a reference instead of a pointer.
It looks like you are trying to count the number of times characters are repeated in a row and save that. For most common strings this will make the size of your compressed string larger than uncompressed as it takes two bytes to store each non-repeated character.
There are a lot of characters, there are two kinds of bits. If you do this method trying to group repeated bits, you'd be more successful (and that's actually one simple method of lossless compression).
If you are allowed, just use a library like zlib to do compression of arbitrary data types.

char to system.string in windowsforms

I wrote a program to write numbers in different bases (base 10, binary, base 53, whatever...)
I inicially wrote it as a win32 console application, in visual c++ 2010, and then converted it to a Windows Form Application (I know, I know...)
In the original form, it worked perfectly, but after the conversion, it stopped working. I narrowed down the problem to this:
The program uses a function that receives a digit and returns a char:
char corresponding_digit(int digit)
{
char corr_digit[62] = {'0','1','2','3','4','5','6','7','8','9',
'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P' ,'Q','R','S' ,'T','U','V','W','X','Y','Z',
'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p' , 'q','r','s' ,'t','u','v','w','x','y','z'};
return corr_digit[digit];
}
This function takes numbers from 1 to 61, and return the corresponding character: 0-9 -> 0-9; 10-35 -> A-Z; 36-61 a->z.
The program uses the function like this:
//on a button click
//base is an integer got from the user
String^ number_base_n = "";
if(base == 10)
number_base_n = value.ToString();
else if(base==0)
number_base_n = "0";
else
{
int digit, power, copy_value;
bool number_started = false;
copy_value = value;
if(copy_value > (int)pow(float(base), MAX_DIGITS)) //cmath was included
number_base_n = "Number too big";
else
{
for(int i = MAX_DIGITS; i >= 0; i--)
{
power = (int)pow(float(base), i);
if(copy_value >= power)
{
digit = copy_value/power;
copy_value -= digit*power;
number_started = true;
number_base_n += corresponding_digit(digit);
}
else if (number_started || i==0)
{
number_base_n += "0";
}
}
}
}
textBox6->Text = number_base_n;
After debugging a bit, I realized the problem happens when function corresponding_digit is called with digit value "1", which should return '1', in the expression
//number base_n equals ""
number_base_n += String(corresponding_digit(digit));
//number_base_n equals "49"
number_base_n, starting at "", ends with "49", which is actually the ASCII value of 1. I looked online, and all I got was converting the result, with String(value) or value.ToString(), but apparently I can't do
number_base_n += corresponding_digit(digit).ToString();
I tried using an auxiliar variable:
aux = corresponding_digit(digit);
number_base_n += aux.ToString();
but I got the exact same (wrong) result... (Same thing with String(value) )
I fumbled around a bit more, but not anything worth mentioning, I believe.
So... any help?
Also: base 10 and base 0 are working perfectly
Edit: If the downvoter would care to comment and explain why he downvoted... Constructive criticism, I believe is the term.
In C++/CLI, char is the same thing as it is in C++: a single byte, representing a single character. In C#, char (or System.Char) is a two byte Unicode codepoint. The C++ and C++/CLI equivalent to C#'s char is wchar_t. C++'s char is equivalent to System::Byte in C#.
As you have it now, attempting to do things with managed strings results in the managed APIs treating your C++ char as a C# byte, which is a number, not a character. That's why you're getting the ASCII value of the character, because it's being treated as a number, not a character.
To be explicit about things, I'd recommend you switch the return type of your corresponding_digit method to be System::Char. This way, when you operate with managed strings, the managed APIs will know that the data in question are characters, and you'll get your expected results.
System::Char corresponding_digit(int digit)
{
System::Char corr_digit[62] = {'0','1','2','3','4','5','6','7','8','9',
'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',
'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'};
return corr_digit[digit];
}
Other possible changes you could make:
Use a StringBuilder instead of appending strings.
Switch corr_digit to a managed array (array<System::Char>^), and store it somewhere reusable. As the code is written now, the corresponding_digit method has to re-create this array from scratch every time the method is called.

How to convert large integers to base 2^32?

First off, I'm doing this for myself so please don't suggest "use GMP / xint / bignum" (if it even applies).
I'm looking for a way to convert large integers (say, OVER 9000 digits) into a int32 array of 232 representations. The numbers will start out as base 10 strings.
For example, if I wanted to convert string a = "4294967300" (in base 10), which is just over INT_MAX, to the new base 232 array, it would be int32_t b[] = {1,5}. If int32_t b[] = {3,2485738}, the base 10 number would be 3 * 2^32 + 2485738. Obviously the numbers I'll be working with are beyond the range of even int64 so I can't exactly turn the string into an integer and mod my way to success.
I have a function that does subtraction in base 10. Right now I'm thinking I'll just do subtraction(char* number, "2^32") and count how many times before I get a negative number, but that will probably take a long time for larger numbers.
Can someone suggest a different method of conversion? Thanks.
EDIT
Sorry in case you didn't see the tag, I'm working in C++
Assuming your bignum class already has multiplication and addition, it's fairly simple:
bignum str_to_big(char* str) {
bignum result(0);
while (*str) {
result *= 10;
result += (*str - '0');
str = str + 1;
}
return result;
}
Converting the other way is the same concept, but requires division and modulo
std::string big_to_str(bignum num) {
std::string result;
do {
result.push_back(num%10);
num /= 10;
} while(num > 0);
std::reverse(result.begin(), result.end());
return result;
}
Both of these are for unsigned only.
To convert from base 10 strings to your numbering system, starting with zero continue adding and multiplying each base 10 digit by 10. Every time you have a carry add a new digit to your base 2^32 array.
The simplest (not the most efficient) way to do this is to write two functions, one to multiply a large number by an int, and one to add an int to a large number. If you ignore the complexities introduced by signed numbers, the code looks something like this:
(EDITED to use vector for clarity and to add code for actual question)
void mulbig(vector<uint32_t> &bignum, uint16_t multiplicand)
{
uint32_t carry=0;
for( unsigned i=0; i<bignum.size(); i++ ) {
uint64_t r=((uint64_t)bignum[i] * multiplicand) + carry;
bignum[i]=(uint32_t)(r&0xffffffff);
carry=(uint32_t)(r>>32);
}
if( carry )
bignum.push_back(carry);
}
void addbig(vector<uint32_t> &bignum, uint16_t addend)
{
uint32_t carry=addend;
for( unsigned i=0; carry && i<bignum.size(); i++ ) {
uint64_t r=(uint64_t)bignum[i] + carry;
bignum[i]=(uint32_t)(r&0xffffffff);
carry=(uint32_t)(r>>32);
}
if( carry )
bignum.push_back(carry);
}
Then, implementing atobignum() using those functions is trivial:
void atobignum(const char *str,vector<uint32_t> &bignum)
{
bignum.clear();
bignum.push_back(0);
while( *str ) {
mulbig(bignum,10);
addbig(bignum,*str-'0');
++str;
}
}
I think Docjar: gnu/java/math/MPN.java might contain what you're looking for, specifically the code for public static int set_str (int dest[], byte[] str, int str_len, int base).
Start by converting the number to binary. Starting from the right, each group of 32 bits is a single base2^32 digit.