C/C++ URL decode library - c++
I am developing a c/c++ program on linux. Can you please tell me if there is any c/c++ library which decodes url?
I am looking for libraries which
convert
"http%3A%2F%2F"
to:
"http://"
or
"a+t+%26+t" to "a t & t"
Thank you.
I actually used Saul's function in an analysis program I was writing (analyzing millions of URL encoded strings), and while it works, at that scale it was slowing my program down horribly, so I decided to write a faster version. This one is thousands of times faster when compiled with GCC and the -O2 option. It can also use the same output buffer as the input (e.g. urldecode2(buf, buf) will work if the original string was in buf and is to be overwritten by its decoded counterpart).
Edit: It doesn't take the buffer size as an input because it is assumed that the buffer will be large enough, this is safe because it is known that the length of the output will always be <= that of the input, so either use the same buffer for the output or create one that's at least the size of the input + 1 for the null terminator, e.g.:
char *output = malloc(strlen(input)+1);
urldecode2(output, input);
printf("Decoded string: %s\n", output);
Edit 2: An anonymous user attempted to edit this answer to handle the '+' character's translation to ' ', which I think it should probably do, again this wasn't something that I needed for my application, but I've added it below.
Here's the routine:
#include <stdlib.h>
#include <ctype.h>
void urldecode2(char *dst, const char *src)
{
char a, b;
while (*src) {
if ((*src == '%') &&
((a = src[1]) && (b = src[2])) &&
(isxdigit(a) && isxdigit(b))) {
if (a >= 'a')
a -= 'a'-'A';
if (a >= 'A')
a -= ('A' - 10);
else
a -= '0';
if (b >= 'a')
b -= 'a'-'A';
if (b >= 'A')
b -= ('A' - 10);
else
b -= '0';
*dst++ = 16*a+b;
src+=3;
} else if (*src == '+') {
*dst++ = ' ';
src++;
} else {
*dst++ = *src++;
}
}
*dst++ = '\0';
}
Here is a C decoder for a percent encoded string. It returns -1 if the encoding is invalid and 0 otherwise. The decoded string is stored in out. I'm quite sure this is the fastest code of the answers given so far.
int percent_decode(char* out, const char* in) {
{
static const char tbl[256] = {
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,-1,-1,-1,-1,-1,-1,
-1,10,11,12,13,14,15,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,10,11,12,13,14,15,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1
};
char c, v1, v2, *beg=out;
if(in != NULL) {
while((c=*in++) != '\0') {
if(c == '%') {
if((v1=tbl[(unsigned char)*in++])<0 ||
(v2=tbl[(unsigned char)*in++])<0) {
*beg = '\0';
return -1;
}
c = (v1<<4)|v2;
}
*out++ = c;
}
}
*out = '\0';
return 0;
}
uriparser library is small and lightweight.
This function I've just whipped up is very lightweight and should do as you wish, note I haven't programmed this to strict URI standards (used what I know off the top of my head). It's buffer-safe and doesn't overflow as far as I can see; adapt as you deem fit:
#include <assert.h>
void urldecode(char *pszDecodedOut, size_t nBufferSize, const char *pszEncodedIn)
{
memset(pszDecodedOut, 0, nBufferSize);
enum DecodeState_e
{
STATE_SEARCH = 0, ///< searching for an ampersand to convert
STATE_CONVERTING, ///< convert the two proceeding characters from hex
};
DecodeState_e state = STATE_SEARCH;
for(unsigned int i = 0; i < strlen(pszEncodedIn)-1; ++i)
{
switch(state)
{
case STATE_SEARCH:
{
if(pszEncodedIn[i] != '%')
{
strncat(pszDecodedOut, &pszEncodedIn[i], 1);
assert(strlen(pszDecodedOut) < nBufferSize);
break;
}
// We are now converting
state = STATE_CONVERTING;
}
break;
case STATE_CONVERTING:
{
// Conversion complete (i.e. don't convert again next iter)
state = STATE_SEARCH;
// Create a buffer to hold the hex. For example, if %20, this
// buffer would hold 20 (in ASCII)
char pszTempNumBuf[3] = {0};
strncpy(pszTempNumBuf, &pszEncodedIn[i], 2);
// Ensure both characters are hexadecimal
bool bBothDigits = true;
for(int j = 0; j < 2; ++j)
{
if(!isxdigit(pszTempNumBuf[j]))
bBothDigits = false;
}
if(!bBothDigits)
break;
// Convert two hexadecimal characters into one character
int nAsciiCharacter;
sscanf(pszTempNumBuf, "%x", &nAsciiCharacter);
// Ensure we aren't going to overflow
assert(strlen(pszDecodedOut) < nBufferSize);
// Concatenate this character onto the output
strncat(pszDecodedOut, (char*)&nAsciiCharacter, 1);
// Skip the next character
i++;
}
break;
}
}
}
The ever-excellent glib has some URI functions, including scheme-extraction, escaping and un-escaping.
I'd suggest curl and libcurl. It's widely used and should do the trick for you. Just check their website.
Thanks to #ThomasH for his answer. I'd like to propose here a better formattation…
And… since the decoded URI component is always less long than the same encoded URI component, is always possible to implode it within the same array of characters (a.k.a.: "string"). So, I'll propose here two possibilities:
#include <stdio.h>
#include <ctype.h>
#include <limits.h>
int decodeURIComponent (char *sSource, char *sDest) {
int nLength;
for (nLength = 0; *sSource; nLength++) {
if (*sSource == '%' && sSource[1] && sSource[2] && isxdigit(sSource[1]) && isxdigit(sSource[2])) {
sSource[1] -= sSource[1] <= '9' ? '0' : (sSource[1] <= 'F' ? 'A' : 'a')-10;
sSource[2] -= sSource[2] <= '9' ? '0' : (sSource[2] <= 'F' ? 'A' : 'a')-10;
sDest[nLength] = 16 * sSource[1] + sSource[2];
sSource += 3;
continue;
}
sDest[nLength] = *sSource++;
}
sDest[nLength] = '\0';
return nLength;
}
#define implodeURIComponent(url) decodeURIComponent(url, url)
And, finally…:
int main () {
char sMyUrl[] = "http%3a%2F%2ffoo+bar%2fabcd";
int nNewLength = implodeURIComponent(sMyUrl);
/* Let's print: "http://foo+bar/abcd\nLength: 19" */
printf("%s\nLength: %d\n", sMyUrl, nNewLength);
return 0;
}
Ste*
Try urlcpp https://github.com/larroy/urlcpp
It's a C++ module that you can easily integrate in your project, depends on boost::regex
Came across this 8 year old question as I was looking for the same. Based on previous answers, I also wrote my own version which is independent from libs, easy to understand and probably fast (no benchmark). Tested code with gcc, it should decode until end or invalid character (not tested). Just remember to free allocated space.
const char ascii_hex_4bit[23] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0, 10, 11, 12, 13, 14, 15};
static inline char to_upper(char c)
{
if ((c >= 'a') && (c <= 'z')) return c ^ 0x20;
return c;
}
char *url_decode(const char *str)
{
size_t i, j, len = strlen(str);
char c, d, url_hex;
char *decoded = malloc(len + 1);
if (decoded == NULL) return NULL;
i = 0;
j = 0;
do
{
c = str[i];
d = 0;
if (c == '%')
{
url_hex = to_upper(str[++i]);
if (((url_hex >= '0') && (url_hex <= '9')) || ((url_hex >= 'A') && (url_hex <= 'F')))
{
d = ascii_hex_4bit[url_hex - 48] << 4;
url_hex = to_upper(str[++i]);
if (((url_hex >= '0') && (url_hex <= '9')) || ((url_hex >= 'A') && (url_hex <= 'F')))
{
d |= ascii_hex_4bit[url_hex - 48];
}
else
{
d = 0;
}
}
}
else if (c == '+')
{
d = ' ';
}
else if ((c == '*') || (c == '-') || (c == '.') || ((c >= '0') && (c <= '9')) ||
((c >= 'A') && (c <= 'Z')) || (c == '_') || ((c >= 'a') && (c <= 'z')))
{
d = c;
}
decoded[j++] = d;
++i;
} while ((i < len) && (d != 0));
decoded[j] = 0;
return decoded;
}
/**
* Locale-independent conversion of ASCII characters to lowercase.
*/
int av_tolower(int c)
{
if (c >= 'A' && c <= 'Z')
c ^= 0x20;
return c;
}
/**
* Decodes an URL from its percent-encoded form back into normal
* representation. This function returns the decoded URL in a string.
* The URL to be decoded does not necessarily have to be encoded but
* in that case the original string is duplicated.
*
* #param url a string to be decoded.
* #return new string with the URL decoded or NULL if decoding failed.
* Note that the returned string should be explicitly freed when not
* used anymore.
*/
char *urldecode(const char *url)
{
int s = 0, d = 0, url_len = 0;
char c;
char *dest = NULL;
if (!url)
return NULL;
url_len = strlen(url) + 1;
dest = av_malloc(url_len);
if (!dest)
return NULL;
while (s < url_len) {
c = url[s++];
if (c == '%' && s + 2 < url_len) {
char c2 = url[s++];
char c3 = url[s++];
if (isxdigit(c2) && isxdigit(c3)) {
c2 = av_tolower(c2);
c3 = av_tolower(c3);
if (c2 <= '9')
c2 = c2 - '0';
else
c2 = c2 - 'a' + 10;
if (c3 <= '9')
c3 = c3 - '0';
else
c3 = c3 - 'a' + 10;
dest[d++] = 16 * c2 + c3;
} else { /* %zz or something other invalid */
dest[d++] = c;
dest[d++] = c2;
dest[d++] = c3;
}
} else if (c == '+') {
dest[d++] = ' ';
} else {
dest[d++] = c;
}
}
return dest;
}
by
www.elesos.com
Related
How do I turn an ip address into an array?
I have std::array<unsigned char, 4> ipv4 = {}; and IP adress 90.100.160.101, how do I add it to peer.ipv4[0] = 90; peer.ipv4[1] = 100; peer.ipv4[2] = 160; peer.ipv4[3] = 101;
This is the answer to the question: Convert an IP-Address, given as a std::string to a std::array I want to add this answer, because I think that "Ihor Drachuk"s answer can be modified to use more modern C++ elements. All this many lines of code can be replaced by one statement, using std::transform and an ultra simple regex (\d+). Please see the example below: #include <iostream> #include <regex> #include <array> #include <algorithm> // For easier reading and writing using IPv4 = std::array<unsigned char, 4>; // a regex for one or more digits std::regex re{R"((\d+))"}; // Some test string. Can be anything const std::string testIpString{"127.128.129.1"}; int main() { // Here we will store the IP address as an array of bytes IPv4 ipV4{}; // Convert IP-String to the array, using one statement. One liner std::transform(std::sregex_token_iterator(testIpString.begin(), testIpString.end(), re), {}, ipV4.begin(), [](const std::string& s){ return static_cast<unsigned char>(std::stoi(s));}); // Some Debug output. Show result on screen std::copy(ipV4.begin(), ipV4.end(), std::ostream_iterator<unsigned int>(std::cout,"\n")); return 0; }
I have std::array<unsigned char, 4> ipv4 = {}; supposing ipv4 contains in fact your address (else why to speak about it ?) and ipv4[0] is 90 etc just do peer.ipv4[0] = ipv4[0]; peer.ipv4[1] = ipv4[1]; peer.ipv4[2] = ipv4[2]; peer.ipv4[3] = ipv4[3]; or use a loop
If you need to parse string to ipv4 with validation of that string, you can try this function I assume this code should be faster than if use regex or some C++ features, but if you don't need speed-optimization, probably better to choose answer with less code and more clear using IPv4 = std::array<unsigned char, 4>; // Speed-optimized variant std::optional<IPv4> convertIPv4(const char* str) { IPv4 result; size_t resultIndex = 0; const char* str2 = str; const char* ssEnd = nullptr; const char* ssStart = nullptr; int cnt = 0; while (true) { if (*str2 == '.' || *str2 == 0) { cnt++; if (!ssStart) return {}; if (cnt == 5) return {}; intptr_t diff = reinterpret_cast<intptr_t>(ssEnd) - reinterpret_cast<intptr_t>(ssStart); if ((diff < 0) || (diff > 2)) return {}; if (diff == 2) { char c1 = *ssStart; char c2 = *(ssStart+1); char c3 = *(ssStart+2); if ((c1 < '0') || (c1 > '2')) return {}; if ((c2 < '0') || (c3 < '0')) return {}; if ((c2 > '9') || (c3 > '9')) return {}; if (c1 == '2' && c2 >= '5' && c3 > '5') return {}; result[resultIndex++] = (c1 - 48) * 100 + (c2 - 48) * 10 + (c3 - 48); } else if (diff == 1) { char c1 = *ssStart; char c2 = *(ssStart+1); if ((c1 < '0') || (c2 < '0')) return {}; if ((c1 > '9') || (c2 > '9')) return {}; result[resultIndex++] = (c1 - 48) * 10 + (c2 - 48); } else { char c1 = *ssStart; if ((c1 < '0') || (c1 > '9')) return {}; result[resultIndex++] = c1 - 48; } // Return if all's done if (cnt == 4 && *str2 == 0) { return result; } ssEnd = nullptr; ssStart = nullptr; } else { if (!ssStart) ssStart = str2; ssEnd = str2; } str2++; } } Usage: IPv4 value = convertIPv4("127.0.0.1");
atoi() for int128_t type
How can I use argv values with int128_t support? I know about atoi() and family of functions exposed by <cstdlib> but somehow I cannot find one for int128_t fixed width integer. This might be because of the fact that this type isn't backed by either c or c++ standard, but is there any way for me to make this code work? #include <iostream> int main(int argc, char **argv) { __int128_t value = atoint128_t(argv[1]); } Almost all answers posted are good enough for me but I'm selecting the one that is a drop by solution for my current code, so do look at other ones too.
Here's a simple way of implementing this: __int128_t atoint128_t(const char *s) { const char *p = s; __int128_t val = 0; if (*p == '-' || *p == '+') { p++; } while (*p >= '0' && *p <= '9') { val = (10 * val) + (*p - '0'); p++; } if (*s == '-') val = val * -1; return val; } This code checks each character to see if it's a digit (with an optional leading + or -), and if so it multiplies the current result by 10 and adds the value associated with that digit. It then inverts the sign if need be. Note that this implementation does not check for overflow, which is consistent with the behavior of atoi. EDIT: Revised implementation that covers int128_MIN case by either adding or subtracting the value of each digit based on the sign, and skipping leading whitespace. int myatoi(const char *s) { const char *p = s; int neg = 0, val = 0; while ((*p == '\n') || (*p == '\t') || (*p == ' ') || (*p == '\f') || (*p == '\r') || (*p == '\v')) { p++; } if ((*p == '-') || (*p == '+')) { if (*p == '-') { neg = 1; } p++; } while (*p >= '0' && *p <= '9') { if (neg) { val = (10 * val) - (*p - '0'); } else { val = (10 * val) + (*p - '0'); } p++; } return val; }
Here is a C++ implementation: #include <string> #include <stdexcept> __int128_t atoint128_t(std::string const & in) { __int128_t res = 0; size_t i = 0; bool sign = false; if (in[i] == '-') { ++i; sign = true; } if (in[i] == '+') { ++i; } for (; i < in.size(); ++i) { const char c = in[i]; if (not std::isdigit(c)) throw std::runtime_error(std::string("Non-numeric character: ") + c) res *= 10; res += c - '0'; } if (sign) { res *= -1; } return res; } int main() { __int128_t a = atoint128_t("170141183460469231731687303715884105727"); } If you want to test it then there is a stream operator here. Performance I ran a few performance test. I generate 100,000 random numbers uniformly distributed in the entire support of __int128_t. Then I converted each of them 2000 times. All of these (200,000,000) conversions where completed within ~12 seconds. Using this code: #include <iostream> #include <string> #include <random> #include <vector> #include <chrono> int main() { std::mt19937 gen(0); std::uniform_int_distribution<> num(0, 9); std::uniform_int_distribution<> len(1, 38); std::uniform_int_distribution<> sign(0, 1); std::vector<std::string> str; for (int i = 0; i < 100000; ++i) { std::string s; int l = len(gen); if (sign(gen)) s += '-'; for (int u = 0; u < l; ++u) s += std::to_string(num(gen)); str.emplace_back(s); } namespace sc = std::chrono; auto start = sc::duration_cast<sc::microseconds>(sc::high_resolution_clock::now().time_since_epoch()).count(); __int128_t b = 0; for (int u = 0; u < 200; ++u) { for (int i = 0; i < str.size(); ++i) { __int128_t a = atoint128_t(str[i]); b += a; } } auto time = sc::duration_cast<sc::microseconds>(sc::high_resolution_clock::now().time_since_epoch()).count() - start; std::cout << time / 1000000. << 's' << std::endl; }
Adding here a "not-so-naive" implementation in pure C, it's still kind of simple: #include <stdio.h> #include <inttypes.h> __int128 atoi128(const char *s) { while (*s == ' ' || *s == '\t' || *s == '\n' || *s == '+') ++s; int sign = 1; if (*s == '-') { ++s; sign = -1; } size_t digits = 0; while (s[digits] >= '0' && s[digits] <= '9') ++digits; char scratch[digits]; for (size_t i = 0; i < digits; ++i) scratch[i] = s[i] - '0'; size_t scanstart = 0; __int128 result = 0; __int128 mask = 1; while (scanstart < digits) { if (scratch[digits-1] & 1) result |= mask; mask <<= 1; for (size_t i = digits-1; i > scanstart; --i) { scratch[i] >>= 1; if (scratch[i-1] & 1) scratch[i] |= 8; } scratch[scanstart] >>= 1; while (scanstart < digits && !scratch[scanstart]) ++scanstart; for (size_t i = scanstart; i < digits; ++i) { if (scratch[i] > 7) scratch[i] -= 3; } } return result * sign; } int main(int argc, char **argv) { if (argc > 1) { __int128 x = atoi128(argv[1]); printf("%" PRIi64 "\n", (int64_t)x); // just for demo with smaller numbers } } It reads the number bit by bit, using a shifted BCD scratch space, see Double dabble for the algorithm (it's reversed here). This is a lot more efficient than doing many multiplications by 10 in general. *) This relies on VLAs, without them, you can replace char scratch[digits]; with char *scratch = malloc(digits); if (!scratch) return 0; and add a free(scratch); at the end of the function. Of course, the code above has the same limitations as the original atoi() (e.g. it will produce "random" garbage on overflow and has no way to check for that) .. if you need strtol()-style guarantees and error checking, extend it yourself (not a big problem, just work to do). *) Of course, implementing double dabble in C always suffers from the fact you can't use "hardware carries", so there are extra bit masking and testing operations necessary. On the other hand, "naively" multiplying by 10 can be very efficient, as long as the platform provides multiplication instructions with a width "close" to your target type. Therefore, on your typical x86_64 platform (which has instructions for multiplying 64bit integers), this code is probably a lot slower than the naive decimal method. But it scales much better to really huge integers (which you would implement e.g. using arrays of uintmax_t).
is there any way for me to make this code work? "What about implementing your own atoint128_t ?" #Marian It is not to hard to roll your own atoint128_t(). Points to consider. There is 0 or 1 more representable negative value than positive values. Accumulating the value using negative numbers provides more range. Overflow is not defined for atoi(). Perhaps provide a capped value and set errno? Detecting potential OF prevents UB. __int128_t constants need careful code to form correctly. How to handle unusual input? atoi() is fairly loose and made sense years ago for speed/size, yet less UB is usually desired these days. Candidate cases: "", " ", "-", "z", "+123", "999..many...999", "the min int128", "locale_specific_space" + " 123" or even non-string NULL. Code to do atoi() and atoint128_t() need only vary on the type, range, and names. The algorithm is the same. #if 1 #define int_t __int128_t #define int_MAX (((__int128_t)0x7FFFFFFFFFFFFFFF << 64) + 0xFFFFFFFFFFFFFFFF) #define int_MIN (-1 - int_MAX) #define int_atoi atoint128_t #else #define int_t int #define int_MAX INT_MAX #define int_MIN INT_MIN #define int_atoi int_atoi #endif Sample code: Tailor as needed. Relies on C99 or later negative/positive and % functionality. int_t int_atoi(const char *s) { if (s == NULL) { // could omit this test errno = EINVAL; return 0; } while (isspace((unsigned char ) *s)) { // skip same leading white space like atoi() s++; } char sign = *s; // remember if the sign was `-` for later if (sign == '-' || sign == '+') { s++; } int_t sum = 0; while (isdigit((unsigned char)*s)) { int digit = *s - '0'; if ((sum > int_MIN/10) || (sum == int_MIN/10 && digit <= -(int_MIN%10))) { sum = sum * 10 - digit; // accumulate on the - side } else { sum = int_MIN; errno = ERANGE; break; // overflow } s++; } if (sign != '-') { if (sum < -int_MAX) { sum = int_MAX; errno = ERANGE; } else { sum = -sum; // Make positive } } return sum; } As #Lundin commented about the lack of overflow detection, etc. Modeling the string-->int128 after strtol() is a better idea. For simplicity, consider __128_t strto__128_base10(const char *s, char *endptr); This answer all ready handles overflow and flags errno like strtol(). Just need a few changes: bool digit_found = false; while (isdigit((unsigned char)*s)) { digit_found = true; // delete the `break` // On overflow, continue looping to get to the end of the digits. // break; // after the `while()` loop: if (!digit_found) { // optional test errno = EINVAL; } if (endptr) { *endptr = digit_found ? s : original_s; } A full long int strtol(const char *nptr, char **endptr, int base); like functionality would also handle other bases with special code when base is 0 or 16. #chqrlie
The C Standard does not mandate support for 128-bit integers. Yet they are commonly supported by modern compilers: both gcc and clang support the types __int128_t and __uint128_t, but surprisingly still keep intmax_t and uintmax_t limited to 64 bits. Beyond the basic arithmetic operators, there is not much support for these large integers, especially in the C library: no scanf() or printf() conversion specifiers, etc. Here is an implementation of strtoi128(), strtou128() and atoi128() that is consistent with the C Standard's atoi(), strtol() and strtoul() specifications. #include <ctype.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> /* Change these typedefs for your local flavor of 128-bit integer types */ typedef __int128_t i128; typedef __uint128_t u128; static int strdigit__(char c) { /* This is ASCII / UTF-8 specific, would not work for EBCDIC */ return (c >= '0' && c <= '9') ? c - '0' : (c >= 'a' && c <= 'z') ? c - 'a' + 10 : (c >= 'A' && c <= 'Z') ? c - 'A' + 10 : 255; } static u128 strtou128__(const char *p, char **endp, int base) { u128 v = 0; int digit; if (base == 0) { /* handle octal and hexadecimal syntax */ base = 10; if (*p == '0') { base = 8; if ((p[1] == 'x' || p[1] == 'X') && strdigit__(p[2]) < 16) { p += 2; base = 16; } } } if (base < 2 || base > 36) { errno = EINVAL; } else if ((digit = strdigit__(*p)) < base) { v = digit; /* convert to unsigned 128 bit with overflow control */ while ((digit = strdigit__(*++p)) < base) { u128 v0 = v; v = v * base + digit; if (v < v0) { v = ~(u128)0; errno = ERANGE; } } if (endp) { *endp = (char *)p; } } return v; } u128 strtou128(const char *p, char **endp, int base) { if (endp) { *endp = (char *)p; } while (isspace((unsigned char)*p)) { p++; } if (*p == '-') { p++; return -strtou128__(p, endp, base); } else { if (*p == '+') p++; return strtou128__(p, endp, base); } } i128 strtoi128(const char *p, char **endp, int base) { u128 v; if (endp) { *endp = (char *)p; } while (isspace((unsigned char)*p)) { p++; } if (*p == '-') { p++; v = strtou128__(p, endp, base); if (v >= (u128)1 << 127) { if (v > (u128)1 << 127) errno = ERANGE; return -(i128)(((u128)1 << 127) - 1) - 1; } return -(i128)v; } else { if (*p == '+') p++; v = strtou128__(p, endp, base); if (v >= (u128)1 << 127) { errno = ERANGE; return (i128)(((u128)1 << 127) - 1); } return (i128)v; } } i128 atoi128(const char *p) { return strtoi128(p, (char**)NULL, 10); } char *utoa128(char *dest, u128 v, int base) { char buf[129]; char *p = buf + 128; const char *digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"; *p = '\0'; if (base >= 2 && base <= 36) { while (v > (unsigned)base - 1) { *--p = digits[v % base]; v /= base; } *--p = digits[v]; } return strcpy(dest, p); } char *itoa128(char *buf, i128 v, int base) { char *p = buf; u128 uv = (u128)v; if (v < 0) { *p++ = '-'; uv = -uv; } if (base == 10) utoa128(p, uv, 10); else if (base == 16) utoa128(p, uv, 16); else utoa128(p, uv, base); return buf; } static char *perrno(char *buf, int err) { switch (err) { case EINVAL: return strcpy(buf, "EINVAL"); case ERANGE: return strcpy(buf, "ERANGE"); default: sprintf(buf, "%d", err); return buf; } } int main(int argc, char *argv[]) { char buf[130]; char xbuf[130]; char ebuf[20]; char *p1, *p2; i128 v, v1; u128 v2; int i; for (i = 1; i < argc; i++) { printf("%s:\n", argv[i]); errno = 0; v = atoi128(argv[i]); perrno(ebuf, errno); printf(" atoi128(): %s 0x%s errno=%s\n", itoa128(buf, v, 10), utoa128(xbuf, v, 16), ebuf); errno = 0; v1 = strtoi128(argv[i], &p1, 0); perrno(ebuf, errno); printf(" strtoi128(): %s 0x%s endptr:\"%s\" errno=%s\n", itoa128(buf, v1, 10), utoa128(xbuf, v1, 16), p1, ebuf); errno = 0; v2 = strtou128(argv[i], &p2, 0); perrno(ebuf, errno); printf(" strtou128(): %s 0x%s endptr:\"%s\" errno=%s\n", utoa128(buf, v2, 10), utoa128(xbuf, v2, 16), p2, ebuf); } return 0; }
Convert string of hexadecimal to decimal in c
I am writing an operating system in C and assembly, and in implementing the EXT2 file system I have encountered a problem. I need to convert FOUR bytes of hexadecimal to decimal in c. An example would be to convert 00 00 01(10000) to 65536.I need to convert to decimal,because parsing the super block requires all values to be in decimal. Most specifically the ext2 fs I'm working on is here: #include "ext2.h" #include <stdlib.h> long hex2dec(unsigned const char *hex){ long ret = 0; int i = 0; while(hex[i] != 0){ //if(hex[i] >= 0x00 && hex[i] <= 0x09) // ret+=(10 * i) * hex[i]; } //kprintf("\n"); return ret; } char *strsep(char *buf,int offset,int num){ char *ret = malloc(1024); int j = 0; int i = offset; int end = (offset + num); int i1 = 0; while(i1 < num){ ///kstrcat(ret,&buf[i]); ret[i1] = buf[i]; i++; i1++; } return ret; } int get_partition(partnum){ if(partnum > 4) return -1; //int i = (12 * partnum); int i = 0; if(partnum == 1) i = 190; else if(partnum == 2) i = 206; else if(partnum == 3) i = 222; else i = 190; int ret = 0; char *buf = malloc(1024); ata_read_master(buf,1,0x00); ret = buf[(i + 2)]; return ret; } int _intlen(int i){ int ret = 0; while(i){ ret++; i/=10; } return ret; } int _hex2int(char c){ if(c == '0') return 0; else if(c == '1') return 1; else if(c == '2') return 2; else if(c == '3') return 3; else if(c == '4') return 4; else if(c == '5') return 5; else if(c == '6') return 6; else if(c == '7') return 7; else if(c == '8') return 8; else if(c == '9') return 9; else if(c == 'A') return 10; else if(c == 'B') return 11; else if(c == 'C') return 12; else if(c == 'D') return 13; else if(c == 'E') return 14; else if(c == 'F') return 15; } int hex2int(char c){ int i = c; } int comb(const char *str,int n){ int i = 0; int ret = 0; while(i < n){ //if(str[i] == 0x01) // kprintf("(:"); /*int j = str[i]; int k = 0; int m = 0; if(j < 10) j*=10; else while(j > 0){ k+=(10 ^ (_intlen(j) - m)) * j % 10; m++; j/=10; } //kprintf("%d",j); //if(j == 1) // kprintf("(:");*/ i++; } //ret = (char)ret; ret = (char)str int ret = 0; int i = 0; char *s = malloc(1024); /*while(i < n){ //kstrcat(s,&((char*)buf[i])); n++; }*/ return ret; //kprintf("\n"); //return ret; } struct ext2_superblock *parse_sblk(int partnum){ int i = get_partition(partnum); if(i > 0) kprintf("[EXT2_SUPERBLOCK]Found partition!\n"); else i = 0; struct ext2_superblock *ret; struct ext2_superblock retnp; char *buf = malloc(1024); int i1 = 0; //char *tmpbuf = malloc(4); /*if(i != 0) ata_read_master(buf,((i * 4)/256),0x00); else{ kprintf("[WRN]: Looking for superblock at offset 1024\n"); ata_read_master(buf,4,0x00); }*/ ata_read_master(buf,2,0x00); const char *cmp = strsep(buf,0,4); retnp.ninode = comb(strsep(buf,0,4),4); retnp.nblock = comb(strsep(buf,4,4),4); retnp.nsblock = comb(strsep(buf,8,4),4); retnp.nunallocb = comb(strsep(buf,12,4),4); retnp.nunalloci = comb(strsep(buf,16,4),4); retnp.supernum = comb(strsep(buf,20,4),4); retnp.leftshiftbs = comb(strsep(buf,24,4),4); retnp.leftshiftfs = comb(strsep(buf,28,4),4); retnp.numofblockpg= comb(strsep(buf,32,4),4); // retnp.numofffpbg= comb(strsep(buf,36,4)); retnp.numoffpbg = comb(strsep(buf,36,4),4); retnp.numofinpbg = comb(strsep(buf,40,4),4); retnp.lastmount = comb(strsep(buf,44,4),4); retnp.lastwrite = comb(strsep(buf,48,4),4); retnp.fsckpass = comb(strsep(buf,52,2),2); retnp.fsckallow = comb(strsep(buf,54,2),2); retnp.sig = comb(strsep(buf,56,2),2); retnp.state = comb(strsep(buf,58,2),2); retnp.erroropp = comb(strsep(buf,60,2),2); retnp.minorpor = comb(strsep(buf,52,2),2); retnp.ptimefsck = comb(strsep(buf,64,4),4); retnp.inter = comb(strsep(buf,68,4),4); retnp.osid = comb(strsep(buf,72,4),4); retnp.mpv = comb(strsep(buf,76,4),4); retnp.uid = comb(strsep(buf,80,2),2); retnp.gid = comb(strsep(buf,82,2),2); ret = &retnp; return ret; i1 = 0; } If there is anyway of avoiding conversion and successfully implementing ext2 I would be glad to hear it. I would prefer it to be in c,but assembly is also okay.
If you have this: const uint8_t bytes[] = { 0, 0, 1 }; and you want to consider that the bytes of a (24-bit) unsigned integer in little-endian order, you can convert to the actual integer using: const uint32_t value = ((uint32_t) bytes[2] << 16) | (bytes[1] << 8) | bytes[0]; This will set value equal to 65536.
You can use std::istringstream or sscanf instead of writing your own. char const * hex_text[] = "0x100"; const std::string hex_str(hex_text); std::istringstream text_stream(hex_str); unsigned int value; text_stream >> std::ios::hex >> value; std::cout << "Decimal value of 0x100: " << value << "\n"; Or using sscanf: sscanf(hex_text, "0x%X", &value); std::cout << "Decimal value of 0x100: " << value << "\n"; A good idea is to search your C++ reference for existing functions or search the internet, before writing your own. To roll your own: unsigned int hex2dec(const std::string& hex_text) { unsigned int value = 0U; const unsigned int length = hex_text.length(); for (unsigned int i = 0; i < length; ++i) { const char c = hex_text[i]; if ((c >= '0') && (c <= '9')) { value = value * 16 + (c - '0'); } else { c = toupper(c); if ((c >= 'A') && (c <= 'Z')) { value = value * 16 + (c - 'A') + 10; } } } return value; } To convert to use C-style character strings, change the parameter type and use strlen for the length.
C++ ShiftJIS to UTF8 conversion
I need to convert Doublebyte characters. In my special case Shift-Jis into something better to handle, preferably with standard C++. the following Question ended up without a workaround: Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized So is there anyone with a suggestion or a reference on how to handle this conversion with C++ standard?
Normally I would recommend using the ICU library, but for this alone, using it is way too much overhead. First a conversion function which takes an std::string with Shiftjis data, and returns an std::string with UTF8 (note 2019: no idea anymore if it works :)) It uses a uint8_t array of 25088 elements (25088 byte), which is used as convTable in the code. The function does not fill this variable, you have to load it from eg. a file first. The second code part below is a program that can generate the file. The conversion function doesn't check if the input is valid ShiftJIS data. std::string sj2utf8(const std::string &input) { std::string output(3 * input.length(), ' '); //ShiftJis won't give 4byte UTF8, so max. 3 byte per input char are needed size_t indexInput = 0, indexOutput = 0; while(indexInput < input.length()) { char arraySection = ((uint8_t)input[indexInput]) >> 4; size_t arrayOffset; if(arraySection == 0x8) arrayOffset = 0x100; //these are two-byte shiftjis else if(arraySection == 0x9) arrayOffset = 0x1100; else if(arraySection == 0xE) arrayOffset = 0x2100; else arrayOffset = 0; //this is one byte shiftjis //determining real array offset if(arrayOffset) { arrayOffset += (((uint8_t)input[indexInput]) & 0xf) << 8; indexInput++; if(indexInput >= input.length()) break; } arrayOffset += (uint8_t)input[indexInput++]; arrayOffset <<= 1; //unicode number is... uint16_t unicodeValue = (convTable[arrayOffset] << 8) | convTable[arrayOffset + 1]; //converting to UTF8 if(unicodeValue < 0x80) { output[indexOutput++] = unicodeValue; } else if(unicodeValue < 0x800) { output[indexOutput++] = 0xC0 | (unicodeValue >> 6); output[indexOutput++] = 0x80 | (unicodeValue & 0x3f); } else { output[indexOutput++] = 0xE0 | (unicodeValue >> 12); output[indexOutput++] = 0x80 | ((unicodeValue & 0xfff) >> 6); output[indexOutput++] = 0x80 | (unicodeValue & 0x3f); } } output.resize(indexOutput); //remove the unnecessary bytes return output; } About the helper file: I used to have a download here, but nowadays I only know unreliable file hosters. So... either http://s000.tinyupload.com/index.php?file_id=95737652978017682303 works for you, or: First download the "original" data from ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT . I can't paste this here because of the length, so we have to hope at least unicode.org stays online. Then use this program while piping/redirecting above text file in, and redirecting the binary output to a new file. (Needs a binary-safe shell, no idea if it works on Windows). #include<iostream> #include<string> #include<cstdio> using namespace std; // pipe SHIFTJIS.txt in and pipe to (binary) file out int main() { string s; uint8_t *mapping; //same bigendian array as in converting function mapping = new uint8_t[2*(256 + 3*256*16)]; //initializing with space for invalid value, and then ASCII control chars for(size_t i = 32; i < 256 + 3*256*16; i++) { mapping[2 * i] = 0; mapping[2 * i + 1] = 0x20; } for(size_t i = 0; i < 32; i++) { mapping[2 * i] = 0; mapping[2 * i + 1] = i; } while(getline(cin, s)) //pipe the file SHIFTJIS to stdin { if(s.substr(0, 2) != "0x") continue; //comment lines uint16_t shiftJisValue, unicodeValue; if(2 != sscanf(s.c_str(), "%hx %hx", &shiftJisValue, &unicodeValue)) //getting hex values { puts("Error hex reading"); continue; } size_t offset; //array offset if((shiftJisValue >> 8) == 0) offset = 0; else if((shiftJisValue >> 12) == 0x8) offset = 256; else if((shiftJisValue >> 12) == 0x9) offset = 256 + 16*256; else if((shiftJisValue >> 12) == 0xE) offset = 256 + 2*16*256; else { puts("Error input values"); continue; } offset = 2 * (offset + (shiftJisValue & 0xfff)); if(mapping[offset] != 0 || mapping[offset + 1] != 0x20) { puts("Error mapping not 1:1"); continue; } mapping[offset] = unicodeValue >> 8; mapping[offset + 1] = unicodeValue & 0xff; } fwrite(mapping, 1, 2*(256 + 3*256*16), stdout); delete[] mapping; return 0; } Notes: Two-byte big endian raw unicode values (more than two byte not necessary here) First 256 chars (512 byte) for the single byte ShiftJIS chars, value 0x20 for invalid ones. Then 3 * 256*16 chars for the groups 0x8???, 0x9??? and 0xE??? = 25088 byte
For those looking for the Shift-JIS conversion table data, you can get the uint8_t array here: https://github.com/bucanero/apollo-ps3/blob/master/include/shiftjis.h Also, here's a very simple function to convert basic Shift-JIS chars to ASCII: const char SJIS_REPLACEMENT_TABLE[] = " ,.,..:;?!\"*'`*^" "-_????????*---/\\" "~||--''\"\"()()[]{" "}<><>[][][]+-+X?" "-==<><>????*'\"CY" "$c&%#&*#S*******" "*******T><^_'='"; //Convert Shift-JIS characters to ASCII equivalent void sjis2ascii(char* bData) { uint16_t ch; int i, j = 0; int len = strlen(bData); for (i = 0; i < len; i += 2) { ch = (bData[i]<<8) | bData[i+1]; // 'A' .. 'Z' // '0' .. '9' if ((ch >= 0x8260 && ch <= 0x8279) || (ch >= 0x824F && ch <= 0x8258)) { bData[j++] = (ch & 0xFF) - 0x1F; continue; } // 'a' .. 'z' if (ch >= 0x8281 && ch <= 0x829A) { bData[j++] = (ch & 0xFF) - 0x20; continue; } if (ch >= 0x8140 && ch <= 0x81AC) { bData[j++] = SJIS_REPLACEMENT_TABLE[(ch & 0xFF) - 0x40]; continue; } if (ch == 0x0000) { //End of the string bData[j] = 0; return; } // Character not found bData[j++] = bData[i]; bData[j++] = bData[i+1]; } bData[j] = 0; return; }
CString Hex value conversion to Byte Array
I have been trying to carry out a conversion from CString that contains Hex string to a Byte array and have been unsuccessful so far. I have looked on forums and none of them seem to help so far. Is there a function with just a few lines of code to do this conversion? My code: BYTE abyData[8]; // BYTE = unsigned char CString sByte = "0E00000000000400"; Expecting: abyData[0] = 0x0E; abyData[6] = 0x04; // etc.
You can simply gobble up two characters at a time: unsigned int value(char c) { if (c >= '0' && c <= '9') { return c - '0'; } if (c >= 'A' && c <= 'F') { return c - 'A' + 10; } if (c >= 'a' && c <= 'f') { return c - 'a' + 10; } return -1; // Error! } for (unsigned int i = 0; i != 8; ++i) { abyData[i] = value(sByte[2 * i]) * 16 + value(sByte[2 * i + 1]); } Of course 8 should be the size of your array, and you should ensure that the string is precisely twice as long. A checking version of this would make sure that each character is a valid hex digit and signal some type of error if that isn't the case.
How about something like this: for (int i = 0; i < sizeof(abyData) && (i * 2) < sByte.GetLength(); i++) { char ch1 = sByte[i * 2]; char ch2 = sByte[i * 2 + 1]; int value = 0; if (std::isdigit(ch1)) value += ch1 - '0'; else value += (std::tolower(ch1) - 'a') + 10; // That was the four high bits, so make them that value <<= 4; if (std::isdigit(ch2)) value += ch1 - '0'; else value += (std::tolower(ch1) - 'a') + 10; abyData[i] = value; } Note: The code above is not tested.
You could: #include <stdint.h> #include <sstream> #include <iostream> int main() { unsigned char result[8]; std::stringstream ss; ss << std::hex << "0E00000000000400"; ss >> *( reinterpret_cast<uint64_t *>( result ) ); std::cout << static_cast<int>( result[1] ) << std::endl; } however take care of memory management issues!!! Plus the result is in the reverse order as you would expect, so: result[0] = 0x00 result[1] = 0x04 ... result[7] = 0x0E