Convert ASCII to Unsigned Int and Vice Versa - c++

I am reading a large file with several different fields on each line. By analogy, you can think of each line of the file as representing an employee and one of the fields contains the department name that they work in. However, the department name can be made up of any set of 4-5 ASCII characters (e.g. "1234", "ABCD", "P+0$i".
Currently, I am (naively) storing the characters as an std::string but I've noticed that I've been performing a lot of time-consuming string comparisons. Thus, I would like to read the field from the file, convert the string into a number (maybe an unsigned int?), and then later do many numerical comparisons (and avoid the string comparison). Of course, I will need a way to convert the number back into a string for output.
Most of my online searches bring up "convert string to number" which discusses the use of using stringstream to convert a number string to an int of some sort. This isn't particularly helpful and I can't seem to come up with a proper search query to find a solution. Can anybody please point me to a relevant source or provide a way to perform this conversion?

Well, if you have up to 5 ASCII characters, then the simplest approach is to pad it out to 8 with whatever character you like, then *reinterpret_cast<uint64_t*>(the_id.data()).
If you want to fit the character representation in to a 32 bit int, you need to do considerably more work: simply discarding the high order bit (possible because ASCII codes are 0-127) still leaves 7*5 = 35 bits - too many for a 32-bit type. Assuming the ids don't contain any control codes (i.e. ASCII codes 0-31), you can achieve the packing with base-96 encoding like this:
unsigned base = 128 - 32;
// pad c out to 5 characters if necessary.
unsigned idnum = (((((c[0] - 32) * base + (c[1] - 32)) * base + (c[2] - 32)) * base + (c[3] - 32)) * base + (c[4] - 32)) * base + c[5] - 32;
You may find it easier to read with a loop:
unsigned idnum = 0;
for (size_t i = 0; i < 5; ++i)
{
idnum *= base;
idnum += c[i] - ' ';
}
Unpacking the number back to the string value is done with % base to get the last digit, then / base to prepare for getting the next....

Create a map with department name -> integer mappings somewhere. Depending on how fixed the department names are, either read them from a configuration file or define the map statically. Create the inverse mapping afterwards to map integer -> string when needed.
While reading your file, lookup the associated key in the mapping and store it in your data structure.
When needed, lookup the department name in the inverse map

Given that std::string equality operations are actually efficient for data of arbitrary lengths and you need to retain the actual string value, I feel that you might want to look elsewhere for performance improvements. Looking over your requirements it looks like you just need to have a better search efficiency for your objects. The std::unordered_map container is likely a good candidate for use. It is an associative container that has a constant time for lookups via a key. You can store other collections of your data as the value type for the unordered_map and set the associated key to what ever you want to look it up by. Here is an example of some types that enable looking up a matching subset of your data via a string.
struct Employee;
typedef std::vector<std::shared_ptr<Employee>> Employees;
typedef std::unordered_map<std::string, Employees> EmployeeByLookup;
You could then lookup all employees that match a given key value like this.
static EmployeeByLookup byDepartment;
Employees& GetDepartmentList(const std::string& department)
{
return byDepartment[department];
}
If you need to look up your object by value instead of via an associative key then I would suggest that you look at std::unordered_set. It also has an average complexity of constant time for lookups. You can create your own hash function for your object if you need to optimize the internal hash performance.
I created a simple application using the unordered_map example. Take a look if your interested.

C++11 has std::stoi. Other than that, you can use a stringstream for conversions:
std::istringstream iss("1234");
int x;
if ( iss >> x )
std::cout << "Got " << x << "\n";
else
std::cout << "String did not contain a number\n";
You can do other stream extractions from iss just like you are used to doing with cin.
I'm not sure why you consider the stringstream advice to be "not very useful"?

Related

Assigning a unique index to each character (not character position!) within a string

I am trying to assign a unique index between 0 and N (where n is the number of unique characters within the string) to the characters in a UTF32 string.
For example, if I had the string "hello", the output of the function would be:
'h' = 0
'e' = 1
'l' = 2
'o' = 3
There are 4 unique characters in the string "hello", so the output would need to be between 0 and 3.
I know this can be done using a hash table quite easily, or even minimal perfect hashing. What I'm curious about is if there's a more efficient way of handling this task, since I only ever need to map a single character to a single output value (I don't need to hash entire strings, for example). Because of this, using something like std::map seems a bit overkill, however I've not been able to find mention of any alternative that would be any faster to initialize or evaluate (though I suppose you could just shove the characters in a sorted array and look them up using a binary search).
I would probably use a hash-table (in the form of std::unordered_set) to store the unique letters, then just use a simple counter when the output is needed.
Something like
std::string str = "hello";
std::unordered_set<char> chars(begin(str), end(str));
std::size_t counter = 0;
for (char c : chars)
std::cout << '\'' << c << "' = " << counter++ << '\n';
any alternative that would be any faster to initialize or evaluate
You're not going to get faster than a std::unordered_map<char, size_t> as you have to check if you've already seen a char before you know if you need to store a new char --> size_t map for it.
Unless, of course, you write a better unordered map. As #MaxLanghof points out: this can be done with something like a std::array<char, 256> intitailsed to a not found value.
If you work with 8 bit chars, you can use std::array<char, 256> map from char to unique index (which obviously fits into a char too):
constexpr unsigned char UNASSIGNED = 255; // Could be another character but then the loop logic gets harder.
std::array<unsigned char, 256> indices;
std::fill(indices.begin(), indices.end(), UNASSIGNED);
std::string input = ...;
unsigned char nextUniqueIndex = 0;
for (unsigned char c : input)
if (indices[c] == UNASSIGNED)
{
indices[c] = nextUniqueIndex;
++nextUniqueIndex;
}
// indices now contains a mapping of each char in the input to a unique index.
This of course requires that your input string doesn't use the entire value range of char (or rather that there are not 256 distinct characters in the input).
Now, you said that you are working with UTF32 which doesn't make this solution immediately viable. Indeed, for 32-bit characters the map would require 16 GB of memory (which would not perform well in any case). But if you actually receive 232 different UTF32 characters in random order then you're already at 16 GB input data, so at this point the question is "what assumptions can you make about your input data that can be exploited to improve the lookup" (presumably in the form of a good hashing function) and what kind of hash table gives you the best performance. I would wager that std::unordered_map with its separate allocations per key-value-pair and linked list traversal on lookup will not result in peak performance.
The sorting approach you mentioned is one such option, but if e.g. the entire input is a mix of two character this will not be "efficient" either compared to other approaches. I will also drop the keyword Bloom Filter here as, for large data volumes, it might be a good way to handle frequently seen characters quickly (i.e. having a separate data structure for frequent keys vs infrequent keys).
As you are using UTF32 strings, I'm assuming this is for a good reason, namely that you want to support a huge amount of different characters and symbols from all over the world. If you cannot make any assumption at all about which characters you will be most likely dealing with, I think the answer of Some programmer dude is your best bet.
However, std::unordered_set is known to be much slower than a simple array lookup, as proposed by Max Langhof. So, if you can make some assumptions, you may be able to combine these two ideas.
For example, if you can reasonably assume that most of your input will be ASCII characters, you can use something like this:
constexpr char32_t ExpectedStart = U' '; // space == 32
constexpr char32_t ExpectedEnd = 128;
int main()
{
std::basic_string<char32_t> input = U"Hello €";
std::array<bool, ExpectedEnd - ExpectedStart> fastLookup;
std::fill(fastLookup.begin(), fastLookup.end(), false);
std::unordered_set<char32_t> slowLookup;
for (auto c : input)
{
if (ExpectedStart <= c && c < ExpectedEnd)
fastLookup[c - ExpectedStart] = true;
else
slowLookup.insert(c);
}
size_t unique_id = 0;
for (char32_t c = ExpectedStart; c < ExpectedEnd; ++c)
if (fastLookup[c - ExpectedStart])
std::wcout << '\'' << (wchar_t)c << "' = " << unique_id++ << '\n';
for (auto c : slowLookup)
std::wcout << '\'' << (wchar_t)c << "' = " << unique_id++ << '\n';
}
Live demo.
Note that for printing purposes I casted the chars to wchar_t as it is apparently quite difficult to properly print char32_t. But I'm assuming your final goal is not printing anyway, so I hope this doesn't matter.

How do I convert a Char to Int?

So I have a String of integers that looks like "82389235", but I wanted to iterate through it to add each number individually to a MutableList. However, when I go about it the way I think it would be handled:
var text = "82389235"
for (num in text) numbers.add(num.toInt())
This adds numbers completely unrelated to the string to the list. Yet, if I use println to output it to the console it iterates through the string perfectly fine.
How do I properly convert a Char to an Int?
That's because num is a Char, i.e. the resulting values are the ascii value of that char.
This will do the trick:
val txt = "82389235"
val numbers = txt.map { it.toString().toInt() }
The map could be further simplified:
map(Character::getNumericValue)
The variable num is of type Char. Calling toInt() on this returns its ASCII code, and that's what you're appending to the list.
If you want to append the numerical value, you can just subtract the ASCII code of 0 from each digit:
numbers.add(num.toInt() - '0'.toInt())
Which is a bit nicer like this:
val zeroAscii = '0'.toInt()
for(num in text) {
numbers.add(num.toInt() - zeroAscii)
}
This works with a map operation too, so that you don't have to create a MutableList at all:
val zeroAscii = '0'.toInt()
val numbers = text.map { it.toInt() - zeroAscii }
Alternatively, you could convert each character individually to a String, since String.toInt() actually parses the number - this seems a bit wasteful in terms of the objects created though:
numbers.add(num.toString().toInt())
On JVM there is efficient java.lang.Character.getNumericValue() available:
val numbers: List<Int> = "82389235".map(Character::getNumericValue)
Since Kotlin 1.5, there's a built-in function Char.digitToInt(): Int:
println('5'.digitToInt()) // 5 (int)
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/digit-to-int.html
For clarity, the zeroAscii answer can be simplified to
val numbers = txt.map { it - '0' }
as Char - Char -> Int. If you are looking to minimize the number of characters typed, that is the shortest answer I know. The
val numbers = txt.map(Character::getNumericValue)
may be the clearest answer, though, as it does not require the reader to know anything about the low-level details of ASCII codes. The toString().toInt() option requires the least knowledge of ASCII or Kotlin but is a bit weird and may be most puzzling to the readers of your code (though it was the thing I used to solve a bug before investigating if there really wasn't a better way!)

How to check if number from stdin is less than the numeric limit of a given type?

I have received a task that does not actually specify, what range of input one of my functions should expect (only, that it is always going to be a positive integer), and the input is decided runtime. Can I somehow test if the type I selected can actually hold the value it was fed to?
An illustration of what I am hoping to do:
char test;
std::cin >> test;
if(MAGIC)
{
std::cout << "Error." << std::endl;
}
With the magical part (or even the preceeding line) being the test I'm looking for.
It should work like this:
stdin: 100 -> no output
stdin: 1000000 -> Error.
For most data types you can check the error state of the input stream as described here. However, this would not work for some other types, such as char from your post, because std::cin >> test would read data as a single character, i.e. '1' which is not what you need to achieve.
One approach is to read the number into std::string, and compare it to the max() obtained through limits:
std::string max = std::to_string(std::numeric_limits<char>::max());
std::string s;
std::cin >> s;
bool cmp = (s.size() == max.size()) ? (s <= max) : (s.size() < max.size());
std::cout << s << " " << cmp << std::endl;
Demo.
Note: The above code makes an assumption that the data entered is a number, which may be arbitrarily large, but must not contain characters other than digits. If you cannot make this assumption, use a solution from this Q&A to test if the input is numeric.
I think the answer may be in https://stackoverflow.com/a/1855465/7071399 to know where to get the limits, then refer to https://stackoverflow.com/a/9574912/7071399 to get inspiration to your code (given you can use a long int or a char array to store the number and then check if it fits in the integer value).

C++ - Overloading operator>> and processing input using C-style strings

I'm working on an assignment where we have to create a "MyInt" class that can handle larger numbers than regular ints. We only have to handle non-negative numbers. I need to overload the >> operator for this class, but I'm struggling to do that.
I'm not allowed to #include <string>.
Is there a way to:
a. Accept input as a C-style string
b. Parse through it and check for white space and non-numbers (i.e. if the prompt is cin >> x >> y >> ch, and the user enters 1000 934H, to accept that input as two MyInts and then a char).
I'm assuming it has something to do with peek() and get(), but I'm having trouble figuring out where they come in.
I'd rather not know exactly how to do it! Just point me in the right direction.
Here's my constructor, so you can get an idea for what the class is (I also have a conversion constructor for const char *.
MyInt::MyInt (int n)
{
maxsize = 1;
for (int i = n; i > 9; i /= 10) {
// Divides the number by 10 to find the number of times it is divisible; that is the length
maxsize++;
}
intstring = new int[maxsize];
for (int j = (maxsize - 1); j >= 0; j--) {
// Copies the integer into an integer array by use of the modulus operator
intstring[j] = n % 10;
n = n / 10;
}
}
Thanks! Sorry if this question is vague, I'm still new to this. Let me know if I can provide any more info to make the question clearer.
So what you basically want is to parse a const char* to retrieve a integer number inside it, and ignore all whitespace(+others?) characters.
Remember that characters like '1' or 'M' or even ' ' are just integers, mapped to the ASCII table. So you can easily convert a character from its notation human-readable ('a') to its value in memory. There are plenty of sources on ascii table and chars in C/C++ so i'll let you find it, but you should get the idea. In C/C++, characters are numbers (of type char).
With this, you then know you can perform operations on them, like addition, or comparison.
Last thing when dealing with C-strings : they are null-terminated, meaning that the character '\0' is placed right after their last used character.

How to determine user input commands

I am starting to write a command line converter and the only concern I have is user input (rest wont be hard). The program will have few commands like (convert 2 m to km), so when the user enters that, the program will output the converted value. My question is, what is the best way to parse user input and determine the command that user entered?. Should I divide user input into array of words and then pass to a function, so it can do something or there is another way?
I have written a few types of "simple parsers" (and several more advanced ones). From what you describe, if the commands are "convert 2 m to km", then you would simply need to split things on spaces.
Of course, if you allow "convert2mtokm" and "convert 2m to km" it gets a bit more difficult to deal with. Sticking to a "strict rule of there has to be a(t least one) space between words" makes life a lot easier.
At that point, you will have a vector<string> cmd that can be dealt with. So for example:
if (cmd[0] == "convert")
{
convert(cmd);
}
...
void convert(vector<string> cmd)
{
double dist = stod(cmd[1]);
string unit_from = cmd[2];
string unit_to = cmd[4];
if(cmd[3] != "to")
{
... print some error ...
}
double factor = unit_conversion(unit_from, unit_to);
cout << "that becomes " << dist * factor << " in " << unit_to << endl;
}
If you only have a few commands, it will be best to just strtok(input, ' '), which just splits up a string into an array of words in the command (assuming your command words are all separated by spaces). Then you can do some simple if/switch checks to see which command the user entered. For a larger number of commands (where some may be similar), you will probably need to implement or at least write out a DFA (deterministic finite automata).
An array of structures will be fine. The structure may be like this:
struct cmd
{
char **usrcmd;
void (*fc)();
};
Then you just have to iterate the array and compare the user input and the usrcmd[0] field (I assume the command is the first word).
However this solution is not the best way to go if you have a lot of user commands to handle.