So I have a String of integers that looks like "82389235", but I wanted to iterate through it to add each number individually to a MutableList. However, when I go about it the way I think it would be handled:
var text = "82389235"
for (num in text) numbers.add(num.toInt())
This adds numbers completely unrelated to the string to the list. Yet, if I use println to output it to the console it iterates through the string perfectly fine.
How do I properly convert a Char to an Int?
That's because num is a Char, i.e. the resulting values are the ascii value of that char.
This will do the trick:
val txt = "82389235"
val numbers = txt.map { it.toString().toInt() }
The map could be further simplified:
map(Character::getNumericValue)
The variable num is of type Char. Calling toInt() on this returns its ASCII code, and that's what you're appending to the list.
If you want to append the numerical value, you can just subtract the ASCII code of 0 from each digit:
numbers.add(num.toInt() - '0'.toInt())
Which is a bit nicer like this:
val zeroAscii = '0'.toInt()
for(num in text) {
numbers.add(num.toInt() - zeroAscii)
}
This works with a map operation too, so that you don't have to create a MutableList at all:
val zeroAscii = '0'.toInt()
val numbers = text.map { it.toInt() - zeroAscii }
Alternatively, you could convert each character individually to a String, since String.toInt() actually parses the number - this seems a bit wasteful in terms of the objects created though:
numbers.add(num.toString().toInt())
On JVM there is efficient java.lang.Character.getNumericValue() available:
val numbers: List<Int> = "82389235".map(Character::getNumericValue)
Since Kotlin 1.5, there's a built-in function Char.digitToInt(): Int:
println('5'.digitToInt()) // 5 (int)
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/digit-to-int.html
For clarity, the zeroAscii answer can be simplified to
val numbers = txt.map { it - '0' }
as Char - Char -> Int. If you are looking to minimize the number of characters typed, that is the shortest answer I know. The
val numbers = txt.map(Character::getNumericValue)
may be the clearest answer, though, as it does not require the reader to know anything about the low-level details of ASCII codes. The toString().toInt() option requires the least knowledge of ASCII or Kotlin but is a bit weird and may be most puzzling to the readers of your code (though it was the thing I used to solve a bug before investigating if there really wasn't a better way!)
Related
I am trying to assign a unique index between 0 and N (where n is the number of unique characters within the string) to the characters in a UTF32 string.
For example, if I had the string "hello", the output of the function would be:
'h' = 0
'e' = 1
'l' = 2
'o' = 3
There are 4 unique characters in the string "hello", so the output would need to be between 0 and 3.
I know this can be done using a hash table quite easily, or even minimal perfect hashing. What I'm curious about is if there's a more efficient way of handling this task, since I only ever need to map a single character to a single output value (I don't need to hash entire strings, for example). Because of this, using something like std::map seems a bit overkill, however I've not been able to find mention of any alternative that would be any faster to initialize or evaluate (though I suppose you could just shove the characters in a sorted array and look them up using a binary search).
I would probably use a hash-table (in the form of std::unordered_set) to store the unique letters, then just use a simple counter when the output is needed.
Something like
std::string str = "hello";
std::unordered_set<char> chars(begin(str), end(str));
std::size_t counter = 0;
for (char c : chars)
std::cout << '\'' << c << "' = " << counter++ << '\n';
any alternative that would be any faster to initialize or evaluate
You're not going to get faster than a std::unordered_map<char, size_t> as you have to check if you've already seen a char before you know if you need to store a new char --> size_t map for it.
Unless, of course, you write a better unordered map. As #MaxLanghof points out: this can be done with something like a std::array<char, 256> intitailsed to a not found value.
If you work with 8 bit chars, you can use std::array<char, 256> map from char to unique index (which obviously fits into a char too):
constexpr unsigned char UNASSIGNED = 255; // Could be another character but then the loop logic gets harder.
std::array<unsigned char, 256> indices;
std::fill(indices.begin(), indices.end(), UNASSIGNED);
std::string input = ...;
unsigned char nextUniqueIndex = 0;
for (unsigned char c : input)
if (indices[c] == UNASSIGNED)
{
indices[c] = nextUniqueIndex;
++nextUniqueIndex;
}
// indices now contains a mapping of each char in the input to a unique index.
This of course requires that your input string doesn't use the entire value range of char (or rather that there are not 256 distinct characters in the input).
Now, you said that you are working with UTF32 which doesn't make this solution immediately viable. Indeed, for 32-bit characters the map would require 16 GB of memory (which would not perform well in any case). But if you actually receive 232 different UTF32 characters in random order then you're already at 16 GB input data, so at this point the question is "what assumptions can you make about your input data that can be exploited to improve the lookup" (presumably in the form of a good hashing function) and what kind of hash table gives you the best performance. I would wager that std::unordered_map with its separate allocations per key-value-pair and linked list traversal on lookup will not result in peak performance.
The sorting approach you mentioned is one such option, but if e.g. the entire input is a mix of two character this will not be "efficient" either compared to other approaches. I will also drop the keyword Bloom Filter here as, for large data volumes, it might be a good way to handle frequently seen characters quickly (i.e. having a separate data structure for frequent keys vs infrequent keys).
As you are using UTF32 strings, I'm assuming this is for a good reason, namely that you want to support a huge amount of different characters and symbols from all over the world. If you cannot make any assumption at all about which characters you will be most likely dealing with, I think the answer of Some programmer dude is your best bet.
However, std::unordered_set is known to be much slower than a simple array lookup, as proposed by Max Langhof. So, if you can make some assumptions, you may be able to combine these two ideas.
For example, if you can reasonably assume that most of your input will be ASCII characters, you can use something like this:
constexpr char32_t ExpectedStart = U' '; // space == 32
constexpr char32_t ExpectedEnd = 128;
int main()
{
std::basic_string<char32_t> input = U"Hello €";
std::array<bool, ExpectedEnd - ExpectedStart> fastLookup;
std::fill(fastLookup.begin(), fastLookup.end(), false);
std::unordered_set<char32_t> slowLookup;
for (auto c : input)
{
if (ExpectedStart <= c && c < ExpectedEnd)
fastLookup[c - ExpectedStart] = true;
else
slowLookup.insert(c);
}
size_t unique_id = 0;
for (char32_t c = ExpectedStart; c < ExpectedEnd; ++c)
if (fastLookup[c - ExpectedStart])
std::wcout << '\'' << (wchar_t)c << "' = " << unique_id++ << '\n';
for (auto c : slowLookup)
std::wcout << '\'' << (wchar_t)c << "' = " << unique_id++ << '\n';
}
Live demo.
Note that for printing purposes I casted the chars to wchar_t as it is apparently quite difficult to properly print char32_t. But I'm assuming your final goal is not printing anyway, so I hope this doesn't matter.
I have a program set up already to read in a file and split each line into words, storing them into a double vector of strings. That is,
std::vector < std::vector <std::string> > words
So, the idea is to use an array from alphabet a-z and using the ASCII values of the letters to get the index and swapping the characters in the strings with the appropriate shifted character. How would I get the value of each character so that I can look it up as an index?
I also want to keep numbers intact, as a shift cipher, I believe, doesn't do anything with numbers in the text to be deciphered. How would I check if the character is an int so I can leave it alone?
If you want the ASCII value, you simply have to cast the value to a int:
int ascii_value = (int)words[i][j][k];
If you want to have a value starting from A or a you can do this:
int letter_value_from_A = (int)(words[i][j][k] - 'A');
int letter_value_from_a = (int)(words[i][j][k] - 'a');
Your char is nothing else than a value. Take this code as example (I am used to program C++11, so this will be a little ugly):
char shiftarray[256] = {0, 0, 0, 0 // Here comes your map //
std::string output;
for(int w=0; w<words.length(); w++)
{
for(int c=0; c<words[w].length(); c++)
{
output.pushback(shiftarry[words[w][c]]);
}
output.push_back(' ');
}
I do not know how to do it in anything other than basic, but very simply get the ascii value of each letter in the string using a loop. As the loop continues add a value to, or subtract a value from the ascii value you just obtained, then convert it back to a letter and append it to a string. This will give you a different character than you had originally. By doing this, you can load and save data that will look like gibberish if anyone tried to view it other than in the program it was written in. The data then becomes a special propriatry document format.
I am reading a large file with several different fields on each line. By analogy, you can think of each line of the file as representing an employee and one of the fields contains the department name that they work in. However, the department name can be made up of any set of 4-5 ASCII characters (e.g. "1234", "ABCD", "P+0$i".
Currently, I am (naively) storing the characters as an std::string but I've noticed that I've been performing a lot of time-consuming string comparisons. Thus, I would like to read the field from the file, convert the string into a number (maybe an unsigned int?), and then later do many numerical comparisons (and avoid the string comparison). Of course, I will need a way to convert the number back into a string for output.
Most of my online searches bring up "convert string to number" which discusses the use of using stringstream to convert a number string to an int of some sort. This isn't particularly helpful and I can't seem to come up with a proper search query to find a solution. Can anybody please point me to a relevant source or provide a way to perform this conversion?
Well, if you have up to 5 ASCII characters, then the simplest approach is to pad it out to 8 with whatever character you like, then *reinterpret_cast<uint64_t*>(the_id.data()).
If you want to fit the character representation in to a 32 bit int, you need to do considerably more work: simply discarding the high order bit (possible because ASCII codes are 0-127) still leaves 7*5 = 35 bits - too many for a 32-bit type. Assuming the ids don't contain any control codes (i.e. ASCII codes 0-31), you can achieve the packing with base-96 encoding like this:
unsigned base = 128 - 32;
// pad c out to 5 characters if necessary.
unsigned idnum = (((((c[0] - 32) * base + (c[1] - 32)) * base + (c[2] - 32)) * base + (c[3] - 32)) * base + (c[4] - 32)) * base + c[5] - 32;
You may find it easier to read with a loop:
unsigned idnum = 0;
for (size_t i = 0; i < 5; ++i)
{
idnum *= base;
idnum += c[i] - ' ';
}
Unpacking the number back to the string value is done with % base to get the last digit, then / base to prepare for getting the next....
Create a map with department name -> integer mappings somewhere. Depending on how fixed the department names are, either read them from a configuration file or define the map statically. Create the inverse mapping afterwards to map integer -> string when needed.
While reading your file, lookup the associated key in the mapping and store it in your data structure.
When needed, lookup the department name in the inverse map
Given that std::string equality operations are actually efficient for data of arbitrary lengths and you need to retain the actual string value, I feel that you might want to look elsewhere for performance improvements. Looking over your requirements it looks like you just need to have a better search efficiency for your objects. The std::unordered_map container is likely a good candidate for use. It is an associative container that has a constant time for lookups via a key. You can store other collections of your data as the value type for the unordered_map and set the associated key to what ever you want to look it up by. Here is an example of some types that enable looking up a matching subset of your data via a string.
struct Employee;
typedef std::vector<std::shared_ptr<Employee>> Employees;
typedef std::unordered_map<std::string, Employees> EmployeeByLookup;
You could then lookup all employees that match a given key value like this.
static EmployeeByLookup byDepartment;
Employees& GetDepartmentList(const std::string& department)
{
return byDepartment[department];
}
If you need to look up your object by value instead of via an associative key then I would suggest that you look at std::unordered_set. It also has an average complexity of constant time for lookups. You can create your own hash function for your object if you need to optimize the internal hash performance.
I created a simple application using the unordered_map example. Take a look if your interested.
C++11 has std::stoi. Other than that, you can use a stringstream for conversions:
std::istringstream iss("1234");
int x;
if ( iss >> x )
std::cout << "Got " << x << "\n";
else
std::cout << "String did not contain a number\n";
You can do other stream extractions from iss just like you are used to doing with cin.
I'm not sure why you consider the stringstream advice to be "not very useful"?
I have an arbitrary Unicode string that represents a number, such as "2", "٢" (U+0662, ARABIC-INDIC DIGIT TWO) or "Ⅱ" (U+2161, ROMAN NUMERAL TWO). I want to convert that string into an int. I don't care about specific locales (the input might not be in the current locale); if it's a valid number then it should get converted.
I tried QString.toInt and QLocale.toInt, but they don't seem to get the job done. Example:
bool ok;
int n;
QString s = QChar(0x0662); // ARABIC-INDIC DIGIT TWO
n = s.toInt(&ok); // n == 0; ok == false
QLocale anyLocale(QLocale::AnyLanguage, QLocale::AnyScript, QLocale::AnyCountry);
n = anyLocale.toInt(s, &ok); // n == 0; ok == false
QLocale cLocale = QLocale::C;
n = cLocale.toInt(s, &ok); // n == 0; ok == false
QLocale arabicLocale = QLocale::Arabic; // Specific locale. I don't want that.
n = arabicLocale.toInt(s, &ok); // n == 2; ok == true
Is there a function I am missing?
I could try all locales:
QList<QLocale> allLocales = QLocale::matchingLocales(QLocale::AnyLanguage, QLocale::AnyScript, QLocale::AnyCountry);
for(int i = 0; i < allLocales.size(); i++)
{
n = allLocales[i].toInt(s, &ok);
if(ok)
break;
}
But that feels slightly hackish. Also, it does not work for all strings (e.g. Roman numerals, but that's an acceptable limitation). Are there any pitfalls when doing it that way, such as conflicting rules in different locales (cf. Turkish vs. non-Turkish letter case rules)?
I' not aware of any ready to use package which does this (but
maybe ICU supports it), but it isn't hard to do if you really
want to. First, you should download the UnicodeData.txt file
from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
This is an easy to parse ASCII file; the exact syntax is
described in http://www.unicode.org/reports/tr44/tr44-10.html,
but for your purposes, all you need to know is that each line in
the file consists of semi-colon separated fields. The first
field contains the character code in hex, the third field the
"general category", and if the third field is "Nd" (numeric,
decimal), the seventh field contains the decimal value.
This file can easily be parsed using Python or a number of other
scripting languages, to build a mapping table. You'll want some
sort of sparse representation, since there are over a million
Unicode characters, of which very few (a couple of hundred) are
decimal digits. The following Python script will give you a C++
table which can be used to initialize an
std::map<int, int>;. If the character is
in the map, the mapped element is its value.
Whether this is sufficient or not depends on your application.
It has several weaknesses:
It requires extra logic to recognize when two successive
digits are in different alphabets. Presumably a sequence "1١"
should be treated as two numbers (1 and 1), rather than as one
(11). (Because all of the sets of decimal digits are in 10
successive codes, it would be fairly easy, once you know the
digit, to check whether the preceding digit character was in the
same set.)
It ignores non-decimal digits, like ௰ or ൱ (Tamil ten and
Malayam one hundred). There aren't that many of them, and they are
also in the UnicodeData.txt file, so it might be possible to
find them manually and add them to the table. I don't know
myself, however, how they combine with other digits when numbers
have been composed.
If you're converting numbers, you might have to worry about
the direction. I'm not sure how this is handled (but there is
documentation at the Unicode site); in general, text will appear
in its natural order. In the case of Arabic and related
languages, when reading in the natural order, the low order
digits appear first: something like "١٢" (literally "12",
but because the writing is from right to left, the digits will
appear in the order "21") should be interpreted as 12, and not 21. Except that I'm not sure whether a change direction mark is
present or not. (The exact rules are described in the
documentation at the Unicode site; in the UnicodeData.txt file,
the fifth field—index 4—gives this information. I
think if it's anything but "AN", you can assume the big-endian
standard used in Europe, but I'm not sure.)
Just to show how simple this is, here's the Python script to
parse the UnicodeData.txt file for the digit values:
print('std::pair<int, int> initUnicodeMap[] = {')
for line in open("UnicodeData.txt"):
fields = line.split(';')
if fields[2] == 'Nd':
print(' {{{:d}, {:d}}},'.format(int(fields[0], 16), int(fields[7])))
print('};')
If you're doing any work with Unicode, this files is a gold mine
for generating all sorts of useful tables.
You can get the numeric equivalent of an unicode character with the method QChar::digitValue:
int value = QChar::digitValue((uint)0x0662);
It will return -1 if the character does not have numeric value.
See the documentation if you need more help, I don't really know much about c++/qt
Chinese numerals mentioned in that wikipedia article belong to 0x4E00-0x9FCC. There is no useful metadata about individual characters in this range:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
So if you wish to map chinese numerals to integers, you must do that mapping yourself, simple as that.
Here's simple mapping of the symbols in the wikipedia article where a single symbol maps to some single number:
0x96f6,0x3007 = 0
0x58f9,0x4e00,0x5f0c = 1
0x8cb3,0x8d30,0x4e8c,0x5f0d,0x5169,0x4e24 = 2
0x53c3,0x53c1,0x4e09,0x5f0e,0x53c3,0x53c2,0x53c4,0x53c1 = 3
0x8086,0x56db,0x4989 = 4
0x4f0d,0x4e94 = 5
0x9678,0x9646,0x516d = 6
0x67d2,0x4e03 = 7
0x634c,0x516b = 8
0x7396,0x4e5d = 9
0x62fe,0x5341,0x4ec0 = 10
0x4f70,0x767e = 100
0x4edf,0x5343 = 1000
0x842c,0x842c,0x4e07 = 10000
0x5104,0x5104,0x4ebf = 100000000
0x5e7a = 1
0x5169,0x4e24 = 2
0x5440 = 10
0x5ff5,0x5eff = 20
0x5345 = 30
0x534c = 40
0x7695 = 200
0x6d1e = 0
0x5e7a = 1
0x4e24 = 2
0x5200 = 4
0x62d0 = 7
0x52fe = 9
Currently I'm working on an assignment and using C++ for the first time.
I'm trying to append certain "message types" to the beginning of strings so when sent to the server/client it will deal with the strings depending on the message type. I was wondering if I would be able to put any two-digit integer into an element of the message buffer.... see below.
I've left a section of the code below:
char messageBuffer[32];
messageBuffer[0] = '10'; << I get an overflow here
messageBuffer[1] = '0';
for (int i = 2; i < (userName.size() + 2); i++)
{
messageBuffer[i] = userName[(i - 2)];
}
Thanks =)
'10' is not a valid value, thus the overflow
either write 10 as in messageBuffer[0]=10 - if ten is the value you want to put it or do as Lars wrote.
The message buffer is an array of char. Index 0 contains one char, so you cannot put 2 chars into one char. That would violate the rule that one bit contains one binary digit :-)
The correct solution is to do this:
messageBuffer[0]='0';
messageBuffer[1]='1';
or:
messageBuffer[1]='0';
messageBuffer[0]='1';
or
messageBuffer[0]=10;