Convert any Unicode string to int - c++

I have an arbitrary Unicode string that represents a number, such as "2", "٢" (U+0662, ARABIC-INDIC DIGIT TWO) or "Ⅱ" (U+2161, ROMAN NUMERAL TWO). I want to convert that string into an int. I don't care about specific locales (the input might not be in the current locale); if it's a valid number then it should get converted.
I tried QString.toInt and QLocale.toInt, but they don't seem to get the job done. Example:
bool ok;
int n;
QString s = QChar(0x0662); // ARABIC-INDIC DIGIT TWO
n = s.toInt(&ok); // n == 0; ok == false
QLocale anyLocale(QLocale::AnyLanguage, QLocale::AnyScript, QLocale::AnyCountry);
n = anyLocale.toInt(s, &ok); // n == 0; ok == false
QLocale cLocale = QLocale::C;
n = cLocale.toInt(s, &ok); // n == 0; ok == false
QLocale arabicLocale = QLocale::Arabic; // Specific locale. I don't want that.
n = arabicLocale.toInt(s, &ok); // n == 2; ok == true
Is there a function I am missing?
I could try all locales:
QList<QLocale> allLocales = QLocale::matchingLocales(QLocale::AnyLanguage, QLocale::AnyScript, QLocale::AnyCountry);
for(int i = 0; i < allLocales.size(); i++)
{
n = allLocales[i].toInt(s, &ok);
if(ok)
break;
}
But that feels slightly hackish. Also, it does not work for all strings (e.g. Roman numerals, but that's an acceptable limitation). Are there any pitfalls when doing it that way, such as conflicting rules in different locales (cf. Turkish vs. non-Turkish letter case rules)?

I' not aware of any ready to use package which does this (but
maybe ICU supports it), but it isn't hard to do if you really
want to. First, you should download the UnicodeData.txt file
from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
This is an easy to parse ASCII file; the exact syntax is
described in http://www.unicode.org/reports/tr44/tr44-10.html,
but for your purposes, all you need to know is that each line in
the file consists of semi-colon separated fields. The first
field contains the character code in hex, the third field the
"general category", and if the third field is "Nd" (numeric,
decimal), the seventh field contains the decimal value.
This file can easily be parsed using Python or a number of other
scripting languages, to build a mapping table. You'll want some
sort of sparse representation, since there are over a million
Unicode characters, of which very few (a couple of hundred) are
decimal digits. The following Python script will give you a C++
table which can be used to initialize an
std::map<int, int>;. If the character is
in the map, the mapped element is its value.
Whether this is sufficient or not depends on your application.
It has several weaknesses:
It requires extra logic to recognize when two successive
digits are in different alphabets. Presumably a sequence "1١"
should be treated as two numbers (1 and 1), rather than as one
(11). (Because all of the sets of decimal digits are in 10
successive codes, it would be fairly easy, once you know the
digit, to check whether the preceding digit character was in the
same set.)
It ignores non-decimal digits, like ௰ or ൱ (Tamil ten and
Malayam one hundred). There aren't that many of them, and they are
also in the UnicodeData.txt file, so it might be possible to
find them manually and add them to the table. I don't know
myself, however, how they combine with other digits when numbers
have been composed.
If you're converting numbers, you might have to worry about
the direction. I'm not sure how this is handled (but there is
documentation at the Unicode site); in general, text will appear
in its natural order. In the case of Arabic and related
languages, when reading in the natural order, the low order
digits appear first: something like "١٢" (literally "12",
but because the writing is from right to left, the digits will
appear in the order "21") should be interpreted as 12, and not 21. Except that I'm not sure whether a change direction mark is
present or not. (The exact rules are described in the
documentation at the Unicode site; in the UnicodeData.txt file,
the fifth field—index 4—gives this information. I
think if it's anything but "AN", you can assume the big-endian
standard used in Europe, but I'm not sure.)
Just to show how simple this is, here's the Python script to
parse the UnicodeData.txt file for the digit values:
print('std::pair<int, int> initUnicodeMap[] = {')
for line in open("UnicodeData.txt"):
fields = line.split(';')
if fields[2] == 'Nd':
print(' {{{:d}, {:d}}},'.format(int(fields[0], 16), int(fields[7])))
print('};')
If you're doing any work with Unicode, this files is a gold mine
for generating all sorts of useful tables.

You can get the numeric equivalent of an unicode character with the method QChar::digitValue:
int value = QChar::digitValue((uint)0x0662);
It will return -1 if the character does not have numeric value.
See the documentation if you need more help, I don't really know much about c++/qt
Chinese numerals mentioned in that wikipedia article belong to 0x4E00-0x9FCC. There is no useful metadata about individual characters in this range:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
So if you wish to map chinese numerals to integers, you must do that mapping yourself, simple as that.
Here's simple mapping of the symbols in the wikipedia article where a single symbol maps to some single number:
0x96f6,0x3007 = 0
0x58f9,0x4e00,0x5f0c = 1
0x8cb3,0x8d30,0x4e8c,0x5f0d,0x5169,0x4e24 = 2
0x53c3,0x53c1,0x4e09,0x5f0e,0x53c3,0x53c2,0x53c4,0x53c1 = 3
0x8086,0x56db,0x4989 = 4
0x4f0d,0x4e94 = 5
0x9678,0x9646,0x516d = 6
0x67d2,0x4e03 = 7
0x634c,0x516b = 8
0x7396,0x4e5d = 9
0x62fe,0x5341,0x4ec0 = 10
0x4f70,0x767e = 100
0x4edf,0x5343 = 1000
0x842c,0x842c,0x4e07 = 10000
0x5104,0x5104,0x4ebf = 100000000
0x5e7a = 1
0x5169,0x4e24 = 2
0x5440 = 10
0x5ff5,0x5eff = 20
0x5345 = 30
0x534c = 40
0x7695 = 200
0x6d1e = 0
0x5e7a = 1
0x4e24 = 2
0x5200 = 4
0x62d0 = 7
0x52fe = 9

Related

Extracting numbers using Regex in Matlab

I would like to extract integers from strings from a cell array in Matlab. Each string contains 1 or 2 integers formatted as shown below. Each number can be one or two digits. I would like to convert each string to a 1x2 array. If there is only one number in the string, the second column should be -1. If there are two numbers then the first entry should be the first number, and the second entry should be the second number.
'[1, 2]'
'[3]'
'[10, 3]'
'[1, 12]'
'[11, 12]'
Thank you very much!
I have tried a few different methods that did not work out. I think that I need to use regex and am having difficulty finding the proper expression.
You can use str2num to convert well formatted chars (which you appear to have) to the correct arrays/scalars. Then simply pad from the end+1 element to the 2nd element (note this is nothing in the case there's already two elements) with the value -1.
This is most clearly done in a small loop, see the comments for details:
% Set up the input
c = { ...
'[1, 2]'
'[3]'
'[10, 3]'
'[1, 12]'
'[11, 12]'
};
n = cell(size(c)); % Initialise output
for ii = 1:numel(n) % Loop over chars in 'c'
n{ii} = str2num(c{ii}); % convert char to numeric array
n{ii}(end+1:2) = -1; % Extend (if needed) to 2 elements = -1
end
% (Optional) Convert from a cell to an Nx2 array
n = cell2mat(n);
If you really wanted to use regex, you could replace the loop part with something similar:
n = regexp( c, '\d{1,2}', 'match' ); % Match between one and two digits
for ii = 1:numel(n)
n{ii} = str2double(n{ii}); % Convert cellstr of chars to arrays
n{ii}(end+1:2) = -1; % Pad to be at least 2 elements
end
But there are lots of ways to do this without touching regex, for example you could erase the square brackets, split on a comma, and pad with -1 according to whether or not there's a comma in each row. Wrap it all in a much harder to read (vs a loop) cellfun and ta-dah you get a one-liner:
n = cellfun( #(x) [str2double( strsplit( erase(x,{'[',']'}), ',' ) ), -1*ones(1,1-nnz(x==','))], c, 'uni', 0 );
I'd recommend one of the loops for ease of reading and debugging.

I'm trying to encrypt a message for my homework assignment

The gist of it is that every letter from a-z needs to be encrypted into a number.
For example a will turn to "1", b into "2" all the way to z="26". Then I have to guess the number of possible outcomes for every encryption. For example 25114 can be 6 different thing. It can be BEAN,BEAAD,YAAD,YAN,YKD,BEKD.
My question is "How do I do this" ?
I've tried using if but it keeps printing "1" as an output every time.
#include <iostream>
using namespace std;
int main()
{
int a1,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z;
cout<<"vnesi kod"<<endl;
cin>>a1;
if (a)
{
cout<<"1"<<endl;
}
else if (b)
{
cout<<"2"endl;
}
return 0;
}
Since this is a homework problem, I just give you some pseudo-code on how to solve this. You will still have to implement it yourself.
Let us assume you get a number as input existing out of n digits: a1a2a3 ... an
Since the alphabet contains 26 letters, we want to split this number into groups of 1 or 2 digits and if we have a group of two digits, you have to check if the number is smaller than 27. The quickest way to do this is to make use of a recursive function. It is not the cleanest, but the quickest. Let us assume the recursive function is called decode.
It is very easy to understand why a recursive function is needed. If we want to decode the number 25114. There are two paths we need to take, groups of 1 and groups of 2:
group of 1: translate the last digit 4 into "D", and decode the remaining number 2511
group of 2: check if the last two digits are smaller than 27, translate the last two digits 14 into N and decode the remaining number 251
In pseudo-code this looks like this:
# function decode
# input: the number n to decode
# a postfix string p representing the decoded part
function decode(n, p) {
# end condition: If the number is ZERO, I have decoded the full number
# only print and return
if (n == 0) { print p; return }
# group of 1: use integer division to extract the
# last digit as n%10 and
# remainder to decode is n/10
decode(n/10, concat(translate(n%10),p) )
# group of 2: use integer division to extract the
# last two digits as n%100 and
# remainder to decode is n/100
# This does not need to run if n < 10 or if n%100 > 26
if (n > 9 && n%100 <= 26) { decode(n/100, concat(translate(n%100),p) ) }
}
The function concat concatenates two strings: concat("AA","BB") returns "AABB"
The function translate(n) converts a number n into its corresponding alphabetic character. This can be written as sprintf("%c",64+n)
As is mentioned in the comments, this is not a very efficient method. This is because we do the same work over and over. If the input reads 25114, we will do the following steps in order:
step 1: translate(4), decode _2511_
step 1.1: translate(1), decode _251_
step 1.1.1: ...
step 1.2: translate(11), decode _25_
step 1.2.1: ...
step 2: translate(14), decode _251_
as you see, we have to decode 251 twice (in step 1.1 and step 2). This is very inefficient as we do everything more than ones.
To improve this, you can keep track of what you have done so far in a lookup table
Check out the ASCII table http://www.asciitable.com/ . I have had something like this similar for my homework as well. since 'a' = 97 and 'z' = 122, you could subtract the desired character from 96 to get the preferred value from the casted character.
For example:
int letterNum {(int)'a' - 97 + 1}; // 1
int letterNum {(int)'z' - 97 + 1}; // 26

How do I convert a Char to Int?

So I have a String of integers that looks like "82389235", but I wanted to iterate through it to add each number individually to a MutableList. However, when I go about it the way I think it would be handled:
var text = "82389235"
for (num in text) numbers.add(num.toInt())
This adds numbers completely unrelated to the string to the list. Yet, if I use println to output it to the console it iterates through the string perfectly fine.
How do I properly convert a Char to an Int?
That's because num is a Char, i.e. the resulting values are the ascii value of that char.
This will do the trick:
val txt = "82389235"
val numbers = txt.map { it.toString().toInt() }
The map could be further simplified:
map(Character::getNumericValue)
The variable num is of type Char. Calling toInt() on this returns its ASCII code, and that's what you're appending to the list.
If you want to append the numerical value, you can just subtract the ASCII code of 0 from each digit:
numbers.add(num.toInt() - '0'.toInt())
Which is a bit nicer like this:
val zeroAscii = '0'.toInt()
for(num in text) {
numbers.add(num.toInt() - zeroAscii)
}
This works with a map operation too, so that you don't have to create a MutableList at all:
val zeroAscii = '0'.toInt()
val numbers = text.map { it.toInt() - zeroAscii }
Alternatively, you could convert each character individually to a String, since String.toInt() actually parses the number - this seems a bit wasteful in terms of the objects created though:
numbers.add(num.toString().toInt())
On JVM there is efficient java.lang.Character.getNumericValue() available:
val numbers: List<Int> = "82389235".map(Character::getNumericValue)
Since Kotlin 1.5, there's a built-in function Char.digitToInt(): Int:
println('5'.digitToInt()) // 5 (int)
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/digit-to-int.html
For clarity, the zeroAscii answer can be simplified to
val numbers = txt.map { it - '0' }
as Char - Char -> Int. If you are looking to minimize the number of characters typed, that is the shortest answer I know. The
val numbers = txt.map(Character::getNumericValue)
may be the clearest answer, though, as it does not require the reader to know anything about the low-level details of ASCII codes. The toString().toInt() option requires the least knowledge of ASCII or Kotlin but is a bit weird and may be most puzzling to the readers of your code (though it was the thing I used to solve a bug before investigating if there really wasn't a better way!)

Can't understand the proof behind the use of this map<> method

Drazil is playing a math game with Varda.
Let's define for positive integer x as a product of factorials of its
digits. For example, f(135) = 1! * 3! * 5! = 720.
First, they choose a decimal number a consisting of n digits that
contains at least one digit larger than 1. This number may possibly
start with leading zeroes. Then they should find maximum positive
number x satisfying following two conditions:
x doesn't contain neither digit 0 nor digit 1.
= f(x) = f(a)
Help friends find such number.
Input The first line contains an integer n (1 ≤ n ≤ 15) — the number
of digits in a.
The second line contains n digits of a. There is at least one digit in
a that is larger than 1. Number a may possibly contain leading zeroes.
Output Output a maximum possible integer satisfying the conditions
above. There should be no zeroes and ones in this number decimal
representation.
Examples
input
4
1234
output
33222
input
3
555
output
555
Here is the solution,
#include <bits/stdc++.h>
#include <algorithm>
using namespace std;
int main()
{
map<char, string> mp;
mp['0'] = mp['1'] = "";
mp['2'] = "2";
mp['3'] = "3";
mp['4'] = "223";
mp['5'] = "5";
mp['6'] = "35";
mp['7'] = "7";
mp['8'] = "2227";
mp['9'] = "2337";
int n;
string str;
cin>>n>>str;
string res;
for(int i = 0; i < str.size(); ++i)
res += mp[str[i]];
sort(res.rbegin(), res.rend());
cout<<res;
return 0;
}
I'd like if someone explains the reason why were the digits transformed into other form of digits rather than just with some way to compute the number with..sadly brute force would give a TLE(Time limit exceeded) in this question cause of the 15 digit thing so that's a big number to brute force to,so I kindly hope that someone can explain the "proof" below, cause idk what theory says that these numbers can be transformed to those numbers for example 4 to 223 and stuff.
Thanks in advance.
Picture: What the proof says
The theory behind these transformations is the following (Ill use 4 as an example):
4! = 3! * 2! * 2!
A longer sequence of digits will always produce a larger number than a shorter sequence (at least for positive integers). Thus this code produces a longer sequence where possible. With the above example we get:
4! = 3! * 4
We can't reduce the 3! any further, since 3 is a prime. 4 on the other hand is simply 2²:
4 = 2² = 2! * 2!
Thus we have found the optimal replacement for 4 in the number-sequence as "322". This can be done for all numbers, but prime-numbers aren't factorisable and will thus always be the best replacement available for them self.
And thanks to the fact that we're using prime factorization we also know that we have the only (and longest possible) string of digits that can replace a certain digit.

Reading integers from a file with mixed integers, letters, and spaces C++

This is a sort of self-imposed extra credit problem I'm adding to my current programming assignment which I finished a week early. The assignment involved reading in integers from a file with multiple integers per line, each separated by a space. This was achieved easily using while(inFile >> val) .
The challenge I put myself up to was to try and read integers from a file of mixed numbers and letters, pulling out all contiguous digits as separate integers composed of those digits. For examples if I was reading in the following line from a text file:
12f 356 48 r56 fs6879 57g 132e efw ddf312 323f
The values that would be read in (and stored) would be
12f 356 48 r56 fs6879 57g 132e efw ddf312 323f
or
12, 356, 48, 56, 6879, 57, 132, 312, and 323
I've spent all afternoon digging through cplusplus.com and reading cover to cover the specifics of get, getline, cin etc. and I am unable to find an elegant solution for this. Every method I can deduce involves exhaustive reading in and storing of each character from the entire file into a container of some sort and then going through one element at a time and pulling out each digit.
My question is if there is a way to do this during the process of reading them in from a file; ie does the functionality of get, getline, cin and company support that complex of an operation?
Read one character at a time and inspect it. Have a variable that maintains the number currently being read, and a flag telling you if you are in the middle of processing a number.
If the current character is a digit then multiple the current number by 10 and add the digit to the number (and set the "processing a number" flag).
If the current character isn't a digit and you were in the middle of processing a number, you have reached the end of the number and should add it to your output.
Here is a simple such implementation:
std::vector<int> read_integers(std::istream & input)
{
std::vector<int> numbers;
int number = 0;
bool have_number = false;
char c;
// Loop until reading fails.
while (input.get(c)) {
if (c >= '0' && c <= '9') {
// We have a digit.
have_number = true;
// Add the digit to the right of our number. (No overflow check here!)
number = number * 10 + (c - '0');
} else if (have_number) {
// It wasn't a digit and we started on a number, so we hit the end of it.
numbers.push_back(number);
have_number = false;
number = 0;
}
}
// Make sure if we ended with a number that we return it, too.
if (have_number) { numbers.push_back(number); }
return numbers;
}
(See a live demo.)
Now you can do something like this to read all integers from standard input:
std::vector<int> numbers = read_integers(std::cin);
This will work equally well with an std::ifstream.
You might consider making the function a template where the argument specifies the numeric type to use -- this will allow you to (for example) switch to long long int without altering the function, if you know the file is going to contain large numbers that don't fit inside of an int.