Reversed offset tokenizer

Reversed offset tokenizer - c++

I have a string to tokenize. It's form is HHmmssff where H, m, s, f are digits.
It's supposed to be tokenized into four 2-digit numbers, but I need it to also accept short-hand forms, like sff so it interprets it as 00000sff.
I wanted to use boost::tokenizer's offset_separator but it seems to work only with positive offsets and I'd like to have it work sort of backwards.
Ok, one idea is to pad the string with zeroes from the left, but maybe the community comes up with something uber-smart. ;)
Edit: Additional requirements have just come into play.
The basic need for a smarter solution was to handle all cases, like f, ssff, mssff, etc. but also accept a more complete time notation, like HH:mm:ss:ff with its short-hand forms, e.g. s:ff or even s: (this one's supposed to be interpreted as s:00).
In the case where the string ends with : I can obviously pad it with two zeroes as well, then strip out all separators leaving just the digits and parse the resulting string with spirit.
But it seems like it would be a bit simpler if there was a way to make the offset tokenizer going back from the end of string (offsets -2, -4, -6, -8) and lexically cast the numbers to ints.

I keep preaching BNF notation. If you can write down the grammar that defines your problem, you can easily convert it into a Boost.Spirit parser, which will do it for you.
TimeString := LongNotation | ShortNotation
LongNotation := Hours Minutes Seconds Fractions
Hours := digit digit
Minutes := digit digit
Seconds := digit digit
Fraction := digit digit
ShortNotation := ShortSeconds Fraction
ShortSeconds := digit
Edit: additional constraint
VerboseNotation = [ [ [ Hours ':' ] Minutes ':' ] Seconds ':' ] Fraction

In response to the comment "Don't mean to be a performance freak, but this solution involves some string copying (input is a const & std::string)".
If you really care about performance so much that you can't use a big old library like regex, won't risk a BNF parser, don't want to assume that std::string::substr will avoid a copy with allocation (and hence can't use STL string functions), and can't even copy the string chars into a buffer and left-pad with '0' characters:
void parse(const string &s) {
string::const_iterator current = s.begin();
int HH = 0;
int mm = 0;
int ss = 0;
int ff = 0;
switch(s.size()) {
case 8:
HH = (*(current++) - '0') * 10;
case 7:
HH += (*(current++) - '0');
case 6:
mm = (*(current++) - '0') * 10;
// ... you get the idea.
case 1:
ff += (*current - '0');
case 0: break;
default: throw logic_error("invalid date");
// except that this code goes so badly wrong if the input isn't
// valid that there's not much point objecting to the length...
}
}
But fundamentally, just 0-initialising those int variables is almost as much work as copying the string into a char buffer with padding, so I wouldn't expect to see any significant performance difference. I therefore don't actually recommend this solution in real life, just as an exercise in premature optimisation.

Regular Expressions come to mind. Something like "^0*?(\\d?\\d?)(\\d?\\d?)(\\d?\\d?)(\\d?\\d?)$" with boost::regex. Submatches will provide you with the digit values. Shouldn't be difficult to adopt to your other format with colons between numbers (see sep61.myopenid.com's answer). boost::regex is among the fastest regex parsers out there.

Related

How to convert one string to another by successive substitutions of characters?

I'm currently trying to design an algorithm that doing such thing:
I got two strings A and B which consist of lowercase characters 'a'-'z'
and I can modify string A using the following operations:
1. Select two characters 'c1' and 'c2' from the character set ['a'-'z'].
2. Replace all characters 'c1' in string A with character 'c2'.
I need to find the minimum number of operations needed to convert string A to string B when possible.
I have 2 ideas that didn't work
1. Simple range-based for cycle that changes string B and compares it with A.
2. Idea with map<char, int> that does the same.
Right now I'm stuck on unit-testing with such situation : 'ab' is transferable to 'ba' in 3 iterations and 'abc' to 'bca' in 4 iterations.
My algorithm is wrong and I need some fresh ideas or working solution.
Can anyone help with this?
Here is some code that shows minimal RepEx:
int Transform(string& A, string& B)
{
int count = 0;
if(A.size() != B.size()){
return -1;
}
for(int i = A.size() - 1; i >= 0; i--){
if(A[i]!=B[i]){
char rep_elem = A[i];
++count;
replace(A.begin(),A.end(),rep_elem,B[i]);
}
}
if(A != B){
return -1;
}
return count;
}
How can I improve this or I should find another ideas?

First of all, don't worry about string operations. Your problem is algorithmic, not textual. You should somehow analyze your data, and only afterwards print your solution.
Start with building a data structure which tells, for each letter, which letter it should be replaced with. Use an array (or std::map<char, char> — it should conceptually be similar, but have different syntax).
If you discover that you should convert a letter to two different letters — error, conversion impossible. Otherwise, count the number of non-trivial cycles in the conversion graph.
The length of your solution will be the number of letters which shouldn't be replaced by themselves plus the number of cycles.
I think the code to implement this would be too long to be helpful.

How do I convert a Char to Int?

So I have a String of integers that looks like "82389235", but I wanted to iterate through it to add each number individually to a MutableList. However, when I go about it the way I think it would be handled:
var text = "82389235"
for (num in text) numbers.add(num.toInt())
This adds numbers completely unrelated to the string to the list. Yet, if I use println to output it to the console it iterates through the string perfectly fine.
How do I properly convert a Char to an Int?

That's because num is a Char, i.e. the resulting values are the ascii value of that char.
This will do the trick:
val txt = "82389235"
val numbers = txt.map { it.toString().toInt() }
The map could be further simplified:
map(Character::getNumericValue)

The variable num is of type Char. Calling toInt() on this returns its ASCII code, and that's what you're appending to the list.
If you want to append the numerical value, you can just subtract the ASCII code of 0 from each digit:
numbers.add(num.toInt() - '0'.toInt())
Which is a bit nicer like this:
val zeroAscii = '0'.toInt()
for(num in text) {
numbers.add(num.toInt() - zeroAscii)
}
This works with a map operation too, so that you don't have to create a MutableList at all:
val zeroAscii = '0'.toInt()
val numbers = text.map { it.toInt() - zeroAscii }
Alternatively, you could convert each character individually to a String, since String.toInt() actually parses the number - this seems a bit wasteful in terms of the objects created though:
numbers.add(num.toString().toInt())

On JVM there is efficient java.lang.Character.getNumericValue() available:
val numbers: List<Int> = "82389235".map(Character::getNumericValue)

Since Kotlin 1.5, there's a built-in function Char.digitToInt(): Int:
println('5'.digitToInt()) // 5 (int)
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/digit-to-int.html

For clarity, the zeroAscii answer can be simplified to
val numbers = txt.map { it - '0' }
as Char - Char -> Int. If you are looking to minimize the number of characters typed, that is the shortest answer I know. The
val numbers = txt.map(Character::getNumericValue)
may be the clearest answer, though, as it does not require the reader to know anything about the low-level details of ASCII codes. The toString().toInt() option requires the least knowledge of ASCII or Kotlin but is a bit weird and may be most puzzling to the readers of your code (though it was the thing I used to solve a bug before investigating if there really wasn't a better way!)

luhn algorithm and 'digit manipulation' in C++

In generating the 'check digit' in Luhn's algorithm
The check digit (x) is obtained by computing the sum of digits then computing 9 times that value modulo 10 (in equation form, (67 * 9
mod 10)). In algorithm form: Compute the sum of the digits (67).
Multiply by 9 (603). The last digit, 3, is the check digit.
Natural instincts point towards taking an id as a string to make individual digit operation easier. But there seems to be no way to extract a digit at a time through stringstream since there's no delimiter(as far as I can tell). So the process turns into a cumbersome conversion of individual characters to ints...
There's modulus for each digit approach as well, which also takes a bit of work.
I guess what I'm getting at is that I feel maybe I'm overlooking a more elegant way taking an input and operating on the input as if they were single digit inputs.

Use modular arithmetic to simply the equation like following :-
checkdigit = (sum_digits*9)%10
= ((sum_digits)%10*9)%10
Now sum_digits%10 is very simple to evaluate using strings.
C++ implementation :-
#include<iostream>
using namespace std;
int main() {
char* str = new char[100];
cout<<"Enter the String: ";
cin>>str;
int val = 0;
for(int i=0;str[i]!=0;i++) {
val = (val+str[i]-'0')%10;
}
val = (val*9)%10;
cout<<"Checkdigit("<<str<<") = "<<val;
return 0;
}

Stringstream has to calculate all digits and then convert each digit to a char by adding '0'. You'd have to subtract '0' again to get the digit values back. You'd be better off using the modulo approach directly.

C++ string parsing

All:
I got one question in string parsing:
For now, if I have a string like "+12+400-500+2:+13-50-510+20-66+20:"
How can I do like calculate total sum of each segment( : can be consider as end of one segment). For now, what I can figure out is only use for to loop through and check +/- sign, but I do not think it is good for a Universal method to solve this kind of problem :(
For example, the first segment, +12+400-500+2 = -86, and the second segment is
+13-50-510+20-66+20 = -573
1) The number of operand is varied( but they are always integer)
2) The number of segment is varied
3) I need do it in C++ or C.
I do not really think it as a very simple question to most newbie, and also I will claim this is not a homework. :)
best,

Since the string ends in a colon, it is easy to use find and substr to separate out parts of the string partitioned by ':', like this:
string all("+12+400-500+2:+13-50-510+20-66+20:");
int pos = 0;
for (;;) {
int next = all.find(':', pos);
if (next == string::npos) break;
string expr(all.substr(pos, (next-pos)+1));
cout << expr << endl;
pos = next+1;
}
This splits the original string into parts
+12+400-500+2:
and
+13-50-510+20-66+20:
Since istreams take leading plus as well as leading minus, you can parse out the numbers using >> operator:
istringstream iss(expr);
while (iss) {
int n;
iss >> n;
cout << n << endl;
}
With these two parts in hand, you can easily total up the individual numbers, and produce the desired output. Here is a quick demo.

You need to seperate operands and operators. To do this you can use two queue data types one for operands and one for operators

split by :, then by +, then by -. translate into int and there you are.

Your expression language seems regular: you could use a regex library - like boost::regex - to match the numbers, the signs, and the segments in groups directly, with something like
((([+-])([0-9]+))+)(:((([+-])([0-9]))+))+

Convert any Unicode string to int

I have an arbitrary Unicode string that represents a number, such as "2", "٢" (U+0662, ARABIC-INDIC DIGIT TWO) or "Ⅱ" (U+2161, ROMAN NUMERAL TWO). I want to convert that string into an int. I don't care about specific locales (the input might not be in the current locale); if it's a valid number then it should get converted.
I tried QString.toInt and QLocale.toInt, but they don't seem to get the job done. Example:
bool ok;
int n;
QString s = QChar(0x0662); // ARABIC-INDIC DIGIT TWO
n = s.toInt(&ok); // n == 0; ok == false
QLocale anyLocale(QLocale::AnyLanguage, QLocale::AnyScript, QLocale::AnyCountry);
n = anyLocale.toInt(s, &ok); // n == 0; ok == false
QLocale cLocale = QLocale::C;
n = cLocale.toInt(s, &ok); // n == 0; ok == false
QLocale arabicLocale = QLocale::Arabic; // Specific locale. I don't want that.
n = arabicLocale.toInt(s, &ok); // n == 2; ok == true
Is there a function I am missing?
I could try all locales:
QList<QLocale> allLocales = QLocale::matchingLocales(QLocale::AnyLanguage, QLocale::AnyScript, QLocale::AnyCountry);
for(int i = 0; i < allLocales.size(); i++)
{
n = allLocales[i].toInt(s, &ok);
if(ok)
break;
}
But that feels slightly hackish. Also, it does not work for all strings (e.g. Roman numerals, but that's an acceptable limitation). Are there any pitfalls when doing it that way, such as conflicting rules in different locales (cf. Turkish vs. non-Turkish letter case rules)?

I' not aware of any ready to use package which does this (but
maybe ICU supports it), but it isn't hard to do if you really
want to. First, you should download the UnicodeData.txt file
from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
This is an easy to parse ASCII file; the exact syntax is
described in http://www.unicode.org/reports/tr44/tr44-10.html,
but for your purposes, all you need to know is that each line in
the file consists of semi-colon separated fields. The first
field contains the character code in hex, the third field the
"general category", and if the third field is "Nd" (numeric,
decimal), the seventh field contains the decimal value.
This file can easily be parsed using Python or a number of other
scripting languages, to build a mapping table. You'll want some
sort of sparse representation, since there are over a million
Unicode characters, of which very few (a couple of hundred) are
decimal digits. The following Python script will give you a C++
table which can be used to initialize an
std::map<int, int>;. If the character is
in the map, the mapped element is its value.
Whether this is sufficient or not depends on your application.
It has several weaknesses:
It requires extra logic to recognize when two successive
digits are in different alphabets. Presumably a sequence "1١"
should be treated as two numbers (1 and 1), rather than as one
(11). (Because all of the sets of decimal digits are in 10
successive codes, it would be fairly easy, once you know the
digit, to check whether the preceding digit character was in the
same set.)
It ignores non-decimal digits, like ௰ or ൱ (Tamil ten and
Malayam one hundred). There aren't that many of them, and they are
also in the UnicodeData.txt file, so it might be possible to
find them manually and add them to the table. I don't know
myself, however, how they combine with other digits when numbers
have been composed.
If you're converting numbers, you might have to worry about
the direction. I'm not sure how this is handled (but there is
documentation at the Unicode site); in general, text will appear
in its natural order. In the case of Arabic and related
languages, when reading in the natural order, the low order
digits appear first: something like "١٢" (literally "12",
but because the writing is from right to left, the digits will
appear in the order "21") should be interpreted as 12, and not 21. Except that I'm not sure whether a change direction mark is
present or not. (The exact rules are described in the
documentation at the Unicode site; in the UnicodeData.txt file,
the fifth field—index 4—gives this information. I
think if it's anything but "AN", you can assume the big-endian
standard used in Europe, but I'm not sure.)
Just to show how simple this is, here's the Python script to
parse the UnicodeData.txt file for the digit values:
print('std::pair<int, int> initUnicodeMap[] = {')
for line in open("UnicodeData.txt"):
fields = line.split(';')
if fields[2] == 'Nd':
print(' {{{:d}, {:d}}},'.format(int(fields[0], 16), int(fields[7])))
print('};')
If you're doing any work with Unicode, this files is a gold mine
for generating all sorts of useful tables.

You can get the numeric equivalent of an unicode character with the method QChar::digitValue:
int value = QChar::digitValue((uint)0x0662);
It will return -1 if the character does not have numeric value.
See the documentation if you need more help, I don't really know much about c++/qt
Chinese numerals mentioned in that wikipedia article belong to 0x4E00-0x9FCC. There is no useful metadata about individual characters in this range:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
So if you wish to map chinese numerals to integers, you must do that mapping yourself, simple as that.
Here's simple mapping of the symbols in the wikipedia article where a single symbol maps to some single number:
0x96f6,0x3007 = 0
0x58f9,0x4e00,0x5f0c = 1
0x8cb3,0x8d30,0x4e8c,0x5f0d,0x5169,0x4e24 = 2
0x53c3,0x53c1,0x4e09,0x5f0e,0x53c3,0x53c2,0x53c4,0x53c1 = 3
0x8086,0x56db,0x4989 = 4
0x4f0d,0x4e94 = 5
0x9678,0x9646,0x516d = 6
0x67d2,0x4e03 = 7
0x634c,0x516b = 8
0x7396,0x4e5d = 9
0x62fe,0x5341,0x4ec0 = 10
0x4f70,0x767e = 100
0x4edf,0x5343 = 1000
0x842c,0x842c,0x4e07 = 10000
0x5104,0x5104,0x4ebf = 100000000
0x5e7a = 1
0x5169,0x4e24 = 2
0x5440 = 10
0x5ff5,0x5eff = 20
0x5345 = 30
0x534c = 40
0x7695 = 200
0x6d1e = 0
0x5e7a = 1
0x4e24 = 2
0x5200 = 4
0x62d0 = 7
0x52fe = 9

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Reversed offset tokenizer - c++

Related

How to convert one string to another by successive substitutions of characters?

How do I convert a Char to Int?

luhn algorithm and 'digit manipulation' in C++

C++ string parsing

Convert any Unicode string to int

Categories

Resources