Parsing log files containing NMEA sentences C++ - c++

I have multiple log files of NMEA sentences that contain geographical positions captured by a camera.
Example of one of the sentences: $GPRMC,100101.000,A,3723.1741,N,00559.5624,W,0.000,0.00,150914,,A*63
My question is, how do you reckon I can start on that? Just need someone to push me to the right direction, thanks.

I use this checksum function in my GPSReverse driver.
string chk(const char* data)
{
// Assuming data contains a NMEA sentence (check it)
// Variables for keeping track of data index and checksum
const char *datapointer = &data[1];
char checksum = 0;
// Loop through entire string, XORing each character to the next
while (*datapointer != '\0')
{
checksum ^= *datapointer;
datapointer++;
}
// Print out the checksum in ASCII hex nybbles
char x[100] = {0};
sprintf_s(x,100,"%02X",checksum);
return x;
}
And after that, some append to the NMEA string (say, GGA) :
string re = chk(gga.c_str());
gga += "*";
gga += re;
gga += "\r\n";
So you can read up to the *, calculate the checksum, and see if it matches the string after the *.
Read more here.
Each sentence begins with a '$' and ends with a carriage return/line
feed sequence and can be no longer than 80 characters of visible text
(plus the line terminators). The data is contained within this single
line with data items separated by commas. The data itself is just
ascii text and may extend over multiple sentences in certain
specialized instances but is normally fully contained in one variable
length sentence. The data may vary in the amount of precision
contained in the message. For example time might be indicated to
decimal parts of a second or location may be show with 3 or even 4
digits after the decimal point. Programs that read the data should
only use the commas to determine the field boundaries and not depend
on column positions. There is a provision for a checksum at the end of
each sentence which may or may not be checked by the unit that reads
the data. The checksum field consists of a '' and two hex digits
representing an 8 bit exclusive OR of all characters between, but not
including, the '$' and ''. A checksum is required on some sentences.

Related

Reading from a file, multiple delimeters

So I have some code which reads from a file and separates by the commas in the file. However some things in the file often have spaces after or before the commas so it's causing a bit of a problem when executing the code.
This is the code that I which reads in the data from the file. Using the same kind of format I was wondering if there was a way to prepare for this spaces
while(getline(inFile, line)){
stringstream linestream(line);
// each field of the inventory file
string type;
string code;
string countRaw;
int count;
string priceRaw;
int price;
string other;
//
if(getline(linestream,type,',') && getline(linestream,code,',')
&& getline(linestream,countRaw,',')&& getline(linestream,priceRaw,',')){
// optional other
getline(linestream,other,',');
count = atoi(countRaw.c_str());
price = atoi(priceRaw.c_str());
StockItem *t = factoryFunction(code, count, price, other, type);
list.tailAppend(t);
}
}
The better approach for those kind of problems is a state machine. Each character that you get should act in a simple way. You don't state if you need spaces between words non delimited by commas, so I suppose you need them. I don't know what you need to do with double spaces, I suppose you need to keep things as are. So start reading one character at a time, there are two variables the start positions and the limit position. When you start you are determining the start position ( state 1 ). If you find any character different than the space character you set that start position to that character and you change your state to ( state 2 ). When in state 2 when you find a non space character you set the limit position to the next position than the character you found. If you find a comma character you get the string that begins form start to limit and you change again into state 1.

How can I parse a char array with octal values in Python?

EDIT: I should note that I want a general case for any hex array, not just the google one I provided.
EDIT BACKGROUND: Background is networking: I'm parsing a DNS packet and trying to get its QNAME. I'm taking in the whole packet as a string, and every character represents a byte. Apparently this problem looks like a Pascal string problem, and using the struct module seems like the way to go.
I have a char array in Python 2.7 which includes octal values. For example, let's say I have an array
DNS = "\03www\06google\03com\0"
I want to get:
www.google.com
What's an efficient way to do this? My first thought would be iterating through the DNS char array and adding chars to my new array answer. Every time i see a '\' char, I would ignore the '\' and two chars after it. Is there a way to get the resulting www.google.com without using a new array?
my disgusting implementation (my answer is an array of chars, which is not what i want, i want just the string www.google.com:
DNS = "\\03www\\06google\\03com\\0"
answer = []
i = 0
while i < len(DNS):
if DNS[i] == '\\' and DNS[i+1] != 0:
i += 3
elif DNS[i] == '\\' and DNS[i+1] == 0:
break
else:
answer.append(DNS[i])
i += 1
Now that you've explained your real problem, none of the answers you've gotten so far will work. Why? Because they're all ways to remove sequences like \03 from a string. But you don't have sequences like \03, you have single control characters.
You could, of course, do something similar, just replacing any control character with a dot.
But what you're really trying to do is not replace control characters with dots, but parse DNS packets.
DNS is defined by RFC 1035. The QNAME in a DNS packet is:
a domain name represented as a sequence of labels, where each label consists of a length octet followed by that number of octets. The domain name terminates with the zero length octet for the null label of the root. Note that this field may be an odd number of octets; no padding is used.
So, let's parse that. If you understand how "labels consisting of "a length octet followed by that number of octets" relates to "Pascal strings", there's a quicker way. Also, you could write this more cleanly and less verbosely as a generator. But let's do it the dead-simple way:
def parse_qname(packet):
components = []
offset = 0
while True:
length, = struct.unpack_from('B', packet, offset)
offset += 1
if not length:
break
component = struct.unpack_from('{}s'.format(length), packet, offset)
offset += length
components.append(component)
return components, offset
import re
DNS = "\\03www\\06google\\03com\\0"
m = re.sub("\\\\([0-9,a-f]){2}", "", DNS)
print(m)
Maybe something like this?
#!/usr/bin/python3
import re
def convert(adorned_hostname):
result1 = re.sub(r'^\\03', '', adorned_hostname )
result2 = re.sub(r'\\0[36]', '.', result1)
result3 = re.sub(r'\\0$', '', result2)
return result3
def main():
adorned_hostname = r"\03www\06google\03com\0"
expected_result = 'www.google.com'
actual_result = convert(adorned_hostname)
print(actual_result, expected_result)
assert actual_result == expected_result
main()
For the question as originally asked, replacing the backslash-hex sequences in strings like "\\03www\\06google\\03com\\0" with dots…
If you want to do this with a regular expression:
\\ matches a backslash.
[0-9A-Fa-f] matches any hex digit.
[0-9A-Fa-f]+ matches one or more hex digits.
\\[0-9A-Fa-f]+ matches a backslash followed by one or more hex digits.
You want to find each such sequence, and replace it with a dot, right? If you look through the re docs, you'll find a function called sub which is used for replacing a pattern with a replacement string:
re.sub(r'\\[0-9A-Fa-f]+', '.', DNS)
I suspect these may actually be octal, not hex, in which case you want [0-7] rather than [0-9A-Fa-f], but nothing else would change.
A different way to do this is to recognize that these are valid Python escape sequences. And, if we unescape them back to where they came from (e.g., with DNS.decode('string_escape')), this turns into a sequence of length-prefixed (aka "Pascal") strings, a standard format that you can parse in any number of ways, including the stdlib struct module. This has the advantage of validating the data as you read it, and not being thrown off by any false positives that could show up if one of the string components, say, had a backslash in the middle of it.
Of course that's presuming more about the data. It seems likely that the real meaning of this is "a sequence of length-prefixed strings, concatenated, then backslash-escaped", in which case you should parse it as such. But it could be just a coincidence that it looks like that, in which case it would be a very bad idea to parse it as such.

Decrypt an encrypted string

Let's say there is a certain way of encrypting strings:
Append the character $, which is the first character in the alphabet, at the end of the string.
Form all the strings we get by continuously moving the first character to the end of the string.
Sort all the strings we have gotten into alphabetical order.
Form a new string by appending last character of each string to it.
For example, the word FRUIT is encrypted in the following manner:
We append the character $ at the end of the word:
FRUIT$
We then form all the strings by moving the first character at the end:
FRUIT$
RUIT$S
UIT$FR
IT$FRU
T$FRUI
$FRUIT
Then we sort the new strings into alphabetical order:
$FRUIT
FRUIT$
IT$FRU
RUIT$F
T$FRUI
UIT$FR
The encrypted string:
T$UFIR
Now my problem is obvious: How to decrypt a given string into it's original form.
I've been pounding my head for half a week now and I've finally run out of paper.
How should I get on with this?
What I have discovered:
if we have the last step of the encryption:
$FRUIT
FRUIT$
IT$FRU
RUIT$F
T$FRUI
UIT$FR
We can know the first and last character of the original string, since the rightmost column is the encrypted string itself, and the leftmost column is always in alphabetical order. The last character is the first character of the encrypted string, because $ is always first in the alphabet, and it only exists once in a string. Then, if we find the $ character from the rightmost column, and look up the character on the same row in the leftmost column, we get the first character.
So what we can know about the encrypted string T$UFIR is that the original string is F***T$, where * is an unknown character.
There ends my ideas. Now I have to utilize the world-wide-web and ask another human being: How?
You could say this is homework, and being familiar with my tutor, I place my bets on this being a dynamic programming -problem.
This is the Burrows-Wheeler transform.
It's an algorithm typically used for aiding compression algorithms, as it tends to group together common repeating phrases, and is reversible.
To decode your string:
Number each character:
T$UFIR
012345
Now sort, retaining the numbering. If characters repeat, you use the indices as a secondary sort-key, such that the indices for the repeated characters are kept in increasing order, or otherwise use a sorting algorithm that guarantees this.
$FIRTU
134502
Now we can decode. Start at the '$', and use the associated index as the next character to output ('$' = 1, so the next char is 'F'. 'F' is 3, so the next char is 'R', etc...)
The result:
$FRUIT
So just remove the marker character, and you're done.

REGEX - Insert space after every 4 characters, and a line break after every 40 characters

I have a huge string (22000+ characters) of encoded text. The code is consisted of digits [0-9] and lower case letters [a-z]. I need a regular expression to insert a space after every 4 characters, and one to insert a line break [\n] after every fourty characters. Any ideas?
Many people would prefer to do this with a for loop and string concatenation, but I hate those substring calls. I am really against using regexes when they aren't the right tool for the job (parsing HTML), but I think it'd pretty easy to work with in this case.
JSFiddle Example
Let's say you have the string
var str = "aaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnoooo";
And you want to insert a space after every four characters, and a newline after 40 characters, you could use the following code
str.replace(/.{4}g/, function (value, index){
return value + (index % 40 == 36? '\n' : ' ');
});
Note that this wouldn't work if the newline(40) index wasn't a multiple of the space index(4)
I abstracted this in a project, here's a simple way to do it
/**
* Adds padding and newlines into a string without whitespace
* #param {str} str The str to be modified (any whitespace will be stripped)
* #param {int} spaceEvery number of characters before inserting a space
* #param {int} wrapeEvery number of spaces before using a newline instead
* return {string} The replaced string
*/
function addPadding(str, spaceEvery, wrapEvery) {
var regex = new RegExp(".{"+spaceEvery+"}", "g");
// Add space every {spaceEvery} chars, newline after {wrapEvery} spaces
return str.replace(/[\n\s]/g, '').replace(regex, function(value, index) {
// The index is the group that just finished
var newlineIndex = spaceEvery * (wrapEvery - 1);
return value + ((index % (spaceEvery * wrapEvery) === newlineIndex) ? '\n' : ' ');
});
}
Well, a regexp in itself doesn't insert a space, so I'll assume you have some command in whatever language you're using that inserts based on finding a regexp.
So, finding 4 characters and finding 40 characters: that's not pretty in general regular expressions (unless your particular implementation has nice ways to express numbers). For finding 4 characters, use
....
Because typical regexp finders use maximal munch, then from the end of one regexp, search forward and maximally munch again, that'll chunk your string into 4 character pieces. The ugly part is that in standard regular expressions, you'll have to use
........................................
to find chuncks of 40 characters, although I'll note that if you run your 4 character one first, you'll have to run
..................................................
or
.... .... .... .... .... .... .... .... .... ....
to account for the spaces you've already put in.
The period finds any characters, but given that you're only using [0-9|a-z], you could use that regexp in place of each period if you need to ensure nothing else slipped in, I was just avoiding making it even more gross.
As you may be noting, regexp have some limitations. Take a look at the Chomsky hierarchy to really get into their theoretical limitations.

input, output and \n's

So I'm trying to solve this problem that asks to look for palindromes in strings, so seems like I've got everything right, however the problem is with the output.
Here's the original and my out put:
http://pastebin.com/c6Gh8kB9
Here's whats been said about input and input of the problem:
Input format :
A file with no more than 20,000
characters. The file has one or more
lines. No line is longer than 80
characters (not counting the newline
at the end).
Output format :
The first line of the output should be the length of the longest
palindrome found. The next line or
lines should be the actual text of the
palindrome (without any surrounding
white space or punctuation but with
all other characters) printed on a
line (or more than one line if
newlines are included in the
palindromic text). If there are
multiple palindromes of longest
length, output the one that appears
first.
Here's how I read the input :
string test;
string original;
while (getline(fin,test))
original += test;
And here's how I output it:
int len = answer.length();
answer = cleanUp(answer);
while (len > 0){
string s3 = answer.substr(0,80);
answer.erase(0,80);
fout << s3 << endl;
len -= 80;
}
cleanUp() is a function to remove the illegal characters from the beginning and the end. I'm guessing that the problem is with \n's and the way I read the input. How can I fix this ?
No line is longer than 80 characters (not counting the newline at the end)
does not imply that every line is 80 characters except for the last, while your output code does assume this by taking 80 characters off answer in every iteration.
You may want to keep the newlines in the string until the output phase. Alternatively, you might store newline positions in a separate std::vector. The first option complicates your palindrome search routine; the second your output code.
(If I were you, I'd also index into answer instead of taking chunks off with substr/erase; your output code is now O(n^2) while it could be O(n).)
After rereading, it appears that I misunderstood the question. I was thinking in terms of each line representing a single word, and the intent is to test whether that "word" is palindromic.
After rereading, I think the question is really more like: "Given a sequence of up to 20,000 characters, find the longest palindromic sub-sequence. Oh, incidentally, the input is broken up into lines of no more than 80 characters."
If that's correct, I'd ignore the line-length completely. I'd read the entire file into a single buffer, then search for palindromes in that buffer.
To find the palindromes, I'd simply walk through each position in the array, and find the longest possible palindrome with that as its center point:
for (int i=1; i<total_chars; i++)
for (n=1; n<min(i, total_chars-i); n++)
if (array[i+n] != array[i-n])
// Candidate palindrome is from array[i-n+1] to array[i+n-1]