Python C Header File Parsing and Reverse Initialization - c++

I am interested in parsing C header files (only structures and variable declarations) using Python in a recursive manner.
Here is an example of what I am looking for. Suppose the following:
typedef struct
{
double value[3];
} vector3;
typedef struct
{
unsigned int variable_a[4][2];
vector3 variable_b[5];
} my_example;
Also, suppose there is a file that contains initialization values such as:
ANCHOR_STRUCT(my_example) =
{
// variable_a
{ {1,2}, {3, 4}, {5,6} ,{7,8} },
// variable_b
{ {1.0,2.0,3.0}, {4.0,5.0,6.0}, {7.0,8.0,9.0}, {10.0,11.0,12.0}, {13.0,14.0,15.0} }
}
I would like to be able to parse both of these files and be able to generate a report such as:
OUTPUT:
my_example.variable_a[0][0] = 1
my_example.variable_a[0][1] = 2
my_example.variable_a[1][0] = 3
my_example.variable_a[1][1] = 4
my_example.variable_a[2][0] = 5
my_example.variable_a[2][1] = 6
my_example.variable_a[3][0] = 7
my_example.variable_a[3][1] = 8
my_example.variable_b[0].value[0] = 1
my_example.variable_b[0].value[1] = 2
my_example.variable_b[0].value[2] = 3
my_example.variable_b[1].value[0] = 4
my_example.variable_b[1].value[1] = 5
my_example.variable_b[1].value[2] = 6
my_example.variable_b[2].value[0] = 7
my_example.variable_b[2].value[1] = 8
my_example.variable_b[2].value[2] = 9
my_example.variable_b[3].value[0] = 10
my_example.variable_b[3].value[1] = 11
my_example.variable_b[3].value[2] = 12
my_example.variable_b[4].value[0] = 13
my_example.variable_b[4].value[1] = 14
my_example.variable_b[4].value[2] = 15
I would like to be able to report this without running the code (only through parsing). Is there a Python tool that exist that would parses and prints this information. I'd also like to print out the data type.
It seems it is a bit complicated to parse the "{" and "," and "}" in the intiailization file and be able to match this with the structure's variables and children. Matching the values with the correct code name seems difficult because the order is very important. I also assume recursion is needed for parent/children/grandchildren variables.
Thanks,
Ned

Unless you restrict yourself to simple data types, this is going to get very complicated. For example, do you want to handle arbitrary data types such as nested classes?
You say you don't want to run the c-sources, but what you are trying to do here is build your own c-interpreter! Are you sure you want to reinvent the wheel? If yes...
The first thing you need to be able to do, is parse the file. You can can use a parser+lexicographic analyzer such as PLY. Once you have the parse tree, you can analyze what your variables are and what their intended values are.

Related

Binary files: write with C++, read with MATLAB

I could use your support on this. Here is my issue:
I've got a 2D buffer of floats (in a data object) in a C++ code, that I write in a binary file using:
ptrToFile.write(reinterpret_cast<char *>(&data->array[0][0]), nbOfEltsInArray * sizeof(float));
The data contains 8192 floats, and I (correctly ?) get a 32 kbytes (8192 * 4 bytes) file out of this line of code.
Now I want to read that binary file using MATLAB. The code is:
hdr_binaryfile = fopen(str_binaryfile_path,'r');
res2_raw = fread(hdr_binaryfile, 'float');
res2 = reshape(res2_raw, int_sizel, int_sizec);
But it's not happening as I expect it to happen. If I print the array of data in the C++ code using std::cout, I get:
pCarte_bin->m_size = 8192
pCarte_bin->m_sizel = 64
pCarte_bin->m_sizec = 128
pCarte_bin->m_p[0][0] = 1014.97
pCarte_bin->m_p[0][1] = 566946
pCarte_bin->m_p[0][2] = 423177
pCarte_bin->m_p[0][3] = 497375
pCarte_bin->m_p[0][4] = 624860
pCarte_bin->m_p[0][5] = 478834
pCarte_bin->m_p[1][0] = 2652.25
pCarte_bin->m_p[2][0] = 642077
pCarte_bin->m_p[3][0] = 5.33649e+006
pCarte_bin->m_p[4][0] = 3.80922e+006
pCarte_bin->m_p[5][0] = 568725
And on the MATLAB side, after I read the file using the little block of code above:
size(res2) = 64 128
res2(1,1) = 1014.9659
res2(1,2) = 323288.4063
res2(1,3) = 2652.2515
res2(1,4) = 457593.375
res2(1,5) = 642076.6875
res2(1,6) = 581674.625
res2(2,1) = 566946.1875
res2(3,1) = 423177.1563
res2(4,1) = 497374.6563
res2(5,1) = 624860.0625
res2(6,1) = 478833.7188
The size (lines, columns) is OK, as well as the very first item ([0][0] in C++ == [1][1] in MATLAB). But:
I'm reading the C++ line elements along the columns: [0][1] in C++ == [1][2] in MATLAB (remember that indexing starts at 1 in MATLAB), etc.
I'm reading one correct element out of two along the other dimension: [1][0] in C++ == [1][3] in MATLAB, [2][0] == [1][5], etc.
Any idea about this ?
Thanks!
bye
Leaving aside the fact there seems to be some precision difference (likely the display settings in MATLAB) the issue here is likely the difference between row major and column major ordering of data. Without more details it will be hard to be certain. In particular MATLAB is column major meaning that contiguous memory on disk is interpreted as detailing sequential elements in a column rather than a row.
The likely solution is to reverse the two sizes in your reshape, and access the elements with indices reversed. That is, swap the int_size1 and int_size2, and then read elements expecting
pCarte_bin->m_p[0][0] = res2(1,1)
pCarte_bin->m_p[0][1] = res2(2,1)
pCarte_bin->m_p[0][2] = res2(3,1)
pCarte_bin->m_p[0][3] = res2(4,1)
pCarte_bin->m_p[1][0] = res2(1,2)
etc.
You could also transpose the array in MATLAB after read, but for a large array that could be costly in itself

C equivalent of C++ enum data for binary reading and manipulating

trying to convert some C++ code into C, I'm working with binary data and need to use a C equivalent of this:
enum GssipFlags : uint16_t
{
SPARE0 = 1,
SPARE1 = 2 * SPARE0,
SPARE2 = 2 * SPARE1,
SPARE3 = 2 * SPARE2,
REQ_MSG = 2 * SPARE3,
DISCONNECT = 2 * REQ_MSG,
CONNECT = 2 * DISCONNECT,
INVALID_DATA = 2 * CONNECT,
CMD_REJECT = 2 * INVALID_DATA,
HANDSHAKE = 2 * CMD_REJECT,
NAK_MSG = 2 * HANDSHAKE,
ACK_MSG = 2 * NAK_MSG,
ACK_REQ = 2 * ACK_MSG,
RESYNC = 2 * ACK_REQ,
MODE = 2 * RESYNC,
READY = 2 * MODE
};
enum GssipMessageIDs : uint16_t
{
CCCCCCCC = 1,
RECEIVER_ID_MSG = 2,
BUFFER_BOX_STATUS_REQUEST_MSG = 3,
SETUP_DATA_5031 = 4,
WARNING_MSG = 5,
TIME_TRANSFER = 6
};
enum GssipWarningMsgIDs : uint16_t
{
EXTERNAL_POWER_DISCONNECT = 17,
SELF_TEST_OK = 8,
AAAAA = 9,
BBBBB = 10
};
Everything I've tried hasnt worked. the main aspect of this I need is for everything to be uint16_t
You have one standard option and two potential options depending on your compiler and what "I'm working with binary data and need to use a C" means (memory usage?, speed?, etc):
This has been already commented, the use of structs with the type you are looking for:
typedef struct {
uint16_t SPARE0;
...
} GssipFlags_t;
GssipFlags_t a = {
.SPARE0=1
...
};
If you are trying to reduce the size of enums, take advantage of the compiler (if available) and use -fshort-enums.
Allocate to an enum type only as many bytes as it needs for the declared range of possible values. Specifically, the enum type is equivalent to the smallest integer type that has enough room.
__attribute__((packed)), in order to remove the padding added between members (which may do things slower due to the cost of accessing to unaligned data).
If you don't mind about the size and your only concern is to compile C++11 code with a C compiler (which might produce the same output than adding -fshort-enums), just do:
enum GssipFlags {
SPARE0 = 1
...
};
Items 2, 3 and 4 don't explicitly create members with uint16_t type, but if this is a XY problem, they provide different solutions depending on your real issue.

Storing file data variables in a dimentional array

I have a .txt file which I'm trying to gather data from, that can then be used within variables within my code to be used in other functions.
Here's an example of my text file:
0 10 a namez 1 0
0 11 b namea 1 1
1 12 c nameb 1 1
2 13 d namec 0 1
3 14 e named 1 1
So my file will not always be the same number of lines, but always the same number of variables per line.
I currently have this, to firstly get the length of the file and then change the amount of rows within the array:
int FileLength()
{
int linecount = 0;
string line;
ifstream WorkingFile("file.txt");
while(getline(WorkingFile, line))
{
++linecount;
}
return linecount;
}
int main()
{
string FileTable [FileLength()][6];
}
Firstly I don't know if the above code is correct or how I can add the values from my file into my FileTable array.
Once I have my FileTable array with all the file data in it, I then want to be able to use this in other functions.
I've been able to do:
if(FileTable[2][0] = 1)
{
cout << "The third name is: " << FileTable[2][3] << endl;
}
I understand my code may not make sense here but I hope it demonstrates what I'm attempting to do.
I have to do this for a larger text file and all the 6 variables per line relate to be input to a function.
Hold each line in its own object, this is much clearer:
struct Entry
{
std::array<std::string, 6> items; // or a vector
};
In main:
std::vector<Entry> file_table( FileLength() );
Note that it is a waste of time to read the whole file first in order to find the number of entries. You could just start with an empty vector, and push in each entry as you read it.
Your access code:
if( file_table.size() > 2 && file_table[2].items[0] == "1" )
{
cout << "The third name is: " << FileTable[2].items[2] << endl;
}
I would actually recommend giving the members of Entry names, instead of just having an array of 6 of them. That would make your code more readable. (Unless you really need to iterate over them, in which case you can use an enum for the indices).
You could define an operator[] overload for Entry if you don't like the .items bit.
since the number of lines is dynamic I suggest to use vector instead of array. you can push back your data to the vector line by line until you read eof.
also try to study about OOP a little , it would make your code more understandable.
take look at these:
http://www.cplusplus.com/reference/vector/vector/
http://www.geeksforgeeks.org/eof-and-feof-in-c/

Convert any Unicode string to int

I have an arbitrary Unicode string that represents a number, such as "2", "٢" (U+0662, ARABIC-INDIC DIGIT TWO) or "Ⅱ" (U+2161, ROMAN NUMERAL TWO). I want to convert that string into an int. I don't care about specific locales (the input might not be in the current locale); if it's a valid number then it should get converted.
I tried QString.toInt and QLocale.toInt, but they don't seem to get the job done. Example:
bool ok;
int n;
QString s = QChar(0x0662); // ARABIC-INDIC DIGIT TWO
n = s.toInt(&ok); // n == 0; ok == false
QLocale anyLocale(QLocale::AnyLanguage, QLocale::AnyScript, QLocale::AnyCountry);
n = anyLocale.toInt(s, &ok); // n == 0; ok == false
QLocale cLocale = QLocale::C;
n = cLocale.toInt(s, &ok); // n == 0; ok == false
QLocale arabicLocale = QLocale::Arabic; // Specific locale. I don't want that.
n = arabicLocale.toInt(s, &ok); // n == 2; ok == true
Is there a function I am missing?
I could try all locales:
QList<QLocale> allLocales = QLocale::matchingLocales(QLocale::AnyLanguage, QLocale::AnyScript, QLocale::AnyCountry);
for(int i = 0; i < allLocales.size(); i++)
{
n = allLocales[i].toInt(s, &ok);
if(ok)
break;
}
But that feels slightly hackish. Also, it does not work for all strings (e.g. Roman numerals, but that's an acceptable limitation). Are there any pitfalls when doing it that way, such as conflicting rules in different locales (cf. Turkish vs. non-Turkish letter case rules)?
I' not aware of any ready to use package which does this (but
maybe ICU supports it), but it isn't hard to do if you really
want to. First, you should download the UnicodeData.txt file
from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
This is an easy to parse ASCII file; the exact syntax is
described in http://www.unicode.org/reports/tr44/tr44-10.html,
but for your purposes, all you need to know is that each line in
the file consists of semi-colon separated fields. The first
field contains the character code in hex, the third field the
"general category", and if the third field is "Nd" (numeric,
decimal), the seventh field contains the decimal value.
This file can easily be parsed using Python or a number of other
scripting languages, to build a mapping table. You'll want some
sort of sparse representation, since there are over a million
Unicode characters, of which very few (a couple of hundred) are
decimal digits. The following Python script will give you a C++
table which can be used to initialize an
std::map<int, int>;. If the character is
in the map, the mapped element is its value.
Whether this is sufficient or not depends on your application.
It has several weaknesses:
It requires extra logic to recognize when two successive
digits are in different alphabets. Presumably a sequence "1١"
should be treated as two numbers (1 and 1), rather than as one
(11). (Because all of the sets of decimal digits are in 10
successive codes, it would be fairly easy, once you know the
digit, to check whether the preceding digit character was in the
same set.)
It ignores non-decimal digits, like ௰ or ൱ (Tamil ten and
Malayam one hundred). There aren't that many of them, and they are
also in the UnicodeData.txt file, so it might be possible to
find them manually and add them to the table. I don't know
myself, however, how they combine with other digits when numbers
have been composed.
If you're converting numbers, you might have to worry about
the direction. I'm not sure how this is handled (but there is
documentation at the Unicode site); in general, text will appear
in its natural order. In the case of Arabic and related
languages, when reading in the natural order, the low order
digits appear first: something like "١٢" (literally "12",
but because the writing is from right to left, the digits will
appear in the order "21") should be interpreted as 12, and not 21. Except that I'm not sure whether a change direction mark is
present or not. (The exact rules are described in the
documentation at the Unicode site; in the UnicodeData.txt file,
the fifth field—index 4—gives this information. I
think if it's anything but "AN", you can assume the big-endian
standard used in Europe, but I'm not sure.)
Just to show how simple this is, here's the Python script to
parse the UnicodeData.txt file for the digit values:
print('std::pair<int, int> initUnicodeMap[] = {')
for line in open("UnicodeData.txt"):
fields = line.split(';')
if fields[2] == 'Nd':
print(' {{{:d}, {:d}}},'.format(int(fields[0], 16), int(fields[7])))
print('};')
If you're doing any work with Unicode, this files is a gold mine
for generating all sorts of useful tables.
You can get the numeric equivalent of an unicode character with the method QChar::digitValue:
int value = QChar::digitValue((uint)0x0662);
It will return -1 if the character does not have numeric value.
See the documentation if you need more help, I don't really know much about c++/qt
Chinese numerals mentioned in that wikipedia article belong to 0x4E00-0x9FCC. There is no useful metadata about individual characters in this range:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
So if you wish to map chinese numerals to integers, you must do that mapping yourself, simple as that.
Here's simple mapping of the symbols in the wikipedia article where a single symbol maps to some single number:
0x96f6,0x3007 = 0
0x58f9,0x4e00,0x5f0c = 1
0x8cb3,0x8d30,0x4e8c,0x5f0d,0x5169,0x4e24 = 2
0x53c3,0x53c1,0x4e09,0x5f0e,0x53c3,0x53c2,0x53c4,0x53c1 = 3
0x8086,0x56db,0x4989 = 4
0x4f0d,0x4e94 = 5
0x9678,0x9646,0x516d = 6
0x67d2,0x4e03 = 7
0x634c,0x516b = 8
0x7396,0x4e5d = 9
0x62fe,0x5341,0x4ec0 = 10
0x4f70,0x767e = 100
0x4edf,0x5343 = 1000
0x842c,0x842c,0x4e07 = 10000
0x5104,0x5104,0x4ebf = 100000000
0x5e7a = 1
0x5169,0x4e24 = 2
0x5440 = 10
0x5ff5,0x5eff = 20
0x5345 = 30
0x534c = 40
0x7695 = 200
0x6d1e = 0
0x5e7a = 1
0x4e24 = 2
0x5200 = 4
0x62d0 = 7
0x52fe = 9

C++: Design: should I use enum here?

What is the preferred and best way in C++ to do this: Split the letters of the alphabeth into 7 groups so I can later ask if a char is in group 1, 3 or 4 etc... ? I can of course think of several ways of doing this myself but I want to know the standard and stick with it when doing this kinda stuff.
0
AEIOUHWY
1
BFPV
2
CGJKQSXZ
3
DT
4
MN
5
L

6
R
best way in C++ to do this: Split the letters of the alphabeth into 7 groups so I can later ask if a char is in group 1, 3 or 4 etc... ?
The most efficient way to do the "split" itself is to have an array from letter/char to number.
// A B C D E F G H...
const char lookup[] = { 0, 1, 2, 3, 0, 1, 2, 0...
A switch/case statement's another reasonable choice - the compiler can decide itself whether to create an array implementation or some other approach.
It's unclear what use of those 1-6 values you plan to make, but an enum appears a reasonable encoding choice. That has the advantage of still supporting any use you might have for those specific numeric values (e.g. in < comparisons, streaming...) while being more human-readable and compiler-checked than "magic" numeric constants scattered throughout the code. constant ints of any width are also likely to work fine, but won't have a unifying type.
Create a lookup table.
int lookup[26] = { 0, 1, 2, 3, 0, 1, 2, 0 .... whatever };
inline int getgroup(char c)
{
return lookup[tolower(c) - 'a'];
}
call it this way
char myc = 'M';
int grp = lookup(myc);
Error checks omitted for brevity.
Of course, depending on what the 7 groups represent , you can make enums instead of using 0, 1, 2 etc.
Given the small amount of data involved, I'd probably do it as a bit-wise lookup -- i.e., set up values:
cat1 = 1;
cat2 = 2;
cat3 = 4;
cat4 = 8;
cat5 = 16;
cat6 = 32;
cat7 = 64;
Then just create an array of 26 values, one for each letter in the alphabet, with each containing the value of the category for that letter. When you want to classify a letter, you just categories[ch-'A'] to find it.