struct.unpack for network byte order binary encoded numbers - python-2.7

I am totally new to Python. I have to parse a .txt file that contains network byte order binary encoded numbers (see here for the details on the data). I know that I have to use the package struct.unpack in Python. My questions are the following:
(1) Since I don't really understand how the function struct.unpack works, is it straight forward to parse the data? By that, I mean that if you look at the data structure it seems that I have to write a code for each type of messages. But if I look online for the documentation on struct.unpack it seems more straight forward but I am not sure how to write the code. A short sample would be appreciated.
(2) What's the best practice once I parse the data? I would like to save the parsed file in order to avoid parsing the file each time I need to make a query. In what format should I keep the parsed file that would be the most efficient?

This should be relatively straight forward. I can't comment on how you're actually supposed to get the byte encoded packets of information, but I can help you parse them.
First, here's a list of some of the packet types you'll be dealing with that I gathered from section 4 of the documentation:
TimeStamp
System Event Message
Stock Related Messages
Stock Directory
Stock Trading Action
Reg SHO Short Sale Price Test Restricted Indicator
Market Participant Position
Add Order Message
This continues on. But as an example, let's see how to decode one or two of these:
System Event Message
A System Event Message packet has 3 portions, which is 6 bytes long:
A Message Type, which starts at byte 0, is 1 byte long, with a Value of S (a Single Character)
A TimeStamp, which starts at byte 1, is 4 bytes long, and should be interpreted an in Integer.
An Event Code, which starts at byte 5, is 1 byte long and is a String (Alpha).
Looking up each type in the struct.unpack code table, we'll need to build a string to represent this sequence. First, we have a Character, then a 4Byte Unsigned Integer, then another Character. This corresponds to the encoding and decoding string of "cIc".
*NOTE: The unsigned portion of the Integer is documented in Section 3: Data Types of their documentation
Construct a fake packet
This could probably be done better, but it's functional:
>>> from datetime import datetime
>>> import time
>>> data = struct.pack('cIc', 'S', int(time.mktime(datetime.now().timetuple())), 'O')
>>> print repr(data) # What does the bytestring look like?
'S\x00\x00\x00\xa6n\x8dRO' # Yep, that's bytes alright!
Unpack the data
In this example, we'll use the fake packet above, but in the real world we'd use a real data response:
>>> response_tuple = struct.unpack('cIc', data)
>>> print(repr(response_tuple))
('S', 1385000614, 'O')
In this case, the 3rd item in the tuple (the 'O') is a key, to be looked up in another table called System Event Codes - Daily and System Event Codes - As Needed.
If you need additional examples, feel free to ask, but that's the jist of it.
Recommendations on how to store this data. Well, I suppose that depends on what you'd like to do long term to this data. Probably, a database makes sense here. However, without further information, I cannot say.
Hope that helps!

Related

How to extract data of an unknown length from a string with random data

This question has been stumping me for quite a while now. I have a binary file, within which are the contents of an SNMP trap sent by a server, i am able to read the file and output its data as a UTF-8 encoded string '0‚Œcommunity¤‚} +…"Õ#À¨V ÁC8o[0‚W0+…"ÕCPU00020A+…"Õ/Message Detailing The Event Triggering The SNMP Trap (BIST).0+…"Õ0+…"Õ7K882L30+…"Õ server-name0+…"Õ0+…"ÕN/A0+…"Õ"1"0+…"Õ 7K882L30%+…"Õ Main System Chassis0&+…"ÕiDRAC-server.example.com' The issue is, im trying to extract certain elements of this data such as community and server-name and store them into variables, however, the program needs to work with lots of different SNMP traps with values that may differ in length and content, and as I don't know where the data will be located within the string I cant come up with code to reliably sort the data into the specific variables. The only thing I have to go off is that I know the sequential order of the data.
I'm not sure where to begin with approaching this problem, many thanks for any help.
To clarify, the data is currently stored in a variable char buffer[1025] = DATA;

Most efficient way to use AWS SQS (with Golang)

When using the AWS SQS (Simple Queue Service) you pay for each request you make to the service (push, pull, ...). There is a maximum of 256kb for each message you can send to a queue.
To save money I'd like to buffer messages sent to my Go application before I send them out to SQS until I have enough data to efficiently use the 256kb limit.
Since my Go application is a webserver, my current idea is to use a string mutex and append messages as long as I would exceed the 256kb limit and then issue the SQS push event. To save even more space I could gzip every single message before appending it to the string mutex.
I wonder if there is some kind of gzip stream that I could use for this. My assumption is that gzipping all concatenated messages together will result in smaller size then gzipping every message before appending it to the string mutex. One way would be to gzip the string mutex after every append to validate its size. But that might be very slow.
Is there a better way? Or is there a total better approach involving channels? I'm still new to Go I have to admit.
I'd take the following approach
Use a channel to accept incoming "internal" messages to a go routine
In that go routine keep the messages in a "raw" format, so 10 messages is 10 raw uncompressed items
Each time a new raw item arrives, compress all the raw messages into one. If the size with the new message > 256k then compress messages EXCEPT the last one and push to SQS
This is computationally expensive. Each individual message causes a full compression of all pending messages. However it is efficient for SQS use
You could guesstimate the size of the gzipped messages and calculate whether you've reached the max size threshold. Keep track of a message size counter and for every new message increment the counter by it's expected compressed size. Do the actual compression and send to SQS only if your counter will exceed 256kb. So you could avoid compressing every time a new message comes in.
For a use-case like this, running a few tests on a sample set of messages should give the rough percentage of compression expected.
Before you get focused on compression, eliminate redundant data that is known on both sides. This is what encodings like msgpack, protobuf, AVRO, and so on do.
Let's say all of your messages are a struct like this:
type Foo struct {
bar string
qux int
}
and you were thinking of encoding it as JSON. Then the most efficient you could do is:
{"bar":"whatever","qux",123}
If you wanted to just append all of those together in memory, you might get something like this:
{"bar":"whatever","qux",123}{"bar":"hello world","qux",0}{"bar":"answer to life, the universe, and everything","qux",42}{"bar":"nice","qux",69}
A really good compression algorithm might look at hundreds of those messages and identify the repetitiveness of {"bar":" and ","qux",.
But compression has to do work to figure that out from your data each time.
If the receiving code already knows what "schema" (the {"bar": some_string, "qux": some_int} "shape" of your data) each message has, then you can just serialize the messages like this:
"whatever"123"hello world"0"answer to life, the universe, and everything"42"nice"69
Note that in this example encoding, you can't just start in the middle of the data and unambiguously find your place. If you have a bunch of messages such as {"bar":"1","qux":2}, {"bar":"2","qux":3}, {"bar":"3","qux":4}, then the encoding will produce: "1"2"2"3"3"4, and you can't just start in the middle and know for sure if you're looking at a number or a string - you have to count from the ends. Whether or not this matters will depend on your use case.
You can come up with other simple schemes that are more unambiguous or make the code for writing or reading messages easier or simpler, like using a field separator or message separator character which is escaped in your encoding of the other data (just like how \ and " would be escaped in quoted JSON strings).
If you can't have the receiver just know/hardcode the expected message schema - if you need the full flexibility of something like JSON and you always unmarshal into something like a map[string]interface{} or whatever - then you should consider using something like BSON.
Of course, you can't use msgpack, protobuf, AVRO, or BSON directly - they need a medium that allows arbitrary bytes like 0x0. And according to the AWS SQS FAQ:
Q: What kind of data can I include in a message?
Amazon SQS messages can contain up to 256 KB of text data, including XML, JSON and unformatted text. The following Unicode characters are accepted:
#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]
So if you want to aim for maximum space efficiency for your exact usecase, you'd have to write your own code which use the techniques from those encoding schemes, but only used bytes which bytes are allowed in SQS messages.
Relatedly, if you have a lot of integers, and you know most of them are small (or clump around a certain spot of the number line, so that by adding a constant offset to all of them you can make most of them small), you can use one of the variable length quantity techniques to encode all of those integers. In fact several of those common encoding schemes mentioned above use variable length quantities in their encoding of integers. If you use a "piece size" of six (6) bits (instead of the standard implicitly assumed piece size of eight (8) bits to match a byte) then you can use base64. Not full base64 encoding, because the padding will completely defeat the purpose - just map from the 64 possible values that fit in six bits to the 64 distinct ASCII characters that base64 uses.
Anyway, unless you know your data has a lot repetition (but not the kind that you can just not send, like the same field names in every message) I would start with all of that, and only then would I look at compression.
Even so, if you want minimal size, I would aim for LZMA, and if you want minimal computing overhead, I would use LZ4. Gzip is not bad per se - if it's much easier to use gzip then just use it - but if you're optimizing for either size or for speed, there are better options. I don't know if gzip is even a good "middle ground" of speed and output size and working memory size - it's pretty old and maybe there's compression algorithms which are just strictly superior in speed and output and memory size by now. I think gzip, depending on implementation, also includes headers and framing information (like version metadata, size, checksums, and so on), which if you really need to minimize for size you probably don't want, and in the context of SQS messages you probably don't need.

How can I send a list using MQTT

d = random,randint(1,30)
data = [d, strftime("%Y%m%d %H%M%S", gmtime())] #random num , system time
client.publish("gas", str(data)]
This is a part of my python code which is ver2.
I'm trying to send a list using MQTT.
However, If I write bytearray instead of str which is third line
It says "ValueError: string must be of size 1".
So I wrote str then make it sting type
Can I send a just list which is NOT string type.
MQTT message payloads are just byte arrays, there is no inherent format to them. Strings tend to works as long as both ends of the transaction are using the same character encoding.
If you want to send structured data (such as the ost) then you need to decide on a way to encode that structure so the code receiving the message will know how to reconstruct it.
The current usual solution to this problem is to encode structures are JSON, but XML or something like protobuffers are also good candidates.
The following question has some examples of converting Python lists to JSON objects
Serializing list to JSON

File Binary vs Text

Are there some situation where I have to prefer binary file to text file? I'm using C++ as programming language?
For example if I have to store some large text file is it better use text file or binary file?
Edit
The file for the moment has no requirment to be readable from human. Are some performance difference, security difference and so on?
Edit
Sorry for the omit other the requirment (thanks to Carey Gregory)
The record to save are in ascii encoding
The file must be crypted ( AES )
The machine can power off any time. So I've to try to prevents errors.
I've to know if the file change outside the program, I think I'll use a sha1 digest of the file.
As a general rule, define a text format, and use it. It's much
easier to develop and debug, and it's much easier to see what is
going wrong if it doesn't work.
If you find that the files are becoming too big, or taking to
much time to transfer over the wire, consider compressing them.
A compressed text file is often smaller than you can do with
binary. Or consider a less verbose text format; it's possible
to reliably transmit a text representation of your data with
a lot less characters than XML uses.
And finally, if you do end up having to use binary, try to chose
an existing format (e.g. Google's protocol blocks), or base your
format on an existing format. Just remember that:
Binary is a lot more work than text, since you practically
have to write all of the << operators again, including those
in the standard library.
Binary is a lot more difficult to debug, because you can't
easily see what you've actually done.
Concerning your last edit:
Once you've encrypted, the results will be binary. You can
use a text representation of the binary (base64 or some such),
but the results won't be any more readable than the binary, so
it's not worth the bother. If you're encrypting in process,
before writing to disk, you automatically lose all of the
advantages of text.
The issues concerning powering off mean that you cannot use
ofstream directly. You must open or create the file with the
necessary options for full transactional integrity (O_SYNC as
a flag to open under Unix). You must write each record as
a single write request to the system.
It's always a good idea to have a checksum, just in case. If
you're worried about security, SHA1 is a good choice. But keep
in mind that if someone has access to the file, and wants to
intentionally change it, they can recalculate the SHA1 and
insert the new value as well.
All files are binary; the data within them is a binary representation of some information. If you have to store a large amount of text then the file will contain the binary representation of that text. The difference between a "binary file" and a "text file" is that creating the latter involves converting data to a text form before saving it. This is typically done so humans can read it.
The distinction between binary and text is usually made when storing data that is for computer consumption. Typically this data would not be text - it might be a list of numerical configuration values, for example: 1, 2, 3.
If you stored this in text format, your file could contain a list of human-readable numbers, and if you opened the file in Notepad you might see one number per line. But what you're actually saving here is not the binary values 1, 2, 3 - you're saving a string "1\n2\n3\n". Note that this string is 6 characters long, and the binary values (assuming ASCI) would actually be 49, 10, 50, 10, 51, 10!
If the same data were stored in binary format, you would store the numbers in the smallest useful space, and write the file as individual bytes that can often only be read by the code that created them. Opening this file in Notepad would likely display junk characters, because the data makes no sense as text. In this case you would be saving a byte array with actual values { 1, 2, 3 } - or even a single byte with the three values embedded. This could be much smaller than the human-readable equivalent.
Binary files store a sequence of bytes like all other files. You can store numeric values like integers per 4 bytes, characters per single byte, or even serialized class objects and anything you want.
When you know how to read a binary file (ie. you know what is stored in it) you can extract all the information from it. However, text files use text encodings like UTF8, ANSI etc. and they are intended to encode text characters to be processed by text editors.
Binary files are for machines only to interpret, whereas a text file, a human can also open and interpret its content.
So it depends whether you want your file to be readable by a human or not.
It depends on a lot of factors. I can think of two right now:
Do you require the file to be readable by humans?
Is compression a factor? A 10-digits number will take at least 10 bytes as text, but might take as little as four or two as binary.
All data is binary. You always need a machine to interpret it for you. Even if the data is compressed like protocol buffers, Avro, Thrift etc, it is binary, and if it is uncompressed, it is still binary. If you want to read protocol buffers by notepad, there is a two step process. Uncompress, and read. In case of text, this step of uncompressing is not needed. Same is case with encrypted. First unencrypted, and then read. Humans cannot read binary (as some commenters are mentioning). We still need notepad to interpret and display binary (so called text).
All data stored in a text file are human-readable graphic characters. Each line of data ends with a new line character.
In case of a binary file - data is stored in the same format as they are stored in the memory. There are no lines or new line characters. There is an end of file marker.
Moreover binary files show more efficiency for memory as they are stored in zeros and one's.

How can I obfuscate/de-obfuscate integer properties?

My users will in some cases be able to view a web version of a database table that stores data they've entered. For various reasons I need to include all the stored data, including a number of integer flags for each record that encapsulate adjacencies and so forth within the data (this is for speed and convenience at runtime). But rather than exposing them one-for-one in the webview, I'd like to have an obfuscated field that's just called "reserved" and contains a single unintelligible string representing those flags that I can easily encode and decode.
How can I do this efficiently in C++/Objective C?
Thanks!
Is it necessary that this field is exposed to the user visually, or just that it’s losslessly captured in the HTML content of the webview? If possible, can you include the flags as a hidden input element with each row, i.e., <input type=“hidden” …?
Why not convert each of the fields to hex, and append them as a string and save that value?
As long as you always append the strings in the same order, breaking them back apart and converting them back to numbers should be trivial.
Use symmetric encryption (example) to encode and decode the values. Of course, only you should know of the key.
Alternatively, Assymetric RSA is more powerfull encryption but is less efficient and is more complex to use.
Note: i am curios about the "various reasons" that require this design...
Multiply your flag integer by 7, add 3, and convert to base-36. To check if the resulting string is modified, convert back to base-2, and check if the result modulo 7 is still 3. If so, divide by 7 to get the flags. note that this is subject to replay attacks - users can copy any valid string in.
Just calculate a CRC-32 (or similar) and append it to your value. That will tell you, with a very high probability, if your value has been corrupted.