Ruby code sample:
"\u0000\u0000\u0000\u0002".unpack('N')[0] #=> 2
How can I do this with crystal language?
You can use the IO#read_bytes method to read integers from many places. For example
io = IO::Memory.new("\u0000\u0000\u0000\u0002")
io.read_bytes(UInt32, format: IO::ByteFormat::NetworkEndian) # => 2
I would advise against using strings to store binary data though, reading directly from IO, or storing using the Bytes type is much more idiomatic Crystal.
Related
so I've recently started a project about sequence alignment, and have used been using the standalone software programme 'MAFFT' to align some BRCA1 sequences (via. command-line)
I need to create a function which will produce all the pairwise combinations from aln files that I have produced using MAFFT. However I've been told that hard coding the alignments is not good practice e.g. first_alignment=align[0], second_alignment=align[1].
I've parsed the aln files using this code:
from Bio import AlignIO
align = AlignIO.read(aln_1, "clustal")
print(align)
align = AlignIO.read(aln_2, "clustal")
print(align)
Now I need a function to produced all the possible pairwise combinations, so that I can compare sequences and look for transitions/transversions, indels/substitutions, SNPs, or any biological significance.
Would appreciate the help!
Thanks.
You could loop through each of the alignments like shown on the official biopython site:
for record in alignment :
print(record.seq + " " + record.id)
To get both record you can add a inner loop like:
align1 = AlignIO.read(aln_1, "clustal")
align2 = AlignIO.read(aln_2, "clustal")
for record1 in alignment1 :
for record2 in alignment2:
# do comparison here
For the combination the pairwise module, I think has what you are looking for.
An example from the page is:
from Bio import pairwise2
alignments = pairwise2.align.globalxx("ACCGT", "ACG")
You can adjust that and pass as arguments for gloablxx the two records (inside the double loop) like:
alignments = pairwise2.align.globalxx(record1, record2)
Hope that is what you are looking for! :)
When I use
df.repartition(100).write.mode('overwrite').json(output_path)
Spark will write 100 json files under the same path specified by 'output_path'. Is it possible to write partitions into different sub-directories? For example, the first 10 partitions written into 'output_path/01/', and the second 10 partitions written into 'output_path/02', and so on?
It is not restricted to this scheme. I just need to avoid writing all output data into the same path; I need to partition the dataframe and write them into different subfolders.
The motivation for this question is that, I am using AWS s3, and whenever I write all data under the same path, it gives me a 'SLOW DOWN' error. I am informed that the writing speed limit is "prefix based", that is, if I write all data into
s3://someurl/
Then I will get a SLOW DOWN error. Instead, I need to write some data into s3://someurl/01/, and some into s3://someurl/02/, and s3://someurl/03/, ... I need help how to achieve this.
Of course, one way to solve this is to use where to manually separate data; but I hope there is some build-in mechanism to more elegantly solve this. Thanks!
You could add a dummy partition column like
from pyspark.sql import functions as F
df = df.withColumn("dummy", F.floor(F.rand() * 10))
df.write.partitionBy("dummy").mode('overwrite').json(output_path)
This will generate the following paths:
s3://someurl/dummy=0/
s3://someurl/dummy=1/
s3://someurl/dummy=2/
...
On the downside you'll have an extra column when reading
I have some data to feed to a C/C++ program and I could easily convert it in CSV format. However I would need a couple of extensions to the CSV standard, or the parts I know about it.
Data are heterogeneous, there are different parameters of different sizes. They could be 1-valued, vectors or multidimensional arrays. My ideal format would be like this one
--+ Size1
2
--+ Size2
4
--+Table1
1;2;3;4
5;6;7;8
--+Table2
1;2
"--+" is some sort of separator. I have two 1-valued parameters named symbolically Size1 and Size2 and two other multidimensional parameters Table1 and Table2. In this case the dimensions of Table1 and Table2 are given by the other two parameters.
Also rows and columns could be named, i.e. there could be a table like
--+Table3
A;B
X;1;2
Y;4;5
Where element ("A","X") is 1 and ("B","X") is 2 and so forth.
In other terms it's like a series of appended CSV files with names for tables, rows and columns.
The parsers should be able to exploit the structure of the file allowing me to write code like this:
parse(my_parser,"Size1",&foo->S1); // read Size1 value and write it in &foo.S1
parse(my_parser,"Size2",&foo->S2); // read Size2 value and write it in &foo.S2
foo->T2=malloc(sizeof(int)*(foo->S1));
parse(my_parser,"Table2",foo->T2); // read Table2
If it was able to store rows and columns name it would be a bonus.
I don't think it would take much to write such a library, but I have more important things to do ATM.
Is there an already defined format like this one? With open-source libraries for C++? Do you have other suggestions for my problem?
Thanks in advance.
A.
I would use JSON, which boost will readily handle. A scalar is a simple case of an array
[ 2 ]
The array is easy
[ 1, 2]
Multidimensional
[ [1,2,3,4], [5,6,7,8] ]
It's been a while since I've done this sort of thing, so I'm not sure how the code will break down for you. Definitely by expanding on this you could add row/column names. The code will be very nice, perhaps not quite as brainless as in python, but it should be simple.
Here's a link for the JSON format: http://json.org
Here's a stackoverflow link for reading JSON with boost: Reading json file with boost
A good option could be YAML.
It's a well known, human friendly data serialization standard for programming languages.
It fits quite well your needs: YAML syntax is designed to be easily mapped to data types common to most high-level languages: vector, associative array and scalar:
Size1: 123
---
Table1: [[1.0,2.0,3.0,4.0], [5.0,6.0,7.0,8.0]]
There are good libraries for C, C++ and many other languages.
To get a feel for how it can be used see the C++ tutorial.
For interoperability you could also consider the way OpenCV uses YAML format:
%YAML:1.0
frameCount: 5
calibrationDate: "Fri Jun 17 14:09:29 2011\n"
cameraMatrix: !!opencv-matrix
rows: 3
cols: 3
dt: d
data: [ 1000., 0., 320., 0., 1000., 240., 0., 0., 1. ]
Since JSON and YAML have many similarities, you could also take a look at: What is the difference between YAML and JSON? When to prefer one over the other
Thanks everyone for the suggestions.
The data is primarily numeric, with lots of dimensions and, given its size, it could be slow to parse with those text formats, I found that the quickest and cleanest way is to use a database, for now.
I still think it may be overkill but there are no clearly better alternatives now IMHO.
I am totally new to Python. I have to parse a .txt file that contains network byte order binary encoded numbers (see here for the details on the data). I know that I have to use the package struct.unpack in Python. My questions are the following:
(1) Since I don't really understand how the function struct.unpack works, is it straight forward to parse the data? By that, I mean that if you look at the data structure it seems that I have to write a code for each type of messages. But if I look online for the documentation on struct.unpack it seems more straight forward but I am not sure how to write the code. A short sample would be appreciated.
(2) What's the best practice once I parse the data? I would like to save the parsed file in order to avoid parsing the file each time I need to make a query. In what format should I keep the parsed file that would be the most efficient?
This should be relatively straight forward. I can't comment on how you're actually supposed to get the byte encoded packets of information, but I can help you parse them.
First, here's a list of some of the packet types you'll be dealing with that I gathered from section 4 of the documentation:
TimeStamp
System Event Message
Stock Related Messages
Stock Directory
Stock Trading Action
Reg SHO Short Sale Price Test Restricted Indicator
Market Participant Position
Add Order Message
This continues on. But as an example, let's see how to decode one or two of these:
System Event Message
A System Event Message packet has 3 portions, which is 6 bytes long:
A Message Type, which starts at byte 0, is 1 byte long, with a Value of S (a Single Character)
A TimeStamp, which starts at byte 1, is 4 bytes long, and should be interpreted an in Integer.
An Event Code, which starts at byte 5, is 1 byte long and is a String (Alpha).
Looking up each type in the struct.unpack code table, we'll need to build a string to represent this sequence. First, we have a Character, then a 4Byte Unsigned Integer, then another Character. This corresponds to the encoding and decoding string of "cIc".
*NOTE: The unsigned portion of the Integer is documented in Section 3: Data Types of their documentation
Construct a fake packet
This could probably be done better, but it's functional:
>>> from datetime import datetime
>>> import time
>>> data = struct.pack('cIc', 'S', int(time.mktime(datetime.now().timetuple())), 'O')
>>> print repr(data) # What does the bytestring look like?
'S\x00\x00\x00\xa6n\x8dRO' # Yep, that's bytes alright!
Unpack the data
In this example, we'll use the fake packet above, but in the real world we'd use a real data response:
>>> response_tuple = struct.unpack('cIc', data)
>>> print(repr(response_tuple))
('S', 1385000614, 'O')
In this case, the 3rd item in the tuple (the 'O') is a key, to be looked up in another table called System Event Codes - Daily and System Event Codes - As Needed.
If you need additional examples, feel free to ask, but that's the jist of it.
Recommendations on how to store this data. Well, I suppose that depends on what you'd like to do long term to this data. Probably, a database makes sense here. However, without further information, I cannot say.
Hope that helps!
I would like to use Apriori to carry out affinity analysis on transaction data. I have a table with a list of orders and their information. I mainly need to use the OrderID and ProductID attributes which are in the following format
OrderID ProductID
1 A
1 B
1 C
2 A
2 C
3 A
Weka requires you to create a nominal attribute for every product ID and to specify whether the item is present in the order using a true or false value like like this:
1, TRUE, TRUE, TRUE
2, TRUE, FALSE, TRUE
3, TRUE, FALSE, FALSE
My dataset contains about 10k records... about 3k different products. Can anyone suggest a way to create the dataset in this format? (Besides a manually time consuming way...)
How about writing a script to convert it?
Should be less than 10 lines in a good scripting language such as Python.
Or you may look into options of pivoting the relation as desired.
Either way, it is a straight forward programming task, so I don't see your question here.
You obviously need to convert your data. Easiest way: write a software that read the file in the programming language that you are most familiar with and then write the file in the appropriate format. Since it is text files, it should not be too complicated.
By the way, if you want more algorithms for pattern mining and association mining than just Apriori in Weka, you could check my software SPMF ( http://www.philippe-fournier-viger.com/spmf/ ) which is also in Java, can read ARFF files too and offers about 50 algorithms specialized in pattern mining (Apriori FPGrowth, and many others.
Your data is formatted correctly as-is for implementation in R using the ARULES package (and apriori function). You might consider checking it out, esp. if you're not able to get into script coding.