Word2Vec is it for word only in a sentence or for features as well? - word2vec

I would like to ask more about Word2Vec:
I am currently trying to build a program that check for the embedding vectors for a sentence. While at the same time, I also build a feature extraction using sci-kit learn to extract the lemma 0, lemma 1, lemma 2 from the sentence.
From my understanding;
1) Feature extractions : Lemma 0, lemma 1, lemma 2
2) Word embedding: vectors are embedded to each character (this can be achieved by using gensim word2vec(I have tried it))
More explanation:
Sentence = "I have a pen".
Word = token of the sentence, for example, "have"
1) Feature extraction
"I have a pen" --> lemma 0:I, lemma_1: have, lemma_2:a.......lemma 0:have, lemma_1: a, lemma_2:pen and so on.. Then when try to extract the feature by using one_hot then will produce:
[[0,0,1],
[1,0,0],
[0,1,0]]
2) Word embedding(Word2vec)
"I have a pen" ---> "I", "have", "a", "pen"(tokenized) then word2vec from gensim will produced matrices for example if using window_size = 2 produced:
[[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345]
]
The floating and integer numbers are for explanation purpose and original data should vary depending on the sentence. These are just dummy data to explain.*
Questions:
1) Is my understanding about Word2Vec correct? If yes, what is the difference between feature extraction and word2vec?
2) I am curious whether can I use word2vec to get the feature extraction embedding too since from my understanding, word2vec is only to find embedding for each word and not for the features.
Hopefully someone could help me in this.

It's not completely clear what you're asking, as you seem to have many concepts mixed-up together. (Word2Vec gives vectors per word, not character; word-embeddings are a kind of feature-extraction on words, rather than an alternative to 'feature extraction'; etc. So: I doubt your understanding is yet correct.)
"Feature extraction" is a very general term, meaning any and all ways of taking your original data (such as a sentence) and creating a numerical representation that's good for other kinds of calculation or downstream machine-learning.
One simple way to turn a corpus of sentences into numerical data is to use a "one-hot" encoding of which words appear in each sentence. For example, if you have the two sentences...
['A', 'pen', 'will', 'need', 'ink']
['I', 'have', 'a', 'pen']
...then you have 7 unique case-flattened words...
['a', 'pen', 'will', 'need', 'ink', 'i', 'have']
...and you could "one-hot" the two sentences as a 1-or-0 for each word they contain, and thus get the 7-dimensional vectors:
[1, 1, 1, 1, 1, 0, 0] # A pen will need ink
[1, 1, 0, 0, 0, 1, 1] # I have a pen
Even with this simple encoding, you can now compare sentences mathematically: a euclidean-distance or cosine-distance calculation between those two vectors will give you a summary distance number, and sentences with no shared words will have a high 'distance', and those with many shared words will have a small 'distance'.
Other very-similar possible alternative feature-encodings of these sentences might involve counts of each word (if a word appeared more than once, a number higher than 1 could appear), or weighted-counts (where words get an extra significance factor by some measure, such as the common "TF/IDF" calculation, and thus values scaled to be anywhere from 0.0 to values higher than 1.0).
Note that you can't encode a single sentence as a vector that's just as wide as its own words, such as "I have a pen" into a 4-dimensional [1, 1, 1, 1] vector. That then isn't comparable to any other sentence. They all need to be converted to the same-dimensional-size vector, and in "one hot" (or other simple "bag of words") encodings, that vector is of dimensionality equal to the total vocabulary known among all sentences.
Word2Vec is a way to turn individual words into "dense" embeddings with fewer dimensions but many non-zero floating-point values in those dimensions. This is instead of sparse embeddings, which have many dimensions that are mostly zero. The 7-dimensional sparse embedding of 'pen' alone from above would be:
[0, 1, 0, 0, 0, 0, 0] # 'pen'
If you trained a 2-dimensional Word2Vec model, it might instead have a dense embedding like:
[0.236, -0.711] # 'pen'
All the 7 words would have their own 2-dimensional dense embeddings. For example (all values made up):
[-0.101, 0.271] # 'a'
[0.236, -0.711] # 'pen'
[0.302, 0.293] # 'will'
[0.672, -0.026] # 'need'
[-0.198, -0.203] # 'ink'
[0.734, -0.345] # 'i'
[0.288, -0.549] # 'have'
If you have Word2Vec vectors, then one alternative simple way to make a vector for a longer text, like a sentence, is to average together all the word-vectors for the words in the sentence. So, instead of a 7-dimensional sparse vector for the sentence, like:
[1, 1, 0, 0, 0, 1, 1] # I have a pen
...you'd get a single 2-dimensional dense vector like:
[ 0.28925, -0.3335 ] # I have a pen
And again different sentences may be usefully comparable to each other based on these dense-embedding features, by distance. Or these might work well as training data for a downstream machine-learning process.
So, this is a form of "feature extraction" that uses Word2Vec instead of simple word-counts. There are many other more sophisticated ways to turn text into vectors; they could all count as kinds of "feature extraction".
Which works best for your needs will depend on your data and ultimate goals. Often the most-simple techniques work best, especially once you have a lot of data. But there are few absolute certainties, and you often need to just try many alternatives, and test how well they do in some quantitative, repeatable scoring evaluation, to find which is best for your project.

Related

Map Reduce Removing Duplicates

I have been given a large text file and want to find the number of different words that start with each letter. I am trying to understand input and output values for map and reduce functions.
I understand a simpler problem which does not need to deal with duplicate words: determine the frequency with which each letter of the alphabet starts a word in the text using map reduce.
Map input: <0, “everyday i am city in tomorrow easy over school i iterate tomorrow city community”>
Map output: [<e,1>,<i,1>,<a,1>,<c,1>,<i,1>,<t,1>,<e,1>,<o,1>,<s,1>,<i,1>,<i,1>,<t,1>,<c,1>,<c,1>]
Reduce input: <a,[1]>,<c,[1,1,1]>,<e,[1,1]>,<i,[1,1,1,1]>,<o,[1]>,<s,[1]>,<t,[1,1]>
Reduce output: [<a,1>,<c,3>,<e,2>,<i,4>,<o,1>,<s,1>,<t,2>]
For the above problem the words 'i' 'city' and 'tomorrow' appear more than once so my final output should be:
Reduce output: [<a,1>,<c,2>,<e,2>,<i,3>,<o,1>,<s,1>,<t,1>]
I am unsure of how I would ensure duplicate words are remove in the above problem (would it be done in a pre processing phase or could it be implemented on either map or reduce functions). If I could get help understanding the map and reduce outputs of the new problem I would appreciate it.
You can do it in two map-reduce passes:
find all the distinct word by using word as a map output and in reduce outputting each word once
you already solved - find frequency of each initial letter on each unique word.
Alternatively, since there are not many unique words you can cache them in mapper and output each one (or its first letter) only once and the reduce will be identical to your simpler problem. Actually, no, that won't work because same words can appear in different mappers. But you can still cache the words in mapper in the first solution and output each word only once per mapper - a little bit less traffic between map and reduce.
Maybe something like this would help,
let str = "everyday i am city in tomorrow easy over school i iterate tomorrow city community"
let duplicatesRemoved = Set(str.split(separator: " "))
Output:
["city", "community", "tomorrow", "easy", "everyday", "over", "in", "iterate", "i", "am", "school"]
And maybe you don't need those map statements and can achieve something like this,
Code
var varCount = [Character: Int]()
for subStr in duplicatesRemoved {
if let firstChar = subStr.first {
varCount[firstChar] = (varCount[firstChar] ?? 0) + 1
}
}
Output
["i": 3, "t": 1, "e": 2, "c": 2, "s": 1, "a": 1, "o": 1]

How does the minizinc pentominoes regular constraint example work?

The minizinc benchmarks repository contains several pentomino examples.
Here is the data for the first example:
width = 5;
height = 4;
filled = 1;
ntiles = 5;
size = 864;
tiles = [|63,6,1,2,0,
|9,6,1,2,378,
|54,6,1,2,432,
|4,6,1,2,756,
|14,6,1,2,780,
|];
dfa = [7,5,5,5,5,3,0,2,2,2,2,2,7,5,5,5,5,3,19,4,4,4,4,3,30,4,4,4,4,3,0,10,10,10,10,10,46,8,8,8,8,0,0,12,12,12,12,13,0,15,15,15,15,14,0,16,16,16,16,16,0,18,18,18,18,17,0,20,20,20,20,20,0,21,21,21,21,21,0,22,22,22,22,22,0,23,23,23,23,23,0,28,28,28,28,0,47,22,22,22,22,22,47,23,23,23,23,23,46,11,11,11,11,24,0,26,26,26,26,26,0,25,25,25,25,25,0,27,27,27,27,25,0,29,29,29,29,26,0,31,31,31,31,31,32,0,0,0,0,0,33,0,0,0,0,0,34,0,0,0,0,0,35,0,0,0,0,0,36,0,0,0,0,0,46,9,9,9,9,6,47,16,16,16,16,16,0,35,35,35,35,0,60,35,35,35,35,0,0,37,37,37,37,39,0,39,39,39,39,39,60,37,37,37,37,39,0,40,40,40,40,40,0,41,41,41,41,41,0,42,42,42,42,42,0,43,43,43,43,43,0,45,45,45,45,45,0,47,47,47,47,47,60,47,47,47,47,47,48,0,0,0,0,0,49,44,44,44,44,0,53,38,38,38,38,38,60,0,0,0,0,0,0,50,50,50,50,50,0,51,51,51,51,0,0,52,52,52,52,52,0,54,54,54,54,54,0,55,55,55,55,55,0,56,56,56,56,56,0,57,57,57,57,57,0,60,60,60,60,0,0,58,58,58,58,58,0,59,59,59,59,59,61,55,55,55,55,0,62,0,0,0,0,0,63,0,0,0,0,0,0,62,62,62,62,0,0,63,63,63,63,0,0,2,2,2,2,2,3,4,3,3,3,3,2,0,2,2,2,2,3,4,3,3,3,3,5,9,5,5,5,5,6,0,6,6,6,6,7,0,7,7,7,7,8,0,8,8,8,8,0,9,0,0,0,0,2,0,2,2,2,2,4,4,14,4,4,5,2,2,0,2,2,2,3,3,10,3,3,5,3,3,12,3,3,5,4,4,14,4,4,5,8,8,0,8,8,0,9,9,0,9,9,13,11,11,0,11,11,11,11,11,22,11,11,11,7,7,15,7,7,11,13,13,0,13,13,13,6,6,15,6,6,0,0,0,22,0,0,0,6,6,25,6,6,0,17,17,29,17,17,16,19,19,0,19,19,19,20,20,0,20,20,20,21,21,0,21,21,21,22,22,0,22,22,0,23,23,0,23,23,24,24,24,0,24,24,24,26,26,0,26,26,0,26,26,27,26,26,0,0,0,27,0,0,0,18,18,29,18,18,0,0,0,30,0,0,0,28,28,0,28,28,0,30,30,0,30,30,0,32,32,0,32,32,32,33,33,0,33,33,33,34,34,0,34,34,0,35,35,0,35,35,35,36,36,0,36,36,36,0,0,37,0,0,0,31,31,40,31,31,0,0,0,45,0,0,0,39,39,0,39,39,39,41,41,0,41,41,41,42,42,0,42,42,42,43,43,0,43,43,0,44,44,0,44,44,44,45,45,0,45,45,0,38,38,46,38,38,0,0,0,50,0,0,0,0,0,51,0,0,0,47,47,0,47,47,47,49,49,0,49,49,49,51,51,0,51,51,0,48,48,52,48,48,0,0,0,53,0,0,0,0,0,54,0,0,0,53,53,0,53,53,0,54,54,0,54,54,0,2,2,0,2,2,2,3,3,3,4,3,3,2,2,2,0,2,2,3,3,3,4,3,3,2,2,2,0,2,2,3,3,3,3,8,3,2,2,2,2,0,2,3,3,3,3,8,3,5,5,5,5,0,5,6,6,6,6,0,6,7,7,7,7,0,7,0,0,0,0,9,0,4,4,4,4,13,4,10,10,10,10,0,10,11,11,11,11,0,11,12,12,12,12,0,12,13,13,13,13,0,13,0,0,0,0,14,0,2,2,2,2,0,2,]
As far as I understand it, the goal is to fill a 5 x 4 board with 5 pentominoes. Some overlap and/or exclusion of tiles seems to be required, which is not usual.
Here is the minizinc solution:
include "globals.mzn";
int: Q = 1;
int: S = 2;
int: Fstart = 3;
int: Fend = 4;
int: Dstart = 5;
int: width;
int: height;
int: filled;
int: ntiles;
int: size;
array[1..ntiles,1..Dstart] of int: tiles;
array[1..size] of int: dfa;
array[1..width*height] of var filled..ntiles+1: board;
constraint forall (h in 1..height, w in 1..width-1) (
board[(h-1)*width+w] != ntiles+1);
constraint forall (h in 1..height) (
board[(h-1)*width+width] = ntiles+1);
constraint
forall (t in 1..ntiles)(
let {
int: q = tiles[t,Q],
int: s = tiles[t,S],
set of int: f = tiles[t,Fstart]..tiles[t,Fend],
array[1..q,1..s] of int: d =
array2d(1..q,1..s,
[ dfa[i] | i in tiles[t,Dstart]+1..tiles[t,Dstart]+q*s] )
}
in
regular(board,q,s,d,1,f)
);
solve :: int_search(board, input_order, indomain_min, complete) satisfy;
output [show(board)];
I've not been able to find much documentation on the minizinc benchmarks. They were part of the minizinc challenge for a few years but not anymore.
Chapter 3 of Mikael Lagerkvist's thesis is perhaps partially relevant. It describes placing pentominoes using the regular constraint in the gecode toolkit.
Section 3.2 illustrates a string representation for placing the L pentomino using a regular expression string of 0s and 1s: 1s where each square of the board overlaps a square of the L pentomino. Piece rotations are handled in section 3.3 using disjunctions of regular expressions. In general, there are 8 symmetries to consider for each pentomino (2 mirrorings and 4 rotations).
The minizinc data above does not use disjunctions of 8 binary strings to represent pentomino tiles but the minizinc code does use the regular constraint.
I realise gecode and minizinc work differently and in this case minizinc has chosen an alternative to difficult to read and write binary string regular expression disjunctions. The 864 long string of numbers in the dfa variable is probably the core part of the minizinc solution I'm missing. The rest of the solution (removing board symmetries) I can probably figure out after that.
I don't see how to fill a 5 x 4 board with 5 pentominoes without overlaps and/or exclusions. What is the goal of this example?
How does the minizinc pentomino tile and dfa representation work?
How does pentomino rotation and mirroring work in this minizinc representation?
Here is the only board solution from the above code:
[1, 1, 1, 2, 6, 3, 3, 1, 2, 6, 3, 5, 5, 5, 6, 3, 3, 3, 4, 6]
Here is the solution reformatted into a 5 x 4 board:
[1, 1, 1, 2, 6,
3, 3, 1, 2, 6,
3, 5, 5, 5, 6,
3, 3, 3, 4, 6]
Note the 6s. What do they represent?
See this web page for the complete set of 50, 5 x 4 pentomino tilings without overlaps, exclusions or holes.
There are alternative approaches for solving this sort of problem with minizinc.
The geost predicate is one possibility. None of these alternatives are relevant for this question.
Alternative software suggestions or discussion, beyond minizinc and gecode, are again not relevant.
Both the original MiniZinc model and the one in the repository in the comment are ones I wrote. While my licentiate thesis and the linked repository use regular expressions to express the constraints, the original MiniZinc challenge model was written when MiniZinc only had support for DFA inputs, as this is what the regular constraint inside solvers actually use (§). The DFAs were in fact generated by taking the Gecode model and writing a small program (lost to time) that printed out the DFA for the regular expressions in the Gecode example file using the Gecode regular expression to DFA translation. A nice thing about the translation is that for a piece that has some symmetry, the DFA minimization will remove the symmetries. The instance generator linked was written this year, and uses the modern MiniZinc feature that accepts regular expressions. It was just easier to write that way.
So, in order to understand the long list of numbers, you have to view it as a DFA. The list represent a matrix, where the indexes are the states and the next input, and the values are the next state to go to. The other arguments to the regular constraint indicate the number of states, the number of symbols in the alphabet, and the starting and accepting states of the DFA.
As for the 6's at the end of the matrix, these are end-of-line markers. They are there to make sure that a piece is not split apart. Consider the simple piece XXX on a 4 by 4 board with no other pieces (so X is 1, and empty is 0). With the expression 0*1110*, all placements of the piece are modeled, but so are placements like
_ _ _ _
_ _ X X
X X _ _
_ _ _ _
In order to avoid that, an additional end-of-line column is added to the board. in this particular case, the end of line marker is it's own unique value, which means that the model could even handle disjoint pieces. If all pieces are connected and the board is not full, the the end of line markers could be the same as the empty square.
I have a couple of other papers that use a similar construction of placing parts if you find this interesting, as well as the original publication.
Footnote: (§) Technically, most direct implementations of Pesant's algorithm can handle NFAs as well (disregarding epsilon transitions), which could be used to optimize the representation. However, DFA minimization is a known and fast method, while NFA minimization is much much harder. The fewer the states in the FA, the faster the propagation will be.

Understanding; for i in range, x,y = [int(i) in i.... Python3

I am stuck trying to understand the mechanics behind this combined input(), loop & list-comprehension; from Codegaming's "MarsRover" puzzle. The sequence creates a 2D line, representing a cut-out of the topology in an area 6999 units wide (x-axis).
Understandably, my original question was put on hold, being to broad. I am trying to shorten and to narrow the question: I understand list comprehension basically, and I'm ok experienced with for-loops.
Like list comp:
land_y = [int(j) for j in range(k)]
if k = 5; land_y = [0, 1, 2, 3, 4]
For-loops:
for i in the range(4)
a = 2*i = 6
ab.append(a) = 0,2,4,6
But here, it just doesn't add up (in my head):
6999 points are created along the x-axis, from 6 points(x,y).
surface_n = int(input())
for i in range(surface_n):
land_x, land_y = [int(j) for j in input().split()]
I do not understand where "i" makes a difference.
I do not understand how the data "packaged" inside the input. I have split strings of integers on another task in almost exactly the same code, and I could easily create new lists and work with them - as I understood the structure I was unpacking (pretty simple being one datatype with one purpose).
The fact that this line follows within the "game"-while-loop confuses me more, as it updates dynamically as the state of the game changes.
x, y, h_speed, v_speed, fuel, rotate, power = [int(i) for i in input().split()]
Maybe someone could give an example of how this could be written in javascript, haskell or c#? No need to be syntax-correct, I'm just struggling with the concept here.
input() takes a line from the standard input. So it’s essentially reading some value into your program.
The way that code works, it makes very hard assumptions on the format of the input strings. To the point that it gets confusing (and difficult to verify).
Let’s take a look at this line first:
land_x, land_y = [int(j) for j in input().split()]
You said you already understand list comprehension, so this is essentially equal to this:
inputs = input().split()
result = []
for j in inputs:
results.append(int(j))
land_x, land_y = results
This is a combination of multiple things that happen here. input() reads a line of text into the program, split() separates that string into multiple parts, splitting it whenever a white space character appears. So a string 'foo bar' is split into ['foo', 'bar'].
Then, the list comprehension happens, which essentially just iterates over every item in that splitted input string and converts each item into an integer using int(j). So an input of '2 3' is first converted into ['2', '3'] (list of strings), and then converted into [2, 3] (list of ints).
Finally, the line land_x, land_y = results is evaluated. This is called iterable unpacking and essentially assumes that the iterable on the right has exactly as many items as there are variables on the left. If that’s the case then it’s just a nice way to write the following:
land_x = results[0]
land_y = results[1]
So basically, the whole list comprehension assumes that there is an input of two numbers separated by whitespace, it then splits those into separate strings, converts those into numbers and then assigns each number to a separate variable land_x and land_y.
Exactly the same thing happens again later with the following line:
x, y, h_speed, v_speed, fuel, rotate, power = [int(i) for i in input().split()]
It’s just that this time, it expects the input to have seven numbers instead of just two. But then it’s exactly the same.

How do I fuzzy match items in a column of an array in python?

I have an array of team names from NCAA, along with statistics associated with them. The school names are often shortened or left out entirely, but there is usually a common element in all variations of the name (like Alabama Crimson Tide vs Crimson Tide). These names are all contained in an array in no particular order. I would like to be able to take all variations of a team name by fuzzy matching them and rename all variants to one name. I'm working in python 2.7 and I have a numpy array with all of the data. Any help would be appreciated, as I have never used fuzzy matching before.
I have considered fuzzy matching through a for-loop, which would (despite being unbelievably slow) compare each element in the column of the array to every other element, but I'm not really sure how to build it.
Currently, my array looks like this:
{Names , info1, info2, info 3}
The array is a few thousand rows long, so I'm trying to make the program as efficient as possible.
The Levenshtein edit distance is the most common way to perform fuzzy matching of strings. It is available in the python-Levenshtein package. Another popular distance is Jaro Winkler's distance, also available in the same package.
Assuming a simple array numpy array:
import numpy as np
import Levenshtein as lv
ar = np.array([
'string'
, 'stum'
, 'Such'
, 'Say'
, 'nay'
, 'powder'
, 'hiden'
, 'parrot'
, 'ming'
])
We define helpers to give us indexes of Levenshtein and Jaro distances, between a string we have and all strings in the array.
def levenshtein(dist, string):
return map(lambda x: x<dist, map(lambda x: lv.distance(string, x), ar))
def jaro(dist, string):
return map(lambda x: x<dist, map(lambda x: lv.jaro_winkler(string, x), ar))
Now, note that Levenshtein distance is an integer value counted in number of characters, whilst Jaro's distance is a floating point value that normally varies between 0 and 1. Let's test this using np.where:
print ar[np.where(levenshtein(3, 'str'))]
print ar[np.where(levenshtein(5, 'str'))]
print ar[np.where(jaro(0.00000001, 'str'))]
print ar[np.where(jaro(0.9, 'str'))]
And we get:
['stum']
['string' 'stum' 'Such' 'Say' 'nay' 'ming']
['Such' 'Say' 'nay' 'powder' 'hiden' 'ming']
['string' 'stum' 'Such' 'Say' 'nay' 'powder' 'hiden' 'parrot' 'ming']

Save list of table of numbers from Python into format easily readable by Mathematica?

I am running a simulation in Python. The simulation's results are summarized in a list of number matrices. Is there a nice export format I can use to write this list, so that later I can read the file in Mathematica easily, and Mathematica will recognize it as a list of matrices automatically?
Well, it depends on how large your matrices are and whether speed or memory are a concern for you. The most simple solution is to create a plain-text Mathematica expression by yourself. Just iterate through your matrices and create a list of them in Mathematica formate. This boils down to writing braces and numbers in a file
{mat1, mat2, ...}
where mat1, etc are themselves lists of lists of numbers.
Update 1
If you want a standardized format, then you could look what you can easily import into Mathematica. One thing that hits the eye (after it was hit by MTX, which obviously doesn't work) is the MAT format. A quick search seems to indicate, that you can write those files with Python.
Update 2
Regarding your comment
Pythonica looks nice. Regrettably, I am running the Python simulations on a cluster that does not have Mathematica installed. I am using Mathematica in my personal PC for post-processing.
OK, but the package is not even 500 lines of code. Why don't you skim over it and just take out what you need: Code that transforms arbitrary Python lists to Mathematica code
_id_to_mathematica = lambda x: str(x)
def _float_to_mathematica(x):
return ("%e" % x).replace('e', '*10^')
def _complex_to_mathematica(z):
return 'Complex' + ('[%e,%e]' % (z.real, z.imag)).replace('e', '*10^')
def _str_to_mathematica(s):
return '\"%s\"' % s
def _iter_to_mathematica(xs):
s = '{'
for x in xs:
s += _python_mathematica[type(x)](x)
s += ','
s = s[:-1]
s += '}'
return s
_python_mathematica = {bool: _id_to_mathematica,
type(None): _id_to_mathematica,
int: _id_to_mathematica,
float: _float_to_mathematica,
long: _id_to_mathematica,
complex: _complex_to_mathematica,
iter: _iter_to_mathematica,
list: _iter_to_mathematica,
set: _iter_to_mathematica,
xrange: _iter_to_mathematica,
str: _str_to_mathematica,
tuple: _iter_to_mathematica,
frozenset: _iter_to_mathematica}
l = [[1, 2, 3], 1, [1, 5, [7, 3, 7, 8]]]
print(_iter_to_mathematica(l))
The output is a string
{{1,2,3},1,{1,5,{7,3,7,8}}}
that you can directly save into a file and load it into Mathematica using Get.
How big are the matrices?
If they are not too large, the JSON format will work well. I have used this, it is easy to work with both in Python and Mathematica.
If they are large, I would try HDF5. I have no experience with writing this from Python, but I know that it can store multiple datasets, thus it can store multiple matrices of different sizes.