I have to identify and isolate a number sequence from the file names in a folder of files, and optionally, identify non-continuous sequences. The files are .dpx files. There is almost no file naming structure except that somewhere in the filename is a sequence number, and an extention of '.dpx'. There is a wonderful module called PySeq that can do all of the hard work, except it just bombs with a directory of thousands and sometimes hundreds of thousands of files. "Argument list too large". Has anyone had experience working with sequence number isolation and dpx files in particular? Each file can be up to 100MB in size. I am working on a CentOS box using Python2.7.
File names might be something like:<br/>
test00_take1_00001.dpx<br/>
test00_take1_00002.dpx<br/>
another_take_ver1-0001_3.dpx<br/>
another_take_ver1-0002_3.dpx<br/>
(Two continuous sequences)
This should do exactly what you're looking for. It will create a dict of dicts containing start and end of strings and putting the full string in a list.
It will then join all of the lists into a single list (You might as well skip on this part and turn it into a generator of lists for higher efficiency regarding memory)
from collections import defaultdict
input_list = [
"test00_take1_00001.dpx",
"test00_take1_00002.dpx",
"another_take_ver1-0001_3.dpx",
"another_take_ver1-0002_3.dpx"]
results_dict = defaultdict(lambda: defaultdict(list))
matches = (re.match(r"(.*?[\W_])\d+([\W_].*)", item) for item in input_list)
for match in matches:
results_dict[match.group(1)][match.group(2)].append(match.group(0))
results_list = [d2 for d1 in results_dict.values() for d2 in d1.values()]
>>> results_list
[['another_take_ver1-0001_3.dpx', 'another_take_ver1-0002_3.dpx'], ['test00_take
1_00001.dpx', 'test00_take1_00002.dpx']]
Related
I am currently working with Python 2.7 on a stand alone system. My preferred method for solving this problem would be to use pandas data-frames to compare. However, I do not have access to install the library on the system I'm working with. So my question is how else could I use a text file and look for matches of the strings in a csv.
If I have a main csv file with many fields (for relevance the first one is timestamps) and several other text files that contain a list of timestamps how can I compare each of the txt files with the main csv and if a match is found grab the entire row from the csv based on the specific field matching and outputting that result to another csv
Example:
example.csv
timestamp,otherfield,otherfield2
1.2345,,
2.3456,,
3.4567,,
5.7867,,
8.3654,,
12.3434,,
32.4355,,
example1.txt
2.3456
3.4565
example2.txt
12.3434
32.4355
If there are any questions I'm happy to answer them.
You can load all the files into lists, then search the lists using
with open('example.txt','r') as file_handle:
example_file_content = file_handle.read().split("\n")
with open("example1.txt", "r") as file_handle:
example1_file_content = file_handle.read().split("\n")
for index, line in enumerate(example_file_content):
if line.split(",")[0] in example1_file_content:
print('match found; line is',line)
I think part of my issue has to do with spaCy and part has to do with not understanding the most elegant way to work within python itself.
I am uploading a txt file in python, tokenizing it into sentences and then tokenizing that into words with nltk:
sent_text = nltk.sent_tokenize(text)
tokenized_text = [nltk.word_tokenize(x) for x in sent_text]
That gives me a list of lists, where each list within the main list is a sentence of tokenized words. So far so good.
I then run it through SpaCy:
text = nlp(unicode(tokenized_text))
Still a list of lists, same thing, but it has all the SpaCy info.
This is where I'm hitting a block. Basically what I want to do is, for each sentence, only retain the nouns, verbs, and adjectives, and within those, also get rid of auxiliaries and conjunctions. I was able to do this earlier by creating a new empty list and appending only what I want:
sent11 = []
for token in sent1:
if (token.pos_ == 'NOUN' or token.pos_ == 'VERB' or token.pos_ =='ADJ') and (token.dep_ != 'aux') and (token.dep_ != 'conj'):
sent11.append(token)
This works fine for a single sentence, but I don't want to be doing it for every single sentence in a book-length text.
Then, once I have these new lists (or whatever the best way to do it is) containing only the pieces I want, I want to use the "similarity" function of SpaCy to determine which sentence is closest semantically to some other, much shorter text that I've done the same stripping of everything but nouns, adj, verbs, etc to.
I've got it working when comparing one single sentence to another by using:
sent1.similarity(sent2)
So I guess my questions are
1) What is the best way to turn a list of lists into a list of lists that only contain the pieces I want?
and
2) How do I cycle through this new list of lists and compare each one to a separate sentence and return the sentence that is most semantically similar (using the vectors that SpaCy comes with)?
You're asking a bunch of questions here so I'm going to try to break them down.
Is nearly duplicating a book-length amount of text by appending each word to a list bad?
How can one eliminate or remove elements of a list efficiently?
How can one compare a sentence to each sentence in the book where each sentence is a list and the book is a list of sentences.
Answers:
Generally yes, but on a modern system it isn't a big deal. Books are text which are probably just UTF-8 characters if English, otherwise they might be Unicode. A UTF-8 character is a byte and even a long book such as War and Peace comes out to under 3.3 Mb. If you are using chrome, firefox, or IE to view this page your computer has more than enough memory to fit a few copies of it into ram.
In python you can't really.
You can do removal using:
l = [1,2,3,4]
del l[-2]
print(l)
[1,2,4]
but in the background python is copying every element of that list over one. It is not recommended for large lists. Instead using a dequeue which implements itself as a doublely-linked-list has a bit of extra overhead but allows for efficient removal of elements in the middle.
If memory is an issue then you can also use generators wherever possible. For example you could probably change:
tokenized_text = [nltk.word_tokenize(x) for x in sent_text]
which creates a list that contains tokens of the entire book, with
tokenized_text = (nltk.word_tokenize(x) for x in sent_text)
which creates a generator that yields tokens of the entire book. Generators have almost no memory overhead and instead compute the next element as they go.
I'm not familiar with SpaCy, and while the question fits on SO you're unlikely to get good answers about specific libraries here.
From the looks of it you can just do something like:
best_match = None
best_similarity_value = 0
for token in parsed_tokenized_text:
similarity = token.similarity(sent2)
if similarity > best_similarity_value:
best_similarity_value = similarity
best_match = token
And if you wanted to check against multiple sentences (non-consecutive) then you could put an outer loop that goes through those:
for sent2 in other_token_list:
I have the file names of four files stored in a cell array called F2000. These files are named:
L14N_2009_2000MHZ.txt
L8N_2009_2000MHZ.txt
L14N_2010_2000MHZ.txt
L8N_2009_2000MHZ.txt
Each file consists of an mxn matrix where m is the same but n varies from file to file. I'd like to store each of the L14N files and each of the L8N files in two separate cell arrays so I can use dlmread in a for loop to store each text file as a matrix in an element of the cell array. To do this, I wrote the following code:
idx2009=cellfun('isempty',regexp(F2000,'L\d{1,2}N_2009_2000MHZ.txt'));
F2000_2009=F2000(idx2009);
idx2010=~idx2009;
F2000_2010=F2000(idx2010);
cell2009=cell(size(F2000_2009));
cell2010=cell(size(F2000_2010));
for k = 1:numel(F2000_2009)
cell2009{k}=dlmread(F2000_2009{k});
end
and repeated a similar "for" loop to use on F2000_2010. So far so good. However.
My real data set is much larger than just four files. The total number of files will vary, although I know there will be five years of data for each L\d{1,2}N (so, for instance, L8N_2009, L8N_2010, L8N_2011, L8N_2012, L8N_2013). I won't know what the number of files is ahead of time (although I do know it will range between 50 and 100), and I won't know what the file names are, but they will always be in the same L\d{1,2}N format.
In addition to what's already working, I want to count the number of files that have unique combinations of numbers in the portion of the filename that says L\d{1,2}N so I can further break down F2000_2010 and F2000_2009 in the above example to F2000_2010_L8N and F2000_2009_L8N before I start the dlmread loop.
Can I use regexp to build a list of all of my unique L\d{1,2}N occurrences? Next, can I easily change these list elements to strings to parse the original file names and create a new file name to the effect of L14N_2009, where 14 comes from \d{1,2}? I am sure this is a beginner question, but I discovered regexp yesterday! Any help is much appreciated!
Here is some code which might help:
% Find all the files in your directory
files = dir('*2000MHZ.txt');
files = {files.name};
% match identifiers
ids = unique(cellfun(#(x)x{1},regexp(files,'L\d{1,2}N','match'),...
'UniformOutput',false));
% find all years
years = unique(cellfun(#(x)x{1},regexp(files,'(?<=L\d{1,2}N_)\d{4,}','match'),...
'UniformOutput',false));
% find the years for each identifier
for id_ix = 1:length(ids)
% There is probably a better way to do this
list = regexp(files,['(?<=' ids{id_ix} '_)\d{4,}'],'match');
ids_years{id_ix} = cellfun(#(x)x{1},list(cellfun(...
#(x)~isempty(x),list)),'uniformoutput',false);
end
% If you need dynamic naming, I would suggest dynamic struct names:
for ix_id = 1:length(ids)
for ix_year = 1:length(ids_years{ix_id})
% the 'Y' is in the dynamic name becuase all struct field names must start with a letter
data.(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}]) =...
'read in my data here for each one';
end
end
Also, if anyone is interested in mapping keys with values try looking into the containers.map class.
So I'm trying to write a small program that can search for words within files of a specific folder.
It should be able to search for one word as well as a combination of two words or more. Then it should as a result give a list of the file names that include those words.
It's important that I use dictionaries and that the files are given a number within the dictionary.
I've tried several of things but I'm still stuck. Basically I've been thinking that each words within the file have to be split into strings, and then by searching for a string you get a result.
It's important to note that I'm using unicode utf-8.
Anyone willing to help me?
Intro
I work in a facility where we have microscopes. These guys can be asked to generate 4D movies of a sample: they take e.g. 10 pictures at different Z position, then wait a certain amount of time (next timepoint) and take 10 slices again.
They can be asked to save a file for each slice, and they use an explicit naming pattern, something like 2009-11-03-experiment1-Z07-T42.tif. The file names are numbered to reflect the Z position and the time point
Question
Once you have all these file names, you can use a regex pattern to extract the Z and T value, if you know the backbone pattern of the file name. This I know how to do.
The question I have is: do you know a way to automatically generate regex pattern from the file name list? For instance, there is an awesome tool on the net that does similar thing: txt2re.
What algorithm would you use to parse all the file name list and generate a most likely regex pattern?
There is a Perl module called String::Diff which has the ability to generate a regular expression for two different strings. The example it gives is
my $diff = String::Diff::diff_regexp('this is Perl', 'this is Ruby');
print "$diff\n";
outputs:
this\ is\ (?:Perl|Ruby)
Maybe you could feed pairs of filenames into this kind of thing to get an initial regex. However, this wouldn't give you capturing of numbers etc. so it wouldn't be completely automatic. After getting the diff you would have to hand-edit or do some kind of substitution to get a working final regex.
First of all, you are trying to do this the hard way. I suspect that this may not be impossible but you would have to apply some artificial intelligence techniques and it would be far more complicated than it is worth. Either neural networks or a genetic algorithm system could be trained to recognize the Z numbers and T numbers, assuming that the format of Z[0-9]+ and T[0-9]+ is always used somewhere in the regex.
What I would do with this problem is to write a Python script to process all of the filenames. In this script, I would match twice against the filename, one time looking for Z[0-9]+ and one time looking for T[0-9]+. Each time I would count the matches for Z-numbers and T-numbers.
I would keep four other counters with running totals, two for Z-numbers and two for T-numbers. Each pair would represent the count of filenames with 1 match, and the ones with multiple matches. And I would count the total number of filenames processed.
At the end, I would report as follows:
nnnnnnnnnn filenames processed
Z-numbers matched only once in nnnnnnnnnn filenames.
Z-numbers matched multiple times in nnnnnn filenames.
T-numbers matched only once in nnnnnnnnnn filenames.
T-numbers matched multiple times in nnnnnn filenames.
If you are lucky, there will be no multiple matches at all, and you could use the regexes above to extract your numbers. However, if there are any significant number of multiple matches, you can run the script again with some print statements to show you example filenames that provoke a multiple match. This would tell you whether or not a simple adjustment to the regex might work.
For instance, if you have 23,768 multiple matches on T-numbers, then make the script print every 500th filename with multiple matches, which would give you 47 samples to examine.
Probably something like [ -/.=]T[0-9]+[ -/.=] would be enough to get the multiple matches down to zero, while also giving a one-time match for every filename. Or at worst, [0-9][ -/.=]T[0-9]+[ -/.=]
For Python, see this question about TemplateMaker.