I'm working on a program that reads trough a FASTQ file and gives the amount of N's per sequence in this file. I managed to get the number of N's per line and I put these in a list.
The problem is that I need all the numbers in one list to sum op the total amount of N's in the file but they get printed in their own list.
C:\Users\Zokids\Desktop>N_counting.py test.fastq
[4]
4
[3]
3
[5]
5
This is my output, the List and total amount in the list. I've seen ways to manually combine lists but one can have hundreds of sequences so that's a no go.
def Count_N(line):
'''
This function takes a line and counts the anmount of N´s in the line
'''
List = []
Count = line.count("N") # Count the amount of N´s that are in the line returned by import_fastq_file
List.append(int(Count))
Total = sum(List)
print(List)
print(Total)
This is what I have as code, another function selects the lines.
I hope someone can help me with this.
Thank you in advance.
The List you're defining in your function never gets more than one item, so it's not very useful. Instead, you should probably return the count from the function, and let the calling code (which is presumably running in some kind of loop) append the value to its own list. Of course, since there's not much to the function, you might just move it's contents out to the loop too!
For example:
list_of_counts = []
for line in my_file:
count = line.count("N")
list_of_counts.append(count)
total = sum(list_of_counts)
Looks from your code you send one line each time you call count_N(). List you declared is a local list and gets reinitialized when you call the function each time. You can declare the list global using:
global List =[]
I think you will also need to declare the list outside function in order to access it globally.
Also it would be better if you Total the list outside the function. Right now you are summing up list inside function. For that you will need to match indentation with the function declaration.
Related
I have 2 lists of values in 2 variables, which contain ZIP-codes in string, as they have numbers and letters. My first list contains 33.000 ZIP-codes, the second list 1400. Now I want to check if my ZIP-codes from the second variable are also in the first variable, and if so, give a third variable the code 1. If it is not in both variable lists, give it the code 0. I've tried to compare datasets, but that only compares if the variable is on the same position. Writing a loop didn't work so far.
Hopefully anyone can help! Thanks in advance.
Assuming you have two datasets:
dataset activate list2.
compute InBothLists=1.
sort cases by zipcode.
dataset activate list1.
sort cases by zipcode.
match files /file=* /table=list2 /by zipcode.
execute.
In the code above use your own dataset names and variable names - make sure your have the same variable name for the zipcode in both lists.
Once you run this you will have a new variable in the dataset list1 which has the value 1 for zipcodes that also appear in list2.
(Using Python 3)
Given this list named numList: [1,1,2,2,3,3,3,4].
I want to remove exactly one instance of “1” and “3” from numList.
In other words, I want a function that will turn numList into: [1,2,2,3,3,4].
What function will let me remove an X number of elements from a Python list once per element I want to remove?
(The elements I want to remove are guaranteed to exist in the list)
For the sake of clarity, I will give more examples:
[1,2,3,3,4]
Remove 2 and 3
[1,3,4]
[3,3,3]
Remove 3
[3,3]
[1,1,2,2,3,4,4,4,4]
Remove 2, 3 and 4
[1,1,2,4,4,4]
I’ve tried doing this:
numList=[1,2,2,3,3,4,4,4]
remList = [2,3,4]
for x in remList:
numList.remove(x)
This turns numList to [1,2,3,4,4] which is what I want. However, this has a complexity of:
O((len(numList))^(len(remList)))
This is a problem because remList and numList can have a length of 10^5. The program will take a long time to run. Is there a built-in function that does what I want faster?
Also, I would prefer the optimum function which can do this job in terms of space and time because the program needs to run in less than a second and the size of the list is large.
Your approach:
for x in rem_list:
num_list.remove(x)
is intuitative and unless the lists are going to be very large I might do that because it is easy to read.
One alternative would be:
result = []
for x in num_list:
if x in rem_list:
rem_list.remove(x)
else:
result.append(x)
This would be O(len(rem_list) ^ len(num_list)) and faster than the first solution if len(rem_list) < len(num_list).
If rem_list was guaranteed to not contain any duplicates (as per your examples) you could use a set instead and the complexity would be O(len(num_list)).
I am using the deepdiff function to find the difference between 2 dictionaries, which gives the output as: A = {'dictionary_item_added': set(["root['mismatched_element']"])}. How to print just 'mismatched_element'?
Try this:
set_item = A['dictionary_item_added'].pop()
print set_item[set_item.find("['")+2 : set_item.find("']")]
The first line gets the element from the set, the second removes the [] and everything around them, and prints.
This code does the specific task you asked for, but it's hard to generalize the solution without a more generalized question..
Within a large data frame, I have a column containing character strings e.g. "1&27&32" representing a combination of codes. I'd like to split each element in the column, search for a particular code (e.g. "1"), and return the row number if that element does in fact contain the code of interest. I was thinking something along the lines of:
apply(df["MEDS"],2,function(x){x.split<-strsplit(x,"&")if(grep(1,x.split)){return(row(x))}})
But I can't figure out where to go from there since that gives me the error:
Error in apply(df["MEDS"], 2, function(x) { :
dim(X) must have a positive length
Any corrections or suggestions would be greatly appreciated, thanks!
I see a couple of problems here (in addition to the missing semicolon in the function).
df["MEDS"] is more correctly written df[,"MEDS"]. It is a single column. apply() is meant to operate on each column/row of a matrix as if they were vectors. If you want to operate on a single column, you don't need apply()
strsplit() returns a list of vectors. Since you are applying it to a row at a time, the list will have one element (which is a character vector). So you should extract that vector by indexing the list element strsplit(x,"&")[[1]].
You are returning row(x) is if the input to your function is a matrix or knows what row it came from. It does not. apply() will pull each row and pass it to your function as a vector, so row(x) will fail.
There might be other issues as well. I didn't get it fully running.
As I mentioned, you don't need apply() at all. You really only need to look at the 1 column. You don't even need to split it.
OneRows <- which(grepl('(^|&)1(&|$)', df$MEDS))
as Matthew suggested. Or if your intention is to subset the dataframe,
newdf <- df[grepl((^|&)1(&|$)', df$MEDS),]
I have the file names of four files stored in a cell array called F2000. These files are named:
L14N_2009_2000MHZ.txt
L8N_2009_2000MHZ.txt
L14N_2010_2000MHZ.txt
L8N_2009_2000MHZ.txt
Each file consists of an mxn matrix where m is the same but n varies from file to file. I'd like to store each of the L14N files and each of the L8N files in two separate cell arrays so I can use dlmread in a for loop to store each text file as a matrix in an element of the cell array. To do this, I wrote the following code:
idx2009=cellfun('isempty',regexp(F2000,'L\d{1,2}N_2009_2000MHZ.txt'));
F2000_2009=F2000(idx2009);
idx2010=~idx2009;
F2000_2010=F2000(idx2010);
cell2009=cell(size(F2000_2009));
cell2010=cell(size(F2000_2010));
for k = 1:numel(F2000_2009)
cell2009{k}=dlmread(F2000_2009{k});
end
and repeated a similar "for" loop to use on F2000_2010. So far so good. However.
My real data set is much larger than just four files. The total number of files will vary, although I know there will be five years of data for each L\d{1,2}N (so, for instance, L8N_2009, L8N_2010, L8N_2011, L8N_2012, L8N_2013). I won't know what the number of files is ahead of time (although I do know it will range between 50 and 100), and I won't know what the file names are, but they will always be in the same L\d{1,2}N format.
In addition to what's already working, I want to count the number of files that have unique combinations of numbers in the portion of the filename that says L\d{1,2}N so I can further break down F2000_2010 and F2000_2009 in the above example to F2000_2010_L8N and F2000_2009_L8N before I start the dlmread loop.
Can I use regexp to build a list of all of my unique L\d{1,2}N occurrences? Next, can I easily change these list elements to strings to parse the original file names and create a new file name to the effect of L14N_2009, where 14 comes from \d{1,2}? I am sure this is a beginner question, but I discovered regexp yesterday! Any help is much appreciated!
Here is some code which might help:
% Find all the files in your directory
files = dir('*2000MHZ.txt');
files = {files.name};
% match identifiers
ids = unique(cellfun(#(x)x{1},regexp(files,'L\d{1,2}N','match'),...
'UniformOutput',false));
% find all years
years = unique(cellfun(#(x)x{1},regexp(files,'(?<=L\d{1,2}N_)\d{4,}','match'),...
'UniformOutput',false));
% find the years for each identifier
for id_ix = 1:length(ids)
% There is probably a better way to do this
list = regexp(files,['(?<=' ids{id_ix} '_)\d{4,}'],'match');
ids_years{id_ix} = cellfun(#(x)x{1},list(cellfun(...
#(x)~isempty(x),list)),'uniformoutput',false);
end
% If you need dynamic naming, I would suggest dynamic struct names:
for ix_id = 1:length(ids)
for ix_year = 1:length(ids_years{ix_id})
% the 'Y' is in the dynamic name becuase all struct field names must start with a letter
data.(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}]) =...
'read in my data here for each one';
end
end
Also, if anyone is interested in mapping keys with values try looking into the containers.map class.