I have written a piece of code that would read in one .fasta file, analyze a single genetic sequence, make calculations based on said sequence, and then organize the calculation results into a single pandas dataframe, which would subsequently be exported as .csv file.
I have updated the code recently in order for it to parse a .fasta file that contains multiple sequences, and although I figured out how to do it, the code in its current form exports one .csv file per sequence. When the .fasta file contains many sequences (over 100, for example), having to sort through so many .csv files might be somewhat laborious.
So instead I am trying to have each of the pandas dataframes be exported in a single .csv file instead. However, I am not sure how to set up code in order to have this occur. Right now, the code is based around a for loop that iterates over values of a dict (where the sequences from the .fasta file are stored). In each iteration, a function is called that creates a dict full of the the pertinent calculation results, and another function is called that creates pandas dataframe and fills it with the information from the dict, which is then exports as a .csv file.
import pandas as pd
from os import path
for seq in seq_dict.keys():
result_dict= calculator_func(seq_dict[seq])
results_df= data_assembler(result_dict)
results_df.to_csv(path.join(output_dir, "{}_dataframe.csv".format(project_name)
It should also be noted that the indices of the dataframes are all based on the numerical positions within the relevant sequence.
In any case, I am having a hard time trying to figure out exactly how I should conglomerate all the dataframes into one .csv file such that indices make it possible for the user to tell a. from which sequence the row is from and b. at which position within the sequence the row is based on. Can anybody recommend me a some kind of approach?
You can set your index as whatever you want, including a string. Try this example:
import pandas as pd
test_frame = pd.DataFrame({"Sequence":[1,2],"Position":[3,4]})
test_frame.index = "Sequence:" + test_frame['Sequence'].astype(str) + "_" + "Position:" + test_frame['Position'].astype(str)
test_frame
Related
I am searching since days but I can't find an answer to my question.
I need to change a single cell currently named "29" into "Si29" in hundreds of csv files.
The position of the cell is the same in every file [3,7].
Then I need to save the files again (can be under the same name).
For one file I would do:
read_data[3,7]<-"Si29
However, I have no clue how I apply this to multiple files.
Cheers
I am new into Python, I've been using Matlab for a long time. Most of the features that Python offers outperform those from Matlab, but I still miss some of the features of matlab structures!
Is there a similar way of grouping independent pandas dataframes into a single object? This would be of my convenience since sometimes I have to read data of different locations and I would like to obtain as many independent dataframes as locations, and if possible into a single object.
Thanks!
I am not sure that I fully understand your question, but this is where I think you are going.
You can use many of the different python data structures to organize pandas dataframes into a single group (List, Dictionary, Tuple). The list is most common, but a dictionary would also work well if you need to call them by name later on rather than position.
**Note: This example uses csv files, these files could be any io that pandas supports (csv, excel, txt, or even a call to a database)
import pandas as pd
files = ['File1.csv', 'File2.csv', 'File3.csv']
frames = [frames.append(pd.read_csv(file)) for file in files]
single_df = pd.concat(frames)
You can use each frame independently by calling it from the list. The following would return the File1.csv dataframe
frames[0]
I wrote this list comprehension to export pandas Data Frames to CSV files (each data frame is written to a different file):
[v.to_csv(str(k)+'.csv') for k,v in df_dict.items()]
The pandas Data Frames are the values of a dictionary where the keys will be the part of the CSV file names. So in the code above v are the Data Frames, and k are strings to which the Data Frames are mapped to.
A colleague said that using list comprehensions is not a good idea for writing to output files. Why would that be? Moreover, he said that using a for loop for this would be more reliable. If true, why is that so?
A colleague said that using list comprehensions is not a good idea for writing to output files. Why would that be?
List comprehensions are usually more performant and readable than for loops when you are building a list (i.e., using append to generate a list with a for loop).
In other cases, like yours, a for loop is preferred when you want the "side effect" of an iteration.
Moreover, he said that using a for loop for this would be more reliable. If true, why is that so?
A for loop is more readable and relevant for this use case, IMHO, and should therefore be preferred:
for k,v in df_dict.items():
v.to_csv(str(k)+'.csv')
Reading through some stackoverflow questions and I could not find what I was looking for, at least, I didn't think it was when I read various posts.
I have some Training data set up like described here
So, I am using sklearn.datasets.load_files to read those it as it was a perfect match on set up.
BUT my files are tsv as bag of words already (aka each line is a word and it's frequency count separated by a tab).
To be honest, I am not sure how to proceed. The data pulled in by load_files is set up as a list where each element is the contents of each file, including the new line characters. I am not even 100% sure how the Bunch data type is tracking which files belong to which classifier folder.
I have worked with scikit-learn before with tsvs, but it was a single tsv file that had all the data so i used pandas to read it in and then used numpy.array to fetch what I needed from it, which is one of the things I attempted to do, but I am not sure how to do it with multiple files where the classifier is the folder name, as in that single tsv file i worked with before, each line of training data was individually
Some help on getting the data to a format that is useable for training classifiers would be appreciated.
You could loop over the files and read them, to create a list of dictionaries where each dictionary will contain the features and the frequencies of each document. Assume the file 1.txt:
import codecs
corpus = []
#make a loop over the files here and repeat the following
f = codecs.open("1.txt", encoding='utf8').read().splitlines()
corpus.append({line.split("\t")[0]:line.split("\t")[1] for line in f})
#exit the loop here
from sklearn.feature_extraction import DictVectorizer
vec=DictVectorizer()
X=vec.fit_transform(measurements)
You can find more here for DictVectorizer
I know that I can create a dta file if I have dat file and dictionary dct file. However, I want to know whether the reverse is also possible. In particular, if I have a dta file, is it possible to generate dct file along with dat file (Stata has an export command that allows export as ASCII file but I haven't found a way to generate dct file). StatTransfer does generate dct and dat file, but I was wondering if it is possible without using StatTransfer.
Yes. outfile will create dictionaries as well as export data in ASCII (text) form.
If you want dictionaries and dictionaries alone, you would need to delete the data part.
If you really want two separate files, you would need to split each file produced by outfile.
Either is programmable in Stata, or you could just use your favourite text editor or scripting language.
Dictionaries are in some ways a very good idea, but they are not as important to Stata as they were in early versions.