Reading through some stackoverflow questions and I could not find what I was looking for, at least, I didn't think it was when I read various posts.
I have some Training data set up like described here
So, I am using sklearn.datasets.load_files to read those it as it was a perfect match on set up.
BUT my files are tsv as bag of words already (aka each line is a word and it's frequency count separated by a tab).
To be honest, I am not sure how to proceed. The data pulled in by load_files is set up as a list where each element is the contents of each file, including the new line characters. I am not even 100% sure how the Bunch data type is tracking which files belong to which classifier folder.
I have worked with scikit-learn before with tsvs, but it was a single tsv file that had all the data so i used pandas to read it in and then used numpy.array to fetch what I needed from it, which is one of the things I attempted to do, but I am not sure how to do it with multiple files where the classifier is the folder name, as in that single tsv file i worked with before, each line of training data was individually
Some help on getting the data to a format that is useable for training classifiers would be appreciated.
You could loop over the files and read them, to create a list of dictionaries where each dictionary will contain the features and the frequencies of each document. Assume the file 1.txt:
import codecs
corpus = []
#make a loop over the files here and repeat the following
f = codecs.open("1.txt", encoding='utf8').read().splitlines()
corpus.append({line.split("\t")[0]:line.split("\t")[1] for line in f})
#exit the loop here
from sklearn.feature_extraction import DictVectorizer
vec=DictVectorizer()
X=vec.fit_transform(measurements)
You can find more here for DictVectorizer
Related
I am searching since days but I can't find an answer to my question.
I need to change a single cell currently named "29" into "Si29" in hundreds of csv files.
The position of the cell is the same in every file [3,7].
Then I need to save the files again (can be under the same name).
For one file I would do:
read_data[3,7]<-"Si29
However, I have no clue how I apply this to multiple files.
Cheers
I downloaded daily MODIS DATA LEVEL 3 data for a few months from https://disc.gsfc.nasa.gov/datasets. The filenames are of the form MCD06COSP_M3_MODIS.A2006001.061.2020181145945 but the files do not contain any time dimension. Hence when I use ncecat to concatenate various files, the date information is missing in the resulting file. I want to know how to add the time information in the combined dataset.
Your commands look correct. Good job crafting them. Not sure why it's not working. Possibly the input files are HDF4 format (do they have a .hdf suffix?) and your NCO is not HDF4-enabled. Try to download the files in netCDF3 or netCDF4 format and your commands above should work. If that's not what's wrong, then examine the output files in each step of your procedure and identify which step produces the unintended results and then narrow your question. Good luck.
I'm trying to prepare SEO data from Screaming Frog, Majestic and Ahrefs, join it before importing said data into BigQuery for analysis.
The Majestic and Ahrefs csv files import after some pruning down to the 100MB limit.
The Screaming Frog CSV file however doesn't fully load, only displaying approx 37,000 rows of 193,000. By further pruning less important cols in Excel and reducing the filesize(from 44MB to 39MB) , the number of rows loaded increases slightly. This would indicate to me that it's not an errant character or cell.
I've made sure(resaved via text editor) that the CSV file is saved in UTF8, checked the limitations of Dataprep to see if there is a limit on the number of cells per Flow/Wrangle and can find nothing.
The Majestic and AHREFS files are larger and load completely with no issue. There is no data corruption in the Screaming Frog file. Is there something common I'm missing?
Is the total limit for all files 100MB?
Any advice or insight would be appreciated.
To get the full transformation of your files, you should run the recipe.
What you see in the Dataprep Transformer Page is a head sample.
You can take a look about how the sampling works here.
I have written a piece of code that would read in one .fasta file, analyze a single genetic sequence, make calculations based on said sequence, and then organize the calculation results into a single pandas dataframe, which would subsequently be exported as .csv file.
I have updated the code recently in order for it to parse a .fasta file that contains multiple sequences, and although I figured out how to do it, the code in its current form exports one .csv file per sequence. When the .fasta file contains many sequences (over 100, for example), having to sort through so many .csv files might be somewhat laborious.
So instead I am trying to have each of the pandas dataframes be exported in a single .csv file instead. However, I am not sure how to set up code in order to have this occur. Right now, the code is based around a for loop that iterates over values of a dict (where the sequences from the .fasta file are stored). In each iteration, a function is called that creates a dict full of the the pertinent calculation results, and another function is called that creates pandas dataframe and fills it with the information from the dict, which is then exports as a .csv file.
import pandas as pd
from os import path
for seq in seq_dict.keys():
result_dict= calculator_func(seq_dict[seq])
results_df= data_assembler(result_dict)
results_df.to_csv(path.join(output_dir, "{}_dataframe.csv".format(project_name)
It should also be noted that the indices of the dataframes are all based on the numerical positions within the relevant sequence.
In any case, I am having a hard time trying to figure out exactly how I should conglomerate all the dataframes into one .csv file such that indices make it possible for the user to tell a. from which sequence the row is from and b. at which position within the sequence the row is based on. Can anybody recommend me a some kind of approach?
You can set your index as whatever you want, including a string. Try this example:
import pandas as pd
test_frame = pd.DataFrame({"Sequence":[1,2],"Position":[3,4]})
test_frame.index = "Sequence:" + test_frame['Sequence'].astype(str) + "_" + "Position:" + test_frame['Position'].astype(str)
test_frame
I know that I can create a dta file if I have dat file and dictionary dct file. However, I want to know whether the reverse is also possible. In particular, if I have a dta file, is it possible to generate dct file along with dat file (Stata has an export command that allows export as ASCII file but I haven't found a way to generate dct file). StatTransfer does generate dct and dat file, but I was wondering if it is possible without using StatTransfer.
Yes. outfile will create dictionaries as well as export data in ASCII (text) form.
If you want dictionaries and dictionaries alone, you would need to delete the data part.
If you really want two separate files, you would need to split each file produced by outfile.
Either is programmable in Stata, or you could just use your favourite text editor or scripting language.
Dictionaries are in some ways a very good idea, but they are not as important to Stata as they were in early versions.