How to load specific columns with varying location from a text file in python? - python-2.7

I'm trying to read the discharge data of 346 US rivers stored online in textfiles. The files are more or less in this format:
Measurement_number Date Gage_height Discharge_value
1 2017-01-01 10 1000
2 2017-01-20 15 2000
# etc.
I only want to read the gage height and discharge value columns.
The problem is that in most files additional columns with metadata are added in front of the 'Gage height' column, so i can not just simply read the 3rd and 4th column because their index varies.
I'm trying to find a way to say 'read the columns with the name 'Gage_height' and 'Discharge_value'', but I haven't succeeded yet.
I hope anyone can help. I'm currently trying to load the text files with numpy.genfromtxt so it would be great to find a solution with that package but other solutions are also more than welcome.
This is my code so far
data_url=urllib2.urlopen(#the url of this specific site)
data=np.genfromtxt(data_url,skip_header=1,comments='#',usecols=2,3])

You can use the names=True option to genfromtxt, and then use the column names to select which columns you want to read with usecols.
For example, to read 'Gage_height' and 'Discharge_value' from your data file:
data = np.genfromtxt(filename, names=True, usecols=['Gage_height', 'Discharge_value'])
Note that you don't need to set skip_header=1 if you use names=True.
You can then access the columns using their names:
gage_height = data['Gage_height'] # == array([ 10., 15.])
discharge_value = data['Discharge_value'] # == array([ 1000., 2000.])
See the docs here for more information.

Related

BadDataError when editing a .dbf file using dbf package

I have recently produced several thousand shapefile outputs and accompanying .dbf files from an atmospheric model (HYSPLIT) on a unix system. The converter txt2dbf is used to convert shapefile attribute tables (text file) to a .dbf.
Unfortunately, something has gone wrong (probably a separator/field length error) because there are 2 problems with the output .dbf files, as follows:
Some fields of the dbf contain data that should not be there. This data has "spilled over" from neighbouring fields.
An additional field has been added that should not be there (it actually comes from a section of the first record of the text file, "1000 201").
This is an example of the first record in the output dbf (retrieved using dbview unix package):
Trajnum : 1001 2
Yyyymmdd : 0111231 2
Time : 300
Level : 0.
1000 201:
Here's what I expected:
Trajnum : 1000
Yyyymmdd : 20111231
Time : 2300
Level : 0.
Separately, I'm looking at how to prevent this from happening again, but ideally I'd like to be able to repair the existing .dbf files. Unfortunately the text files are removed for each model run, so "fixing" the .dbf files is the only option.
My approaches to the above problems are:
Extract the information from the fields that do exist to a new variable using dbf.add_fields and dbf.write (python package dbf), then delete the old incorrect fields using dbf.delete_fields.
Delete the unwanted additional field.
This is what I've tried:
with dbf.Table(db) as db:
db.add_fields("TRAJNUMc C(4)") #create new fields
db.add_fields("YYYYMMDDc C(8)")
db.add_fields("TIMEc C(4)")
for record in db: #extract data from fields
dbf.write(TRAJNUMc=int(str(record.Trajnum)[:4]))
dbf.write(YYYYMMDDc=int(str(record.Trajnum)[-1:] + str(record.Yyyymmdd)[:7]))
dbf.write(TIMEc=record.Yyyymmdd[-1:] + record.Time[:])
db.delete_fields('Trajnum') # delete the incorrect fields
db.delete_fields('Yyyymmdd')
db.delete_fields('Time')
db.delete_fields('1000 201') #delete the unwanted field
db.pack()
But this produces the following error:
dbf.ver_2.BadDataError: record data is not the correct length (should be 31, not 30)
Given the apparent problem that there has been with the txt2dbf conversion, I'm not surprised to find an error in the record data length. However, does this mean that the file is completely corrupted and that I can't extract the information that I need (frustrating because I can see that it exists)?
EDIT:
Rather than attempting to edit the 'bad' .dbf files, it seems a better approach to 1. extract the required data to a text from the bad files and then 2. write to a new dbf. (See Ethan Furman's comments/answer below).
EDIT:
An example of a faulty .dbf file that I need to fix/recover data from can be found here:
https://www.dropbox.com/s/9y92f7m88a8g5y4/p0001120110.dbf?dl=0
An example .txt file from which the faulty dbf files were created can be found here:
https://www.dropbox.com/s/d0f2c0zehsyy8ab/attTEST.txt?dl=0
To fix the data and recreate the original text file, this snippet should help:
import dbf
table = dbf.Table('/path/to/scramble/table.dbf')
with table:
fixed_data = []
for record in table:
# convert to str/bytes while skipping delete flag
data = record._data[1:].tostring()
trajnum = data[:4]
ymd = data[4:12]
time = data [12:16]
level = data[16:].strip()
fixed_data.extend([trajnum, ymd, time, level])
new_file = open('repaired_data.txt', 'w')
for line in fixed_data:
new_file.write(','.join(line) + '\n')
Assuming all your data files look like your sample (the big IF being the data has no embedded commas), then this rough code should help translate your text files into dbfs:
raw_data = open('some_text_file.txt').read().split('\n')
final_table = dbf.Table(
'dest_table.dbf',
'trajnum C(4); yyyymmdd C(8); time C(4); level C(9)',
)
with final_table:
for line in raw_data:
fields = line.split(',')
final_table.append(tuple(fields))
# table has been populated and closed
Of course, you could get fancier and use actual date, and number fields if you want to:
# dbf string becomes
'trajnum N; yyyymmdd D; time C(4), level N'
#appending data loop becomes
for line in raw_data:
trajnum, ymd, time, level = line.split(',')
trajnum = int(trajnum)
ymd = dbf.Date(ymd[:4], ymd[4:6], ymd[6:])
level = int(level)
final_table.append((trajnum, ymd, time, level))

reading in certain rows from a csv file using python

Say I have a csv file that looks like the following with the first column containing frequencies and the second column containing the power level (dBm).
Frequency | dBm
1 -11.43
2.3 -51.32
2.5 -12.11
2.8 -11.21
3.1 -73.22
3.2 -21.13
I only want to read in the data sets of this file that have a (dBm) value between -13 and -10. Therefore, in this example I only want the data sets (1, -11.43)(2.5, -12.11)(2.8, -11.21) to be read into my program variables x1 and y1. Could someone give me some help in how I could do this?
You can just use the csv library and check if each meets your criteria.
Something like this should work on your file:
with open('file.csv') as csvfile:
reader = csv.reader(csvfile,delimiter=' ')
reader.next()
reader.next()
for row in reader:
a = [float(i) for i in row if i!='']
if a[1]>=-13 and a[1]<=-1:
print (a[0],a[1])
Edit: If you're working with table data I would suggest trying out Pandas, it's really helpful in these situations.

How do I iterate a loop over several data frames in a list in python

I am very new to programming and am working with Python. For a work project I am trying to read several .csv files, convert them to data frames, concatenate some of the fields into one for a column header, and then append all of the dataframes into one big DataFrame. I have searched extensively in StackOverflow as well as in other resources but I have not been able to find an answer. Here is the code I have thus far along with some abbreviated output:
import pandas as pd
import glob
# Read a directory of files to a list
csvlist = []
for f in glob.glob("AssayCerts/*"):
csvlist.append(f)
csvlist
['AssayCerts/CH09051590.csv', 'AssayCerts/CH09051591.csv', 'AssayCerts/CH14158806.csv', 'AssayCerts/CH14162453.csv', 'AssayCerts/CH14186004.csv']
# Read .csv files and convert to DataFrames
dflist = []
for csv in csvlist:
df = pd.read_csv(filename, header = None, skiprows = 7)
dflist.append(df)
dflist
[ 0 1 2 3 4 5 \
0 NaN Au-AA23 ME-ICP41 ME-ICP41 ME-ICP41 ME-ICP41
1 SAMPLE Au Ag Al As B
2 DESCRIPTION ppm ppm % ppm ppm
#concatenates the cells in the first three rows of the last dataframe; need to apply this to all of the dataframes.
for df in dflist:
column_names = df.apply(lambda x: str(x[1]) + '-'+str(x[2])+' - '+str(x[0]),axis=0)
column_names
0 SAMPLE-DESCRIPTION - nan
1 Au-ppm - Au-AA23
2 Ag-ppm - ME-ICP41
3 Al-% - ME-ICP41
I am unable to apply the last operation across all of the DataFrames. It seems I can only get it to apply to the last DataFrame in my list. Once I get past this point I will have to append all of the DataFrames to form one large DataFrame.
As Andy Hayden mentions in his comment, the reason your loop only appears to work on the last DataFrame is that you just keep assigning the result of df.apply( ... ) to column_names, which gets written over each time. So at the end of the loop, column_names always contains the results from the last DataFrame in the list.
But you also have some other problems in your code. In the loop that begins for csv in csvlist:, you never actually reference csv - you just reference filename, which doesn't appear to be defined. And dflist just appears to have one DataFrame in it anyway.
As written in your problem, the code doesn't appear to work. I'd advise posting the real code that you're using, and only what's relevant to your problem (i.e. if building csvlist is working for you, then you don't need to show it to us).

How to rename a column of a data frame with part of the data frame identifier in R?

I've got a number of files that contain gene expression data. In each file, the gene name is kept in a column "Gene_symbol" and the expression measure (a real number) is kept in a column "RPKM". The file name consists of an identifier followed by _ and the rest of the name (ends with "expression.txt"). I would like to load all of these files into R as data frames, for each data frame rename the column "RPKM" with the identifier of the original file and then join the data frames by "Gene_symbol" into one large data frame with one column "Gene_symbol" followed by all the columns with the expression measures from the individual files, each labeled with the original identifier.
I've managed to transfer the identifier of the original files to the names of the individual data frames as follows.
files <- list.files(pattern = "expression.txt$")
for (i in files) {var_name = paste("Data", strsplit(i, "_")[[1]][1], sep = "_"); assign(var_name, read.table(i, header=TRUE)[,c("Gene_symbol", "RPKM")])}
So now I'm at a stage where I have dataframes as follows:
Data_id0001 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(2.43,5.24,6.53))
Data_id0002 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(4.53,1.07,2.44))
But then I don't seem to be able to rename the RPKM column with the id000x bit. (That is in a fully automated way of course, looping through all the data frames I will generate in the real scenario.)
I've tried to store the identifier bit as a comment with the data frames but seem to be unable to assign the comment from within a loop.
Any help would be appreciated,
mce
You should never work this way in R. You should always try keeping all your data frames in a list and operate over them using function such as lapply etc. Thus, instead of using assign, just create an empty list of length of your files list and fill it with the for loop
For your current situation, we can fixed it using ls and mget combination in order to pull this data frames from the global environment into a list and then change the columns of interest.
temp <- mget(ls(pattern = "Data_id\\d+$"))
lapply(names(temp), function(x) names(temp[[x]])[2] <<- gsub("Data_", "", x))
temp
#$Data_id0001
# Gene_symbol id0001
# 1 geneA 2.43
# 2 geneB 5.24
# 3 geneC 6.53
#
# $Data_id0002
# Gene_symbol id0002
# 1 geneA 4.53
# 2 geneB 1.07
# 3 geneC 2.44
You could eventually use list2env in order to get them back to the global environment, but you should use with caution
thanks a lot for your suggestions! I think I get the point. The way I'm doing it now (see below) is hopefully a lot more R-like and works fine!!!
Cheers,
Maik
library(plyr)
files <- list.files(pattern = "expression.txt$")
temp <- list()
for (i in 1:length(files)) {temp[[i]]=read.table(files[i], header=TRUE)[,c("Gene_symbol", "RPKM")]}
for (i in 1:length(temp)) {temp[[i]]=rename(temp[[i]], c("RPKM"=strsplit(files[i], "_")[[1]][1]))}
combined_expression <- join_all(temp, by="Gene_symbol", type="full")

VBA Excel : Read file, and stock images in a column

I am new to VBA (I mean, REALLY new) and I would like you to give me some tips.
I have an Excel file with 2 columns: SKU and media_gallery
I also have images stocked in a folder (lets name it /imageFolder)
I need to parse the imageFolder and look for ALL images sarting by SKU.jpg , and put them into the media_gallery column separated by a semicolon ( ; )
Example: My SKU is "1001", I need to parse the image folder for all images starting by 1001 (all image have this pattern: 1001-2.jpg , 1001-3.jpg etc...)
I can do that in Java or C# but I want to give a chance to VBA. :)
How can I do that?
EDIT: I only need file names yes! And I should of said that I have 20 000 images in my folder, and 8000 SKUs , so I don't know how we can handle looping on 20 000 images names.
EDIT2: If SKU contains a dash ( - ), I don't need to treat it, so I can pass to the next SKU. And each SKU has a maximum of 5 images (....;SKU-5.jpg)
Thanks all.
How to insert images given you have one image name per cell in a column: How to get images to appear in Excel given image url
Take the above and introduce an inner loop for the file name:
if instr(url_column.Cells(i).Value, "-") = 0 then
dim cur_file_name as string
cur_file_name = dir("imageFolder\" & url_column.Cells(i).Value & "*.jpg")
do until len(cur_file_name) = 0
image_column.Cells(i).Value = image_column.Cells(i).Value & cur_file_name & ";"
cur_file_name = Dir
loop
end if