here is a link to my data https://docs.google.com/document/d/1oIiwiucRkXBkxkdbrgFyPt6fwWtX4DJG4nbRM309M20/edit?usp=sharing
My problem is that when I run this in a Jupyter Notebook. I get just the USA map with the colour bar and the lakes in blue. No data is on the map, not the labels nor the actual z data.
Here is my header:
import plotly.graph_objs as go
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
%matplotlib inline
init_notebook_mode(connected=True) # For Plotly For Notebooks
cf.go_offline() # For Cufflinks For offline use
%matplotlib inline
init_notebook_mode(connected=True) # For Plotly For Notebooks
cf.go_offline() # For Cufflinks For offline use
Here is my data and layout:
data = dict(type='choropleth',
locations = gb_state['state'],
locationmode = 'USA-states',
colorscale = 'Portland',
text =gb_state['state'],
z = gb_state['beer'],
colorbar = {'title':"Styles of beer"}
)
data
layout = dict(title = 'Styles of beer by state',
geo = dict(scope='usa',
showlakes = True,
lakecolor = 'rgb(85,173,240)')
)
layout
and here is how I fire off the command:
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)
Any help, guidelines or pointers would be appreciated
Here is a minified working example which will give you the desired output.
import pandas as pd
import io
import plotly.graph_objs as go
from plotly.offline import plot
txt = """ state abv ibu id beer style ounces brewery city
0 AK 25 17 25 25.0 25.0 25 25 25
1 AL 10 9 10 10.0 10.0 10 10 10
2 AR 5 1 5 5.0 5.0 5 5 5
3 AZ 44 24 47 47.0 46.0 47 47 47
4 CA 182 135 183 183.0 183.0 183 183 183
5 CO 250 146 265 265.0 263.0 265 265 265
6 CT 27 6 27 27.0 27.0 27 27 27
7 DC 8 4 8 8.0 8.0 8 8 8
8 DE 1 1 2 2.0 2.0 2 2 2
9 FL 56 37 58 58.0 58.0 58 58 58
10 GA 16 7 16 16.0 16.0 16 16 16
"""
gb_state = pd.read_csv(io.StringIO(txt), delim_whitespace=True)
data = dict(type='choropleth',
locations=gb_state['state'],
locationmode='USA-states',
text=gb_state['state'],
z=gb_state['beer'],
)
layout = dict(geo = dict(scope='usa',
showlakes= False)
)
choromap = go.Figure(data=[data], layout=layout)
plot(choromap)
I have a data frame like follow:
pop state value1 value2
0 1.8 Ohio 2000001 2100345
1 1.9 Ohio 2001001 1000524
2 3.9 Nevada 2002100 1000242
3 2.9 Nevada 2001003 1234567
4 2.0 Nevada 2002004 1420000
And I have a ordered dictionary like following:
OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(1, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
I want to changed the data frame as the OrderedDict needed.
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 0 1 2 1003 45
1 1.9 Ohio 20 1 1 1 5 24
2 3.9 Nevada 20 2 100 1 2 42
3 2.9 Nevada 20 1 3 1 2345 67
4 2.0 Nevada 20 2 4 1 4200 0
I think it is really a complex logic in python pandas. How can I solve it? Thanks.
First, your OrderedDict overwrites the same key, you need to use different keys.
d= OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(2, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
Now, for your actual problem, you can iterate through d to get the items, and use the apply function on the DataFrame to get what you need.
for k,v in d.items():
for k1,v1 in v.items():
if k == 1:
df[k1] = df.value1.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
else:
df[k1] = df.value2.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
Now, df is
pop state value1 value2 value1_1 value1_2 value1_3 value2_1 \
0 1.8 Ohio 2000001 2100345 20 0 1 2
1 1.9 Ohio 2001001 1000524 20 1 1 1
2 3.9 Nevada 2002100 1000242 20 2 100 1
3 2.9 Nevada 2001003 1234567 20 1 3 1
4 2.0 Nevada 2002004 1420000 20 2 4 1
value2_2 value2_3
0 1003 45
1 5 24
2 2 42
3 2345 67
4 4200 0
I think this would point you in the right direction.
Converting the value1 and value2 columns to string type:
df['value1'], df['value2'] = df['value1'].astype(str), df['value2'].astype(str)
dct_1,dct_2 = OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])]),
OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])])
Converting Ordered Dictionary to a list of tuples:
dct_1_list, dct_2_list = list(dct_1.items()), list(dct_2.items())
Flattening a list of lists to a single list:
L1, L2 = sum(list(x[1] for x in dct_1_list), []), sum(list(x[1] for x in dct_2_list), [])
Subtracting the even slices of the list by 1 as the string indices start from 0 and not 1:
L1[::2], L2[::2] = np.array(L1[0::2]) - np.array([1]), np.array(L2[0::2]) - np.array([1])
Taking the appropriate slice positions and mapping those values to the newly created columns of the dataframe:
df['value1_1'],df['value1_2'],df['value1_3']= map(df['value1'].str.slice,L1[::2],L1[1::2])
df['value2_1'],df['value2_2'],df['value2_3']= map(df['value2'].str.slice,L2[::2],L2[1::2])
Dropping off unwanted columns:
df.drop(['value1', 'value2'], axis=1, inplace=True)
Final result:
print(df)
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 00 001 2 1003 45
1 1.9 Ohio 20 01 001 1 0005 24
2 3.9 Nevada 20 02 100 1 0002 42
3 2.9 Nevada 20 01 003 1 2345 67
4 2.0 Nevada 20 02 004 1 4200 00
I have a .csv file named fileOne.csv that contains many unnecessary strings and records. I want to delete unnecessary records / rows and strings based on multiple condition / criteria using a Python or R script and save the records into a new .csv file named resultFile.csv.
What I want to do is as follows:
Delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
Keep only records that have the .csv file extension in the files column, but do not contain No Bi in cut column.
Assign new name to each of the columns.
Delete the records that contain strings like less in the CC column.
Trim all other unnecessary string from the records.
Delete the reamining filds of each rows after I find the "Mi" in each rows.
My fileOne.csv is as follows:
AA BB CC DD EE FF GG
1 1_1.csv (=0 =10" 27" =57 "Mi"
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv (=0 =10" 27" "Mi" 0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv (=0 =10" "Mi" =53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv No Bi 000 000 000 000
5 2_8.csv No Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(=0 =26" =46" "Mi" 121
My 1st expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57 Mi
1 3 0 10 27 Mi 0.5
1 6 0 10 Mi 53 cnt
7 9 0 26 46 Mi 121
My final expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46
This can be achieved with the following Python script:
import csv
import re
import string
output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
def sanitise_cell(cell):
return cell.translate(sanitise_table, nodigits_table) # Keep digits
with open('fileOne.csv') as f_input, open('resultFile.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
for row in csv_input:
bb = re.match(r'(\d+)_(\d+)\.csv', row[1])
if bb and row[2] not in ['No Bi', 'less']:
# Remove all columns after 'Mi' if present
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
row[:] = [sanitise_cell(col) for col in row]
row[0] = bb.group(1)
row[1] = bb.group(2)
csv_output.writerow(row)
To simply remove Mi columns from an existing file the following can be used:
import csv
with open('input.csv') as f_input, open('output.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for row in csv_input:
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
csv_output.writerow(row)
Tested using Python 2.7.9
I have a text file formatted as such:
# title "Secondary Structure"
# xaxis label "Time (ns)"
# yaxis label "Number of Residues"
#TYPE xy
# subtitle "Structure = A-Helix + B-Sheet + B-Bridge + Turn"
# view 0.15, 0.15, 0.75, 0.85
# legend on
# legend box on
# legend loctype view
# legend 0.78, 0.8
# legend length 2
# s0 legend "Structure"
# s1 legend "Coil"
# s2 legend "B-Sheet"
# s3 legend "B-Bridge"
# s4 legend "Bend"
# s5 legend "Turn"
# s6 legend "A-Helix"
# s7 legend "5-Helix"
# s8 legend "3-Helix"
# s9 legend "Chain_Separator"
0 637 180 201 7 94 129 300 0 47 1
1 617 189 191 11 99 121 294 5 48 1
2 625 183 198 7 97 130 290 0 53 1
3 625 180 195 5 102 125 300 0 51 1
4 622 185 196 5 99 117 304 0 52 1
5 615 192 190 5 106 121 299 0 45 1
6 629 187 196 7 102 122 304 0 40 1
I'm trying to to match the lines starting with "s+number" (s0,s1,s2,...s9) and save the values between "" in a list so I can then use this list for naming the columns.
list <- c("Structure", "Coil","B-Sheet", ..., "Chain_Separato")
names(data) <- list
The problem is that I can't match the single words but only the entire lines.
grep('s\\d\\s[a-z]{6}\\s\"([A-z-9]+)\"',readLines("file.xvg"),perl=T,value=T)
[1] "# s0 legend \"Structure\"" "# s1 legend \"Coil\""
[3] "# s2 legend \"B-Sheet\"" "# s3 legend \"B-Bridge\""
[5] "# s4 legend \"Bend\"" "# s5 legend \"Turn\""
[7] "# s6 legend \"A-Helix\"" "# s9 legend \"Chain_Separator\""
I tried several regex, like '# s[0-9] [a-z]+ "([A-z-9]+)"', all working in perl but in R I'm always matching the entire line and not the word.
Isn't the () used to capture the value? What am I doing wrong?
You can do this:
conn = file(fileName,open="r")
lines=readLines(conn)
lst = Filter(function(u) grepl('^# s[0-9]+', u), lines)
result = gsub('.*\"(.*)\".*','\\1',lst)
close(conn)
#> result
#[1] "Structure" "Coil" "B-Sheet" "B-Bridge" "Bend" "Turn" "A-Helix" "5-Helix"
#[9] "3-Helix" "Chain_Separator"
You can use a system command in fread(). For example, on a file named "file.txt" you can do
library(data.table)
fread("grep '^# s[0-9]\\+' file.txt", header = FALSE, select = 4)[[1]]
# [1] "Structure" "Coil" "B-Sheet"
# [4] "B-Bridge" "Bend" "Turn"
# [7] "A-Helix" "5-Helix" "3-Helix"
# [10] "Chain_Separator"
Note: This uses data.table dev version 1.9.5
Basically the area you're looking for in the text has four columns. ^# s[0-9]\\+ looks for lines that begin with # and then a space, then s, then any number of digits. select = 4 takes the last column, and [[1]] drops it down from a single column data table into a character vector.
Thanks to #BrodieG for help with the regex.
If you use linux, awk commands can be combined with read.table using pipe
read.table(pipe("awk 'BEGIN {FS=\" \"}/# s[0-9]/ { print$4 }' fra.txt"),
stringsAsFactors=FALSE)$V1
# [1] "Structure" "Coil" "B-Sheet" "B-Bridge"
# [5] "Bend" "Turn" "A-Helix" "5-Helix"
# [9] "3-Helix" "Chain_Separator"
The above command also works with fread
fread("awk 'BEGIN {FS=\" \"}/# s[0-9]/ { print$4 }' fra.txt",
header=FALSE)$V1
I would like to split strings on the first and last comma. Each string has at least two
commas. Below is an example data set and the desired result.
A similar question here asked how to split on the first comma: Split on first comma in string
Here I asked how to split strings on the first two colons: Split string on first two colons
Thank you for any suggestions. I prefer a solution in base R. Sorry if this is a duplicate.
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
desired.result <- read.table(text='
my.string1 my.string2 my.string3 some.data
123 34,56,78 90 10
87 65,43 21 20
a4 b6 c8888 30
11 bbbb ccccc 40
uu vv,ww xx 50
j k,l,m,n,o p 60', header = TRUE, stringsAsFactors=FALSE)
You can use the \K operator which keeps text already matched out of the result and a negative look ahead assertion to do this (well almost, there is an annoying comma at the start of the middle portion which I am yet to get rid of in the strsplit). But I enjoyed this as an exercise in constructing a regex...
x <- '123,34,56,78,90'
strsplit( x , "^[^,]+\\K|,(?=[^,]+$)" , perl = TRUE )
#[[1]]
#[1] "123" ",34,56,78" "90"
Explantion:
^[^,]+ : from the start of the string match one or more characters that are not a ,
\\K : but don't include those matched characters in the match
So the first match is the first comma...
| : or you can match...
,(?=[^,]+$) : a , so long as it is followed by [(?=...)] one or more characters that are not a , until the end of the string ($)...
Here is a relatively simple approach. In the first line we use sub to replace the first and last commas with semicolons producing s. Then we read s using sep=";" and finally cbind the rest of my.data to it:
s <- sub(",(.*),", ";\\1;", my.data[[1]])
DF <- read.table(text=s, sep =";", col.names=paste0("mystring",1:3), as.is=TRUE)
cbind(DF, my.data[-1])
giving:
mystring1 mystring2 mystring3 some.data
1 123 34,56,78 90 10
2 87 65,43 21 20
3 a4 b6 c8888 30
4 11 bbbb ccccc 40
5 uu vv,ww xx 50
6 j k,l,m,n,o p 60
Here is code to split on the first and last comma. This code draws heavily from an answer by #bdemarest here: Split string on first two colons The gsub pattern below, which is the meat of the answer, contains important differences. The code for creating the new data frame after strings are split is the same as that of #bdemarest
# Replace first and last commas with colons.
new.string <- gsub(pattern="(^[^,]+),(.+),([^,]+$)",
replacement="\\1:\\2:\\3", x=my.data$my.string)
new.string
# Split on colons
split.data <- strsplit(new.string, ":")
# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 my.string3 some.data
# 1 123 34,56,78 90 10
# 2 87 65,43 21 20
# 3 a4 b6 c8888 30
# 4 11 bbbb ccccc 40
# 5 uu vv,ww xx 50
# 6 j k,l,m,n,o p 60
# Here is code for splitting strings on the first comma
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
# Replace first comma with colon
new.string <- gsub(pattern="(^[^,]+),(.+$)",
replacement="\\1:\\2", x=my.data$my.string)
new.string
# Split on colon
split.data <- strsplit(new.string, ":")
# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 some.data
# 1 123 34,56,78,90 10
# 2 87 65,43,21 20
# 3 a4 b6,c8888 30
# 4 11 bbbb,ccccc 40
# 5 uu vv,ww,xx 50
# 6 j k,l,m,n,o,p 60
# Here is code for splitting strings on the last comma
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
# Replace last comma with colon
new.string <- gsub(pattern="^(.+),([^,]+$)",
replacement="\\1:\\2", x=my.data$my.string)
new.string
# Split on colon
split.data <- strsplit(new.string, ":")
# Create new data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 some.data
# 1 123,34,56,78 90 10
# 2 87,65,43 21 20
# 3 a4,b6 c8888 30
# 4 11,bbbb ccccc 40
# 5 uu,vv,ww xx 50
# 6 j,k,l,m,n,o p 60
You can do a simple strsplit here on that column
popshift<-sapply(strsplit(my.data$my.string,","), function(x)
c(x[1], paste(x[2:(length(x)-1)],collapse=","), x[length(x)]))
desired.result <- cbind(data.frame(my.string=t(popshift)), my.data[-1])
I just split up all the values and make a new vector with the first, last and middle strings. Then i cbind that with the rest of the data. The result is
my.string.1 my.string.2 my.string.3 some.data
1 123 34,56,78 90 10
2 87 65,43 21 20
3 a4 b6 c8888 30
4 11 bbbb ccccc 40
5 uu vv,ww xx 50
6 j k,l,m,n,o p 60
Using str_match() from package stringr, and a little help from one of your links,
> library(stringr)
> data.frame(str_match(my.data$my.string, "(.+?),(.*),(.+?)$")[,-1],
some.data = my.data$some.data)
# X1 X2 X3 some.data
# 1 123 34,56,78 90 10
# 2 87 65,43 21 20
# 3 a4 b6 c8888 30
# 4 11 bbbb ccccc 40
# 5 uu vv,ww xx 50
# 6 j k,l,m,n,o p 60