Creating A Dataframe From A Text Dataset

Creating A Dataframe From A Text Dataset - regex

I have a dataset that has hundreds of thousands of fields. The following is a simplified dataset
dataSet <- c("Plnt SLoc Material Description L.T MRP Stat Auto MatSG PC PN Freq Qty CFreq CQty Cur.RPt New.RPt CurRepl NewRepl Updt Cost ServStock Unit OpenMatResb DFStorLocLevel",
"0231 0002 GB.C152260-00001 ASSY PISTON & SEAL/O-RING 44 PD X A A A 18 136 30 29 50 43 24.88 51.000 EA",
"0231 0002 WH.112734 MOTOR REDUCER, THREE-PHAS 41 PD X B B A 16 17 3 3 5 4 483.87 1.000 EA X",
"0231 0002 WH.920569 SPINDLE MOTOR MINI O 22 PD X A A A 69 85 15 9 25 13 680.91 21.000 EA",
"0231 0002 GB.C150583-00001 VALVE-AIR MDI 64 PD X A A A 16 113 50 35 80 52 19.96 116.000 EA",
"0231 0002 FG.124-0140 BEARING 32 PD X A A A 36 205 35 32 50 48 21.16 55.000 EA",
"0231 0002 WP.254997 BEARING,BALL .9843 X 2.04 52 PD X A A A 18 155 50 39 100 58 2.69 181.000 EA"
)
I would like to create a dataframe out of this dataSet for further calculation. The approach I am following is as follows:
I split the dataSet by space and then recombine it.
dataSetSplit <- strsplit(dataSet, "\\s+")
The header (which is the first line) splits correctly and produces 25 characters. This can be seen by the str() function.
str(dataSetSplit)
I will then intend to combine all the rows together using the folloing script
combinedData <- data.frame(do.call(rbind, dataSetSplit))
Please note that the above script "combinedData " errors because the split did not produce equal number of fields.
For this approach to work all the fields must split correctly into 25 fields.
If you think this is a sound approach please let me know how to split the fileds into 25 fields.
It is worth mentioning that I do not like the approach of splitting the data set with the function strsplit(). It is an extremely time consuming step if used with a large data set. Can you please recommend an alternate approach to create a data frame out of the supplied data?

By the looks of it, you have a header row that is actually helpful. You can easily use gregexpr to calculate your "widths" to use with read.fwf.
Here's how:
## Use gregexpr to find the position of consecutive runs of spaces
## This will tell you the starting position of each column
Widths <- gregexpr("\\s+", dataSet[1])[[1]]
## `read.fwf` doesn't need the starting position, but the width of
## each column. We can use `diff` to calculate this.
Widths <- c(Widths[1], diff(Widths))
## Since there are no spaces after the last column, we need to calculate
## a reasonable width for that column too. We can do this with `nchar`
## to find the widest row in the data. From this, subtract the `sum`
## of all the previous values.
Widths <- c(Widths, max(nchar(dataSet)) - sum(Widths))
Let's also extract the column names. We could do this in read.fwf, but it would require us to substitute the spaces in the first line with a "sep" character.
Names <- scan(what = "", text = dataSet[1])
Now, read in everything except the first line. You would use the actual file instead of textConnection, I would suppose.
read.fwf(textConnection(dataSet), widths=Widths, strip.white = TRUE,
skip = 1, col.names = Names)
# Plnt SLoc Material Description L.T MRP Stat Auto MatSG PC PN Freq Qty
# 1 231 2 GB.C152260-00001 ASSY PISTON & SEAL/O-RING 44 PD NA X A A A 18 136
# 2 231 2 WH.112734 MOTOR REDUCER, THREE-PHAS 41 PD NA X B B A 16 17
# 3 231 2 WH.920569 SPINDLE MOTOR MINI O 22 PD NA X A A A 69 85
# 4 231 2 GB.C150583-00001 VALVE-AIR MDI 64 PD NA X A A A 16 113
# 5 231 2 FG.124-0140 BEARING 32 PD NA X A A A 36 205
# 6 231 2 WP.254997 BEARING,BALL .9843 X 2.04 52 PD NA X A A A 18 155
# CFreq CQty Cur.RPt New.RPt CurRepl NewRepl Updt Cost ServStock Unit OpenMatResb
# 1 NA NA 30 29 50 43 NA 24.88 51 EA <NA>
# 2 NA NA 3 3 5 4 NA 483.87 1 EA X
# 3 NA NA 15 9 25 13 NA 680.91 21 EA <NA>
# 4 NA NA 50 35 80 52 NA 19.96 116 EA <NA>
# 5 NA NA 35 32 50 48 NA 21.16 55 EA <NA>
# 6 NA NA 50 39 100 58 NA 2.69 181 EA <NA>
# DFStorLocLevel
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA

Many thanks to Ananda Mahto, he provided many pieces to this answer.
widthMinusFirst <- diff(gregexpr('(\\s[A-Z])+', dataSet[1])[[1]])
widthFirst <- gregexpr('\\s+', dataSet[1])[[1]][1]
Width <- c(widthFirst, widthMinusFirst)
Widths <- c(Width, max(nchar(dataSet)) - sum(Width))
columnNames <- scan(what = "", text = dataSet[1])
read.fwf(textConnection(dataSet[-1]), widths = Widths, strip.white = FALSE,
skip = 0, col.names = columnNames)

Related

plotly choropleth not plotting data

here is a link to my data https://docs.google.com/document/d/1oIiwiucRkXBkxkdbrgFyPt6fwWtX4DJG4nbRM309M20/edit?usp=sharing
My problem is that when I run this in a Jupyter Notebook. I get just the USA map with the colour bar and the lakes in blue. No data is on the map, not the labels nor the actual z data.
Here is my header:
import plotly.graph_objs as go
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
%matplotlib inline
init_notebook_mode(connected=True) # For Plotly For Notebooks
cf.go_offline() # For Cufflinks For offline use
%matplotlib inline
init_notebook_mode(connected=True) # For Plotly For Notebooks
cf.go_offline() # For Cufflinks For offline use
Here is my data and layout:
data = dict(type='choropleth',
locations = gb_state['state'],
locationmode = 'USA-states',
colorscale = 'Portland',
text =gb_state['state'],
z = gb_state['beer'],
colorbar = {'title':"Styles of beer"}
)
data
layout = dict(title = 'Styles of beer by state',
geo = dict(scope='usa',
showlakes = True,
lakecolor = 'rgb(85,173,240)')
)
layout
and here is how I fire off the command:
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)
Any help, guidelines or pointers would be appreciated

Here is a minified working example which will give you the desired output.
import pandas as pd
import io
import plotly.graph_objs as go
from plotly.offline import plot
txt = """ state abv ibu id beer style ounces brewery city
0 AK 25 17 25 25.0 25.0 25 25 25
1 AL 10 9 10 10.0 10.0 10 10 10
2 AR 5 1 5 5.0 5.0 5 5 5
3 AZ 44 24 47 47.0 46.0 47 47 47
4 CA 182 135 183 183.0 183.0 183 183 183
5 CO 250 146 265 265.0 263.0 265 265 265
6 CT 27 6 27 27.0 27.0 27 27 27
7 DC 8 4 8 8.0 8.0 8 8 8
8 DE 1 1 2 2.0 2.0 2 2 2
9 FL 56 37 58 58.0 58.0 58 58 58
10 GA 16 7 16 16.0 16.0 16 16 16
"""
gb_state = pd.read_csv(io.StringIO(txt), delim_whitespace=True)
data = dict(type='choropleth',
locations=gb_state['state'],
locationmode='USA-states',
text=gb_state['state'],
z=gb_state['beer'],
)
layout = dict(geo = dict(scope='usa',
showlakes= False)
)
choromap = go.Figure(data=[data], layout=layout)
plot(choromap)

How to get cut needed value in to different col in python dataframe？

I have a data frame like follow:
pop state value1 value2
0 1.8 Ohio 2000001 2100345
1 1.9 Ohio 2001001 1000524
2 3.9 Nevada 2002100 1000242
3 2.9 Nevada 2001003 1234567
4 2.0 Nevada 2002004 1420000
And I have a ordered dictionary like following:
OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(1, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
I want to changed the data frame as the OrderedDict needed.
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 0 1 2 1003 45
1 1.9 Ohio 20 1 1 1 5 24
2 3.9 Nevada 20 2 100 1 2 42
3 2.9 Nevada 20 1 3 1 2345 67
4 2.0 Nevada 20 2 4 1 4200 0
I think it is really a complex logic in python pandas. How can I solve it? Thanks.

First, your OrderedDict overwrites the same key, you need to use different keys.
d= OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(2, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
Now, for your actual problem, you can iterate through d to get the items, and use the apply function on the DataFrame to get what you need.
for k,v in d.items():
for k1,v1 in v.items():
if k == 1:
df[k1] = df.value1.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
else:
df[k1] = df.value2.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
Now, df is
pop state value1 value2 value1_1 value1_2 value1_3 value2_1 \
0 1.8 Ohio 2000001 2100345 20 0 1 2
1 1.9 Ohio 2001001 1000524 20 1 1 1
2 3.9 Nevada 2002100 1000242 20 2 100 1
3 2.9 Nevada 2001003 1234567 20 1 3 1
4 2.0 Nevada 2002004 1420000 20 2 4 1
value2_2 value2_3
0 1003 45
1 5 24
2 2 42
3 2345 67
4 4200 0

I think this would point you in the right direction.
Converting the value1 and value2 columns to string type:
df['value1'], df['value2'] = df['value1'].astype(str), df['value2'].astype(str)
dct_1,dct_2 = OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])]),
OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])])
Converting Ordered Dictionary to a list of tuples:
dct_1_list, dct_2_list = list(dct_1.items()), list(dct_2.items())
Flattening a list of lists to a single list:
L1, L2 = sum(list(x[1] for x in dct_1_list), []), sum(list(x[1] for x in dct_2_list), [])
Subtracting the even slices of the list by 1 as the string indices start from 0 and not 1:
L1[::2], L2[::2] = np.array(L1[0::2]) - np.array([1]), np.array(L2[0::2]) - np.array([1])
Taking the appropriate slice positions and mapping those values to the newly created columns of the dataframe:
df['value1_1'],df['value1_2'],df['value1_3']= map(df['value1'].str.slice,L1[::2],L1[1::2])
df['value2_1'],df['value2_2'],df['value2_3']= map(df['value2'].str.slice,L2[::2],L2[1::2])
Dropping off unwanted columns:
df.drop(['value1', 'value2'], axis=1, inplace=True)
Final result:
print(df)
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 00 001 2 1003 45
1 1.9 Ohio 20 01 001 1 0005 24
2 3.9 Nevada 20 02 100 1 0002 42
3 2.9 Nevada 20 01 003 1 2345 67
4 2.0 Nevada 20 02 004 1 4200 00

Delete or remove unexpected records and strings based on multiple criteria by python or R script

I have a .csv file named fileOne.csv that contains many unnecessary strings and records. I want to delete unnecessary records / rows and strings based on multiple condition / criteria using a Python or R script and save the records into a new .csv file named resultFile.csv.
What I want to do is as follows:
Delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
Keep only records that have the .csv file extension in the files column, but do not contain No Bi in cut column.
Assign new name to each of the columns.
Delete the records that contain strings like less in the CC column.
Trim all other unnecessary string from the records.
Delete the reamining filds of each rows after I find the "Mi" in each rows.
My fileOne.csv is as follows:
AA BB CC DD EE FF GG
1 1_1.csv (=0 =10" 27" =57 "Mi"
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv (=0 =10" 27" "Mi" 0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv (=0 =10" "Mi" =53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv No Bi 000 000 000 000
5 2_8.csv No Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(=0 =26" =46" "Mi" 121
My 1st expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57 Mi
1 3 0 10 27 Mi 0.5
1 6 0 10 Mi 53 cnt
7 9 0 26 46 Mi 121
My final expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46

This can be achieved with the following Python script:
import csv
import re
import string
output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
def sanitise_cell(cell):
return cell.translate(sanitise_table, nodigits_table) # Keep digits
with open('fileOne.csv') as f_input, open('resultFile.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
for row in csv_input:
bb = re.match(r'(\d+)_(\d+)\.csv', row[1])
if bb and row[2] not in ['No Bi', 'less']:
# Remove all columns after 'Mi' if present
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
row[:] = [sanitise_cell(col) for col in row]
row[0] = bb.group(1)
row[1] = bb.group(2)
csv_output.writerow(row)
To simply remove Mi columns from an existing file the following can be used:
import csv
with open('input.csv') as f_input, open('output.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for row in csv_input:
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
csv_output.writerow(row)
Tested using Python 2.7.9

match a word an not the entire line in R

I have a text file formatted as such:
# title "Secondary Structure"
# xaxis label "Time (ns)"
# yaxis label "Number of Residues"
#TYPE xy
# subtitle "Structure = A-Helix + B-Sheet + B-Bridge + Turn"
# view 0.15, 0.15, 0.75, 0.85
# legend on
# legend box on
# legend loctype view
# legend 0.78, 0.8
# legend length 2
# s0 legend "Structure"
# s1 legend "Coil"
# s2 legend "B-Sheet"
# s3 legend "B-Bridge"
# s4 legend "Bend"
# s5 legend "Turn"
# s6 legend "A-Helix"
# s7 legend "5-Helix"
# s8 legend "3-Helix"
# s9 legend "Chain_Separator"
0 637 180 201 7 94 129 300 0 47 1
1 617 189 191 11 99 121 294 5 48 1
2 625 183 198 7 97 130 290 0 53 1
3 625 180 195 5 102 125 300 0 51 1
4 622 185 196 5 99 117 304 0 52 1
5 615 192 190 5 106 121 299 0 45 1
6 629 187 196 7 102 122 304 0 40 1
I'm trying to to match the lines starting with "s+number" (s0,s1,s2,...s9) and save the values between "" in a list so I can then use this list for naming the columns.
list <- c("Structure", "Coil","B-Sheet", ..., "Chain_Separato")
names(data) <- list
The problem is that I can't match the single words but only the entire lines.
grep('s\\d\\s[a-z]{6}\\s\"([A-z-9]+)\"',readLines("file.xvg"),perl=T,value=T)
[1] "# s0 legend \"Structure\"" "# s1 legend \"Coil\""
[3] "# s2 legend \"B-Sheet\"" "# s3 legend \"B-Bridge\""
[5] "# s4 legend \"Bend\"" "# s5 legend \"Turn\""
[7] "# s6 legend \"A-Helix\"" "# s9 legend \"Chain_Separator\""
I tried several regex, like '# s[0-9] [a-z]+ "([A-z-9]+)"', all working in perl but in R I'm always matching the entire line and not the word.
Isn't the () used to capture the value? What am I doing wrong?

You can do this:
conn = file(fileName,open="r")
lines=readLines(conn)
lst = Filter(function(u) grepl('^# s[0-9]+', u), lines)
result = gsub('.*\"(.*)\".*','\\1',lst)
close(conn)
#> result
#[1] "Structure" "Coil" "B-Sheet" "B-Bridge" "Bend" "Turn" "A-Helix" "5-Helix"
#[9] "3-Helix" "Chain_Separator"

You can use a system command in fread(). For example, on a file named "file.txt" you can do
library(data.table)
fread("grep '^# s[0-9]\\+' file.txt", header = FALSE, select = 4)[[1]]
# [1] "Structure" "Coil" "B-Sheet"
# [4] "B-Bridge" "Bend" "Turn"
# [7] "A-Helix" "5-Helix" "3-Helix"
# [10] "Chain_Separator"
Note: This uses data.table dev version 1.9.5
Basically the area you're looking for in the text has four columns. ^# s[0-9]\\+ looks for lines that begin with # and then a space, then s, then any number of digits. select = 4 takes the last column, and [[1]] drops it down from a single column data table into a character vector.
Thanks to #BrodieG for help with the regex.

If you use linux, awk commands can be combined with read.table using pipe
read.table(pipe("awk 'BEGIN {FS=\" \"}/# s[0-9]/ { print$4 }' fra.txt"),
stringsAsFactors=FALSE)$V1
# [1] "Structure" "Coil" "B-Sheet" "B-Bridge"
# [5] "Bend" "Turn" "A-Helix" "5-Helix"
# [9] "3-Helix" "Chain_Separator"
The above command also works with fread
fread("awk 'BEGIN {FS=\" \"}/# s[0-9]/ { print$4 }' fra.txt",
header=FALSE)$V1

split strings on first and last commas

I would like to split strings on the first and last comma. Each string has at least two
commas. Below is an example data set and the desired result.
A similar question here asked how to split on the first comma: Split on first comma in string
Here I asked how to split strings on the first two colons: Split string on first two colons
Thank you for any suggestions. I prefer a solution in base R. Sorry if this is a duplicate.
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
desired.result <- read.table(text='
my.string1 my.string2 my.string3 some.data
123 34,56,78 90 10
87 65,43 21 20
a4 b6 c8888 30
11 bbbb ccccc 40
uu vv,ww xx 50
j k,l,m,n,o p 60', header = TRUE, stringsAsFactors=FALSE)

You can use the \K operator which keeps text already matched out of the result and a negative look ahead assertion to do this (well almost, there is an annoying comma at the start of the middle portion which I am yet to get rid of in the strsplit). But I enjoyed this as an exercise in constructing a regex...
x <- '123,34,56,78,90'
strsplit( x , "^[^,]+\\K|,(?=[^,]+$)" , perl = TRUE )
#[[1]]
#[1] "123" ",34,56,78" "90"
Explantion:
^[^,]+ : from the start of the string match one or more characters that are not a ,
\\K : but don't include those matched characters in the match
So the first match is the first comma...
| : or you can match...
,(?=[^,]+$) : a , so long as it is followed by [(?=...)] one or more characters that are not a , until the end of the string ($)...

Here is a relatively simple approach. In the first line we use sub to replace the first and last commas with semicolons producing s. Then we read s using sep=";" and finally cbind the rest of my.data to it:
s <- sub(",(.*),", ";\\1;", my.data[[1]])
DF <- read.table(text=s, sep =";", col.names=paste0("mystring",1:3), as.is=TRUE)
cbind(DF, my.data[-1])
giving:
mystring1 mystring2 mystring3 some.data
1 123 34,56,78 90 10
2 87 65,43 21 20
3 a4 b6 c8888 30
4 11 bbbb ccccc 40
5 uu vv,ww xx 50
6 j k,l,m,n,o p 60

Here is code to split on the first and last comma. This code draws heavily from an answer by #bdemarest here: Split string on first two colons The gsub pattern below, which is the meat of the answer, contains important differences. The code for creating the new data frame after strings are split is the same as that of #bdemarest
# Replace first and last commas with colons.
new.string <- gsub(pattern="(^[^,]+),(.+),([^,]+$)",
replacement="\\1:\\2:\\3", x=my.data$my.string)
new.string
# Split on colons
split.data <- strsplit(new.string, ":")
# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 my.string3 some.data
# 1 123 34,56,78 90 10
# 2 87 65,43 21 20
# 3 a4 b6 c8888 30
# 4 11 bbbb ccccc 40
# 5 uu vv,ww xx 50
# 6 j k,l,m,n,o p 60
# Here is code for splitting strings on the first comma
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
# Replace first comma with colon
new.string <- gsub(pattern="(^[^,]+),(.+$)",
replacement="\\1:\\2", x=my.data$my.string)
new.string
# Split on colon
split.data <- strsplit(new.string, ":")
# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 some.data
# 1 123 34,56,78,90 10
# 2 87 65,43,21 20
# 3 a4 b6,c8888 30
# 4 11 bbbb,ccccc 40
# 5 uu vv,ww,xx 50
# 6 j k,l,m,n,o,p 60
# Here is code for splitting strings on the last comma
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
# Replace last comma with colon
new.string <- gsub(pattern="^(.+),([^,]+$)",
replacement="\\1:\\2", x=my.data$my.string)
new.string
# Split on colon
split.data <- strsplit(new.string, ":")
# Create new data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 some.data
# 1 123,34,56,78 90 10
# 2 87,65,43 21 20
# 3 a4,b6 c8888 30
# 4 11,bbbb ccccc 40
# 5 uu,vv,ww xx 50
# 6 j,k,l,m,n,o p 60

You can do a simple strsplit here on that column
popshift<-sapply(strsplit(my.data$my.string,","), function(x)
c(x[1], paste(x[2:(length(x)-1)],collapse=","), x[length(x)]))
desired.result <- cbind(data.frame(my.string=t(popshift)), my.data[-1])
I just split up all the values and make a new vector with the first, last and middle strings. Then i cbind that with the rest of the data. The result is
my.string.1 my.string.2 my.string.3 some.data
1 123 34,56,78 90 10
2 87 65,43 21 20
3 a4 b6 c8888 30
4 11 bbbb ccccc 40
5 uu vv,ww xx 50
6 j k,l,m,n,o p 60

Using str_match() from package stringr, and a little help from one of your links,
> library(stringr)
> data.frame(str_match(my.data$my.string, "(.+?),(.*),(.+?)$")[,-1],
some.data = my.data$some.data)
# X1 X2 X3 some.data
# 1 123 34,56,78 90 10
# 2 87 65,43 21 20
# 3 a4 b6 c8888 30
# 4 11 bbbb ccccc 40
# 5 uu vv,ww xx 50
# 6 j k,l,m,n,o p 60

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Creating A Dataframe From A Text Dataset - regex

Related

plotly choropleth not plotting data

How to get cut needed value in to different col in python dataframe？

Delete or remove unexpected records and strings based on multiple criteria by python or R script

match a word an not the entire line in R

split strings on first and last commas

Categories

Resources