How to move a word position in a sentence in pyspark

How to move a word position in a sentence in pyspark - regex

I have the following street addresses:
- KR 71D 6 94 SUR LC 1709
- KR 24B 15 20 SUR AP 301
- KR 72F 39 42 SUR
- KR 72F SUR 39 42
- KR 72 SUR 39 42
What I need is detect the word SUR only located after the address plate, remove it and then setter after the main address. For example:
- KR 71D 6 94 SUR LC 1709 <-- Change it to: KR 71D SUR 6 94 LC 1709
- KR 24B 15 20 SUR AP 301 <-- Change it to: KR 24B SUR 15 20 AP 301
- KR 72F 39 42 SUR <-- Change it to: KR 72F SUR 39 42
- KR 72F SUR 39 42 <-- It is ok, leave it this way
- KR 72 SUR 39 42 <-- It is ok, leave it this way
Thanks a lot, and I hope somebody could help me.

You can try this:
import re
lyst = ["KR 71D 6 94 SUR LC 1709","KR 24B 15 20 SUR AP 301","KR 72F 39 42 SUR","KR 72F SUR 39 42","KR 72 SUR 39 42"]
comp = re.compile(r'([a-zA-Z]+)(\s)(\w+)\s(\d+)\s(\d+)\s([a-zA-Z]+)(.*)$')
Logic:
Using the logic of capturing the match in parenthesis, you can capture all the matches of words(inclusive numbers and words) separated by spaces, for the match of SUR, we need the fifth word to be matched and inserted at third position. So, we capture that in \6 (one greater than 5 because we are also matching one space). After this match, pick everything else in the single match using (.*). We are using here sub from re module. For the last two strings since the pattern never passes hence nothing is replaced and the string will remain as it is.
newlyst = []
for items in lyst:
newlyst.append(re.sub(comp, r'\1\2\3\2\6\2\4\2\5\7', items))
You can print the newlyst to see the output:
Output:
['KR 71D SUR 6 94 LC 1709', 'KR 24B SUR 15 20 AP 301', 'KR 72F SUR 39 42', 'KR 72F SUR 39 42', 'KR 72 SUR 39 42']

Related

Using factor inputs in a caret-driven shiny app

I am trying to develop a simple shiny app which takes in patient data and predicts the probability of a disease condition using caret.
Diagnosis
Age
Creatine
Chronic
70
765
Chronic
80
784
Chronic
72
692
Chronic
88
965
Chronic
68
1065
Chronic
75
1005
Acute
56
445
Acute
67
378
Acute
78
501
Acute
45
678
Acute
37
776
Acute
39
644
The following code works and returns a probability value.
library(shiny)
library(caret)
library(readxl)
hepC_lite <- read_excel("hepC_lite.xlsx")
model_mars <- train(Diagnosis ~ ., data = hepC_lite, method = "earth")
ui <- fluidPage(
numericInput("age", label = "Age", value = 50, min = 1, max = 99),
numericInput("creatine", label = "Creatine", value = 100, min = 1, max = 2000),
actionButton("submitButton", "Submit"),
tableOutput("userDefinedTable"),
textOutput('probability')
)
server <- function(input, output) {
values <- reactiveValues()
observeEvent(input$submitButton, {
values$new_row <- data.frame(Age = input$age, Creatine = input$creatine)
values$predicted_mars <- predict(model_mars, values$new_row, type="prob")[,2]
})
output$userDefinedTable <- renderTable(values$new_row)
output$probability <- renderText(values$predicted_mars)
}
shinyApp(ui = ui, server = server)
To include a factor variable (Gender) in the prediction, I am first one-hot encoding the dataset.
Diagnosis
Age
Creatine
Gender
Chronic
70
765
M
Chronic
80
784
M
Chronic
72
692
F
Chronic
88
965
M
Chronic
68
1065
M
Chronic
75
1005
F
Acute
56
445
F
Acute
67
378
F
Acute
78
501
F
Acute
45
678
M
Acute
37
776
M
Acute
39
644
F
#one-hot encoding
x = hepC_lite[, 2:4]
y = hepC_lite$Diagnosis
dummy_model <- dummyVars(Diagnosis ~ ., data = hepC_lite)
trainData <- predict(dummy_model, newdata = hepC_lite)
hepC_lite <- data.frame(trainData)
hepC_lite$Diagnosis <- y
How should I edit the following line to include the Gender variable?
values$new_row <- data.frame(Age = input$age, Creatine = input$creatine)
Running this line with Gender = input$gender causes an error - Warning: Error in eval: object 'Gender.F' not found

How to extract only the weight from the title using regular expressions?

I need to get the weight for below titles? Since they aren't follow the same way I couldn't get the desired results.
Pommes d'Aquitaine et mangue (dès 4 mois) - 2 x 130 g (Babybio)
Céréales vanille avec quinoa - à partir de 6 mois - 220 g (Babybio)
Ratatouille riz (dès 12 mois) 2 x 200 g (Babybio)
Pomme de terre, petits pois et jambon (dès 8 mois) 2 x 200 g
Fondue de carotte et maïs doux au quinoa (dès 12 mois) 230 g (Babybio)
Gourdes de fruits: pomme d'Aquitaine, poire et pêche - dès 6 mois - 4 x 90 g (Babybio)
Douceur de panais du Val de Loire, carotte et riz (dès 12 mois) 230 g (Babybio)
Expecting results:
2 x 130 g
220 g
2 x 200 g
230 g
4 x 90 g
230 g
I tried this pattern:
[0-9]+ x \d+ g

Try this:
\d+( ?x ?\d+)? g
Here Is Demo
This allow:
220 g
220x10 g
10 x23 g
45x 78 g
...

This will work. Your regex was missing only the cases that there was not x
[0-9]+( x )?\d+ g
The (...) captures everything enclosed in it, and the ? captures one occurrence or none.

Create date variable from time (Using SAS 9.3)

Using SAS 9.3
I have files with two variables (Time and pulse), one file for each person.
I have the information which date they started measuring for each person.
Now I want create a date variable whom change date at midnight (of course), how?
Example from text files:
23:58:02 106
23:58:07 105
23:58:12 103
23:58:17 98
23:58:22 100
23:58:27 97
23:58:32 99
23:58:37 100
23:58:42 99
23:58:47 104
23:58:52 95
23:58:57 96
23:59:02 98
23:59:07 96
23:59:12 104
23:59:17 109
23:59:22 105
23:59:27 111
23:59:32 111
23:59:37 104
23:59:42 110
23:59:47 100
23:59:52 106
23:59:57 114
00:00:02 123
00:00:07 130
00:00:12 130
00:00:17 125
00:00:22 119
00:00:27 116
00:00:32 122
00:00:37 116
00:00:42 119
00:00:47 117
00:00:52 114
00:00:57 114
00:01:02 110
00:01:07 103
00:01:12 98
00:01:17 98
00:01:22 102
00:01:27 97
00:01:32 99
00:01:37 93
00:01:42 97
00:01:47 103
00:01:52 96
00:01:57 93
00:02:02 93
00:02:07 95
00:02:12 106
00:02:17 99
00:02:22 102
00:02:27 96
00:02:32 93
00:02:37 97
00:02:42 102
00:02:47 101
00:02:52 95
00:02:57 92
00:03:02 100
00:03:07 95
00:03:12 102
00:03:17 102
00:03:22 109
00:03:27 109
00:03:32 107
00:03:37 111
00:03:42 112
00:03:47 113
00:03:52 115

Regex:
\d{2}:\d{2}:\d{2} \d*
See here for an example and play around with regex:
https://regex101.com/r/xF1fQ5/1
EDIT: and have a look at the SAS regex tip sheet: http://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

Something like this:
Date lastDate = startDate;
List<NData> ListData = new ArrayList<NData>();
for(FileData fdat:ListFileData){
Date nDate = this.getDate(lastDate,fdat.gettime());
NData ndata= new NData(ndate,fdat.getMeasuring());
LisData.add(nData);
lastDate = nDate;
}
.
.
.
.
function Date getDate(Date ld,String time){
Calendar cal = Calendar.getInstance();
cal.setTime(ld);
int year = cal.get(Calendar.YEAR);
int month = cal.get(Calendar.MONTH)+1;
int day = cal.get(Calendar.DAY_OF_MONTH);
int hourOfDay = this.getHour(time);
int minuteOfHour = this.getMinute(time);
org.joda.time.LocalDateTime lastDate = new org.joda.time.LocalDateTime(ld)
org.joda.time.LocalDateTime newDate = new org.joda.time.LocalDateTime(year,month,day,hourOfDay,minuteOfHour);
if(newDate.isBefore(lastDate)){
newDate = newDate.plusDays(1);
}
return newDate.toDate();
}

It's hard to provide a complete answer without sample code, but the SAS lag() function might be enough to do what you need. Your data step would include lines like the following, assuming your time variable is called time and your date variable is called date:
retain date;
if time < lag(time) then date = date + 1;
This assumes you never have any 24 hour gaps (but it appears you'd have to assume that anyway).
This answer also assumes that the time field is already in a SAS time format.

R remove only "[" "]" from string

I have a something like :
test[1]
"[0 30.5 4.5 10.5 2 35 22.999999999999996 29 5.500000000000001 23.5 18 23.5 44.5 3 44.5 44.00000000000001 43 27 42 35.5 19.5 44.00000000000001 1 0 31 34 18 1.5 26 6 45.99999999999999 10.5 9.5 24 20 42.5 14.5 45.5 20.499999999999996 150 45.5 0 4.5 22.5 4 9 8 0 0 15.5 30.5 7 5.500000000000001 12.5 33.5 15 500 22.5 18 43 4.5 26 23.5 16 4.5 7.5 32 0 0 18.5 33 31 14.5 21.5 0 40 0 0 43.49999999999999 22.999999999999996]"
And I would like to remove [ and ] (first and last characters) of each line (test[1] test[2] ...) but keep points (22.9999).
I have tried some stringr functions, but I'm not so go with regex ...
Can you help me?
E

There's no need for packages for this. Just use something like the following:
gsub("\\[|\\]", "", test)
This basically says: "Look in test for "[" or (|) "]", and if you find it, replace it with nothing ("")."
Since [ and ] are special characters in regular expressions, they would need to be escaped.
If you're just removing the first and last character, you can also probably do something like:
substring(test, 2, nchar(test)-1)
This basically says, "Extract the part of the string starting from the second position and ending in the second-to-last position."

One easy way to remove [ and ] from a string is
x <- "[12345]"
gsub("[][]", "", x)
# [1] "12345"
Here, the outer [] means one of the characters in the brackets. The inner ][ represent the to-be-replaced characters.

Creating A Dataframe From A Text Dataset

I have a dataset that has hundreds of thousands of fields. The following is a simplified dataset
dataSet <- c("Plnt SLoc Material Description L.T MRP Stat Auto MatSG PC PN Freq Qty CFreq CQty Cur.RPt New.RPt CurRepl NewRepl Updt Cost ServStock Unit OpenMatResb DFStorLocLevel",
"0231 0002 GB.C152260-00001 ASSY PISTON & SEAL/O-RING 44 PD X A A A 18 136 30 29 50 43 24.88 51.000 EA",
"0231 0002 WH.112734 MOTOR REDUCER, THREE-PHAS 41 PD X B B A 16 17 3 3 5 4 483.87 1.000 EA X",
"0231 0002 WH.920569 SPINDLE MOTOR MINI O 22 PD X A A A 69 85 15 9 25 13 680.91 21.000 EA",
"0231 0002 GB.C150583-00001 VALVE-AIR MDI 64 PD X A A A 16 113 50 35 80 52 19.96 116.000 EA",
"0231 0002 FG.124-0140 BEARING 32 PD X A A A 36 205 35 32 50 48 21.16 55.000 EA",
"0231 0002 WP.254997 BEARING,BALL .9843 X 2.04 52 PD X A A A 18 155 50 39 100 58 2.69 181.000 EA"
)
I would like to create a dataframe out of this dataSet for further calculation. The approach I am following is as follows:
I split the dataSet by space and then recombine it.
dataSetSplit <- strsplit(dataSet, "\\s+")
The header (which is the first line) splits correctly and produces 25 characters. This can be seen by the str() function.
str(dataSetSplit)
I will then intend to combine all the rows together using the folloing script
combinedData <- data.frame(do.call(rbind, dataSetSplit))
Please note that the above script "combinedData " errors because the split did not produce equal number of fields.
For this approach to work all the fields must split correctly into 25 fields.
If you think this is a sound approach please let me know how to split the fileds into 25 fields.
It is worth mentioning that I do not like the approach of splitting the data set with the function strsplit(). It is an extremely time consuming step if used with a large data set. Can you please recommend an alternate approach to create a data frame out of the supplied data?

By the looks of it, you have a header row that is actually helpful. You can easily use gregexpr to calculate your "widths" to use with read.fwf.
Here's how:
## Use gregexpr to find the position of consecutive runs of spaces
## This will tell you the starting position of each column
Widths <- gregexpr("\\s+", dataSet[1])[[1]]
## `read.fwf` doesn't need the starting position, but the width of
## each column. We can use `diff` to calculate this.
Widths <- c(Widths[1], diff(Widths))
## Since there are no spaces after the last column, we need to calculate
## a reasonable width for that column too. We can do this with `nchar`
## to find the widest row in the data. From this, subtract the `sum`
## of all the previous values.
Widths <- c(Widths, max(nchar(dataSet)) - sum(Widths))
Let's also extract the column names. We could do this in read.fwf, but it would require us to substitute the spaces in the first line with a "sep" character.
Names <- scan(what = "", text = dataSet[1])
Now, read in everything except the first line. You would use the actual file instead of textConnection, I would suppose.
read.fwf(textConnection(dataSet), widths=Widths, strip.white = TRUE,
skip = 1, col.names = Names)
# Plnt SLoc Material Description L.T MRP Stat Auto MatSG PC PN Freq Qty
# 1 231 2 GB.C152260-00001 ASSY PISTON & SEAL/O-RING 44 PD NA X A A A 18 136
# 2 231 2 WH.112734 MOTOR REDUCER, THREE-PHAS 41 PD NA X B B A 16 17
# 3 231 2 WH.920569 SPINDLE MOTOR MINI O 22 PD NA X A A A 69 85
# 4 231 2 GB.C150583-00001 VALVE-AIR MDI 64 PD NA X A A A 16 113
# 5 231 2 FG.124-0140 BEARING 32 PD NA X A A A 36 205
# 6 231 2 WP.254997 BEARING,BALL .9843 X 2.04 52 PD NA X A A A 18 155
# CFreq CQty Cur.RPt New.RPt CurRepl NewRepl Updt Cost ServStock Unit OpenMatResb
# 1 NA NA 30 29 50 43 NA 24.88 51 EA <NA>
# 2 NA NA 3 3 5 4 NA 483.87 1 EA X
# 3 NA NA 15 9 25 13 NA 680.91 21 EA <NA>
# 4 NA NA 50 35 80 52 NA 19.96 116 EA <NA>
# 5 NA NA 35 32 50 48 NA 21.16 55 EA <NA>
# 6 NA NA 50 39 100 58 NA 2.69 181 EA <NA>
# DFStorLocLevel
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA

Many thanks to Ananda Mahto, he provided many pieces to this answer.
widthMinusFirst <- diff(gregexpr('(\\s[A-Z])+', dataSet[1])[[1]])
widthFirst <- gregexpr('\\s+', dataSet[1])[[1]][1]
Width <- c(widthFirst, widthMinusFirst)
Widths <- c(Width, max(nchar(dataSet)) - sum(Width))
columnNames <- scan(what = "", text = dataSet[1])
read.fwf(textConnection(dataSet[-1]), widths = Widths, strip.white = FALSE,
skip = 0, col.names = columnNames)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to move a word position in a sentence in pyspark - regex

Related

Using factor inputs in a caret-driven shiny app

How to extract only the weight from the title using regular expressions?

Create date variable from time (Using SAS 9.3)

R remove only "[" "]" from string

Creating A Dataframe From A Text Dataset

Categories

Resources