Sentence detection and extraction into same data frame - regex

I have a following data frame:
reviews <- data.frame(value = c("Product was received in excellent condition. Made with high quality materials. Very Good product",
"Inexpensive. An improvement over integrated graphics.",
"I love that product so excite. I will order again if I need more .",
"Excellent card, great graphics."),
user = c(1,2,3,4),
Review_Id = c("101968","101968","210546","112546"),
stringsAsFactors = FALSE)
and I need to have desired output:
user review_Id sentence
1 101968 Made with high quality materials.
1 101968 Very Good product
2 101968 Inexpensive.
2 101968 An improvement over integrated graphics.
3 210546 I love that product so excite.
3 210546 I will order again if I need more .
4 112546 Excellent card, great graphics.
I was wondering about something like this: sent_detect(reviews$value)
But how could I combine that function to have that desired output.

If your data really are so tidy, you can just use cSplit from my "splitstackshape" package.
library(splitstackshape)
cSplit(reviews, "value", ".", direction = "long")
# value user Review_Id
# 1: Product was received in excellent condition 1 101968
# 2: Made with high quality materials 1 101968
# 3: Very Good product 1 101968
# 4: Inexpensive 2 101968
# 5: An improvement over integrated graphics 2 101968
# 6: I love that product so excite 3 210546
# 7: I will order again if I need more 3 210546
# 8: Excellent card, great graphics 4 112546

Related

Trajectory Analysis (SAS): Incorrect number of start values

I am attempting a trajectory analysis in SAS (proc traj).
Following instructions found online, I first begin by testing two quadratic models, then three, then four (i.e., order 2 2, order 2 2 2, order 2 2 2 2, order 2 2 2 2 2).
I determined that a three-group linear model is the best fit (order 1 1 1;)
I then wish to add time stable covariates with the risk command. As found online, I did this by adding the start parameters provided in the Log.
At this point, I receive a notice: "Incorrect number of start values. There should be 10 start values based on the model specifications.").
I understand that it's possible to delete some of the 12 parameter estimates provided - But how do I select which ones to remove?
Thank you.
Code:
proc traj data=followupyes outplot=op outstat=os out=of outest=oe itdetail;
id youthid;
title3 'linear 3-gp model ';
var pronoun_allpar1-pronoun_allpar3;
indep time1-time3;
model logit;
ngroups 3;
order 1 1 1;
weight wgt_00;
start 0.031547 0.499724 1.969017 0.859566 -1.236747 0.007471
0.771878 0.495458 0.000000 0.000000 0.000000 0.000000;
risk P00_45_1;
run;
%trajplot (OP, OS, "linear 3-gp model ", "Traj of Pronoun Support", "Pron Support", "Time");
Because you are estimating a model with 3 linear trajectories, you will need 2 start values for each of your 3 groups.
See here for more info: https://www.andrew.cmu.edu/user/bjones/example.htm

Regex to extract numbers length >4, from a dataframe column

To extract numbers length >4, using regex from a dataframe column, I have these lines:
import pandas as pd
data = {'Company': ["0652369- INTER SUPPORT LLP, 202011",
"CIRCLE TRADING LTD 1-593616, 2020-06, 0201",
"Area Food Service Co., Ltd.-6958047, 2020-07"]}
df = pd.DataFrame(data)
df['co'] = df['Company'].str.extract('(\d+).{5,}')
print (df['co'])
Output:
0 0652369
1 1
2 6958047
It doesn't get right for the second line, which shall return '593616'.
What's the right way to write it? Thank you.
Try extracting (\d{5,}):
df['co'] = df['Company'].str.extract('(\d{5,})')
# Company co
# 0 0652369- INTER SUPPORT LLP, 202011 0652369
# 1 CIRCLE TRADING LTD 1-593616, 2020-06, 0201 593616
# 2 Area Food Service Co., Ltd.-6958047, 2020-07 6958047

Parsing text file into a Data Frame

I have a text file which has information, like so:
product/productId: B000GKXY4S
product/title: Crazy Shape Scissor Set
product/price: unknown
review/userId: A1QA985ULVCQOB
review/profileName: Carleen M. Amadio "Lady Dragonfly"
review/helpfulness: 2/2
review/score: 5.0
review/time: 1314057600
review/summary: Fun for adults too!
review/text: I really enjoy these scissors for my inspiration books that I am making (like collage, but in books) and using these different textures these give is just wonderful, makes a great statement with the pictures and sayings. Want more, perfect for any need you have even for gifts as well. Pretty cool!
product/productId: B000GKXY4S
product/title: Crazy Shape Scissor Set
product/price: unknown
review/userId: ALCX2ELNHLQA7
review/profileName: Barbara
review/helpfulness: 0/0
review/score: 5.0
review/time: 1328659200
review/summary: Making the cut!
review/text: Looked all over in art supply and other stores for "crazy cutting" scissors for my 4-year old grandson. These are exactly what I was looking for - fun, very well made, metal rather than plastic blades (so they actually do a good job of cutting paper), safe ("blunt") ends, etc. (These really are for age 4 and up, not younger.) Very high quality. Very pleased with the product.
I want to parse this into a dataframe with the productID, title, price.. as columns and the data as the rows. How can I do this in R?
A quick and dirty approach:
mytable <- read.table(text=mytxt, sep = ":")
mytable$id <- rep(1:2, each = 10)
res <- reshape(mytable, direction = "wide", timevar = "V1", idvar = "id")
There will be issues if there are other colons in the data. Also assumes that there is an equal number (10) of variables for each case. All

Searching words in sentences in R

I'd like to ask you for an advice with the following stuff. I have a data frame:
reviews <- data.frame(value = c("Product was received in excellent condition. Made with high quality materials. Very Good product",
"Inexpensive. An improvement over integrated graphics.",
"I love that product so excite. I will order again if I need more .",
"Excellent card, great graphics."),
user = c(1,2,3,4),
Review_Id = c("101968","101968","210546","112546"))
Then I have a topics from each of these sentences mentioned above:
topics <- data.frame(topic = c("product","condition","materials","product","integrated graphics","product","card","graphics"),
user = c(1,1,1,1,2,3,4,4), Review_Id = c("101968","101968","101968","101968","101968","210546","112546","112546"))
and I need to find original sentence where particular topic appears if I know user and Review_Id for sentences and also topics. Then write this sentence into column review.
Desired output should looks like following.
topic user Review_Id review
product 1 101968 Product was received in excellent condition.
condition 1 101968 Product was received in excellent condition.
materials 1 101968 Made with high quality materials.
product 1 101968 Very Good product
integrated graphics 2 101968 An improvement over integrated graphics.
product 3 210546 I love that product so excite.
card 4 112546 Excellent card, great graphics.
graphics 4 112546 Excellent card, great graphics.
Any advice or approach will be very appreciated. Thanks a lot in forward.
you can try
merge.data.frame(x = topics, y = reviews, by = c("Review_Id"), all.x = TRUE, all.y = FALSE)

Help: Extracting data tuples from text... Regex or Machine learning?

I would really appreciate your thoughts on the best approach to the following problem. I am using a Car Classified listing example which is similar in nature to give an idea.
Problem: Extract a data tuple from the given text.
Here are some characteristics of the data.
The vocabulary (words) in the text is limited to a specific domain. Lets assume 100-200 words at the most.
Text that needs to be parsed is a headline like a Car Ad data shown below. So each record corresponds to one tuple (row).
In some cases some of the attributes may be missing. So for example, in raw data row #5 below the year is missing.
Some words go together (bigrams). Like "Low miles".
Historical data available = 10,000 records
Incoming New Data volume = 1000-1500 records / week
The expected output should be in the form of (Year,Make,Model, feature). So the output should look like
1 -> (2009, Ford, Fusion, SE)
2 -> (1997, Ford, Taurus, Wagon)
3 -> (2000, Mitsubishi, Mirage, DE)
4 -> (2007, Ford, Expedition, EL Limited)
5 -> ( , Honda, Accord, EX)
....
....
Raw Headline Data:
1 -> 2009 Ford Fusion SE - $7000
2 -> 1997 Ford Taurus Wagon - $800 (san jose east)
3 -> '00 Mitsubishi Mirage DE - $2499 (saratoga) pic
4 -> 2007 Ford Expedition EL Limited - $7800 (x)
5 -> Honda Accord ex low miles - $2800 (dublin / pleasanton / livermore) pic
6 -> 2004 HONDA ODASSEY LX 68K MILES - $10800 (danville / san ramon)
7 -> 93 LINCOLN MARK - $2000 (oakland east) pic
8 -> #######2006 LEXUS GS 430 BLACK ON BLACK 114KMI ####### - $19700 (san rafael) pic
9 -> 2004 Audi A4 1.8T FWD - $8900 (Sacramento) pic
10 -> #######2003 GMC C2500 HD EX-CAB 6.0 V8 EFI WHITE 4X4 ####### - $10575 (san rafael) pic
11 -> 1990 Toyota Corolla RUNS GOOD! GAS SAVER! 5SPEED CLEAN! REG 2011 O.B.O - $1600 (hayward / castro valley) pic img
12 -> HONDA ACCORD EX 2000 - $4900 (dublin / pleasanton / livermore) pic
13 -> 2009 Chevy Silverado LT Crew Cab - $23900 (dublin / pleasanton / livermore) pic
14 -> 2010 Acura TSX - V6 - TECH - $29900 (dublin / pleasanton / livermore) pic
15 -> 2003 Nissan Altima - $1830 (SF) pic
Possible choices:
A machine learning Text Classifier (Naive Bayes etc)
Regex
What I am trying to figure out is if RegEx is too complicated for the job and a Text classifier is an overkill?
If the choice is to go with a text classifier then what would you consider to be the easiest to implement.
Thanks in advance for your kind help.
This is a well studied problem called information extraction. It is not straight forward to do what you want to do, and it is not as simple as you make it sound (ie machine learning is not an overkill). There are several techniques, you should read an overview of the research area.
Check this IE library for writing extraction rule< I think it will work best for you problem.
There also example how to create fast dictionary matching.
I think that the ARX or Phoebus systems may suit your needs if you already have annotated data and a list of words associated to each field. Their approach is a mix of information extraction and information integration.
There are a few good entity recognition libraries. Have you taken a look at Apache opennlp?
As a user looking for a specific model of car the task is easier. I'm pretty sure I could classify, say, most Ford Rangers since I know what to look for with regexp.
I think your best bet is to write a function for each car model with type String -> Maybe Tuple. Then run all these on each input and throw away those inputs resulting in zero or too many tuples.
You should use a tool like Amazon Mechanical Turk for this. Human microtasking. Another alternative is to use a data entry freelancer. upWork is a great place to look. You can get excellent quality results and the cost is very reasonable for each.