Imputing Missing Value - regex

I am looking to filter the talentpool_subset dataframe to capture only the city and state from the location column (it currently contains strings like this, "Software Developer in London, United Kingdom"). I've tried replacing NaN values with 0, and confirmed that I've done this by subsetting the dataframe to return only NaN values, which returned an empty data frame as expected. But every time I run the final statement, I get this error: "ValueError: cannot mask with array containing NA / NaN values"
Why is this happening?
talentpool_subset = talentpool_df[['name', 'profile', 'location','skills']]
talentpool_subset
talentpool_subset['location'].fillna(0, inplace=True)
location = talentpool_subset['location'].isna()
talentpool_subset[location]
talentpool_subset[talentpool_subset['location'].str.contains(r'(?<=in).*')]
name profile url source github location skills tags_strong tags_expert is_available description
0 Hugo L. Samayoa DevOps Developer https://www.toptal.com/resume/hugo-l-samayoa toptal NaN DevOps Developer in Long Beach, CA, United States {"Paradigms":["Agile Software Development","Sc... NaN ["Linux System Administration","VMware ESXi","... available "DevOps before DevOps" is a term mostly associ...
1 Stepan Yakovenko Software Developer https://www.toptal.com/resume/stepan-yakovenko toptal stiv-yakovenko Software Developer in Novosibirsk, Novosibirsk... {"Platforms":["Debian Linux","Windows","Linux"... ["Linux","C++","AngularJS"] ["Java","HTML5","CSS","JavaScript","MySQL","Hi... available Stepan is an experienced software developer wi...
2 Slobodan Gajic Software Developer https://www.toptal.com/resume/slobodan-gajic toptal bobangajicsm Software Developer in Sremska Mitrovica, Vojvo... {"Platforms":["Firebase","XAMPP"],"Storage":["... ["Firebase","Karma"] ["jQuery","HTML5","CSS3","Git","JavaScript","S... available Slobodan is a front-end developer with a Bache...
4 Jennifer Aquino Query Optimization Developer https://www.toptal.com/resume/jennifer-aquino toptal BlueCamelArt Query Optimization Developer in West Ryde, New... {"Paradigms":["Automation","ETL Implementation... ["Data Warehouse","Unix","Oracle 10g","Automat... ["SQL","SQL Server Integration Services (SSIS)... available Jennifer has five years of professional experi...

Assuming here that the objective is to get location and it is not required to use a mask for location. Code below uses .extract() to keep only city, state in the location column.
For example: Long Beach, CA, United States from DevOps Developer in Long Beach, CA, United States.
# Import libraries
import pandas as pd
import numpy as np
# Create list using text from question
name = ['Hugo L. Samayoa','Stepan Yakovenko','Slobodan Gajic','Bruno Furtado Montes Oliveira','Jennifer Aquino']
profile = ['DevOps Developer','Software Developer','Software Developer','Visual Studio Team Services (VSTS) Developer','Query Optimization Developer']
url = ['https://www.toptal.com/resume/hugo-l-samayoa','https://www.toptal.com/resume/stepan-yakovenko','https://www.toptal.com/resume/slobodan-gajic','https://www.toptal.com/resume/bruno-furtado-mo...','https://www.toptal.com/resume/jennifer-aquino']
source = ['toptal','toptal','toptal','toptal','toptal']
github = [np.nan, 'stiv-yakovenko','bobangajicsm','brunofurmon','BlueCamelArt']
location = ['DevOps Developer in Long Beach, CA, United States', 'Software Developer in Novosibirsk, Novosibirsk','Software Developer in Sremska Mitrovica, Vojvo','Visual Studio Team Services (VSTS) Developer in New York','Query Optimization Developer in West Ryde, New York']
skills = ['{"Paradigms":["Agile Software Development","Sc...', '{"Platforms":["Debian Linux","Windows","Linux"...','{"Platforms":["Firebase","XAMPP"],"Storage":["...','{"Paradigms":["Agile","CQRS","Azure DevOps"],"...','{"Paradigms":["Automation","ETL Implementation...']
# Create DataFrame using list above
talentpool_df = pd.DataFrame({
'name':name,
'profile':profile,
'url':url,
'source':source,
'github':github,
'location':location,
'skills':skills
})
# Add NaN row to DataFrame
talentpool_df.loc[6,:] = np.nan
# Subset DataFrame to get columns of interest
talentpool_subset = talentpool_df[['name', 'profile', 'location','skills']]
# Use .extract() to keep only text after 'in' in the 'location' column
talentpool_subset['location'] = talentpool_subset['location'].str.extract(r'((?<=in).*)')
Output
talentpool_subset

Related

Excel coding: how to categorise data and reach 1 of 3 possible outputs

Working in Micorsoft Excel, I have a column (L3) that lists different countries, which refers to a variety of project locations, and from these projects, i want to write a piece of code that, in a new column, either categorises the projects location into Region "Sahel", Region "HoA" or leaves if blank if neither of the conditions are met.
For example:
I want countries Burkina Faso, Mali and Nigeria to be categoried as Sahel in a new column, I want just South Sudan to be categorised as Horn of Africa (HoA) and the projects listed in other countries to be returned as blank in this respective column.
What I have tried so far:
=IF(OR(L2="Burkina Faso";L2="Mali";L2="Nigeria"); "Sahel";"HoA")
-> This works for the courties I want to categories as "Sahel," however, everything else, and not just South Sudan, is then categoried as HoA, which is not what I want - But I cannot add "" at the end as I then have too many arguments in my code.
I then tried:
=IF(OR(L3="Burkina Faso";L3="Mali";L3="Nigeria"); "Sahel";""); IF((L3="South Sudan"); "HoA";"")
-> But this didn't work out at all...
Thus the end output that I want is
Countries Mali; South Sudan; Mali; Burkina Faso; Somalia; Uganda; South Sudan
For Regions Sahel; HoA; Sahel; Sahel; - ; -; HoA
I hope that soneone can help me move on from here.
Use this,
=IF(OR(L2="Burkina Faso",L2="Mali",L2="Nigeria"),"Sahel",IF(L2="South Sudan","HoA",""))

Use Python to extract three sentences based on word finding

I'm working on a text-mining use case in python. These are the sentences of interest:
As a result may continue to be adversely impacted, by fluctuations in foreign currency exchange rates. Certain events such as the threat of additional tariffs on imported consumer goods from China, have increased. Stores are primarily located in shopping malls and other shopping centers.
How can I extract the sentence with the keyword "China"? I do need a sentence before and after that, actually atleast two sentences before and after.
I've tried the below, as was answered here:
import nltk
from nltk.tokenize import word_tokenize
sents = nltk.sent_tokenize(text)
my_sentences = [sent for sent in sents if 'China' in word_tokenize(sent)]
Please help!
TL;DR
Use sent_tokenize, keep track of the index where the focus word and window the sentences to get the desired result.
from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
word_detokenize = TreebankWordDetokenizer().detokenize
text = """As a result may continue to be adversely impacted, by fluctuations in foreign currency exchange rates. Certain events such as the threat of additional tariffs on imported consumer goods from China, have increased global economic and political uncertainty and caused volatility in foreign currency exchange rates. Stores are primarily located in shopping malls and other shopping centers, certain of which have been experiencing declines in customer traffic."""
tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
sent_idx_with_china = [idx for idx, sent in enumerate(tokenized_text)
if 'China' in sent or 'china' in sent]
window = 2 # If you want 2 sentences before and after.
for idx in sent_idx_with_china:
start = max(idx - window, 0)
end = min(idx+window, len(tokenized_text))
result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
print(result)
Another example, pip install wikipedia first:
from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
word_detokenize = TreebankWordDetokenizer().detokenize
import wikipedia
text = wikipedia.page("Winnie The Pooh").content
tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
sent_idx_with_china = [idx for idx, sent in enumerate(tokenized_text)
if 'China' in sent or 'china' in sent]
window = 2 # If you want 2 sentences before and after.
for idx in sent_idx_with_china:
start = max(idx - window, 0)
end = min(idx+window, len(tokenized_text))
result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
print(result)
print()
[out]:
Ashdown Forest in England where the Pooh stories are set is a popular
tourist attraction, and includes the wooden Pooh Bridge where Pooh and
Piglet invented Poohsticks. The Oxford University Winnie the Pooh
Society was founded by undergraduates in 1982. == Censorship in China
== In the People's Republic of China, images of Pooh were censored in mid-2017 from social media websites, when internet memes comparing
Chinese president Xi Jinping to Pooh became popular. The 2018 film
Christopher Robin was also denied a Chinese release.

Extract texts from a large character string based a pattern

I have a large string of characters and would like to extract certain information from it matching pattern:
str(input)
chr [1:109094] "{'asin': '0981850006', 'description': 'Steven Raichlen\'s Best of Barbecue Primal Grill DVD. The first three volumes of the si"| truncated ...
I get the following content of input[1] - description of product meta
[1] ("{'asin': '144072007X', 'related': {'also_viewed': ['B008WC0X0A', 'B000CPMOVG', 'B0046641AE', 'B00J150GAO', 'B00005AMCG', 'B005WGX97I'],
'bought_together': ['B000H85WSA']},
'title': 'Sand Shark Margare Maron Audio CD',
'price': 577.15,
'salesRank': {'Patio, Lawn & Garden': 188289},
'imUrl': 'http://ecx.images-amazon.com/images/I/31B9X0S6dqL._SX300_.jpg',
'brand': 'Tesoro',
'categories': [['Patio, Lawn & Garden', 'Lawn Mowers & Outdoor Power Tools', 'Metal Detectors']],
'description': \"The Tesoro Sand Shark metal combines time-proven PI circuits with the latest digital technology creating the first.\"}")
Now I would like to iterate over each element of the large string and extract asin, title, price, salesRank, brand and categories that should be saved in a data.frame for better handling.
The data is originally from a JSON file as you might notice. I tried to import it using stream_in command, but it didn't help. So just imported it using readLines. Please please help! Being a bit desperate...Any hint is appreciated!
The jsonlite package shows the following problem:
lexical error: invalid char in json text.
{'asin': '0981850006', 'descript
(right here) ------^
closing fileconnectionoldClass input connection.
Any new ideas on that?
Given lots of unanswered questions on that issue, must be very relevant for newbies ;)

Searching words in sentences in R

I'd like to ask you for an advice with the following stuff. I have a data frame:
reviews <- data.frame(value = c("Product was received in excellent condition. Made with high quality materials. Very Good product",
"Inexpensive. An improvement over integrated graphics.",
"I love that product so excite. I will order again if I need more .",
"Excellent card, great graphics."),
user = c(1,2,3,4),
Review_Id = c("101968","101968","210546","112546"))
Then I have a topics from each of these sentences mentioned above:
topics <- data.frame(topic = c("product","condition","materials","product","integrated graphics","product","card","graphics"),
user = c(1,1,1,1,2,3,4,4), Review_Id = c("101968","101968","101968","101968","101968","210546","112546","112546"))
and I need to find original sentence where particular topic appears if I know user and Review_Id for sentences and also topics. Then write this sentence into column review.
Desired output should looks like following.
topic user Review_Id review
product 1 101968 Product was received in excellent condition.
condition 1 101968 Product was received in excellent condition.
materials 1 101968 Made with high quality materials.
product 1 101968 Very Good product
integrated graphics 2 101968 An improvement over integrated graphics.
product 3 210546 I love that product so excite.
card 4 112546 Excellent card, great graphics.
graphics 4 112546 Excellent card, great graphics.
Any advice or approach will be very appreciated. Thanks a lot in forward.
you can try
merge.data.frame(x = topics, y = reviews, by = c("Review_Id"), all.x = TRUE, all.y = FALSE)

R - does failed RegEx pattern matching originate in file conversion or use of tm package?

As a relative novice in R and programming, my first ever question in this forum is about regex pattern matching, specifically line breaks. First some background. I am trying to perform some preprocessing on a corpus of texts using R before processing them further on the NLP platform GATE. I convert the original pdf files to text as follows (the text files, unfortunately, go into the same folder):
dest <- "./MyFolderWithPDFfiles"
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Program Files (x86)/xpdfbin-win-3.04/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))
Then, having loaded the tm package and physically(!) moved the text files to another folder, I create a corpus:
TextFiles <- "./MyFolderWithTXTfiles"
EU <- Corpus(DirSource(TextFiles))
I then want to perform a series of custom transformations to clean the texts. I succeeded to replace a simple string as follows:
ReplaceText <- content_transformer(function(x, from, to) gsub(from, to, x, perl=T))
EU2 <- tm_map(EU, ReplaceText, "Table of contents", "TOC")
However, a pattern that is a 1-3 digit page number followed by two line breaks and a page break is causing me problems. I want to replace it with a blank space:
EU2 <- tm_map(EU, ReplaceText, "[0-9]{1,3}\n\n\f", " ")
The ([0-9]{1,3}) and \f alone match. The line breaks don't. If I copy text from one of the original .txt files into the RegExr online tool and test the expression "[0-9]{1,3}\n\n\f", it matches. So the line breaks do exist in the original .txt file.
But when I view one of the .txt files as read into the EU corpus in R, there appear to be no line breaks even though the lines are obviously breaking before the margin, e.g.
[3] "PROGRESS TOWARDS ACCESSION"
[4] "1"
[5] ""
[6] "\fTable of contents"
Seeing this, I tried other patterns, e.g. to detect one or more blank space ("[0-9]{1,3}\s*\f"), but no patterns worked.
So my questions are:
Am I converting and reading the files into R correctly? If so, what has happened to the line breaks?
If no line breaks is normal, how can I pattern match the character on line 5? Is that not a blank
space?
(A tangential concern:) When converting the pdf files, is there code that will put them directly in a new folder?
Apologies for extending this, but how can one print or inspect only a few lines of the text object? The tm commands and head(EU) print the entire object, each a very long text.
I know my problem(s) must appear simple and perhaps stupid, but one has to start somewhere and extensive searching has not revealed a source that explains comprehensively how to use RegExes to modify text objects in R. I am so frustrated and hope someone here will take pity and can help me.
Thanks for any advice you can offer.
Brigitte
p.s. I think it's not possible to upload attachments in this forum, therefore, here is a link to one of the original PDF documents: http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf
Because the doc is long, I created a snippet of the first 3 pages of the TXT doc, read it into the R corpus ('EU') and printed it to the console and this is it:
dput(EU[[2]])
structure(list(content = c("REGULAR REPORT", "FROM THE COMMISSION ON",
"CZECH REPUBLIC'S", "PROGRESS TOWARDS ACCESSION ***********************",
"1", "", "\fTable of contents", "A. Introduction", "a) Preface The Context of the Progress Report",
"b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations",
"B. Criteria for membership", "1. Political criteria", "1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures",
"1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities",
"1.3. General evaluation", "2. Economic criteria", "2.1. Introduction 2.2. Economic developments since the Commission published its Opinion",
"Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation",
"3. Ability to assume the obligations of Membership", "3.1. Internal Market without frontiers General framework The Four Freedoms Competition",
"3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual",
"3.3. Economic and Fiscal Affairs Economic and Monetary Union",
"2", "", "\fTaxation Statistics "), meta = structure(list(author = character(0),
datetimestamp = structure(list(sec = 50.1142621040344, min = 33L,
hour = 15L, mday = 3L, mon = 10L, year = 114L, wday = 1L,
yday = 306L, isdst = 0L), .Names = c("sec", "min", "hour",
"mday", "mon", "year", "wday", "yday", "isdst"), class = c("POSIXlt",
"POSIXt"), tzone = "GMT"), description = character(0), heading = character(0),
id = "CZ1998ProgressSnippet.txt", language = "en", origin = character(0)), .Names = c("author",
"datetimestamp", "description", "heading", "id", "language",
"origin"), class = "TextDocumentMeta")), .Names = c("content",
"meta"), class = c("PlainTextDocument", "TextDocument"))
Yes, working with text in R is not always a smooth experience! But you can get a lot done quickly with some effort (maybe too much effort!)
If you could share one of your PDF files or the output of dput(EU), that might help to identify exactly how to capture your page numbers with regex. That would also add a reproducible example to your question, which is an important thing to have in questions here so that people can test their answers and make sure they work for your specific problem.
No need to put PDF and text files in separate folders, instead you can use a pattern like so:
EU <- Corpus(DirSource(pattern = ".txt"))
This will only read the text files and ignore the PDF files
There is no 'snippet view' method in tm, which is annoying. I often use just names(EU) and EU[[1]] for quick looks
UPDATE
With the data you've just added, I'd suggest a slightly tangential approach. Do the regex work before passing the data to the tm package formats, like so:
# get the PDF
download.file("http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf", "my_pdf.pdf", method = "wget")
# get the file name of the PDF
myfiles <- list.files(path = getwd(), pattern = "pdf", full.names = TRUE)
# convert to text (not my pdftotext is in a different location to you)
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))
# read plain text int R
x1 <- readLines("my_pdf.txt")
# make into a single string
x2 <- paste(x1, collapse = " ")
# do some regex...
x3 <- gsub("Table of contents", "TOC", x2)
x4 <- gsub("[0-9]{1,3} \f", "", x3)
# convert to corpus for text mining operations
x5 <- Corpus(VectorSource(x4))
With the snippet of data your provided using dput, the output from this method is
inspect(x5)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
REGULAR REPORT FROM THE COMMISSION ON CZECH REPUBLIC'S PROGRESS TOWARDS ACCESSION *********************** TOC A. Introduction a) Preface The Context of the Progress Report b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations B. Criteria for membership 1. Political criteria 1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures 1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities 1.3. General evaluation 2. Economic criteria 2.1. Introduction 2.2. Economic developments since the Commission published its Opinion Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation 3. Ability to assume the obligations of Membership 3.1. Internal Market without frontiers General framework The Four Freedoms Competition 3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual 3.3. Economic and Fiscal Affairs Economic and Monetary Union Taxation Statistics