remove a single quote in a string with pySpark using regex - regex

I need to remove a single quote in a string. The column name is Keywords. I have an array hidden in a string. So I need to use Regex within Spark Dataframe to remove a single quote from the beginning of the string and at the end. The string looks like this:
Keywords=
'
[
"shade perennials"," shade loving perennials"," perennial plants"," perennials"," perennial flowers"," perennial plants for shade"," full shade perennials"
]
'
I have tried the following:
remove_single_quote = udf(lambda x: x.replace(u"'",""))
cleaned_df = spark_df.withColumn('Keywords', remove_single_quote('Keywords'))
But the single quote is still there, I have also tried (u"\'","")

from pyspark.sql.functions import regexp_replace
new_df = data.withColumn('Keywords', regexp_replace('Keywords', "\'", ""))

Try regexp_replace
from pyspark.sql.functions import regexp_replace,col
cleaned_df = spark_df.withColumn('Keywords', regexp_replace('Keywords',"\'",""))
OR
from pyspark.sql import functions as f
cleaned_df = spark_df.withColumn('Keywords', f.regexp_replace('Keywords',"\'",""))
I have not tested It but should work
import ast
cleaned_df = spark_df.withColumn('Keywords',ast.literal_eval('Keywords'))
Please refer

Related

Extract the second instance of the website pattern in a string using pandas str.contains

I am trying to extract 2nd instance of www website from the below string. This is in a pandas dataframe.
https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core-
banking-on-
cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD
So I want to extract the following string and store it in a separate column.
https://www.accenture.com/in-en/insights/software-
platforms/core- banking-on-
cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD
Final Dataframe:
sr.no link_orig link_extracted
1 <the above string> <the extracted string that starts from
https://www.accenture.com>
Below is the code snippet:
df['link_extracted'] = `df['link_orig'].str.contains('www.accenture.com',regex=False,na-np.NaN)
I am getting the following error:
ValueError: Cannot mask with non-boolean array containing NA / NaN values
What I am missing here? If I have to use regex then what should be the approach?
The error message means you probably have NaNs in the link_orig column. That can be fixed by adding a fillna('') to your code.
Something like
df['link_extracted'] = df['link_orig'].fillna('').str.contains ...
That said, I'm not sure the rest of your code will do what you want. That will just return True is www.accenture.com is anywhere in the link_orig string.
If the link you are trying to extract always contains www.accenture.com then you can do this
df['link_extracted'] = df['link_orig'].fillna('').str.extract('(www\.accenture\.com.*)')
Personally, I'd use Series.str.extract() for this. E.g:
df['link_extracted'] = df['link_orig'].str.extract('http.*(http.*)')
This matches http, followed by anything, then captures http followed by anything.
An alternate approach would be to use urlparse.
You can use urllib.parse module:
import pandas as pd
from urllib.parse import urlparse, parse_qs
url = 'https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core-banking-on-cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD'
df = pd.DataFrame({'sr.no':[1], 'link_orig':[url]})
extract_q = lambda url: parse_qs(urlparse(url).query)['q'][0]
df['link_extracted'] = df['link_orig'].apply(extract_q)
Output:
>>> df
sr.no link_orig link_extracted
0 1 https://google.com/url?q=https://www.accenture... https://www.accenture.com/in-en/insights/softw...

Select row with regex instead of unique value

Hello everyone I'm making a really simple lookup in a pandas dataframe, what I need to do is to lookup for the input I'm typing as a regex instead of == myvar
So far this is what I got which is very inneficient because there's a lot of Names in my DataFrame that instead of matching a list of them which could be
Name LastName
NAME 1 Some Awesome
Name 2 Last Names
Nam e 3 I can keep going
Bane Writing this is awesome
BANE 114 Lets continue
However this is what I got
import pandas as pd
contacts = pd.read_csv("contacts.csv")
print("regex contacts")
nameLookUp = input("Type the name you are looking for: ")
print(nameLookUp)
desiredRegexVar = contacts.loc[contacts['Name'] == nameLookUp]
print(desiredRegexVar)
I have to type 'NAME 1' or 'Nam e 3' in order results or I wont get any at all, I tried using this but it didnt work
#regexVar = "^" + contacts.filter(regex = nameLookUp)
Thanks for the answer #Code Different
The code looks like this
import pandas as pd
import re
namelookup = input("Type the name you are looking for: ")
pattern = '^' + re.escape(namelookup)
match = contactos['Cliente'].str.contains(pattern, flags=re.IGNORECASE, na=False)
print(contactos[match])
Use Series.str.contains. Tweak the pattern as appropriate:
import re
pattern = '^' + re.escape(namelookup)
match = contacts['Name'].str.contains(pattern, flags=re.IGNORECASE)
contacts[match]

How to extract files with date pattern using python

I have n-files in a folder like
source_dir
abc_2017-07-01.tar
abc_2017-07-02.tar
abc_2017-07-03.tar
pqr_2017-07-02.tar
Lets consider for a single pattern now 'abc'
(but I get this pattern randomly from Database, so need double filtering,one for pattern and one for last day)
And I want to extract file of last day ie '2017-07-02'
Here I can get common files but not exact last_day files
Code
pattern = 'abc'
allfiles=os.listdir(source_dir)
m_files=[f for f in allfiles if str(f).startswith(pattern)]
print m_files
output:
[ 'abc_2017-07-01.tar' , 'abc_2017-07-02.tar' , 'abc_2017-07-03.tar' ]
This gives me all files related to abc pattern, but how can filter out only last day file of that pattern
Expected :
[ 'abc_2017-07-02.tar' ]
Thanks
just a minor tweak in your code can get you the desired result.
import os
from datetime import datetime, timedelta
allfiles=os.listdir(source_dir)
file_date = datetime.now() + timedelta(days=-1)
pattern = 'abc_' +str(file_date.date())
m_files=[f for f in allfiles if str(f).startswith(pattern)]
Hope this helps!
latest = max(m_files, key=lambda x: x[-14:-4])
will find the filename with latest date among filenames in m_files.
use python regex package like :
import re
import os
files = os.listdir(source_dir)
for file in files:
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
and then you can work with day in the loop to do what ever you want. Like create that list:
import re
import os
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
return day
files = os.listdir(source_dir)
days = [extract_day(file) for file in files]
if the month is also variable you can substitute '07' with '\d\d' or also '\d{2}'. Be carefull if you have files that dont match with the pattern at all, then match.group() will cause an error since match is of type none. Then use :
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
try:
day = match.group(1)
except :
day = None
return day

deleting semicolons in a column of csv in python

I have a column of different times and I want to find the values in between 2 different times but can't find out how? For example: 09:04:00 threw 09:25:00. And just use the values in between those different times.
I was gonna just delete the semicolons separating hours:minutes:seconds and do it that way. But really don't know how to do that. But I know how to find a value in a column so I figured that way would be easier idk.
Here is the csv I'm working with.
DATE,TIME,OPEN,HIGH,LOW,CLOSE,VOLUME
02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505
02/03/1997,09:05:00,3047.00,3048.00,3046.00,3047.00,162
02/03/1997,09:06:00,3047.50,3048.00,3047.00,3047.50,98
02/03/1997,09:07:00,3047.50,3047.50,3047.00,3047.50,228
02/03/1997,09:08:00,3048.00,3048.00,3047.50,3048.00,136
02/03/1997,09:09:00,3048.00,3048.00,3046.50,3046.50,174
02/03/1997,09:10:00,3046.50,3046.50,3045.00,3045.00,134
02/03/1997,09:11:00,3045.50,3046.00,3044.00,3045.00,43
02/03/1997,09:12:00,3045.00,3045.50,3045.00,3045.00,214
02/03/1997,09:13:00,3045.50,3045.50,3045.50,3045.50,8
02/03/1997,09:14:00,3045.50,3046.00,3044.50,3044.50,152
02/03/1997,09:15:00,3044.00,3044.00,3042.50,3042.50,126
02/03/1997,09:16:00,3043.50,3043.50,3043.00,3043.00,128
02/03/1997,09:17:00,3042.50,3043.50,3042.50,3043.50,23
02/03/1997,09:18:00,3043.50,3044.50,3043.00,3044.00,51
02/03/1997,09:19:00,3044.50,3044.50,3043.00,3043.00,18
02/03/1997,09:20:00,3043.00,3045.00,3043.00,3045.00,23
02/03/1997,09:21:00,3045.00,3045.00,3044.50,3045.00,51
02/03/1997,09:22:00,3045.00,3045.00,3045.00,3045.00,47
02/03/1997,09:23:00,3045.50,3046.00,3045.00,3045.00,77
02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131
02/03/1997,09:25:00,3044.50,3044.50,3043.50,3043.50,138
02/03/1997,09:26:00,3043.50,3043.50,3043.50,3043.50,6
02/03/1997,09:27:00,3043.50,3043.50,3043.00,3043.00,56
02/03/1997,09:28:00,3043.00,3044.00,3043.00,3044.00,32
02/03/1997,09:29:00,3044.50,3044.50,3044.50,3044.50,63
02/03/1997,09:30:00,3045.00,3045.00,3045.00,3045.00,28
02/03/1997,09:31:00,3045.00,3045.50,3045.00,3045.50,75
02/03/1997,09:32:00,3045.50,3045.50,3044.00,3044.00,54
02/03/1997,09:33:00,3043.50,3044.50,3043.50,3044.00,96
02/03/1997,09:34:00,3044.00,3044.50,3044.00,3044.50,27
02/03/1997,09:35:00,3044.50,3044.50,3043.50,3044.50,44
02/03/1997,09:36:00,3044.00,3044.00,3043.00,3043.00,61
02/03/1997,09:37:00,3043.50,3043.50,3043.50,3043.50,18
Thanks for the time
If you just want to replace semicolons with commas you can use the built in string replace function.
line = '02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131'
line = line.replace(':',',')
print(line)
Output
02/03/1997,09,04,00,3046.00,3048.50,3046.00,3047.50,505
Then split on commas to separate the data.
line.split(',')
If you only want the numerical values you could also do the following (using a regular expression):
import re
line = '02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505'
values = [float(x) for x in re.sub(r'[^\w.]+', ',', line).split(',')]
print values
Which gives you a list of numerical values that you can process.
[2.0, 3.0, 1997.0, 9.0, 4.0, 0.0, 3046.0, 3048.5, 3046.0, 3047.5, 505.0]
Use the csv module! :)
>>>import csv
>>> with open('myFile.csv', newline='') as csvfile:
... myCsvreader = csv.reader(csvfile, delimiter=',', quotechar='|')
... for row in myCsvreader:
... for item in row:
... item.spit(':') # Returns hours without semicolons
Once you extracted different time stamps, you can use the datetime module, such as:
from datetime import datetime, date, time
x = time(hour=9, minute=30, second=30)
y = time(hour=9, minute=30, second=42)
diff = datetime.combine(date.today(), y) - datetime.combine(date.today(), x)
print diff.total_seconds()

How to Pass Arguments from xlwings to VBA Excel Macro?

I was looking at How do I call an Excel macro from Python using xlwings?, and I understand it's not fully supported, but I Would like to know if there is a way to do this.
some like:
from xlwings import Workbook, Application
wb = Workbook(...)
Application(wb).xl_app.Run("your_macro("%Args%")")
This can be done by doing what you propose. However please keep in mind that the this solution will not be cross-platform (Win/Mac). I´m on Windows so below has to be adjusted to appscript on Mac. http://docs.xlwings.org/en/stable/missing_features.html
The VBA script can be called by following:
linked_wb.xl_workbook.Application.Run("vba_script", variable_to_pass)
Example:
Let´s say you have a list of strings that should be used in a Data Validation list in Excel
Python:
from xlwings import Workbook
linked_wb = Workbook.caller()
animals = ['cat', 'dog', 'snake', 'bird']
animal_list = ""
for animal in animals:
animal_list += animal + "|"
linked_wb.xl_workbook.Application.Run("create_validation", animal_list)
Excel VBA:
Public Sub create_validation(validation_list)
Dim validation_split() As String
validation_split = Split(validation_list, "|")
'The drop-down validation can only be created with 1-dimension array.
'We get 1-D from the Split above
With Sheet1.Range("A1").Validation
.Delete
.Add Type:=xlValidateList, AlertStyle:=xlValidAlertStop, _
Operator:=xlBetween, Formula1:=Join(validation_split, ",")
End With
End Sub
Python Example:
import xlwings as xw
# wb_path = r"set_wb_path"
wb = xw.Book(wb_path)
app = wb.app
variable_to_pass = 'test'
macro = wb.macro(moduleName.macroName)
macro(variable_to_pass)
wb.app.quit()
#or
wb.close()
As long as your VBA function accepts a variable and you pass it the same type of variable (str,int,list) this will work.