remove a single quote in a string with pySpark using regex - regex
I need to remove a single quote in a string. The column name is Keywords. I have an array hidden in a string. So I need to use Regex within Spark Dataframe to remove a single quote from the beginning of the string and at the end. The string looks like this:
Keywords=
'
[
"shade perennials"," shade loving perennials"," perennial plants"," perennials"," perennial flowers"," perennial plants for shade"," full shade perennials"
]
'
I have tried the following:
remove_single_quote = udf(lambda x: x.replace(u"'",""))
cleaned_df = spark_df.withColumn('Keywords', remove_single_quote('Keywords'))
But the single quote is still there, I have also tried (u"\'","")
from pyspark.sql.functions import regexp_replace
new_df = data.withColumn('Keywords', regexp_replace('Keywords', "\'", ""))
Try regexp_replace
from pyspark.sql.functions import regexp_replace,col
cleaned_df = spark_df.withColumn('Keywords', regexp_replace('Keywords',"\'",""))
OR
from pyspark.sql import functions as f
cleaned_df = spark_df.withColumn('Keywords', f.regexp_replace('Keywords',"\'",""))
I have not tested It but should work
import ast
cleaned_df = spark_df.withColumn('Keywords',ast.literal_eval('Keywords'))
Please refer
Related
Extract the second instance of the website pattern in a string using pandas str.contains
I am trying to extract 2nd instance of www website from the below string. This is in a pandas dataframe. https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core- banking-on- cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD So I want to extract the following string and store it in a separate column. https://www.accenture.com/in-en/insights/software- platforms/core- banking-on- cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD Final Dataframe: sr.no link_orig link_extracted 1 <the above string> <the extracted string that starts from https://www.accenture.com> Below is the code snippet: df['link_extracted'] = `df['link_orig'].str.contains('www.accenture.com',regex=False,na-np.NaN) I am getting the following error: ValueError: Cannot mask with non-boolean array containing NA / NaN values What I am missing here? If I have to use regex then what should be the approach?
The error message means you probably have NaNs in the link_orig column. That can be fixed by adding a fillna('') to your code. Something like df['link_extracted'] = df['link_orig'].fillna('').str.contains ... That said, I'm not sure the rest of your code will do what you want. That will just return True is www.accenture.com is anywhere in the link_orig string. If the link you are trying to extract always contains www.accenture.com then you can do this df['link_extracted'] = df['link_orig'].fillna('').str.extract('(www\.accenture\.com.*)')
Personally, I'd use Series.str.extract() for this. E.g: df['link_extracted'] = df['link_orig'].str.extract('http.*(http.*)') This matches http, followed by anything, then captures http followed by anything. An alternate approach would be to use urlparse.
You can use urllib.parse module: import pandas as pd from urllib.parse import urlparse, parse_qs url = 'https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core-banking-on-cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD' df = pd.DataFrame({'sr.no':[1], 'link_orig':[url]}) extract_q = lambda url: parse_qs(urlparse(url).query)['q'][0] df['link_extracted'] = df['link_orig'].apply(extract_q) Output: >>> df sr.no link_orig link_extracted 0 1 https://google.com/url?q=https://www.accenture... https://www.accenture.com/in-en/insights/softw...
Select row with regex instead of unique value
Hello everyone I'm making a really simple lookup in a pandas dataframe, what I need to do is to lookup for the input I'm typing as a regex instead of == myvar So far this is what I got which is very inneficient because there's a lot of Names in my DataFrame that instead of matching a list of them which could be Name LastName NAME 1 Some Awesome Name 2 Last Names Nam e 3 I can keep going Bane Writing this is awesome BANE 114 Lets continue However this is what I got import pandas as pd contacts = pd.read_csv("contacts.csv") print("regex contacts") nameLookUp = input("Type the name you are looking for: ") print(nameLookUp) desiredRegexVar = contacts.loc[contacts['Name'] == nameLookUp] print(desiredRegexVar) I have to type 'NAME 1' or 'Nam e 3' in order results or I wont get any at all, I tried using this but it didnt work #regexVar = "^" + contacts.filter(regex = nameLookUp) Thanks for the answer #Code Different The code looks like this import pandas as pd import re namelookup = input("Type the name you are looking for: ") pattern = '^' + re.escape(namelookup) match = contactos['Cliente'].str.contains(pattern, flags=re.IGNORECASE, na=False) print(contactos[match])
Use Series.str.contains. Tweak the pattern as appropriate: import re pattern = '^' + re.escape(namelookup) match = contacts['Name'].str.contains(pattern, flags=re.IGNORECASE) contacts[match]
How to extract files with date pattern using python
I have n-files in a folder like source_dir abc_2017-07-01.tar abc_2017-07-02.tar abc_2017-07-03.tar pqr_2017-07-02.tar Lets consider for a single pattern now 'abc' (but I get this pattern randomly from Database, so need double filtering,one for pattern and one for last day) And I want to extract file of last day ie '2017-07-02' Here I can get common files but not exact last_day files Code pattern = 'abc' allfiles=os.listdir(source_dir) m_files=[f for f in allfiles if str(f).startswith(pattern)] print m_files output: [ 'abc_2017-07-01.tar' , 'abc_2017-07-02.tar' , 'abc_2017-07-03.tar' ] This gives me all files related to abc pattern, but how can filter out only last day file of that pattern Expected : [ 'abc_2017-07-02.tar' ] Thanks
just a minor tweak in your code can get you the desired result. import os from datetime import datetime, timedelta allfiles=os.listdir(source_dir) file_date = datetime.now() + timedelta(days=-1) pattern = 'abc_' +str(file_date.date()) m_files=[f for f in allfiles if str(f).startswith(pattern)] Hope this helps!
latest = max(m_files, key=lambda x: x[-14:-4]) will find the filename with latest date among filenames in m_files.
use python regex package like : import re import os files = os.listdir(source_dir) for file in files: match = re.search('abc_2017-07-(\d{2})\.tar', file) day = match.group(1) and then you can work with day in the loop to do what ever you want. Like create that list: import re import os def extract_day(name): match = re.search('abc_2017-07-(\d{2})\.tar', file) day = match.group(1) return day files = os.listdir(source_dir) days = [extract_day(file) for file in files] if the month is also variable you can substitute '07' with '\d\d' or also '\d{2}'. Be carefull if you have files that dont match with the pattern at all, then match.group() will cause an error since match is of type none. Then use : def extract_day(name): match = re.search('abc_2017-07-(\d{2})\.tar', file) try: day = match.group(1) except : day = None return day
deleting semicolons in a column of csv in python
I have a column of different times and I want to find the values in between 2 different times but can't find out how? For example: 09:04:00 threw 09:25:00. And just use the values in between those different times. I was gonna just delete the semicolons separating hours:minutes:seconds and do it that way. But really don't know how to do that. But I know how to find a value in a column so I figured that way would be easier idk. Here is the csv I'm working with. DATE,TIME,OPEN,HIGH,LOW,CLOSE,VOLUME 02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505 02/03/1997,09:05:00,3047.00,3048.00,3046.00,3047.00,162 02/03/1997,09:06:00,3047.50,3048.00,3047.00,3047.50,98 02/03/1997,09:07:00,3047.50,3047.50,3047.00,3047.50,228 02/03/1997,09:08:00,3048.00,3048.00,3047.50,3048.00,136 02/03/1997,09:09:00,3048.00,3048.00,3046.50,3046.50,174 02/03/1997,09:10:00,3046.50,3046.50,3045.00,3045.00,134 02/03/1997,09:11:00,3045.50,3046.00,3044.00,3045.00,43 02/03/1997,09:12:00,3045.00,3045.50,3045.00,3045.00,214 02/03/1997,09:13:00,3045.50,3045.50,3045.50,3045.50,8 02/03/1997,09:14:00,3045.50,3046.00,3044.50,3044.50,152 02/03/1997,09:15:00,3044.00,3044.00,3042.50,3042.50,126 02/03/1997,09:16:00,3043.50,3043.50,3043.00,3043.00,128 02/03/1997,09:17:00,3042.50,3043.50,3042.50,3043.50,23 02/03/1997,09:18:00,3043.50,3044.50,3043.00,3044.00,51 02/03/1997,09:19:00,3044.50,3044.50,3043.00,3043.00,18 02/03/1997,09:20:00,3043.00,3045.00,3043.00,3045.00,23 02/03/1997,09:21:00,3045.00,3045.00,3044.50,3045.00,51 02/03/1997,09:22:00,3045.00,3045.00,3045.00,3045.00,47 02/03/1997,09:23:00,3045.50,3046.00,3045.00,3045.00,77 02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131 02/03/1997,09:25:00,3044.50,3044.50,3043.50,3043.50,138 02/03/1997,09:26:00,3043.50,3043.50,3043.50,3043.50,6 02/03/1997,09:27:00,3043.50,3043.50,3043.00,3043.00,56 02/03/1997,09:28:00,3043.00,3044.00,3043.00,3044.00,32 02/03/1997,09:29:00,3044.50,3044.50,3044.50,3044.50,63 02/03/1997,09:30:00,3045.00,3045.00,3045.00,3045.00,28 02/03/1997,09:31:00,3045.00,3045.50,3045.00,3045.50,75 02/03/1997,09:32:00,3045.50,3045.50,3044.00,3044.00,54 02/03/1997,09:33:00,3043.50,3044.50,3043.50,3044.00,96 02/03/1997,09:34:00,3044.00,3044.50,3044.00,3044.50,27 02/03/1997,09:35:00,3044.50,3044.50,3043.50,3044.50,44 02/03/1997,09:36:00,3044.00,3044.00,3043.00,3043.00,61 02/03/1997,09:37:00,3043.50,3043.50,3043.50,3043.50,18 Thanks for the time
If you just want to replace semicolons with commas you can use the built in string replace function. line = '02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131' line = line.replace(':',',') print(line) Output 02/03/1997,09,04,00,3046.00,3048.50,3046.00,3047.50,505 Then split on commas to separate the data. line.split(',') If you only want the numerical values you could also do the following (using a regular expression): import re line = '02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505' values = [float(x) for x in re.sub(r'[^\w.]+', ',', line).split(',')] print values Which gives you a list of numerical values that you can process. [2.0, 3.0, 1997.0, 9.0, 4.0, 0.0, 3046.0, 3048.5, 3046.0, 3047.5, 505.0]
Use the csv module! :) >>>import csv >>> with open('myFile.csv', newline='') as csvfile: ... myCsvreader = csv.reader(csvfile, delimiter=',', quotechar='|') ... for row in myCsvreader: ... for item in row: ... item.spit(':') # Returns hours without semicolons Once you extracted different time stamps, you can use the datetime module, such as: from datetime import datetime, date, time x = time(hour=9, minute=30, second=30) y = time(hour=9, minute=30, second=42) diff = datetime.combine(date.today(), y) - datetime.combine(date.today(), x) print diff.total_seconds()
How to Pass Arguments from xlwings to VBA Excel Macro?
I was looking at How do I call an Excel macro from Python using xlwings?, and I understand it's not fully supported, but I Would like to know if there is a way to do this. some like: from xlwings import Workbook, Application wb = Workbook(...) Application(wb).xl_app.Run("your_macro("%Args%")")
This can be done by doing what you propose. However please keep in mind that the this solution will not be cross-platform (Win/Mac). I´m on Windows so below has to be adjusted to appscript on Mac. http://docs.xlwings.org/en/stable/missing_features.html The VBA script can be called by following: linked_wb.xl_workbook.Application.Run("vba_script", variable_to_pass) Example: Let´s say you have a list of strings that should be used in a Data Validation list in Excel Python: from xlwings import Workbook linked_wb = Workbook.caller() animals = ['cat', 'dog', 'snake', 'bird'] animal_list = "" for animal in animals: animal_list += animal + "|" linked_wb.xl_workbook.Application.Run("create_validation", animal_list) Excel VBA: Public Sub create_validation(validation_list) Dim validation_split() As String validation_split = Split(validation_list, "|") 'The drop-down validation can only be created with 1-dimension array. 'We get 1-D from the Split above With Sheet1.Range("A1").Validation .Delete .Add Type:=xlValidateList, AlertStyle:=xlValidAlertStop, _ Operator:=xlBetween, Formula1:=Join(validation_split, ",") End With End Sub
Python Example: import xlwings as xw # wb_path = r"set_wb_path" wb = xw.Book(wb_path) app = wb.app variable_to_pass = 'test' macro = wb.macro(moduleName.macroName) macro(variable_to_pass) wb.app.quit() #or wb.close() As long as your VBA function accepts a variable and you pass it the same type of variable (str,int,list) this will work.