Select row with regex instead of unique value - regex

Hello everyone I'm making a really simple lookup in a pandas dataframe, what I need to do is to lookup for the input I'm typing as a regex instead of == myvar
So far this is what I got which is very inneficient because there's a lot of Names in my DataFrame that instead of matching a list of them which could be
Name LastName
NAME 1 Some Awesome
Name 2 Last Names
Nam e 3 I can keep going
Bane Writing this is awesome
BANE 114 Lets continue
However this is what I got
import pandas as pd
contacts = pd.read_csv("contacts.csv")
print("regex contacts")
nameLookUp = input("Type the name you are looking for: ")
print(nameLookUp)
desiredRegexVar = contacts.loc[contacts['Name'] == nameLookUp]
print(desiredRegexVar)
I have to type 'NAME 1' or 'Nam e 3' in order results or I wont get any at all, I tried using this but it didnt work
#regexVar = "^" + contacts.filter(regex = nameLookUp)
Thanks for the answer #Code Different
The code looks like this
import pandas as pd
import re
namelookup = input("Type the name you are looking for: ")
pattern = '^' + re.escape(namelookup)
match = contactos['Cliente'].str.contains(pattern, flags=re.IGNORECASE, na=False)
print(contactos[match])

Use Series.str.contains. Tweak the pattern as appropriate:
import re
pattern = '^' + re.escape(namelookup)
match = contacts['Name'].str.contains(pattern, flags=re.IGNORECASE)
contacts[match]

Related

remove a single quote in a string with pySpark using regex

I need to remove a single quote in a string. The column name is Keywords. I have an array hidden in a string. So I need to use Regex within Spark Dataframe to remove a single quote from the beginning of the string and at the end. The string looks like this:
Keywords=
'
[
"shade perennials"," shade loving perennials"," perennial plants"," perennials"," perennial flowers"," perennial plants for shade"," full shade perennials"
]
'
I have tried the following:
remove_single_quote = udf(lambda x: x.replace(u"'",""))
cleaned_df = spark_df.withColumn('Keywords', remove_single_quote('Keywords'))
But the single quote is still there, I have also tried (u"\'","")
from pyspark.sql.functions import regexp_replace
new_df = data.withColumn('Keywords', regexp_replace('Keywords', "\'", ""))
Try regexp_replace
from pyspark.sql.functions import regexp_replace,col
cleaned_df = spark_df.withColumn('Keywords', regexp_replace('Keywords',"\'",""))
OR
from pyspark.sql import functions as f
cleaned_df = spark_df.withColumn('Keywords', f.regexp_replace('Keywords',"\'",""))
I have not tested It but should work
import ast
cleaned_df = spark_df.withColumn('Keywords',ast.literal_eval('Keywords'))
Please refer

How to extract files with date pattern using python

I have n-files in a folder like
source_dir
abc_2017-07-01.tar
abc_2017-07-02.tar
abc_2017-07-03.tar
pqr_2017-07-02.tar
Lets consider for a single pattern now 'abc'
(but I get this pattern randomly from Database, so need double filtering,one for pattern and one for last day)
And I want to extract file of last day ie '2017-07-02'
Here I can get common files but not exact last_day files
Code
pattern = 'abc'
allfiles=os.listdir(source_dir)
m_files=[f for f in allfiles if str(f).startswith(pattern)]
print m_files
output:
[ 'abc_2017-07-01.tar' , 'abc_2017-07-02.tar' , 'abc_2017-07-03.tar' ]
This gives me all files related to abc pattern, but how can filter out only last day file of that pattern
Expected :
[ 'abc_2017-07-02.tar' ]
Thanks
just a minor tweak in your code can get you the desired result.
import os
from datetime import datetime, timedelta
allfiles=os.listdir(source_dir)
file_date = datetime.now() + timedelta(days=-1)
pattern = 'abc_' +str(file_date.date())
m_files=[f for f in allfiles if str(f).startswith(pattern)]
Hope this helps!
latest = max(m_files, key=lambda x: x[-14:-4])
will find the filename with latest date among filenames in m_files.
use python regex package like :
import re
import os
files = os.listdir(source_dir)
for file in files:
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
and then you can work with day in the loop to do what ever you want. Like create that list:
import re
import os
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
return day
files = os.listdir(source_dir)
days = [extract_day(file) for file in files]
if the month is also variable you can substitute '07' with '\d\d' or also '\d{2}'. Be carefull if you have files that dont match with the pattern at all, then match.group() will cause an error since match is of type none. Then use :
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
try:
day = match.group(1)
except :
day = None
return day

Splitting the name when a word matches with one in array?

As a part of my learning. After i successfully split with help, in my next step, wanted to know if i can split the names of files when the month name is found in the name of the file that matches with the name of the month given in this list below ---
Months=['January','February','March','April','May','June','July','August','September','October','November','December'].
When my file name is like this
1.Non IVR Entries Transactions December_16_2016_07_49_22 PM.txt
2.Denied_Calls_SMS_Sent_December_14_2016_05_33_41 PM.txt
Please note that the names of files is not same..i.e why i need to split it like
Non IVR Entries Transactions as one part and December_16_2016_07_49_22 PM as another.
import os
import os.path
import csv
path = 'C:\\Users\\akhilpriyatam.k\\Desktop\\tes'
text_files = [os.path.splitext(f)[0] for f in os.listdir(path)]
for v in text_files:
print (v[0:9])
print (v[10:])
os.chdir('C:\\Users\\akhilpriyatam.k\\Desktop\\tes')
with open('file.csv', 'wb') as csvfile:
thedatawriter = csv.writer(csvfile,delimiter=',')
for v in text_files:
s = (v[0:9])
t = (v[10:])
thedatawriter.writerow([s,t])
import re
import calendar
fullname = 'Non IVR Entries Transactions December_16_2016_07_49_22 PM.txt'
months = list(calendar.month_name[1:])
regex = re.compile('|'.join(months))
iter = re.finditer(regex, fullname)
if iter:
idx = [it for it in iter][0].start()
filename, timestamp = fullname[:idx],fullname[idx:-4]
print filename, timestamp
else:
print "Month not found"
Assuming that you want the filename and timestamp as splits and the month occurs only once in the string, I hope the following code solves your problem.

How to remove unwanted items from a parse file

from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.
Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110
Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.

python replace string function throws asterix wildcard error

When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []