xml data handling regex find replace with conditional values - regex

I got a xml file which looks like this
<DocumentElement>
<Table1>
<Date>2013-08-24</Date>
<Time>00:07:23</Time>
<Type>in</Type>
<Number>393483419761</Number>
<Name>Marc</Name>
<Message>Lorem ipsum</Message>
</Table1>
<Table1>
<Date>2013-08-24</Date>
<Time>00:09:09</Time>
<Type>out</Type>
<Number>1215468498561</Number>
<Name>Marc</Name>
<Message>Lorem ipsum</Message>
</Table1>
<DocumentElement>
What I want to do is Check the Date Value and if the Month is 01, add <Month>january</Month> after </Date>, and if month is 02 add <Month>february</Month> and so on.
So what I got so far is either:
<Date>(\d{4})-01-(\d{2})</Date>
<Date>$1-01-$2</Date>
<Month>january</Month>
or i'd like to do something like:
<Date>(\d{4})-(\d{2})-(\d{2})</Date>
if ($2 = 01) {
<Date>$1-$2-$3</Date>
<Month>january</Month>
}
elseif ($2 = 02) {
<Date>$1-$2-$3</Date>
<Month>february</Month>
}
whats the usual way to handle and manipulate data like this?

Normally if you are parsing XML you would use a real parser instead of regular expressions. But in your particular case it is a very simple operation you want do do. Go over each line, print it, and if the current line is a date, extract the month and print an additional line.
Here's an example python script that does that.
import re
months = ["January", "February", "March", "April", "May", "June", "July",
"August", "September", "October", "November", "December"]
with open(your_xml_file) as f:
for line in f:
print line
match = re.search(r'<Date>\d{4}-(?P<month>\d{2})-\d{2}</Date>', line)
if match is not None:
print months[int(match.group('month')) - 1]
Note, however that this will fail as soon as you insert whitespace or add something else like attributes to Date. That's why it's better to use a real parser. But if your format is exactly like you stated, then it is faster to just write a small throw away script like this.

so just for the record this is my final code which adds another regex substitution and outputs everything in a new file:
x = 'marco_2013_24_08' #filename without extension
import re
months = ["<Month>gennaio</Month>", "<Month>febbraio</Month>", "<Month>marzo</Month>", "<Month>aprile</Month>", "<Month>maggio</Month>", "<Month>giugno</Month>", "<Month>luglio</Month>",
"<Month>agosto</Month>", "<Month>settembre</Month>", "<Month>ottobre</Month>", "<Month>novembre</Month>", "<Month>dicembre</Month>"]
import sys
sys.stdout = open('_' + x + 'regexed.xml', 'w')
with open(x + '.xml') as f:
for line in f:
im = re.sub(r'<Message>Image\:\ .+\/(IMG.+\.jpg)<\/Message>',r'<Image href="Bilder/\1"></Image>',line)
print im
mm = re.search(r'<Date>\d{4}-(?P<month>\d{2})-\d{2}</Date>', line)
if mm is not None:
print months[int(mm.group('month')) - 1]

Related

Select row with regex instead of unique value

Hello everyone I'm making a really simple lookup in a pandas dataframe, what I need to do is to lookup for the input I'm typing as a regex instead of == myvar
So far this is what I got which is very inneficient because there's a lot of Names in my DataFrame that instead of matching a list of them which could be
Name LastName
NAME 1 Some Awesome
Name 2 Last Names
Nam e 3 I can keep going
Bane Writing this is awesome
BANE 114 Lets continue
However this is what I got
import pandas as pd
contacts = pd.read_csv("contacts.csv")
print("regex contacts")
nameLookUp = input("Type the name you are looking for: ")
print(nameLookUp)
desiredRegexVar = contacts.loc[contacts['Name'] == nameLookUp]
print(desiredRegexVar)
I have to type 'NAME 1' or 'Nam e 3' in order results or I wont get any at all, I tried using this but it didnt work
#regexVar = "^" + contacts.filter(regex = nameLookUp)
Thanks for the answer #Code Different
The code looks like this
import pandas as pd
import re
namelookup = input("Type the name you are looking for: ")
pattern = '^' + re.escape(namelookup)
match = contactos['Cliente'].str.contains(pattern, flags=re.IGNORECASE, na=False)
print(contactos[match])
Use Series.str.contains. Tweak the pattern as appropriate:
import re
pattern = '^' + re.escape(namelookup)
match = contacts['Name'].str.contains(pattern, flags=re.IGNORECASE)
contacts[match]

RegEx for matching the month, day and year

I'm trying to find a regular expression to extract the month, day and year from a datetime stamp in this format:
01/20/2019 12:34:54
It should return a list:
['01', '20', '2019']
I know this can be solved using:
dt.split(' ')[0].split('/')
But, I'm trying to find a regex to do it:
[^\/\s]+
But, I need it to exclude everything after the space.
As you are expecting the date month and year to be returned as a list, you can use this Python code,
import re
s = '01/20/2019 12:34:54'
print(re.findall(r'\d+(?=[ /])', s))
Prints,
['01', '20', '2019']
Otherwise, you can better write your regex as,
(\d{2})/(\d{2})/(\d{4})
And get date, month and year from group1, group2 and group3
Regex Demo
Python code in this way should be,
import re
s = '01/20/2019 12:34:54'
m = re.search(r'(\d{2})/(\d{2})/(\d{4})', s)
if m:
print([m.group(1), m.group(2), m.group(3)])
Prints,
['01', '20', '2019']
You should absolutely be taking advantage of Python's date/time API here. Use strptime to parse your input datetime string to a bona fide Python datetime. Then, just build a list, accessing the various components you need.
dt = "01/20/2019 12:34:54"
dto = datetime.strptime(dt, '%m/%d/%Y %H:%M:%S')
list = [dto.month, dto.day, dto.year]
print(list)
[1, 20, 2019]
If you really want/need to work with the original datetime string, then split provides an option, without even formally using a regex:
dt = "01/20/2019 12:34:54"
dt = dt.split()[0].split('/')
print(dt)
['01', '20', '2019']
This RegEx might help you to do so.
([0-9]+)\/([0-9]+)\/([0-9]+)\s[0-9]+:[0-9]+:[0-9]+
Code:
import re
string = '01/20/2019 12:34:54'
matches = re.search(r'([0-9]+)/([0-9]+)/([0-9]+)', string)
if matches:
print([matches.group(1), matches.group(2), matches.group(3)])
else:
print('Sorry! No matches! Something is not right! Call 911')
Output
['01', '20', '2019']

python Find the most reported month

I am trying to find out October(mentioned 2 times), I had the idea to use dictionary to solve this problem. However I struggled a lot to figure out how to find/separate the months, I was not able to use my solution for the 1st str values where there are some spaces. Can someone please suggest how can I modify that split section to cover - , and white space?
import re
#str="May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990"
str="May-29-1990,Oct-18-1980,Sept-1-1980,Oct-2-1990"
val=re.split(',',str)
monthList=[]
myDictionary={}
#put the months in a list
def sep_month():
for item in val:
if not item.isdigit():
month,day,year=item.split("-")
monthList.append(month)
#process the month list from above
def count_month():
for item in monthList:
if item not in myDictionary.keys():
myDictionary[item]=1
else:
myDictionary[item]=myDictionary.get(item)+1
for k,v in myDictionary.items():
if v==2:
print(k)
sep_month()
count_month()
from datetime import datetime
import calendar
from collections import Counter
datesString = "May-29-1990,Oct-18-1980,Sep-1-1980,Oct-2-1990"
datesListString = datesString.split(",")
datesList = []
for dateStr in datesListString:
datesList.append(datetime.strptime(dateStr, '%b-%d-%Y'))
monthsOccurrencies = Counter((calendar.month_name[date.month] for date in datesList))
print(monthsOccurrencies)
# Counter({'October': 2, 'May': 1, 'September': 1})
Something to be aware in my solution with %b for the month is that Sept has changed to Sep to work (Month as locale’s abbreviated name). In this case you can either use fullname months (%B) or abbreviated name (%b). If you can not have the big string as with correct month name formatting, just replace the wrong ones ("Sept" for example with "Sep" and always work with date obj).
Not sure that regex is the best tool for this job, I would just use strip() along with split() to handle your whitespace issues and get a list of just the month abbreviations. Then you could create a dict with counts by month using the list method count(). For example:
dates = 'May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990'
months = [d.split('-')[0].strip() for d in dates.split(',')]
month_counts = {m: months.count(m) for m in set(months)}
print(month_counts)
# {'May': 1, 'Oct': 2, 'Sept': 1}
Or even better with collections.Counter:
from collections import Counter
dates = 'May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990'
months = [d.split('-')[0].strip() for d in dates.split(',')]
month_counts = Counter(months)
print(month_counts)
# Counter({'Oct': 2, 'May': 1, 'Sept': 1})

Identifying dates in a string in python using dateutil returning no output

I am trying to identify dates from a column containing text entries and output the dates to a text file. However, my code is not returning any output. I can't seem to figure out what I did wrong in my code. I'd appreciate some help on this.
My Code:
import csv
from dateutil.parser import parse
with open('file1.txt', 'r') as f_input, open('file2.txt', 'w') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for row in csv_input:
x = str(row[3])
def is_date(x):
try:
parse(x)
csv_output.writerow([row[0], row[1], row[2], row[3], row[4]])
# no return value in case of success
except ValueError:
return False
is_date(x)
Guessing somewhat you input like e.g.:
1,2,3, This is me on march first of 2018 at 2:15 PM, 2015
3,4,5, She was born at 12pm on 9/11/1980, 2015
a version of what you want could be
from dateutil.parser import parse
with open("input.txt", 'r') as inFilePntr, open("output.txt", 'w') as outFilePntr:
for line in inFilePntr:
clmns = line.split(',')
clmns[3] = parse( clmns[3], fuzzy_with_tokens=True )[0].strftime("%Y-%m-%d %H:%M:%S")
outFilePntr.write( ', '.join(clmns) )
Note, as you do not touch the other columns, I just leave them as text. Hence, no need for csv. You never did anything with the return value of parse. I use the fuzzy token, as my column three has the date somewhat hidden in other text. The returned datetime object is transformed into a string of my liking (see here) and inserted in column three, replacing the old value.
I recombine the strings with comma separation again an write it into output.txt, which looks like:
1, 2, 3, 2018-03-01 14:15:00, 2015
3, 4, 5, 1980-09-11 12:00:00, 2015

python replace string function throws asterix wildcard error

When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []