Possible to combine two lines of code into one - regex

Searching through a database looking for matches. Need to log the matches as well as though that don't match so I have the full database but those that match I specifically need to know the part that matches.
serv = ['6:00am-9:00pm', 'Unavailable', '7:00am-10:00pm', '8:00am-9:00pm', 'Closed']
if self.serv[datas] == 'Today':
clotime.append('')
elif self.serv[data] == 'Tomorrow':
clotime.append('')
elif self.serv[data] == 'Yesterday':
clotime.append('')
else:
clo = re.findall('-(.*?):', self.serv[data])
clotime.append(clo[0])
The bulk majority of the data ends up running through re.findall but some is still left for the initial if/elif checks.
Is there a way to condense this code down and do it all with re.findall, maybe even with just one line of code. I need the everything(entire database) gone through/logged so I can process through the database correctly when I go to display the data on a map.

Using anchors you can match a whole string
clo = re.search('^(?:To(?:day|morrow)|Yesterday)$|-(.*?):', self.serv[data])
if clo is not None:
clotime.append(clo.group(1))
With your example list:
serv = ['6:00am-9:00pm', 'Unavailable', '7:00am-10:00pm', '8:00am-9:00pm', 'Closed']
clotime = []
for data in serv:
clo = re.search('^(?:To(?:day|morrow)|Yesterday)$|-(.*?):', data)
if clo is not None:
clotime.append(clo.group(1))
print(clotime)

I would try something like this:
clo = re.findall('-(\d+):', self.serv[data])
clotime.append(clo[0] if clo else '')
If I understood your existing code it looks like you want to append an empty string in the cases where a closing hour couldn't be found in the string? This example extracts the closing hour but uses an empty string whenever the regex doesn't match anything.
Also if you're only matching digits it's better to be explicit about that.

Related

How to extract parts of logs based on identification numbers?

I am trying to extract and preprocess log data for a use case.
For instance, the log consists of problem numbers with information to each ID underneath. Each element starts with:
#!#!#identification_number###96245#!#!#change_log###
action
action1
change
#!#!#attribute###value_change
#!#!#attribute1###status_change
#!#!#attribute2###<None>
#!#!#attribute3###status_change_fail
#!#!#attribute4###value_change
#!#!#attribute5###status_change
#!#!#identification_number###96246#!#!#change_log###
action
change
change1
action1
#!#!#attribute###value_change
#!#!#attribute1###status_change_fail
#!#!#attribute2###value_change
#!#!#attribute3###status_change
#!#!#attribute4###value_change
#!#!#attribute5###status_change
I extracted the identification numbers and saved them as a .csv file:
f = open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8")
change_log = f.readlines()
number = re.findall('#!#!#identification_number###(.+?)#!#!#change_log###', change_log)
Now what I am trying to achieve is, that for every ID in the .csv file I can append the corresponding log content, which is:
action
change
#!#!#attribute###
Since I am rather new to Python and only started working with regex a few days ago, I was hoping for some help.
Each log for an ID starts with "#!#!identification_number###" and ends with "#!#!attribute5### <entry>".
I have tried the following code, but the result is empty:
In:
x = re.findall("\[^#!#!#identification_number###((.|\n)*)#!#!#attribute5###((.|\n)*)$]", str(change_log))
In:
print(x)
Out:
[]
Try this:
pattern='entification_number###(.+?)#!#!#change_log###(.*?)#!#!#id'
re.findall(pattern, string+'#!#!#id', re.DOTALL)
The dotall flag makes the point match newline, so hopefully in the second capturing group you will find the logs.
If you want to get the attributes, for each identification number, you can parse the logs (got for the search above) of each id number with the following:
pattern='#!#!#attribute(.*?)###(.*?)#!#'
re.findall(pattern, string_for_each_log_match+'#!#', re.DOTALL)
If you put each id into the regex when you search using string.format() you can grab the lines that contain the correct changelog.
with open(r'path\to\csv.csv', 'r') as f:
ids = f.readlines()
with open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8") as f:
change_log = f.readlines()
matches = {}
for id_no in ids:
for i in range(len(change_log)):
reg = '#!#!#identification_number###({})#!#!#change_log###'.format(id_no)
if re.search(reg, change_log[i]):
matches[id_no] = i
break
This will create a dictionary with the structure {id_no:line_no,...}.
So once you have all of the lines that tell you where each log starts, you can grab the lines you want that come after these lines.

How do we fetch multiple occurrences of a regex in a single string in Groovy?

I have a string that I need to fetch the ID field from -
{"jobs":[{"id":"6369c112a2ee5ca08adaa1d01b7e5c74","status":"RUNNING"},{"id":"bbfd87f15334c8e27b40bc46896e95c7","status":"RUNNING"},{"id":"90c5a32e8300da7d43ce351f7f72f0d2","status":"RUNNING"}]}
What I would need all the matched IDs stored in an array.
I tried with the following regex, but couldn't fetch the string -
/"id"\ *:\ *"(.*?)"/
/"id"\ *:\ *"(?<id>.*?)"/
I'm not sure if it matches and I'm not sure how to fetch the matched data.
Try this:
def str = '{"jobs":[{"id":"6369c112a2ee5ca08adaa1d01b7e5c74","status":"RUNNING"},{"id":"bbfd87f15334c8e27b40bc46896e95c7","status":"RUNNING"},{"id":"90c5a32e8300da7d43ce351f7f72f0d2","status":"RUNNING"}]}'
def pattern = /(?<="id":")\w+(?=")/
def matcher = str =~ /$pattern/
assert matcher.collect() == ["6369c112a2ee5ca08adaa1d01b7e5c74", "bbfd87f15334c8e27b40bc46896e95c7", "90c5a32e8300da7d43ce351f7f72f0d2"]
It's surely more appropriate to process your input with a JSON parser. It is JSON:
def s = '''{"jobs":
[{"id":"6369c112a2ee5ca08adaa1d01b7e5c74","status":"RUNNING"},
{"id":"bbfd87f15334c8e27b40bc46896e95c7","status":"RUNNING"},
{"id":"90c5a32e8300da7d43ce351f7f72f0d2","status":"RUNNING"}]}'''
def ids = new groovy.json.JsonSlurper().parse(s.bytes).jobs.collect{it.id}
And that sets ids to [6369c112a2ee5ca08adaa1d01b7e5c74, bbfd87f15334c8e27b40bc46896e95c7, 90c5a32e8300da7d43ce351f7f72f0d2]

regex year format authentication

I have a program where the user is asked for the session year which needs to be in the form of 20XX-20XX. The constraint here is that it needs to be a year followed by its next year. Eg. 2019-2020.
For example,
Vaild Formats:
2019-2020
2018-2019
2000-2001
Invalid Fromats:
2019-2021
2000-2000
2019-2018
I am trying to validate this input using regular expressions.
My work:
import re
def add_pages(matchObject):
return "{0:0=3d}".format(int(matchObject) + 1)
try:
a = input("Enter Session")
p = r'2([0-9]{3})-2'
p1= re.compile(p)
x=add_pages(p1.findall(a)[0])
p2 = r'2([0-9]{3})-2'+x
p3 = re.compile(p2)
l=p3.findall(a)
if not l:
raise Exception
else:
print("Authenticated")
except Exception as e:
print("Enter session. Eg. 2019-2020")
Question:
So far I have not been able to retrieve a single regex that will validate this input. I did have a look at backreferencing in regex but it only solved half my query. I am looking for ways to improve this authentication process. Is there any single regex statement that will check for this constraint? Let me know if you need any more information.
Do you really need to get the session year in one input?
I think its better to have two inputs (or just automatically set the session year to be the first year + 1).
I don't know if you're aiming for something bigger and this is just an example but using regex just doesn't seem appropriate for this task to me.
For example you could do this:
print("Enter session year")
first_year = int(input("First year: "))
second_year = int(input("Second year: "))
if second_year != (first_year + 1):
# some validation
else:
# program continues
First of all, why regex? Regex is terrible at math. It would be easier to do something like:
def check_years(string):
string = "2011-2012"
years = string.split("-")
return int(years[0]) == (int(years[1]) - 1)

Python Data Scraping (using Xpath) - Returning empty lists and stripping characters

--
I am attempting to scrape information from the website:
http://www.forexfactory.com/#tradesPositions
Now, I used to have one up and running which this forum helped me get going, but I think something has changed on the website and the script I had no longer works.
What do I need?
I would like to scrape the number of 'short' and 'long' positions for AUDUSD, EURUSD, GBPUSD, USDJPY, USDCAD, NZDUSD and USDCHF.
NOT the percentages, the actual number of traders.
What have I done?
This is for EURUSD
import lxml.html
from selenium import webdriver
driver = webdriver.Chrome("C:\Users\MY NAME\Downloads\Chrome Driver\chromedriver.exe")
url = ('http://www.forexfactory.com/#tradesPositions')
driver.get(url)
tree = lxml.html.fromstring(driver.page_source)
results_short = tree.xpath('//*[#id="flexBox_flex_trades/positions_tradesPositionsCopy1"]/div[1]/table/tbody/tr/td[2]/div[1]/ul[1]/li[2]/span/text()')
results_long = tree.xpath('//*[#id="flexBox_flex_trades/positions_tradesPositionsCopy1"]/div[1]/table/tbody/tr/td[2]/div[1]/ul[1]/li[1]/span/text()')
print "Forex Factory"
print "Traders Short EURUSD:",results_short
print "Traders Long EURUSD:",results_long
driver.quit()
This returns
Forex Factory
Traders Short EURUSD: ['337 Traders ', ' ']
Traders Long EURUSD: [' 259 Traders']
I would like to strip everything away from the result except for the numbers. I've tried .strip() and .replace() but neither work on a list. Which will come as no surprise to you guys I don't think!
Empty List
When I apply the same technique to AUDUSD I get an empty list.
import lxml.html
from selenium import webdriver
driver = webdriver.Chrome("C:\Users\Andrew G\Downloads\Chrome Driver\chromedriver.exe")
url = ('http://www.forexfactory.com/#tradesPositions')
driver.get(url)
tree = lxml.html.fromstring(driver.page_source)
results_short = tree.xpath('//*[#id="flexBox_flex_trades/positions_tradesPositionsCopy1"]/div[6]/table/tbody/tr/td[2]/div[1]/ul[1]/li[2]/span/text()')
results_long = tree.xpath('//*[#id="flexBox_flex_trades/positions_tradesPositionsCopy1"]/div[6]/table/tbody/tr/td[2]/div[1]/ul[1]/li[1]/span/text()')
s2 = results_short
l2 = results_long
print "Traders Short AUDUSD:",s2
print "Traders Long AUDUSD:",l2
This returns
Traders Short AUDUSD: []
Traders Long AUDUSD: []
What gives? Is the Xpath not working? Just use Chromes 'inspect element' feature and navigated to the desired number, and copied the path. Same method for EURUSD.
Ideally, It would be nice to set up a list of div numbers that can insert into the tree.xpath instead of repeating the lines of code for all the different currencies to make it neater. So, in the Xpath where it has:
/div[number]/
It would be nice to have a list, i.e [1,2,3,4,5,6] that can insert into that because the rest of the Xpath is the same for the currencies. Anyway, that's an optional bonus, priority is to get a return for all currencies listed.
THANKS
You can remove all the space inside your result as you mentioned with strip method, here is my sample code:
for index in range(len(results_short)):
results_short[index] = results_short[index].strip()
if results_short[index] == "":
del results_short[index]
for index in range(len(results_long)):
results_long[index] = results_long[index].strip()
if results_long[index] == "":
del results_long[index]
For the problem you cannot get the result of AUD because the values are not loaded to the page until you have clicked the "expand" button. But I have found you can get the result from the following page: http://www.forexfactory.com/trades.php
So you can change the value of url as:
url = ('http://www.forexfactory.com/trades.php')
For this page, since the name of CSS id has changed, you need to update your value to:
results_short = tree.xpath('//*[#id="flexBox_flex_trades/positions_tradesPositions"]/div[6]/table/tbody/tr/td[2]/div[1]/ul[1]/li[2]/span/text()')
results_long = tree.xpath('//*[#id="flexBox_flex_trades/positions_tradesPositions"]/div[6]/table/tbody/tr/td[2]/div[1]/ul[1]/li[1]/span/text()')
Then apply the strip function as mentioned above, you should be able to get the correct results.

Python .splitlines() to segment text into separate variables

I've read the other threads on this site but haven't quite grasped how to accomplish what I want to do. I'd like to find a method like .splitlines() to assign the first two lines of text in a multiline string into two separate variables. Then group the rest of the text in the string together in another variable.
The purpose is to have consistent data-sets to write to a .csv using the three variables as data for separate columns.
Title of a string
Description of the string
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
Any guidance on the pythonic way to do this would be appreciated.
Using islice
In addition to normal list slicing you can use islice() which is more performant when generating slices of larger lists.
Code would look like this:
from itertools import islice
with open('input.txt') as f:
data = f.readlines()
first_line_list = list(islice(data, 0, 1))
second_line_list = list(islice(data, 1, 2))
other_lines_list = list(islice(data, 2, None))
first_line_string = "".join(first_line_list)
second_line_string = "".join(second_line_list)
other_lines_string = "".join(other_lines_list)
However, you should keep in mind that the data source you read from is long enough. If it is not, it will raise a StopIteration error when using islice() or an IndexError when using normal list slicing.
Using regex
The OP asked for a list-less approach additionally in the comments below.
Since reading data from a file leads to a string and via string-handling to lists later on or directly to a list of read lines I suggested using a regex instead.
I cannot tell anything about performance comparison between list/string handling and regex operations. However, this should do the job:
import re
regex = '(?P<first>.+)(\n)(?P<second>.+)([\n]{2})(?P<rest>.+[\n])'
preg = re.compile(regex)
with open('input.txt') as f:
data = f.read()
match = re.search(regex, data, re.MULTILINE | re.DOTALL)
first_line = match.group('first')
second_line = match.group('second')
rest_lines = match.group('rest')
If I understand correctly, you want to split a large string into lines
lines = input_string.splitlines()
After that, you want to assign the first and second line to variables and the rest to another variable
title = lines[0]
description = lines[1]
rest = lines[2:]
If you want 'rest' to be a string, you can achieve that by joining it with a newline character.
rest = '\n'.join(lines[2:])
A different, very fast option is:
lines = input_string.split('\n', maxsplit=2) # This only separates the first to lines
title = lines[0]
description = lines[1]
rest = lines[2]