i have in the database data like this
61/10#61/12,0/12,10/16,0/21,0/12#61/33,0/28#0/34,0/23#0/28
where the part like 10/16(without #) is invalid should not use for the calculation,
but all other has next format min_hr + "/" + min_hrv + "#" + max_hr + "/" + max_hrv
and the issue is get AVG value by next psevdo formula [ summ(all(min_hrv)) + summ(all(max_hrv)) ] / count(all(min_hrv)) + all(max_hrv)), for the axample string result will be ((10 + 12 + 28 + 23) + (12 + 33 + 34 + 28))/8) == 22
What i try is:
SELECT regexp_replace(
'61/10#61/12,0/12,10/16,0/21,0/12#61/33,0/28#0/34,0/23#0/28',
',\d+/\d+,', ',',
'g'
);
to remove invalid data but 10/16 still in the strin, result is:
regexp_replace
--------------------------------------------------
61/10#61/12,10/16,0/12#61/33,0/28#0/34,0/23#0/28
if do good clean the string my plan is split to array some way like this, for max (not full solution, has empty string), has no solution for min:
SELECT
regexp_split_to_array(
regexp_replace(
'61/10#61/12,0/12,0/12#61/33,0/28#0/34,0/23#0/28',
',\d+/\d+,', ',',
'g'
)
,',?\d+/\d+#\d+/'
);
result is:
regexp_split_to_array
-----------------------
{"",12,33,34,28}
and then calculate the data, something like this:
SELECT ((
SELECT sum(tmin.unnest)
FROM
(SELECT unnest('{10,12,28,23}'::int[])) as tmin
)
+
(
SELECT sum(tmax.unnest)
FROM
(SELECT unnest('{12,33,34,28}'::int[])) as tmax
))
/
(SELECT array_length('{12,33,34,28}'::int[], 1) * 2)
may be some one know more simple and right way for such issue?
Use regexp_matches():
select (regexp_matches(
'61/10#61/12,0/12,0/12#61/33,0/28#0/34,0/23#0/28',
'\d+#\d+/(\d+)',
'g'))[1]
regexp_matches
----------------
12
33
34
28
(4 rows)
The whole calculation may look like this:
with my_data(str) as (
values
('61/10#61/12,0/12,10/16,0/21,0/12#61/33,0/28#0/34,0/23#0/28')
),
min_max as (
select
(regexp_matches(str, '(\d+)#\d+', 'g'))[1] as min_hrv,
(regexp_matches(str, '\d+#\d+/(\d+)', 'g'))[1] as max_hrv
from my_data
)
select avg(min_hrv::int+ max_hrv::int) / 2 as result
from min_max;
result
---------------------
22.5000000000000000
(1 row)
The pattern you are looking for should match the digits after #, a streak of digits and a / char. With regexp_matches, you may extract a part of the pattern only if you wrap that part within a pair of parentheses.
The solution is
regexp_matches(your_col, '#\d+/(\d+)', 'g')
Note that g stands for global, meaning that all occurrences found in the string will be returned.
Pattern details
\d+ - 1 or more (+) digits
/ - a /char
(\d+) - Capturing group 1: 1 or more digits
See the regex demo.
You may extract specific bits from your data if you use a single pair of parentheses in different parts of the '(\d+)/(\d+)#(\d+)/(\d+)' regex. To extract min_hr, you'd use '(\d+)/\d+#\d+/\d+'.
I have to paste 3000 url's a day that are unformatted
Can i set up code to convert the raw paste data to a string?
(Example raw data) - 13 Michael Way Cottees NSW 2017
(Example changed data) - "13 Michael Way Cottees NSW 2017"
I have tried
RAW_URL = 13 Michael Way Cottees NSW 2017 + " "
RAW_URL = str(13 HOADLEY ST MAWSON ACT 2607)
RAW_DATA = ' " ' + (13 HOADLEY ST MAWSON ACT 2607) + ' " '
I keep getting "invalid syntax" error and not having much luck with google.
Once it's done it will be folded into the below code, to replace the single input on PASTED_CRM_DATA to a list just below
import requests
import csv
from lxml import html
import time
import sys
text2search = '''RECENTLY SOLD'''
PASTED_CRM_DATA = "13 HOADLEY ST MAWSON ACT 2607"
URL_LIST = 'https://www.realestate.com.au/property/' + str(PASTED_CRM_DATA.replace(' ', '-').lower()),
with open('REA.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for index, url in enumerate(URL_LIST):
page = requests.get(url)
print '\r' 'Scraping URL ' + str(index+1) + ' of ' + str(len(URL_LIST))+ ' ' + url,
if text2search in page.text:
tree = html.fromstring(page.content)
(title,) = (x.text_content() for x in tree.xpath('//title'))
(price,) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold,) = (x.text_content().strip() for x in tree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([title, price, sold])
Any input is appreciated
First of all you should understand what strings are in python
In your examples that you have tried
RAW_URL = 13 Michael Way Cottees NSW 2017 + " "
RAW_URL = str(13 HOADLEY ST MAWSON ACT 2607)
RAW_DATA = ' " ' + (13 HOADLEY ST MAWSON ACT 2607) + ' " '
Here the characters you try to use a string are interpreted as actual code. To make your intentions clear to the interpreter use single quotes ' around them. (or double quotes)
RAW_URL = '13 Micheal Way Cottees NSW 2017'
RAW_DATA = '13 HOADLEY SY MAWSON ACT 2607'
To apply quotes use either string concatanation
RAW_URL = '"' + '13 Micheal Way Cottees NSW 2017' + '"'
Tough im not sure what you mean with raw paste data. Where is the data copied from? Is it by hand or done in the program?
For the past few days (weeks, months, years maybe if you count my on-again off-again search and attempts) I've been trying to make or find a RegEx filter to help me remove all redundant parentheses found in my code.
A worst case scenario of what the regex filter will have to deal with is attached. As is a best case scenario
return ((((((((((((((((((((((((((getHumanReadableLine("avHardwareDisable") + getHumanReadableLine("hasAccessibility")) + getHumanReadableLine("hasAudio")) + getHumanReadableLine("hasAudioEncoder")) + getHumanReadableLine("hasEmbeddedVideo")) + getHumanReadableLine("hasIME")) + getHumanReadableLine("hasMP3")) + getHumanReadableLine("hasPrinting")) + getHumanReadableLine("hasScreenBroadcast")) + getHumanReadableLine("hasScreenPlayback")) + getHumanReadableLine("hasStreamingAudio")) + getHumanReadableLine("hasStreamingVideo")) + getHumanReadableLine("hasTLS")) + getHumanReadableLine("hasVideoEncoder")) + getHumanReadableLine("isDebugger")) + getHumanReadableLine("language")) + getHumanReadableLine("localFileReadDisable")) + getHumanReadableLine("manufacturer")) + getHumanReadableLine("os")) + getHumanReadableLine("pixelAspectRatio")) + getHumanReadableLine("playerType")) + getHumanReadableLine("screenColor")) + getHumanReadableLine("screenDPI")) + getHumanReadableLine("screenResolutionX")) + getHumanReadableLine("screenResolutionY")) + getHumanReadableLine("version")));
return ((((name + ": ") + Capabilities[name]) + "\n"));
As you can see there's... a few... redundant parentheses in my code. Been working actively with these for a very long time but have always tried to clean up what I come across and been trying to find a faster way to do it.
So one example of how the "clean" code would look, I'm hoping at least!
return (name + ": " + Capabilities[name] + "\n");
return name + ": " + Capabilities[name] + "\n";
Either one is acceptable to be completely honest as long as the code itself doesn't mock up and change how it works.
I greatly appreciate any answers anyone can give me. Please don't Mock what I do or am trying to achieve. I haven't worked much with regex or similar things before...
And just to humour you... Here's my "RegExp" for my "clean" example
(return) ({1,}((.[^)]{1,}))(.{1,}))(.{1,})){1,}
$1 $2 $3 $4 // output
oh... Forgot to mention
(!(testCrossZ()))
Might appear at times as well but those aren't as big of an issue to clean up manually if needed.
P.S... There is a "LOT" of occurances of the redundant parentheses... Like... Maybe thousands... Most likely thousands.
Not sure if it applies for actionscript, but for Java you can do: Main Menu | Analyze | Run Inspection by Name | type "parentheses" | select "Unnecessary parentheses" | run in the whole project and fix all problems
Result:
return getHumanReadableLine("avHardwareDisable") + getHumanReadableLine("hasAccessibility")
+ getHumanReadableLine("hasAudio") + getHumanReadableLine("hasAudioEncoder")
+ getHumanReadableLine("hasEmbeddedVideo") + getHumanReadableLine("hasIME")
+ getHumanReadableLine("hasMP3") + getHumanReadableLine("hasPrinting")
+ getHumanReadableLine("hasScreenBroadcast") + getHumanReadableLine("hasScreenPlayback")
+ getHumanReadableLine("hasStreamingAudio") + getHumanReadableLine("hasStreamingVideo")
+ getHumanReadableLine("hasTLS") + getHumanReadableLine("hasVideoEncoder")
+ getHumanReadableLine("isDebugger") + getHumanReadableLine("language")
+ getHumanReadableLine("localFileReadDisable") + getHumanReadableLine("manufacturer")
+ getHumanReadableLine("os") + getHumanReadableLine("pixelAspectRatio")
+ getHumanReadableLine("playerType") + getHumanReadableLine("screenColor")
+ getHumanReadableLine("screenDPI") + getHumanReadableLine("screenResolutionX")
+ getHumanReadableLine("screenResolutionY") + getHumanReadableLine("version");
I honestly haven't understood the exact form of the output format that you wanted but as per starters at least clearing off the unnecessary parenthesis can be done with pure JavaScript as follows.
var text = 'return ((((((((((((((((((((((((((getHumanReadableLine("avHardwareDisable") + getHumanReadableLine("hasAccessibility")) + getHumanReadableLine("hasAudio")) + getHumanReadableLine("hasAudioEncoder")) + getHumanReadableLine("hasEmbeddedVideo")) + getHumanReadableLine("hasIME")) + getHumanReadableLine("hasMP3")) + getHumanReadableLine("hasPrinting")) + getHumanReadableLine("hasScreenBroadcast")) + getHumanReadableLine("hasScreenPlayback")) + getHumanReadableLine("hasStreamingAudio")) + getHumanReadableLine("hasStreamingVideo")) + getHumanReadableLine("hasTLS")) + getHumanReadableLine("hasVideoEncoder")) + getHumanReadableLine("isDebugger")) + getHumanReadableLine("language")) + getHumanReadableLine("localFileReadDisable")) + getHumanReadableLine("manufacturer")) + getHumanReadableLine("os")) + getHumanReadableLine("pixelAspectRatio")) + getHumanReadableLine("playerType")) + getHumanReadableLine("screenColor")) + getHumanReadableLine("screenDPI")) + getHumanReadableLine("screenResolutionX")) + getHumanReadableLine("screenResolutionY")) + getHumanReadableLine("version")));',
r = /\(((getHumanReadableLine\("\w+"\)[\s\+]*)+)\)/g,
temp = "";
while (text != temp) {
temp = text;
text = text.replace(r,"$1");
}
document.write('<pre>' + text + '</pre>');
From this point on, it shouldn't be a big deal to convert the reduced text into the desired output format.
I need to replace some text inside a file with the python re module.
Here is the input value :
<li><span class="PCap CharOverride-4">Contrôles</span> <span class="PCap CharOverride-4">Testes</span></li>
and the excepting output is this :
<li><span class="PCap CharOverride-4">C<span style="font-size:83%">ONTRôLES</span></span>
<span class="PCap CharOverride-4">T<span style="font-size:83%">ESTES</span></span></li>
but insted, I get this as result :
<li><span class="PCap CharOverride-4">C<span style="font-size:83%">ONTRôLES</span></span> <span class="PCap CharOverride-4">C<span style="font-size:83%">ONTRôLES</span></span></li>
Is there something that I missed ?
Here is what I've done so far :
for line in file_data.readlines():
#print(line)
reg = re.compile(r'(?P<b1>(<'+balise_name+' class="(([a-zA-Z0-9_\-]*?) |)'+class_value+')(| ([a-zA-Z0-9_\-]*?))">)(?P<maj>([A-ZÀÁÂÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ]))(?P<min>([a-zàáâãäåæçèéëìíîïðòóôõöøùúûüýÿµœš]*?))(?P<b2>(<\/'+balise_name+'>))')
#print(reg)
search = reg.findall(line)
print(search)
if (search != None):
for matchObj in search:
print(matchObj)
#print(matchObj[8])
print(line)
balise1 = matchObj[0] #search.group('b1')
print(balise1)
balise2 = matchObj[10] #matchObj.group('b2')
print(balise2)
maj = matchObj[6] #matchObj.group('maj')
print(maj)
min = matchObj[8] #matchObj.group('min')
print(min)
sub_str = balise1+""+maj+"<span style=\"font-size:83%\">"+min.upper()+"</span>"+balise2
line = re.sub(reg, sub_str, line)
#ouverture du fichier pour ajour ligne
filename = file_name.split(".")
#file_result = open(filename[0]+"-OK."+filename[1], "a")
#file_result.writelines(line)
#file_data.writelines(line)
#file_result.close()
print(line)
NB : I don't know how to use the module Beautifulsoup of python so why I do it manually.
Pardon me for my poor english.
Thanks for your answer !!
So, I totally forgot about this question but here is the solution I came up with after fixing the code I wrote long time ago :
for line in file_data.readlines():
reg = re.compile(r'(?P<b1>(\<' + balise_name + ' class=\"(([a-zA-Z0-9_\-]*?) |)' + class_value +
')(| ([a-zA-Z0-9_\-]*?))\"\>)(?P<maj>([A-ZÀÁÂÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ]))(?P<min>([a-zàáâãäåæçèéëìíîïðòóôõöøùúûüýÿµœš]*?))(?P<b2>(\<\/' + balise_name + '\>))')
print(line)
while reg.search(line):
search = reg.search(line)
if search:
print(search)
while search:
balise1 = search[0] # search.group('b1')
print('b1 : ' + str(balise1))
balise2 = search[11] # search.group('b2')
print('b2 : ' + str(balise2))
maj = search[7] # search.group('maj')
print('maj : ' + str(maj))
min = search[9] # search.group('min')
print('min : ' + str(min))
sub_str = search[1] + "" + maj + "<span style=\"font-size:83%\">" + min.upper() + \
"</span>" + balise2
print(sub_str)
line = re.sub(str(search[0]), sub_str, line)
print(line)
search = None
Here is what I changed with the code :
Fix some unescaped char inside the pattern
Iterate the result one by one
Fix group number for the sub function
Hope it will help someone who faced the same problem as me.
I have been searching for this in forums and on stackoverflow; it must be here somewhere but I couldn't find it.
I'm on a Mac, using the terminal to run a shell script to rename some pdf files based on file content.
I have a directory full of pdfs that I'm exporting to text files using the opensource pdfbox. The resulting files have the same name as the pdf file but end in .txt. I created the text files so that I could find a string inside the file with the format Page xx Question xx; for example Page 43 Question 2. Given this example, I would like to rename the pdf file as pg43_q2.pdf
I think the regular expression I want is this:
/Page\s+(\d+)Question\s+(\d+)
but I'm not sure how to read the two captured numbers and save them into a string that I can use as a filename.
The script I have so far is:
#!/bin/sh
PDF_FILE_PATH=$1
echo "Converting pdfs at $PDF_FILE_PATH"
find "$PDF_FILE_PATH" -name '*.pdf' -print0 | while IFS= read -r -d '' filename; do
echo $filename
java -jar pdfbox-app-1.6.0.jar ExtractText "$filename" "$filename.txt"
NEWNAME=$(sed -n -e '/Page/s/Page\s+\(\d+\)\s+Question\s+\(\d+\).*$/pg\1_q\2/p' "$filename.txt")
echo "Renaming pdf $filename to $NEWNAME"
# I would do this next but the $NEWNAME is empty
# mv "filename" "PDF_FILE_PATH$NEWNAME"
done
... but the sed command is not putting anything into the NEWNAME variable.
I'm not particularly attached to sed, any suggestions would be appreciated
Latest edit to script uses the following sed command:
newname=$(sed -nE -e '/Page/s/^.*Page[[:blank:]]+([0-9]+)[[:blank:]]+Question[[:blank:]]+([0-9]+).*$/pg\1_q\2.pdf/p' "$filename.txt")
This works about 50% of the time, but the rest of the time the newname variable is empty when I go to rename the file.
The third line of a converted file that does work:
Unit 2 Review Page 257 Question 9 a) 12 (2)(2)(3)
The third line of a converted file that doesn't work:
Unit 2 Review Page 258 Question 16 a) (a – 4)(a + 7) = a(a + 7) – 4(a + 7) = a2 + 7a – 4a – 28 = a2 + 3a – 28 b) (2x + 3)(5x + 2) = 2x(5x + 2) + 3(5x + 2) = 10x2 + 4x + 15x + 6 = 10x2 + 19x + 6 c) (–x + 5)(x + 5) = –x(x + 5) + 5(x + 5) = –x2 – 5x + 5x + 25 = –x2 + 25 d) (3y + 4)2 = (3y + 4)(3y + 4) = 3y(3y + 4) + 4(3y + 4) = 9y2 + 12y + 12y + 16 = 9y2 + 24y + 16 e) (a – 3b)(4a – b) = a(4a – b) – 3b(4a – b) = 4a2 – ab – 12ab + 3b2 = 4a2 – 13ab + 3b2 f) (v – 1)(2v2 – 4v – 9) = v(2v2 – 4v – 9) – 1(2v2 – 4v – 9) = 2v3 – 4v2 – 9v – 2v2 + 4v + 9 = 2v3 – 6v2 – 5v + 9
Removed unhelpful original answer
echo 'Unit 2 Review Page 257 Question 9 a) 12 (2)(2)(3)'\
| sed -n '/Page/{s/.*Page[ ][ ]*\([0-9][0-9]*\)[ ][ ]*Question[ ][ ]*\([0-9][0-9]*\).*$/pg\1_q\2/;p;q;}'
output
pg257_q9
echo 'Unit 2 Review Page 258 Question 16 a) (a 4)(a + 7) = a(a + 7) 4(a + 7)'\
| sed -n '/Page/{s/.*Page[ ][ ]*\([0-9][0-9]*\)[ ][ ]*Question[ ][ ]*\([0-9][0-9]*\).*$/pg\1_q\2/;p;q;}'
output
pg258_q16
Otherwise, you had it right!
(Note that the sed processing is the same for both cases).
I've included a trailing ;p;q}, and an initial { so the sed script will just process the line with 'Page' and then quit.
I've expanded the posix char classes to the basic terms, ie [[:digit:]] = [0-9], and replaced the +, with a repetition of the intitial char class followed by the 'zero-or-more' char '*', making [0-9][0-9]*. My personal experience, having learned sed on Sun 3 from OReilly's 2nd edition Sed and Awk (with the comb-binding!), is that all the posix stuff is a distraction and a further source of errors. I'm clearly in the minority on this here on S.O ;-), but I'm willing to admit that newer seds have some great features and in any case .....
I hope this helps.