Importing Excel data to python with openpyxl

Importing Excel data to python with openpyxl - if-statement

With openpyxl
With
if sheet.cell(j, i).value
i am testing if a certain cell in an excel sheet has a value.
The Problem: If there is a "0" in that cell the request give output "no"
Question: How do I change it, so that my request gives "Yes" as output if there is a "0" in that cell

If you use only if sheet.cell(j, i).value as the logical check, it will give False for both 0 and blank. So, you need to use if sheet.cell(j, i).value is None, in which case, it will return TRUE only when the cell is blank. A working example.
import openpyxl
wb=openpyxl.load_workbook('Sample.xlsx')
sheet=wb['Sheet2']
j=10
for i in range(1, 9):
print("Data in cell is : ", sheet.cell(j, i).value)
if sheet.cell(j, i).value is None:
print("This is blank")
else:
print("This is not blank")
Output
Data in cell is : 1
This is not blank
Data in cell is : 2
This is not blank
Data in cell is : 3
This is not blank
Data in cell is : None
This is blank
Data in cell is : 4
This is not blank
Data in cell is : 5
This is not blank
Data in cell is : 0
This is not blank
Data in cell is : 8
This is not blank

Related

Regex patterns to extract information from event logs

I am working on an event log data, as we all know its unstructured data and need to extract important pieces of information from these logs for a better visualization perspective. This data is tab-separated, I have created a data frame from these event logs and also expected output. Col name Event_message is the raw event log message, and the columns CtrJb_ID, Prcs_ID, LotID, Wafer_ID are the columns which I would like to extract the information from these logs. If the condition is not met, then I would like the row to be None or empty. For example in one event if the lot id exists then extract the lot id and if not then None.
data = {'Timestamp':['2009/8/22 08:02:29.862', '2009/8/22 08:02:30.706','2008/08/22 08:02:33.207','2008/08/22 08:02:37.551'],
'Event_Message':["2009/8/22 08:02:29.862 2009/8/22 08:02:29.862 123456 ControlJobStateTransition1 CWControlJobManager 'ControlJob named XYZ12345-20090822-0005 was created and is in the QUEUED state.' [] ['EventVariable ControlJobID 0 true XYZ12345-20090822-0005'' ControlJobID' 'EventVariable DataCollectionPlan 0 true ",
"2009/8/22 08:02:30.706 2009/8/22 08:02:30.315 123456 PRJobStateChange XYZ12ProcessJobManager 'Process Job 200908221102-2R34567.000-01 has changed state to PRJOBACTIVE/SETUP.' [] ['EventVariable ProcessJobID 0 true ''200908221102-2R34567.000-01'' ProcessJobID' 'EventVariable ProcessJobState 0 true ''1''",
"2008/08/22 08:02:33.207 2008/08/22 08:02:33.175 123456789 DAExtendPerResourceDAWaferCenterOffsetB TransferChamberSlotValvePM4 'DAPerResource EXTEND' [] ['StatusVariable Source 0 true ''TransferChamber-EndEffector2'' Source' 'StatusVariable Destination 0 true ''PM4'' Destination' 'StatusVariable WaferID 0 true ''1A234568ABC2'' WaferID' 'StatusVariable LotID 0 true ''200908221036-2R34567.000-01'' LotID'",
"2008/08/22 08:02:37.551 2008/08/22 08:02:37.404 12345678 RecipeStarted PM4 'Started processing recipe AB0-Z-65XYZ-ABCDE12XYZ1-2R34567000 on material 1A234568ABC2. ' [] ['StatusVariable RecipeName 0 true ''PM4-P-14LPP-PEBNS31JFA1-8R91721000'' RecipeName' 'StatusVariable MaterialID 0 true ''1A234568ABC2'' MaterialID' 'StatusVariable JobID 0 true ''201910021036-2R34567.000-01'' JobID' 'EventVariable WacID 0 true '''' WacID' 'StatusVariable LotID 0 true ''2R34567.000'' LotID' 'StatusVariable SlotID 0 true ''11"],
'CtrJb_ID': ['XYZ12345-20090822-0005', None, None, None],
'Prcs_ID': [None, '200908221102-2R34567.000', None, None],
'LotID': [None, None, '200908221036-2R34567.000-01', '2R34567.000'],
'Wafer_ID': [None, None, '1A234568ABC2', None ]}
df= pd.DataFrame(data)
I have read this event log message line by line and then tried to extract it using regex patterns, but not been succesful. Below is the code, that I have tried so far.
import pandas as pd
import re
f = open ("C:\ABCD\XYZ\egfh_ijk_lmn\log2009082212.txt")
lines = f.readlines()
for line in lines:
print (line)
lot= re.compile(r'LotID\s+\d\s+\w+\s+(.*)\s+LotID')
for line in lines:
if lot.search(str(line)):
print(lot)
else:
print(None)
Output:
None
None
None
None
None
re.compile('LotID\\s+\\d\\s+\\w+\\s+(.*)\\s+LotID')
re.compile('LotID\\s+\\d\\s+\\w+\\s+(.*)\\s+LotID')
None
None
None
None
None
re.compile('LotID\\s+\\d\\s+\\w+\\s+(.*)\\s+LotID')
None
None
re.compile('LotID\\s+\\d\\s+\\w+\\s+(.*)\\s+LotID')

Since you ask for a working snippet:
import re
test = "2008/08/22 08:02:33.207 2008/08/22 08:02:33.175 123456789 DAExtendPerResourceDAWaferCenterOffsetB TransferChamberSlotValvePM4 'DAPerResource EXTEND' [] ['StatusVariable Source 0 true ''TransferChamber-EndEffector2'' Source' 'StatusVariable Destination 0 true ''PM4'' Destination' 'StatusVariable WaferID 0 true ''1A234568ABC2'' WaferID' 'StatusVariable LotID 0 true ''200908221036-2R34567.000-01'' LotID'"
lot = re.compile(r"LotID[^']+''([\d\-.A-Z]*)''[^']+LotID")
match = lot.search(test)
if (match):
print(match.group(1))
else:
print "None"
Output:
200908221036-2R34567.000-01

There's no need to use look-ahead and look-behind for this sort of task. Not using them allows you to use quantifiers. In your version \w.... will not match 'true'. Also your sample contains spaces instead of tab characters, I don't know if that is the formatting on this site or your actual data.
This will match your sample:
LotID\s+\d\s+\w+\s+(.*)\s+LotID
Though if you know what characters you expect in the Lot ID it might be better expressed as
LotID[^']+(['\d\-R.]*)[^']+LotID

How to get PyPDF2 to extract text from multiple sequential pages - in range?

I'm trying to get PyPDF2 to extract specific text throughout a document per the code below. It is pulling exactly what I need and eliminating the duplicates, but it is not getting me a list from each page, it seems to only be showing me the text from the last page. What am I doing wrong?
#import PyPDF2 and set extracted text as the page_content variable
import PyPDF2
pdf_file = open('enme2.pdf','rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
#for loop to get number of pages and extract text from each page
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content = page.extractText()
#initialize the user_input variable
user_input = ""
#function to get the AFE numbers from the pdf document
def get_afenumbers(Y):
#initialize the afe and afelist variables
afe = "A"
afelist = ""
x = ""
#while loop to get only 6 digits after the "A"
while True:
if user_input.upper().startswith("Y") == True:
#Return a list of AFE's
import re
afe = re.findall('[A][0-9]{6}', page_content)
set(afe)
print(set(afe))
break
else:
afe = "No AFE numbers found..."
if user_input.upper().startswith("N") == True:
print("HAVE A GREAT DAY - GOODBYE!!!")
break
#Build a while loop for initial question prompt (when Y or N is not True):
while user_input != "Y" and user_input != "N":
user_input = input('List AFE numbers? Y or N: ').upper()
if user_input not in ["Y","N"]:
print('"',user_input,'"','is an invalid input')
get_afenumbers(user_input)
#FIGURE OUT HOW TO EXTRACT FROM ALL PAGES AND NOT JUST ONE
I'm quite new to this, just learned about regex by a response to my question earlier today. Thanks for any help.

If you change a little, it seems works fine.
page_content="" # define variable for using in loop.
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content += page.extractText() # concate reading pages.

read two columns in Excel using python

I want to write a python script which should read xlsx file and based on value of column X, it should write/append file with the value of column Z.
Sample data:
Column A Column X Column Y Column Z
123 abc test value 1
124 xyz test value 2
125 xyz test value 3
126 abc test value 4
If value in Column X = abc then it should create a file (if not existing already) in some path with name abc.txt and insert the value of column Z in abc.txt file, likewise if Column X = xyz then it should create a file in same path with xyz.txt and insert the value of column Z in xyz.txt file.
from openpyxl import load_workbook
wb = load_workbook('filename.xlsm')
ws = wb.active
for cell in ws.columns[9]: #here column 9 is value is what i am testing which is Column X of my example.
if cell.value == "abc":
print ws.cell(column=12).value #this is not working and i dont know how to read corresponding value of another column
Please suggest what could be done.
Thank you.

Change
print ws.cell(column=12).value
By:
print ws.columns[col][row].value
in your case:
print ws.columns[12-1][cell.row-1].value
Note that if you use this indexation method cols and rows start with index 0. This is why I'm doing cell.row-1, so take it into account when you address your column, if your 12 starts counting from 1 you'll have to address to 11.
Alternatively you can access to your information cell like this: ws.cell(row = cell.row, column = 12).value. Note in this case cols and rows start at 1.

Selenium Python loop through table to print column values I get Element is no longer valid

I have a page with a HTML table with 16 rows and 5 columns.
I have a method to loop through the table and print out the cell values.
I get the following error:
raise exception_class(message, screen, stacktrace)
StaleElementReferenceException: Message: Element is no longer valid
The error happens on this line:
col_name = row.find_elements(By.TAG_NAME, "td")[1] # This is the Name column
My method code is:
def get_variables_col_values(self):
try:
table_id = self.driver.find_element(By.ID, 'data_configuration_variables_ct_fields_body1')
#time.sleep(10)
rows = table_id.find_elements(By.TAG_NAME, "tr")
print "Rows length"
print len(rows)
for row in rows:
# Get the columns
print "cols length"
print len(row.find_elements(By.TAG_NAME, "td"))
col_name = row.find_elements(By.TAG_NAME, "td")[1] # This is the Name column print "col_name.text = "
print col_name.text
except NoSuchElementException, e:
return False
Am i getting the element is no longer valid because the dom has updated, changed?
The table has not completed in loading?
How can i solve this please?
Do i need to wait for the page to be fully loaded, completed?
Should i use the following WebdriverWait code to wait for page load completed?
WebDriverWait(self.driver, 10).until(lambda d: d.execute_script('return document.readyState') == 'complete')
Where about in my code should i put this line if this is required?
I ran my code again, the 2nd time it worked. The output was:
Rows length
16
cols length
6
col_name.text =
Name
cols length
6
col_name.text =
Address
cols length
6
col_name.text =
DOB
...
So I need to make my code better so it works every time i run my test case.
What is the best solution?
Thanks,
Riaz

StaleElementReferenceException: Message: Element is no longer valid can mean that the page wasn't completely loaded or a script that changes the page elements was not finished running, so the elements are still changing or not all present after you start interacting with them.
You're on the right track! Using explicate waits are good practice to avoid StaleElementReferenceException and NoSuchElementException, since your script will often execute commands much faster than a web page can load or JavaScript code can finish.
Use WebDriverWait before you use WebDriver commands.
Here's a list of different "Expected Conditions" you can use to detect that page is loaded completely or at least loaded enough: http://selenium-python.readthedocs.org/en/latest/waits.html
An example of where to place the wait in your code, with an example of waiting up to 10 seconds for all 'td' elements to finish loading (you may need to use a different type of condition, amount of time, or wait for a different element, depending on what the web page as a whole is like):
from selenium.webdriver.support import expected_conditions as EC
def get_variables_col_values(self):
try:
WebDriverWait(self.driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME,'td')))
table_id = self.driver.find_element(By.ID, 'data_configuration_variables_ct_fields_body1')
#time.sleep(10)
rows = table_id.find_elements(By.TAG_NAME, "tr")
print "Rows length"
print len(rows)
for row in rows:
# Get the columns
print "cols length"
print len(row.find_elements(By.TAG_NAME, "td"))
col_name = row.find_elements(By.TAG_NAME, "td")[1] # This is the Name column print "col_name.text = "
print col_name.text
except NoSuchElementException, e:
return False

Pandas: Iterate on a column one row at a time to automate a google search?

I am trying to automate 100 google searches (one per individual String in a row and return urls per each query) on a specific column in a csv (via python 2.7); however, I am unable to get Pandas to read the row contents to the Google Search automater.
*GoogleSearch source = https://breakingcode.wordpress.com/2010/06/29/google-search-python/
Overall, I can print Urls successfully for a query when I utilize the following code:
from google import search
query = "apples"
for url in search(query, stop=5, pause=2.0):
print(url)
However, when I add Pandas ( to read each "query") the rows are not read -> queried as intended. I.E. "data.irow(n)" is being queired instead of the row contents, one at a time.
from google import search
import pandas as pd
from pandas import DataFrame
query_performed = 0
querying = True
query = 'data.irow(n)'
#read the excel file at column 2 (i.e. "Fruit")
df = pd.read_csv('C:\Users\Desktop\query_results.csv', header=0, sep=',', index_col= 'Fruit')
# need to specify "Column2" and one "data.irow(n)" queried at a time
while querying:
if query_performed <= 100:
print("query")
query_performed +=1
else:
querying = False
print("Asked all 100 query's")
#prints initial urls for each "query" in a google search
for url in search(query, stop=5, pause=2.0):
print(url)
Incorrect output I receive at the command line:
query
Asked all 100 query's
query
Asked all 100 query's
Asked all 100 query's
http://www.irondata.com/
http://www.irondata.com/careers
http://transportation.irondata.com/
http://www.irondata.com/about
http://www.irondata.com/public-sector/regulatory/products/versa
http://www.irondata.com/contact-us
http://www.irondata.com/public-sector/regulatory/products/cavu
https://www.linkedin.com/company/iron-data-solutions
http://www.glassdoor.com/Reviews/Iron-Data-Reviews-E332311.htm
https://www.facebook.com/IronData
http://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=35267805
http://www.indeed.com/cmp/Iron-Data
http://www.ironmountain.com/Services/Data-Centers.aspx
FYI: My Excel .CSV format is the following:
B
1 **Fruit**
2 apples
2 oranges
4 mangos
5 mangos
6 mangos
...
101 mangos
Any advice on next steps is greatly appreciated! Thanks in advance!

Here's what I got. Like I mentioned in my comment, I couldn't get the stop parameter to work like i thought it should. Maybe i'm misunderstanding how its used. I'm assuming you only want the first 5 urls per search.
a sample df
d = {"B" : ["mangos", "oranges", "apples"]}
df = pd.DataFrame(d)
Then
stop = 5
urlcols = ["C","D","E","F","G"]
# Here i'm using an apply() to call the google search for each 'row'
# and a list is built for the urls return by search()
df[urlcols] = df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])) #get 5 by slicing
which gives you. Formatting is a bit rough on this
B C D E F G
0 mangos http://en.wikipedia.org/wiki/Mango http://en.wikipedia.org/wiki/Mango_(disambigua... http://en.wikipedia.org/wiki/Mangifera http://en.wikipedia.org/wiki/Mangifera_indica http://en.wikipedia.org/wiki/Purple_mangosteen
1 oranges http://en.wikipedia.org/wiki/Orange_(fruit) http://en.wikipedia.org/wiki/Bitter_orange http://en.wikipedia.org/wiki/Valencia_orange http://en.wikipedia.org/wiki/Rutaceae http://en.wikipedia.org/wiki/Cherry_Orange
2 apples https://www.apple.com/ http://desmoines.citysearch.com/review/692986920 http://local.yahoo.com/info-28919583-apple-sto... http://www.judysbook.com/Apple-Store-BtoB~Cell... https://tr.foursquare.com/v/apple-store/4b466b...
if you'd rather not specify the columns (i.e. ["C",D"..]) you could do the following.
df.join(df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Importing Excel data to python with openpyxl - if-statement

With openpyxl With if sheet.cell(j, i).value i am testing if a certain cell in an excel sheet has a value. The Problem: If there is a "0" in that cell the request give output "no" Question: How do I change it, so that my request gives "Yes" as output if there is a "0" in that cell

Related

Regex patterns to extract information from event logs

How to get PyPDF2 to extract text from multiple sequential pages - in range?

read two columns in Excel using python

Selenium Python loop through table to print column values I get Element is no longer valid

Pandas: Iterate on a column one row at a time to automate a google search?

Categories

Resources