Use regex recursively taking indentation level into account - regex

I am trying to parse a custom input file for a simulation code I am writting. It consist of nested "objects" with properties, values (see the link).
Here is an example file and the regex I am using currently.
([^:#\n]*):?([^#\n]*)#?.*\n
It is made such that each match is a line, with two capture group, one for the property and one for its value. It also excludes "#" and ":" from the character set, as they correspond to a comment delimiter and a property:value delimiter respectively.
How can I modify my regex so as to match the structure recursively? That is if line n+1 has an identation level higher than line n, it should be matched as a subgroup of line n's match.
I am working on Octave, which uses PCRE regex format.

I asked if you have control over the data format because as it is, the data is very easy to parse with YAML instead of regex.
The only problem is that the object is not well formed:
1) Take the regions object for example, it has many attributes called layer all of them. I think your intention is to build a list of layers instead of a lot of properties with the same name.
2) Consider now each layer property that has a corresponding value. Following each layer are orphan attributes that I presume belong to each layer.
With these ideas in mind. If you form your object following YAML rules, it would be a breeze to parse it.
I know that you are working in Octave, but consider the modifications I made to your data, and how easy it is to parse it, in this case with python.
DATA AS YOU HAVE IT NOW
case :
name : tandem solar cell
options :
verbose : true
t_stamp : system
units :
energy : eV
length : nm
time : s
tension : V
temperature: K
mqty : mole
light : cd
regions :
layer : Glass
geometry:
thick : 80 nm
npoints : 10
optical :
nk_file : vacuum.txt
layer : FTO
geometry:
thick : 10 nm
npoints : 10
optical :
nk_file : vacuum.txt
MODIFIED DATA TO COMPLY WITH YAML SYNTAX
case :
name : tandem solar cell
options :
verbose : true
t_stamp : system # a sample comment
units :
energy : eV
length : nm
time : s
tension : V
temperature: K
mqty : mole
light : cd
regions :
- layer : Glass # ADDED THE - TO MAKE IT A LIST OF LAYERS
geometry : # AND KEEP INDENTATION PROPERLY
thick : 80 nm
npoints : 10
optical :
nk_file : vacuum.txt
- layer : FTO
geometry:
thick : 10 nm
npoints : 10
optical :
nk_file : vacuum.txt
With only these instruction you get your object parsed:
import yaml
data = yaml.load(text)
""" your data would be parsed as:
{'case': {'name': 'tandem solar cell',
'options': {'t_stamp': 'system', 'verbose': True},
'regions': [{'geometry': {'npoints': 10, 'thick': '80 nm'},
'layer': 'Glass',
'optical': {'nk_file': 'vacuum.txt'}},
{'geometry': {'npoints': 10, 'thick': '10 nm'},
'layer': 'FTO',
'optical': {'nk_file': 'vacuum.txt'}}],
'units': {'energy': 'eV',
'length': 'nm',
'light': 'cd',
'mqty': 'mole',
'temperature': 'K',
'tension': 'V',
'time': 's'}}}
"""

Related

How to read in table that depends on two sets previously defined

I am optimizing the choice of letters with the surfaces they require in the laser cutter to maximize the total frequency of words that they can form. I wrote this program for GLPK:
set unicodes;
param surfaces{u in unicodes};
table data IN "CSV" "surfaces.csv": unicodes <- [u], surfaces~s;
set words;
param frequency{w in words}, integer;
table data IN "CSV" "words.csv": words <- [word], frequency~frequency;
Then I want to give a table giving each word the count of each character with its unicode. The sets words and unicodes are already defined. According to page 42 of the manual, I can omit the set and the delimiter:
table name alias IN driver arg . . . arg : set <- [fld, ..., fld], par~fld, ..., par~fld;
...
set is the name of an optional simple set called control set. It can be omitted along with the
delimiter <-;
So I write this:
param spectrum{w in words, u in unicodes} >= 0;
table data IN "CSV" "spectrum.csv": words~word, unicodes~unicode, spectrum~spectrum;
I get the error:
Reading model section from lp...
lp:19: delimiter <- missing where expected
Context: ..., u in unicodes } >= 0 ; table data IN '...' '...' : words ~
If I write:
table data IN "CSV" "spectrum.csv": [words, unicodes] <- [word, unicode], spectrum~spectrum;
I get the error:
Reading model section from lp...
lp:19: syntax error in table statement
Context: ...} >= 0 ; table data IN '...' '...' : [ words , unicodes ] <-
How can I read in a table with data on two sets already defined?
Notes: the CSV files are similar to this:
surfaces.csv:
u,s
41,1
42,1.5
43,1.2
words.csv:
word,frequency
abc,10
spectrum.csv:
word,unicode,spectrum
abc,1,41
abc,2,42
abc,3,43
I found the answer with AMPL, A Mathematical Programming Language, which is a superset of GNU MathProg. I needed to define a set with the links between words and unicodes, and use that set as the control set when reading the table:
set links within {words, unicodes};
param spectrum{links} >= 0;
table data IN "CSV" "spectrum.csv": links <- [word, unicode], spectrum~spectrum;
And now I get:
...
INTEGER OPTIMAL SOLUTION FOUND
Time used: 0.0 secs
Memory used: 0.1 Mb (156430 bytes)
The "optional set" in the documentation is still misleading and I filed a bug report. For reference, the AMPL book is free to download and I used the transportation model scattered in page 47 in Section 3.2, page 173 in section 10.1, and page 179 in section 10.2.

Regex to find everything in between

I have the following regex which works when there is no leading /d,"There is 1 interface on the system:
or a trailing ",2017-01-...
Here is the regex:
(?m)(?<_KEY_1>\w+[^:]+?):\s(?<_VAL_1>[^\r\n]+)$
Here is a sample of what I am trying to parse:
1,"There is 1 interface on the system:
Name : Mobile Broadband Connection
Description : Qualcomm Gobi 2000 HS-USB Mobile Broadband Device 250F
GUID : {1234567-12CD-1BC1-A012-C1A1234CBE12}
Physical Address : 00:a0:c6:00:00:00
State : Connected
Device type : Mobile Broadband device is embedded in the system
Cellular class : CDMA
Device Id : A1000001234f67
Manufacturer : Qualcomm Incorporated
Model : Qualcomm Gobi 2000
Firmware Version : 09010091
Provider Name : Verizon Wireless
Roaming : Not roaming
Signal : 67%",2017-01-20T16:00:07.000-0700
I am trying to extract field names where for example Cellular class would equal CDMA but for all fields beginning after:
1,"There is 1 interface on the system: (where 1 increments 1,2 3,4 and so on
and before the tailing ",2017-01....
Any help is much appreciated!
You could use look-ahead to ensure that the strings you match come before a ",\d sequence, and do not include a ". The latter would ensure you will only match between double quotes, of which the second has the pattern ",\d:
/^\h*(?<_KEY_1>[\w\h]+?)\h*:\h*(?<_VAL_1>[^\r\n"]+)(?="|$)(?=[^"]*",\d)/gm
See it on regex101
NB: I put the g and m modifiers at the end, but if your environment requires them at the start with (?m) notation, that will work too of course.
Your example string seems to be a record from a csv file. This is how I will accomplish the task with Python (2.7 or 3.x):
import csv
with open('file.csv', 'r') as fh:
reader = csv.reader(fh)
results = []
for fields in reader:
lines = fields[1].splitlines()
keyvals = [list(map(str.strip, line.split(':', 1))) for line in lines[1:]]
results.append(keyvals)
print(results)
It can be done in a similar way with other languages.
You haven't responded to my comments or any of the answers, but here is my answer - try
^\s*(?<_KEY_1>[\w\s]+?)\s*:\s*(?<_VAL_1>[^\r\n"]+).*$
See it here at regex101.

R- Subset a corpus by meta data (id) matching partial strings

I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id".
For example, I would like to filter all documents that contain the string "US" within the "id" column. The string "US" would be preceded and followed by various characters and numbers.
I have found a similar example here. It is recommended to download the quanteda package but I think this should also be possible with the tm package.
Another more relevant answer to a similar problem is found here. I have tried to adapt that sample code to my context. However, I don't manage to incorporate the partial string matching.
I imagine there might be multiple things wrong with my code so far.
What I have so far looks like this:
US <- tm_filter(corpus, FUN = function(corpus, filter) any(meta(corpus)["id"] == filter), grep(".*US.*", corpus))
And I receive the following error message:
Error in structure(as.character(x), names = names(x)) :
'names' attribute [3811] must be the same length as the vector [3]
I'm also not sure how to come up with a reproducible example simulating my problem for this post.
It could work like this:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
# 502 704 708
# "502" "704" "708"
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 3
You are looking for "US" instead of "0". Have a look at ?grep for details (e.g. fixed=TRUE).

R : Text Analysis - tm Package - stemComplete error

Machine: Windows 7 - 64 bit
R Version : R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
I am working on stemming some text for an analysis that I am doing, I am able to do everything all the way up until 'stemComplete' For more context please see the below;
Packages:
TM
SnowballC
rJava
RWeka
Rwekajars
NLP
Sample list of words
test <- as.vector(c('win', 'winner', 'wins', 'wins', 'winning'))
Convert to Corpus
Test_Corpus <- Corpus(VectorSource(test))
Text manipulations`
Test_Corpus <- tm_map(Survey_Corpus, content_transformer(tolower))
Test_Corpus <- tm_map(Survey_Corpus, removePunctuation)
Test_Corpus <- tm_map(Survey_Corpus, removeNumbers)
Stemming using tm_map under the tm package
>Test_stem <- tm_map(Test_Corpus, stemDocument, language = 'english' )
Below is the result from stemming above, which is all correct so far:
win
winner
win
win
win
Now comes the issue! When I try to use test_corpus as a dictionary to transform the words back to an appropriate format using the following code;
>Test_complete <- tm_map(Test_stem, stemCompletion, Test_Corpus)
Below is the error message that I am getting:
Warning messages:
1: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
2: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
3: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
4: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
5: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
I have tried several things listed on previous posts and seen that other people with the same problem have tried with no luck. Below is a list of those things:
Update Java
used content_transformation
used PlainTextDocument
I think you need to save your test_corpus as a dictionary before the stemming process. You could try something like Test_Corpus <- corpus then you could start the steming and using corpus later on in Test_complete <- tm_map(corpus, stemCompletion).

get find_elements_by_xpath to return 'none' or empty string when element not found :signal missing elements in sequence with selenium

I am trying to extract some attributes from this webpage.
url='http://m.search.allheart.com/?q=stethoscope'
I wrote the following xpaths for this -:
XPATH,ATTRIBUTE='XPATH','ATTRIBUTE'
NUM_RESULTS='NUM_RESULTS'
URL='URL'
TITLE='TITLE'
PROD_ID='PROD_ID'
IS_SALE='IS_SALE'
CURRENCY='CURRENCY'
REGULAR_PRICE='REGULAR_PRICE'
SALE_PRICE='SALE_PRICE'
conf_key={
NUM_RESULTS : {XPATH :'//div[#id="sort-page"]//div[#id="options" and #class="narrowed"]//hgroup[#id="sort-info" and #class="clearfix"]/h2', ATTRIBUTE:''} ,
URL : {XPATH:'//span[#class="info"]//span[#class="swatches clearfix product-colors"]//span[#class="price"]',ATTRIBUTE:'href'} ,
TITLE : {XPATH:'//div[#id="sort-results"]//li[#class="item product-box"]//span[#class="info"]//span[#class="title"]',ATTRIBUTE:''} ,
PROD_ID : {XPATH:'//div[#id="sort-results"]//li[#class="item product-box"]//span[#class="info"]//span[#class="swatches clearfix product-colors"]',ATTRIBUTE:'id'} ,
IS_SALE : {XPATH :'//div[#id="sort-results"]//li[#class="item product-box sale"]', ATTRIBUTE:''} ,
REGULAR_PRICE : {XPATH :'//div[#id="sort-results"]//li[#class="item product-box"]//span[#class="info"]//span[#class="price"]' , ATTRIBUTE:''} ,
SALE_PRICE : {XPATH :'//div[#id="sort-results"]//li[#class="item product-box sale"]//span[#class="info"]//span[#class="price"]' , ATTRIBUTE: '' } ,
}
chromedriver = "/usr/local/CHROMEDRIVER"
desired_capabilities=DesiredCapabilities.CHROME
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver,desired_capabilities=desired_capabilities)
driver.get(url)
The idea is to extract the attributes from the 1st search page to get the name , url, title, regular price & sale price.
Skipping the rest of the code.. and later extract the text through a for loop.
When I try to get the items on sale,
driver.find_elements_by_xpath(conf_key[SALE_PRICE][XPATH])
driver.find_elements_by_xpath(conf_key[REGULAR_PRICE][XPATH])
However this gives me , regular_price,sale_price,is_sale as
['$5.98', '$5.98', '$24.98', '$3.98', '$6.98', '$13.98', '$24.98', '$19.98', '$18.98', '$3.98', '$5.98', '$24.98', '$12.98', '$24.98'] ['$49.99', '$96.99'] [1, 1]
while I would like -:
['$5.98', '$5.98', '$24.98','$49.99', '$3.98', '$6.98', '$13.98', '$24.98', '$19.98', '$18.98', '$3.98', '$5.98', '$96.99', '$24.98', '$12.98', '$24.98']
['','', '24.98', '' , '' ....]
[0, 0, 1, 0 , 0 ...]
Question -:
I would like to force the driver to return '' (or any placeholder) , so that I can have the signal that the product was not on sale.
The webpage will either have the class - : "item product-box" , or "item product-box-sale "
Also, I do not want to hard code this, since I need to repeat this logic for a set of web-pages. How can I do this better without looping through li[0], li[1] .. and so on.
Is there any method that exists to signal that class was not present when scanned in order ?
Using the Xpath's defined above, I do get the rest of the container correctly as -:
SEARCH_PAGE
244 Items ['ah426010', 'ahdst0100', 'ahdst0500blk', 'ahd000090', 'ahdst0600', 'pms1125', 'ahdst0400bke', 'ahdst0400blk', 'adc609', 'ma10448', 'ma10428', 'pm121', 'pm108', 'pm122'] ['allheart Discount Dual Head Stethoscope', 'allheart Discount Single Head Stethoscope', 'allheart Cardiology Stethoscope', 'allheart Disposable Stethoscope', 'allheart Discount Pediatric / Infant Stethoscope With Interchangeable Heads Stethoscope', 'Prestige Medical Ultra-Sensitive Dualhead Latex Free Stethoscope', 'allheart Smoke Black Edition Clinical Stainless Steel Stethoscope', 'allheart Clinical Stainless Steel Stethoscope', 'ADC Adscope-Lite 609 Lightweight Double-Sided Stethoscope', 'Mabis Dispos-A-Scope Nurse Stethoscope', 'Mabis Spectrum Nurse Stethoscope', 'Prestige Medical Clinical Lite Stethoscope', 'Prestige Medical Dual Head Stethoscope', 'Prestige Medical Sprague Rappaport Stethoscope']
And I need to get the lists of same length, corresponding to each of these, for Regular & sale price(and is_sale flag)
find_elements_by_X return a list of WebElements which each of them may call find_elements_by_X.
Use find_elements_by_X to get a list of all the products within the page.
Iterate through them all
use find_elements_by_X (on the current product) to get specific element like cur_price or is_on_sale.
Don't forget to initialize a default value.
Store the informations in a structure (map, class, tuple). Note it is easy to specify a default value in a class using __ init __()
I find css selector easier to read than xpath IMO. Try it using google chrome console (F12) + right click + copy CSS path. https://selenium-python.readthedocs.org/locating-elements.html#locating-elements-by-css-selectors