Related
IFC is a variation of STEP files used for construction projects. The IFC contains information about the building being constructed. The file is text based and it easy to read. I am trying to parse this information into a python dictionary.
The general format of each line will be similar to the following
2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);
ideally this should be parsed int #2334, IFCMATERIALLAYERSETUSAGE, #2333,.AXIS2.,.POSITIVE.,-180.
I found a solution Regex includes two matches in first match
https://regex101.com/r/RHIu0r/10 for part of the problem.
However, there are some cases the data contains arrays instead of values as the example below
2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);
This case need to be parsed as #2335, IFCRELASSOCIATESMATERIAL, '2ON6$yXXD1GAAH8whbdZmc', #5,$,$, [#40,#221,#268,#281],#2334
Where [#40,#221,#268,#281] is a stored in a single variable as an array
The array can be in the middle or the last variable.
Would you be able to assist in creating a regular expression to obtain desired results
I have created https://regex101.com/r/mqrGka/1 with cases to test
Here's a solution that continues from the point you reached with the regular expression in the test cases:
file = """\
#1=IFCOWNERHISTORY(#89024,#44585,$,.NOCHANGE.,$,$,$,1190720890);
#2=IFCSPACE(';;);',#1,$);some text);
#2=IFCSPACE(';;);',#1,$);
#2885=IFCRELAGGREGATES('1gtpBVmrDD_xsEb7NuFKc8',#5,$,$,#2813,(#2840,#2846,#2852,#2858,#2879));
#2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);
#2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);
""".splitlines()
import re
d = dict()
for line in file:
m = re.match(r"^#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(((?:'[^']*'|[^;'])+)\);", line, re.I|re.M)
attr = m.group(3) # attribute list string
values = [m.group(2)] # first value is the entity type name
while attr:
start = 1
if attr[0] == "'": start += attr.find("'", 1) # don't split at comma within string
if attr[0] == "(": start += attr.find(")", 1) # don't split item within parentheses
end = attr.find(",", start) # search for a comma / end of item
if end < 0: end = len(attr)
value = attr[1:end-1].split(",") if attr[0] == "(" else attr[:end]
if value[0] == "'": value = value[1:-1] # remove quotes
values.append(value)
attr = attr[end+1:] # remove current attribute item
d[m.group(1)] = values # store into dictionary
I have an excel sheet where i use the follwoing command to get numbers from a cell that contains a form text:
=MID(D2;SEARCH("number";D2)+6;13)
It searches for the string "number" and gets the next 13 characters that comes after it. But some times the results get more than the number due to the fact these texts within the cells do not have a pattern, like the example below:
62999999990
21999999990
11999999990
6299999993) (
17999999999)
21914714753)
58741236714 P
18888888820
How do i avoid taking anything but numbers OR how do i remove everything but numbers from what i get?
You can user this User Defined Function (UDF) that will get only the numbers inside a specific cell.
Code:
Function only_numbers(strSearch As String) As String
Dim i As Integer, tempVal As String
For i = 1 To Len(strSearch)
If IsNumeric(Mid(strSearch, i, 1)) Then
tempVal = tempVal + Mid(strSearch, i, 1)
End If
Next
only_numbers = tempVal
End Function
To use it, you must:
Press ALT + F11
Insert new Module
Paste code inside Module window
Now you can use the formula =only_numbers(A1) at your spreadsheet, by changing A1 to your data location.
Example Images:
Inserting code at module window:
Executing the function
Ps.: if you want to delimit the number of digits to 13, you can change the last line of code from:
only_numbers = tempVal
to
only_numbers = Left(tempVal, 13)
Alternatively you can take a look a this topic to understand how to achieve this using formulas.
If you are going to go to a User Defined Function (aka UDF) then perform all of the actions; don't rely on the preliminary worksheet formula to pass a stripped number and possible suffix text to the UDF.
In a standard code module as,
Function udfJustNumber(str As String, _
Optional delim As String = "number", _
Optional startat As Long = 1, _
Optional digits As Long = 13, _
Optional bCaseSensitive As Boolean = False, _
Optional bNumericReturn As Boolean = True)
Dim c As Long
udfJustNumber = vbNullString
str = Trim(Mid(str, InStr(startat, str, delim, IIf(bCaseSensitive, vbBinaryCompare, vbTextCompare)) + Len(delim), digits))
For c = 1 To Len(str)
Select Case Asc(Mid(str, c, 1))
Case 32
'do nothing- skip over
Case 48 To 57
If bNumericReturn Then
udfJustNumber = Val(udfJustNumber & Mid(str, c, 1))
Else
udfJustNumber = udfJustNumber & Mid(str, c, 1)
End If
Case Else
Exit For
End Select
Next c
End Function
I've used your narrative to add several optional parameters. You can change these if your circumstances change. Most notable is whether to return a true number or text-that-looks-like-a-number with the bNumericReturn option. Note that the returned values are right-aligned as true numbers should be in the following supplied image.
By supplying FALSE to the sixth parameter, the returned content is text-that-looks-like-a-number and is now left-aligned in the worksheet cell.
If you don't want VBA and would like to use Excel Formulas only, try this one:
=SUMPRODUCT(MID(0&MID(D2,SEARCH("number",D2)+6,13),LARGE(INDEX(ISNUMBER(--MID(MID(D2,SEARCH("number",D2)+6,13),ROW($1:$13),1))* ROW($1:$13),0),ROW($1:$13))+1,1)*10^ROW($1:$13)/10)
I have an IF statement that return the number, if there is a colon symbol in the string. Sometimes the string does not contain a colon symbol. I'm looking for an else statement that would select the only number "45061 if there is no colon in the string. A = Works when the string has a colon sign but I need some assistance with B, if the string does not have a colon.
A.
String/Text = OM_Account_Master_Slave~Account CP~3712011:Shared-001
B.
String/Text = OM_Account_Master_Slave~Account CP~45061Shared-001
A.
if(contains,":",Substring(Abbrev(),1,Subtract(Length(Abbrev()),11)))
Result = 3712011:Shared-001
B.
if(contains,":",Substring(Abbrev(),1,Subtract(Length(Abbrev()),11)))
else
Consider the following User Defined Function:
Public Function GetNumber(r As Range) As Variant
Dim v As String, capture As Boolean
Dim i As Long, t As String
v = r.Value
GetNumber = ""
If v = "" Then Exit Function
t = ""
capture = False
For i = 1 To Len(v)
m = Mid(v, i, 1)
If IsNumeric(m) Then
t = t & m
capture = True
Else
If capture Then Exit For
End If
Next i
If Len(t) > 0 Then
GetNumber = CLng(t)
End If
End Function
User Defined Functions (UDFs) are very easy to install and use:
ALT-F11 brings up the VBE window
ALT-I
ALT-M opens a fresh module
paste the stuff in and close the VBE window
If you save the workbook, the UDF will be saved with it.
If you are using a version of Excel later then 2003, you must save
the file as .xlsm rather than .xlsx
To remove the UDF:
bring up the VBE window as above
clear the code out
close the VBE window
To use the UDF from Excel:
=GetNumber(A1)
To learn more about macros in general, see:
http://www.mvps.org/dmcritchie/excel/getstarted.htm
and
http://msdn.microsoft.com/en-us/library/ee814735(v=office.14).aspx
and for specifics on UDFs, see:
http://www.cpearson.com/excel/WritingFunctionsInVBA.aspx
Macros must be enabled for this to work!
The content of the file is here: http://pastebin.com/nAe9q9Kt (as I cannot have multiple blank lines in a question)
Below is a screenshot from my sublime-text.
SPACED INPUT EXAMPLE START
a
b
c
SPACED INPUT EXAMPLE END
You can notice that most of the lines begin with 0(zero), except the words ENGINEERS and DOESNT and are separated by single blank line and sometimes by double blank lines.
Basically what I want is this:
List(
List("0MOST PEOPLE", "0BELIEVE", "0THAT"),
List("0IF IT", "0AINT BROKE", "0DONT FIX IT"),
List("0BELIEVE", "0THAT", "0IF", "0IT AINT BROKE"),
List("0IT"),
List("0HAVE", "0ENOUGH", "0FEATURES YET.")
)
I tried to write a tail-recursive code and it worked well in the end :) But it takes too long (a couple of minutes) to run on a huge file (which has more than 10K lines)
I thought of using Regex approach or execute Unix commands like sed or awk through Scala code to generate a temp file. My guess is that it will run faster than my current approach.
Can somebody please help me with the Regex ?
Here is my tail-recursive Scala code:
#scala.annotation.tailrec
def inner(remainingLines: List[String], previousLineIsBlank: Boolean, frames: List[List[String]], frame: List[String]): List[List[String]] = {
remainingLines match {
case Nil => frame :: frames
case line :: Nil if !previousLineIsBlank =>
inner(
remainingLines = Nil,
previousLineIsBlank = false,
frames = frame :: frames,
frame = line :: frame)
case line :: tail => {
line match {
case "" if previousLineIsBlank => // Current line is blank, previous line is blank
inner(
remainingLines = tail,
previousLineIsBlank = true,
frames = frame :: frames,
frame = List.empty[String])
case "" if !previousLineIsBlank => // Current line is blank, previous line is not blank
inner(
remainingLines = tail,
previousLineIsBlank = true,
frames = frames,
frame = frame)
case line if !line.startsWith("0") && previousLineIsBlank => // Current line is not blank and does not start with 0 (ENGINEER, DOESN'T), previous line is blank
inner(
remainingLines = tail,
previousLineIsBlank = false,
frames = frames,
frame = frame)
case line if previousLineIsBlank => // Current line is not blank and does starts with 0, previous line is blank
inner(
remainingLines = tail,
previousLineIsBlank = false,
frames = frames,
frame = line :: frame)
case line if !previousLineIsBlank => // Current line is not blank, previous line not is blank
inner(
remainingLines = tail,
previousLineIsBlank = false,
frames = frames,
frame = line :: frame)
case line => sys.error("Unmatched case = " + line)
}
}
}
}
val source = """0MOST PEOPLE
0BELIEVE
0THAT
0IF IT
0AINT BROKE,
0DONT FIX IT.
ENGINEERS
0BELIEVE
0THAT
0IF
0IT AINT BROKE,
0IT
DOESNT
0HAVE
0ENOUGH
0FEATURES YET."""
val output = (for (s <- source.split("\n\n").toList) yield { // split on empty lines
s.split("\n").toList // split on new lines
.filter(_.headOption.getOrElse("")=='0')} // get rid of entries not starting with '0'
).filter(!_.isEmpty) // get rid of possible empty blocks
//output formatted for readability
scala> output: List[List[String]] = List(List(0MOST PEOPLE, 0BELIEVE, 0THAT),
List(0IF IT, 0AINT BROKE,, 0DONT FIX IT.),
List(0BELIEVE, 0THAT, 0IF, 0IT AINT BROKE,),
List(0IT),
List(0HAVE, 0ENOUGH, 0FEATURES YET.))
UPDATE:
if you are reading the lines from file, then the old imperative approach might work quite well, especially if source file is large:
import scala.collection.mutable.ListBuffer
val lb = ListBuffer[List[String]]()
val ml = ListBuffer[String]()
for (ll <- source.fromFile(<yourfile>)) {
if (ll.isEmpty) {
if (!ml.isEmpty) lb += ml.toList
ml.clear
} else if (ll(0)=='0') ml+=ll
}
val output = lb.toList
Here is a way with awk. You'll probably have to figure out a way to incorporate this in your scala code:
awk '
BEGIN { print "List(" }
/^0/ {
printf " %s", "List("
for(i = 1; i <= NF; i++) {
printf "%s%s" ,q $i q,(i==NF?"":", ")
}
print "),"
}
END { print ")" }' RS= FS='\n' q='"' file
Output with your sample data (from pastebin):
List(
List("0MOST PEOPLE", "0BELIEVE", "0THAT"),
List("0IF IT", "0AINT BROKE,", "0DONT FIX IT."),
List("0BELIEVE", "0THAT", "0IF", "0IT AINT BROKE,"),
List("0IT"),
List("0HAVE", "0ENOUGH", "0FEATURES YET."),
)
Using awk
awk 'BEGIN{print "List(" }
{ s=/^[0-9]/?1:0;i=s?i:i+1}
s{a[i]=a[i]==""?$0:a[i] OFS $0}
END{ for (j=1;j<=i;j++)
if (a[j]!="")
{ gsub(/\|/,"\",\"",a[j])
printf " list(\"%s\")\n", a[j]
}
print ")"
}' OFS="|" file
List(
list("0MOST PEOPLE","0BELIEVE","0THAT")
list("0IF IT","0AINT BROKE,","0DONT FIX IT.")
list("0BELIEVE","0THAT","0IF","0IT AINT BROKE,")
list("0IT")
list("0HAVE","0ENOUGH","0FEATURES YET.")
)
Explanation
s=/^[0-9]/?1:0;i=s?i:i+1 marks (s and i) are used to detect new record or not.
s{a[i]=a[i]==""?$0:a[i] OFS $0} save each record (seperated by non-numbmer start line) to array a
the reset in END is used to print out the result with expect format.
OFS="|" Hope there is no char | in your input file, if have, please change it to other chars, such as #, # , etc.
I'm not too familiar with Scala, but I think this is the regex you're looking for:
([A-Z]+[A-Z ]*)
See it in action: http://regex101.com/r/gY8lX6
Edit: / / in that case, all you need to do is add a zero to the beginning of the capture group:
(0[A-Z]+[A-Z ]*)
I am trying to retrieve particular parts of a string in a text file such as below and i would like to save them in a text file in MATLAB
Original text file
D 1m8ea_ 1m8e A: d.174.1.1 74583 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=74583
D 1m8eb_ 1m8e B: d.174.1.1 74584 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=74584
D 3e7ia1 3e7i A:77-496 d.174.1.1 158052 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=158052
D 3e7ib1 3e7i B:77-496 d.174.1.1 158053 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=158053
D 2bhja1 2bhj A:77-497 d.174.1.1 128533 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=128533
So basically, I would like to retrieve the pdbcodes id which are labeled as "1m8e", chainid labeled as "A" the Start values which is "77" and stop values which is "496" and i would like all of these values to be saved inside of a fprintf statment.
Is there some kind of method is which i can use in RegExp stating which index its all starting at and retrieve those strings based on the position in the text file for each line?
In the end, all i want to have in the fprinf statement is 1m8e, A, 77, 496.
So far i have two fopen function which reads a file and one that writes to a new file and to read each line by line, also a fprintf statment:
pdbcode = '';
chainid = '';
start = '';
stop = '';
fin = fopen('dir.cla.scop.txt_1.75.txt', 'r');
fout = fopen('output_scop.txt', 'w');
% TODO: Add error check!
while true
line = fgetl(fin); % Get the next line from the file
if ~ischar(line)
% End of file
break;
end
% Print result into output_cath.txt file
fprintf(fout, 'INSERT INTO cath_domains (scop_pdbcode, scop_chainid, scopbegin, scopend) VALUES("%s", %s, %s, %s);\n', pdbcode, chainid, start, stop);
Thank you.
You should be able to strsplit on whitespace, get the third ("1m8e") and fourth elements ("A:77-496"), then repeat the process on the fourth element using ":" as the split character, and then again on the second of those two arguments using "-" as the split character. That's one approach. For example, you could do:
% split on space and tab, and ignore empty tokens
tokens = strsplit(line, ' \t', true);
pdbcode = tokens(3);
% split fourth token from previous split on colon
tokens = strsplit(tokens(4), ':');
chainid = tokens(1);
% split second token from previous split on dash
tokens = strsplit(tokens(2), '-');
start = tokens(1);
stop = tokens(2);
If you really wanted to use regular expressions, you could try the following
pattern = '\S+\s+\S+\s+(\S+)\s+([A-Za-z]+):([0-9]+)-([0-9]+)';
[mat tok] = regexp(line, pattern, 'match', 'tokens');
pdbcode = cell2mat(tok)(1);
chainid = cell2mat(tok)(2);
start = cell2mat(tok)(3);
stop = cell2mat(tok)(4);