python, pyparsing, stopOn and repeating structures - python-2.7

The time has come to brush up on my pyparsing skills.
given a file containing repetitive structures
space_missions
Main Objects:
/Projects/antares_III
/Projects/apollo
ground_missions
Main Objects:
/Projects/Barbarossa
/Projects/Desert_Eagle
and my chopped-down 2.7 script
def last_occurance_of( expr):
return expr + ~pp.FollowedBy( expr)
ppKeyName = pp.Word( pp.alphanums)
ppObjectLabel = pp.Literal("Main Objects") + pp.FollowedBy(':')
ppObjectRegex = pp.Regex(r'\/Projects\/\w+')
ppTag = pp.Group( ppKeyName.setResultName('keyy') + pp.Suppress( ppObjectLabel) + pp.ZeroOrMore( ppObjectRegex, stopOn=last_occurance_of( ppObjectRegex)).setResultName('objects') )
ppTags = pp.OneOrMore( ppTag)
with open( fn) as fp:
slurp = fp.read()
results = ppTags.parseString( slurp)
I'd like to get results to return
[['space_missions',['/Projects/antares_III','/Projects/apollo']
,['ground_missions',['/Projects/Barbarossa','/Projects/Desert_Eagle']]
So what am I missing here? I realize I'm lucky in that the strings that make up the lists all have the same beginning which gives last_occurance_of() something to lock on to, but what does one do in the more general case where the strings have nothing to differentiate them from tag-strings
Still-Searching Steve

Three things to fix in your parser:
Your given ppKeyNames include '_'s, but you don't include them in the definition of ppKeyName
ppObjectLabel will parse "Main Objects" followed by a ':', but the ':' does not actually get parsed anywhere. Easiest to just add it to ppObjectLabel instead of using pp.FollowedBy.
last_occurance_of is unnecessary, the repetition of ppObjectRegex will not be confused by the next tag's ppKeyName

Related

Extracting data using regular expressions: Python

The basic outline of this problem is to read the file, look for integers using the re.findall(), looking for a regular expression of [0-9]+ and then converting the extracted strings to integers and summing up the integers.
I am finding trouble in appending the list. From my below code, it is just appending the first(0) index of the line. Please help me. Thank you.
import re
hand = open ('a.txt')
lst = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('[0-9]+', line)
if len(stuff)!= 1 : continue
num = int (stuff[0])
lst.append(num)
print sum(lst)
import re
ls=[];
text=open('C:/Users/pvkpu/Desktop/py4e/file1.txt');
for line in text:
line=line.rstrip();
l=re.findall('[0-9]+',line);
if len(l)==0:
continue
ls+=l
for i in range(len(ls)):
ls[i]=int(ls[i]);
print(sum(ls));
Great, thank you for including the whole txt file! Your main problem was in the if len(stuff)... line which was skipping if stuff had zero things in it and when it had 2,3 and so on. You were only keeping stuff lists of length 1. I put comments in the code but please ask any questions if something is unclear.
import re
hand = open ('a.txt')
str_num_lst = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('[0-9]+', line)
#If we didn't find anything on this line then continue
if len(stuff) == 0: continue
#if len(stuff)!= 1: continue #<-- This line was wrong as it skip lists with more than 1 element
#If we did find something, stuff will be a list of string:
#(i.e. stuff = ['9607', '4292', '4498'] or stuff = ['4563'])
#For now lets just add this list onto our str_num_list
#without worrying about converting to int.
#We use '+=' instead of 'append' since both stuff and str_num_lst are lists
str_num_lst += stuff
#Print out the str_num_list to check if everything's ok
print str_num_lst
#Get an overall sum by looping over the string numbers in the str_num_lst
#Can convert to int inside the loop
overall_sum = 0
for str_num in str_num_lst:
overall_sum += int(str_num)
#Print sum
print 'Overall sum is:'
print overall_sum
EDIT:
You are right, reading in the entire file as one line is a good solution, and it's not difficult to do. Check out this post. Here is what the code could look like.
import re
hand = open('a.txt')
all_lines = hand.read() #Reads in all lines as one long string
all_str_nums_as_one_line = re.findall('[0-9]+',all_lines)
hand.close() #<-- can close the file now since we've read it in
#Go through all the matches to get a total
tot = 0
for str_num in all_str_nums_as_one_line:
tot += int(str_num)
print('Overall sum is:',tot) #editing to add ()

How can I extract a file name based on number string?

I have a list of filenames in a struct array, example:
4x1 struct array with fields:
name
date
bytes
isdir
datenum
where files.name
ans =
ts.01094000.crest.csv
ans =
ts.01100600.crest.csv
etc.
I have another list of numbers (say, 1094000). And I want to find the corresponding file name from the struct.
Please note, that 1094000 doesn't have preceding 0. Often there might be other numbers. So I want to search for '1094000' and find that name.
I know I can do it using Regex. But I have never used that before. And finding it difficult to write for numbers instead of text using strfind. Any suggestion or another method is welcome.
What I have tried:
regexp(files.name,'ts.(\d*)1094000.crest.csv','match');
I think the regular expression you'd want is more like
filenames = {'ts.01100600.crest.csv','ts.01094000.crest.csv'};
matches = regexp(filenames, ['ts\.0*' num2str(1094000) '\.crest\.csv']);
matches = ~cellfun('isempty', matches);
filenames(matches)
For a solution with strfind...
Pre-16b:
match = ~cellfun('isempty', strfind({files.name}, num2str(1094000)),'UniformOutput',true)
files(match)
16b+:
match = contains({files.name}, string(1094000))
files(match)
However, the strfind way might have issues if the number you are looking for exists in unexpected places such as looking for 10 in ["01000" "00101"].
If your filenames match the pattern ts.NUMBER.crest.csv, then in 16b+ you could do:
str = {files.name};
str = extractBetween(str,4,'.');
str = strip(str,'left','0');
matches = str == string(1094000);
files(matches)

QString - parsing QString with geoordinates

I work in Qt Creator (Community) 5.5.1. For example, I have
string="44° 36' 14.2\" N, 33° 30' 58.6\" E, 0m"
of QString. I know, that I must parse it, but i don't know how, because I have never faced with the problem like it. From our string I want to get some other smaller strings:
cgt = "44"; cmt = "36"; cst = "14.2"
cgg = "33"; cmg = "30"; csg = "58.6"
What must I do for working my programm how I said?
I need real code. Thanks.
The simplest way to start would be string.split(' ') - that would yield the list of the string components that were separated by the space character (' '). If you're sure the string will always be formatted exactly like this, you can first remove all the special characters (° and so on).
Then analyze the resulting QStringList. Again, if the format is fixed, you can check that the number of list items matches the expected number, and then get degrees as list[0], minutes as ``list[1]` and so on.
Another alternative would be to use QRegExp for parsing the string (splitting it into substrings based on regex), but I find it too complicated for use cases where split works just as well.
"I need code" is not the kind of question you should be asking, SO is about "gimme knowledge" not about "do my work" questions. A good question should demonstrate your effort to solve the problem, so people can tell you what you are doing wrong. Not only does your question lack any such effort, but you didn't expend any even when Devopia did half of the work for you. Keep that in mind for your future questions.
struct G {
double cgt, cmt, cst, cgg, cmg, csg;
};
G parse(QString s) {
QStringList list = s.split(QRegExp("[^0-9.]"), QString::SkipEmptyParts);
G g;
g.cgt = list.at(0).toDouble();
g.cmt = list.at(1).toDouble();
g.cst = list.at(2).toDouble();
g.cgg = list.at(3).toDouble();
g.cmg = list.at(4).toDouble();
g.csg = list.at(5).toDouble();
return g;
}

Dynamic regexprep in MATLAB

I have the following strings in a long string:
a=b=c=d;
a=b;
a=b=c=d=e=f;
I want to first search for above mentioned pattern (X=Y=...=Z) and then output like the following for each of the above mentioned strings:
a=d;
b=d;
c=d;
a=b;
a=f;
b=f;
c=f;
d=f;
e=f;
In general, I want all the variables to have an equal sign with the last variable on the extreme right of the string. Is there a way I can do it using regexprep in MATLAB. I am able to do it for a fixed length string, but for variable length, I have no idea how to achieve this. Any help is appreciated.
My attempt for the case of two equal signs is as follows:
funstr = regexprep(funstr, '([^;])+\s*=\s*+(\w+)+\s*=\s*([^;])+;', '$1 = $3; \n $2 = $3;\n');
Not a regexp but if you stick to Matlab you can make use of the cellfun function to avoid loop:
str = 'a=b=c=d=e=f;' ; %// input string
list = strsplit(str,'=') ;
strout = cellfun( #(a) [a,'=',list{end}] , list(1:end-1), 'uni', 0).' %'// Horchler simplification of the previous solution below
%// this does the same than above but more convoluted
%// strout = cellfun( #(a,b) cat(2,a,'=',b) , list(1:end-1) , repmat(list(end),1,length(list)-1) , 'uni',0 ).'
Will give you:
strout =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
Note: As Horchler rightly pointed out in comment, although the cellfun instruction allows to compact your code, it is just a disguised loop. Moreover, since it runs on cell, it is notoriously slow. You won't see the difference on such simple inputs, but keep this use when super performances are not a major concern.
Now if you like regex you must like black magic code. If all your strings are in a cell array from the start, there is a way to (over)abuse of the cellfun capabilities to obscure your code do it all in one line.
Consider:
strlist = {
'a=b=c=d;'
'a=b;'
'a=b=c=d=e=f;'
};
Then you can have all your substring with:
strout = cellfun( #(s)cellfun(#(a,b)cat(2,a,'=',b),s(1:end-1),repmat(s(end),1,length(s)-1),'uni',0).' , cellfun(#(s) strsplit(s,'=') , strlist , 'uni',0 ) ,'uni',0)
>> strout{:}
ans =
'a=d;'
'b=d;'
'c=d;'
ans =
'a=b;'
ans =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
This gives you a 3x1 cell array. One cell for each group of substring. If you want to concatenate them all then simply: strall = cat(2,strout{:});
I haven't had much experience w/ Matlab; but your problem can be solved by a simple string split function.
[parts, m] = strsplit( funstr, {' ', '='}, 'CollapseDelimiters', true )
Now, store the last part of parts; and iterate over parts until that:
len = length( parts )
for i = 1:len-1
print( strcat(parts(i), ' = ', parts(len)) )
end
I do not know what exactly is the print function in matlab. You can update that accordingly.
There isn't a single Regex that you can write that will cover all the cases. As posted on this answer:
https://stackoverflow.com/a/5019658/3393095
However, you have a few alternatives to achieve your final result:
You can get all the values in the line with regexp, pick the last value, then use a for loop iterating throughout the other values to generate the output. The regex to get the values would be this:
matchStr = regexp(str,'([^=;\s]*)','match')
If you want to use regexprep at any means, you should write a pattern generator and a replace expression generator, based on number of '=' in the input string, and pass these as parameters of your regexprep func.
You can forget about Regex and Split the input to generate the output looping throughout the values (similarly to alternative #1) .

VB.Net Matching and replacing the contents of multiple overlapping sets of brackets in a string

I am using vb.net to parse my own basic scripting language, sample below. I am a bit stuck trying to deal with the 2 separate types of nested brackets.
Assuming name = Sam
Assuming timeFormat = hh:mm:ss
Assuming time() is a function that takes a format string but
has a default value and returns a string.
Hello [[name]], the time is [[time(hh:mm:ss)]].
Result: Hello Sam, the time is 19:54:32.
The full time is [[time()]].
Result: The full time is 05/06/2011 19:54:32.
The time in the format of your choice is [[time([[timeFormat]])]].
Result: The time in the format of your choice is 19:54:32.
I could in theory change the syntax of the script completely but I would rather not. It is designed like this to enable strings without quotes because it will be included in an XML file and quotes in that context were getting messy and very prone to errors and readability issues. If this fails I could redesign using something other than quotes to mark out strings but I would rather use this method.
Preferably, unless there is some other way I am not aware of, I would like to do this using regex. I am aware that the standard regex is not really capable of this but I believe this is possible using MatchEvaluators in vb.net and some form of recursion based replacing. However I have not been able to get my head around it for the last day or so, possibly because it is hugely difficult, possibly because I am ill, or possibly because I am plain thick.
I do have the following regex for parts of it.
Detecting the parentheses: (\w*?)\((.*?)\)(?=[^\(+\)]*(\(|$))
Detecting the square brackets: \[\[(.*?)\]\](?=[^\[+\]]*(\[\[|$))
I would really appreciate some help with this as it is holding the rest of my project back at the moment. And sorry if I have babbled on too much or not put enough detail, this is my first question on here.
Here's a little sample which might help you iterate through several matches/groups/captures. I realize that I am posting C# code, but it would be easy for you to convert that into VB.Net
//these two may be passed in as parameters:
string tosearch;//the string you are searching through
string regex;//your pattern to match
//...
Match m;
CaptureCollection cc;
GroupCollection gc;
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
m = r.Match(tosearch);
gc = m.Groups;
Debug.WriteLine("Number of groups found = " + gc.Count.ToString());
// Loop through each group.
for (int i = 0; i < gc.Count; i++)
{
cc = gc[i].Captures;
counter = cc.Count;
int grpnum = i + 1;
Debug.WriteLine("Scanning group: " + grpnum.ToString() );
// Print number of captures in this group.
Debug.WriteLine(" Captures count = " + counter.ToString());
if (cc.Count >= 1)
{
foreach (Capture cap in cc)
{
Debug.WriteLine(string.format(" Capture found: {0}", cap.ToString()));
}
}
}
Here is a slightly simplified version of the code I wrote for this. Thanks for the help everyone and sorry I forgot to post this before. If you have any questions or anything feel free to ask.
Function processString(ByVal scriptString As String)
' Functions
Dim pattern As String = "\[\[((\w+?)\((.*?)\))(?=[^\(+\)]*(\(|$))\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processFunction(match)))
' Variables
pattern = "\[\[([A-Za-z0-9+_]+)\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processVariable(match)))
Return scriptString
End Function
Function processFunction(ByVal match As Match)
Dim nameString As String = match.Groups(2).Value
Dim paramString As String = match.Groups(3).Value
paramString = processString(paramString)
Select Case nameString
Case "time"
Return getLocalValueTime(paramString)
Case "math"
Return getLocalValueMath(paramString)
End Select
Return ""
End Function
Function processVariable(ByVal match As Match)
Try
Return moduleDictionary("properties")("vars")(match.Groups(1).Value)
Catch ex As Exception
End Try
End Function