Parsing Javascript with Python - regex

In one of my script I use urllib2and BeautifulSoup to parse a HTML page and read a <script> tag.
This is what I get :
<script>
var x_data = {
logged: logged,
lengthcarrousel: 2,
products : [
{
"serial" : "106541823"
...
</script>
My goal is to read the JSON in the x_data variable and I do not know how to do it properly.
I though of :
Convert to string and remove the first chars to the { and same for last }
Use Regular Expression with something like '{.*}' and take the first group
Something else ?
I don't know if these are efficient and if there is some other ways to do it in a nice way.
Do you think a method is preferable to the other ? any method I may not be aware of ?
Thank you in advance for any advice.
EDIT :
Following advice I get the Regexp solution but I can't search in multiple lines despite using re.MULTILINE :
string1 = '<script>
var x_data = {
logged: logged,
lengthcarrousel: 2,
products : [
{
"serial" : "106541823"}
]
};
</script>'
p = re.compile(r'\{.*\};',re.MULTILINE);
m = p.search(string1)
if m:
print m.group(0)
else:
print "Error !"
I always got an "Error !".
EDIT2 :
Works well with re.DOTALL.

I think these methods are essentially the same in terms of elegance and performance (using {.*} may be slightly better because .* is greedy, i.e. there will be almost no backtracking, and because it seems to me more "forgiving" for different JS code formatting nuances). What you may be more interested in is this: https://docs.python.org/3.6/library/json.html.

If it always looks exactly like this, then you can hack a solution like the one you proposed, based on it looking exactly like this.
Because programmers do everything in code, I suspect in practice it will not alway look exactly this, and then any hacky solution will be fragile and will fail at unexpected (read "impossibly inconvenient") moments. (Regex is known to be hacky when it comes to parsing code).
If you want to do this right, you will need to get a real JavaScript parser, apply it to the code fragment defined by the script tag content, to produce an AST, then search the AST for JavaScript nested structures that happen to look like JSON, and take the content of that tree, prettyprinted.
Even this will be fragile in the face of a programmer who assembles the JSON fragment using JavaScript assignment statements. You can handle this by computing data flow, and discovering sets of code that happen to assemble JSON code. This is rather a lot of work.
So you get to decide what the limits on your solution will be, and then accept the consequences when somebody you don't control does something random.

Related

How to get the first digit on the left side of a string with python and regex?

I want to get a specific digit based on the right string.
This stretch of string is in body2.txt
string = "<li>3 <span class='text-info'>quartos</span></li><li>1 <span class='text-info'>suíte</span></li><li>96<span class='text-info'>Área Útil (m²)</span></li>"
with open("body2.txt", 'r') as f:
area = re.compile(r'</span></li><li>(\d+)<span class="text-info">Área Útil')
area = area.findall(f.read())
print(area)
output: []
expected output: 96
You have a quote mismatch. Note carefully the difference between 'text-info' and "text-info" in your example string and in your compiled regex. IIRC escaping quotes in raw strings is a bit of a pain in Python (if it's even possible?), but string concatenation sidesteps the issue handily.
area = re.compile(r'</span></li><li>(\d+)<span class='"'"'text-info'"'"'>Área Útil')
Focusing on the quotes, this is concatenating the strings '...class', "'", 'text-info', "'", and '>.... The rule there is that if you want a single quote ' in a single-quote raw string you instead write '"'"' and try to ignore Turing turning in his grave. I haven't tested the performance, but I think it might behave much like '...class' + "'" + 'text-info' + "'" + '>.... If that's the case, there is a bunch of copying happening behind the scenes, and that strategy has a quadratic runtime in the number of pieces being concatenated (assuming they're roughly the same size and otherwise generally nice for such an analysis). You'd be better off with nearly any other strategy (such as ''.join(...) or using triple quoted raw strings r'''...'''). It might not be a problem though. Benchmark your solution and see if it's good enough before messing with alternatives.
As one of the comments mentioned, you probably want to be parsing the HTML with something more powerful than regex. Regex cannot properly parse arbitrary HTML since it can't parse arbitrarily nested structures. There are plenty of libraries to make the job easier though and handle all of the bracket matching and string munging for you so that you can focus on a high-level description of exactly the data you want. I'm a fan of lxml. Without putting a ton of time into it, something like the following would be roughly equivalent to what you're doing.
from lxml import html
with open("body2.txt", 'r') as f:
tree = html.fromstring(f.read())
area = tree.xpath("//li[contains(span/text(), 'Área Útil')]/text()")
print(area)
The html.fromstring() method parses your data as html. The tree.xpath method uses xpath syntax to query that parsed tree. Roughly speaking it means the following:
// Arbitrarily far down in the tree
li A list node
[*] Satisfying whatever property is in the square brackets
contains(span/text(), 'Área Útil') The li node needs to have a span/text() node containing the text 'Área Útil'
/text() We want any text that is an immediate child of the root li we're describing.
I'm working on a pretty small amount of text here and don't know what your document structure is in the general case. You could add or change any of those properties to better describe the exact document you're parsing. When you inspect an element, any modern browser is able to generate a decent xpath expression to pick out exactly the element you're inspecting. Supposing this snippet came from a larger document I would imagine that functionality would be a time saver for you.
This will get the right digits no matter how / what form the target is in.
Capture group 1 contains the digits.
r"(\d*)\s*<span(?=\s)(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?\sclass\s*=\s*(?:(['\"])\s*text-info\s*\2))\s+(?=((?:\"[\S\s]*?\"|'[\S\s]*?'|[^>]?)+>))\3\s*Área\s+Útil"
https://regex101.com/r/pMATkj/1

Coldfusion JSON Breaking with DataTables

Working on one of the tasks i am using jsstringformat function to handle json data if some special characters are used, but that does not seems to handle all issues.
My JSON still breaks.
I am using like this :
"<a href='edit.cfm?id=#jsStringFormat(qFiltered.randomnumber)#' style='color:##066D99'>#trim(jsStringFormat(qFiltered[thisColumn][qFiltered.currentRow]))#</a>"
I am lost here what else i can use as any part of regex or rereplace that it should not break
Thanks
You're doing multiple things here.
You're putting the string into a URL: use UrlEncodedFormat.
You're also putting it in an HTML tag: use HtmlEditFormat.
The whole thing is going into a JavaScript variable, so I would use JSStringFormat to wrap the whole thing.
Try building your string before assigning it.
<cfsavecontent variable="htmlLink"><cfoutput>
#HtmlEditFormat(Trim(qFiltered[thisColumn][qFiltered.currentRow]))#
</cfoutput></cfsavecontent>
myJsVar = "#JsStringFormat(Trim(htmlLink))#";

Using Regex to validate the number of words in a text area

I am attempting to write a MVC model validation that verifies that there is 10 or more words in a string. The string is being populated correctly, so I did not include the HTML. I have done a fair bit of research, and it seems that something along the lines of what I have tries should work, but, for whatever reason, mine always seem to fail. Any ideas as to what I am doing wrong here?
(using System.ComponentModel.DataAnnotations, in a mvc 4 vb.net environment)
Have tried ([\w]+){10,}, ((\\S+)\s?){10,}, [\b]{20,}, [\w+\w?]{10,}, (\b(\w+?)\b){10,}, ([\w]+?\s){10}, ([\w]+?\s){9}[\w], ([\S]+\s){9}[\S], ([a-zA-Z0-9,.'":;$-]+\s+){10,} and several more varaiations on the same basic idea.
<Required(ErrorMessage:="The Description of Operations field is required"), RegularExpression("([\w]+){20,}", ErrorMessage:="ERROZ")>
Public Property DescOfOperations As String = String.Empty
Correct Solution was ([\S]+\s+){9}[\S\s]+
EDIT Moved accepted version to the top, removing unused versions. Unless I am wrong and the whole sequence needs to match, then something like (also accounting for double spaces):
([\S]+\s+){9}[\S\s]+
Or:
([\w]+?\s+){9}[\w]+
Give this a try:
([a-zA-Z0-9,.'":;$-]+\s){10,}

Creating a simple parser in (V)C++ (2010) similar to PEG

For an school project, I need to parse a text/source file containing a simplified "fake" programming language to build an AST. I've looked at boost::spirit, however since this is a group project and most seems reluctant to learn extra libraries, plus the lecturer/TA recommended leaning to create a simple one on C++. I thought of going that route. Is there some examples out there or ideas on how to start? I have a few attempts but not really successful yet ...
parsing line by line
Test each line with a bunch of regex (1 for procedure/function declaration), one for assignment, one for while etc...
But I will need to assume there are no multiple statements in one line: eg. a=b;x=1;
When I reach a container statement, procedures, whiles etc, I will increase the indent. So all nested statements will go under this
When I reach a } I will decrement indent
Any better ideas or suggestions? Example code I need to parse (very simplified here ...)
procedure Hello {
a = 1;
while a {
b = a + 1 + z;
}
}
Another idea was to read whole file into a string, and go top down. Match all procedures, then capture everything in { ... } then start matching statements (end with ;) or containers while { ... }. This is similar to how PEG does things? But I will need to read entire file
Multipass makes things easier. On a first pass, split things into tokens, like "=", or "abababa", or a quote-delimited string, or a block of whitespace. Don't be destructive (keep the original data), but break things down to simple chunks, and maybe have a little struct or enum that describes what the token is (ie, whitespace, a string literal, an identifier type thing, etc).
So your sample code gets turned into:
identifier(procedure) whitespace( ) identifier(Hello) whitespace( ) operation({) whitespace(\n\t) identifier(a) whitespace( ) operation(=) whitespace( ) number(1) operation(;) whitespace(\n\t) etc.
In those tokens, you might also want to store line number and offset on the line (this will help with error message generation later).
A quick test would be to turn this back into the original text. Another quick test might be to dump out pretty-printed version in html or something (where you color whitespace to have a pink background, identifiers as light blue, operations as light green, numbers as light orange), and see if your tokenizer is making sense.
Now, your language may be whitespace insensitive. So discard the whitespace if that is the case! (C++ isn't, because you need newlines to learn when // comments end)
(Note: a professional language parser will be as close to one-pass as possible, because it is faster. But you are a student, and your goal should be to get it to work.)
So now you have a stream of such tokens. There are a bunch of approaches at this point. You could pull out some serious parsing chops and build a CFG to parse them. (Do you know what a CFG is? LR(1)? LL(1)?)
An easier method might be to do it a bit more ad-hoc. Look for operator({) and find the matching operator(}) by counting up and down. Look for language keywords (like procedure), which then expects a name (the next token), then a block (a {). An ad-hoc parser for a really simple language may work fine.
I've done exactly this for a ridiculously simple language, where the parser consisted of a really simple PDA. It might work for you guys. Or it might not.
Since you mentioned PEG i'll like to throw in my open source project : https://github.com/leblancmeneses/NPEG/tree/master/Languages/npeg_c++
Here is a visual tool that can export C++ version: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-language-workbench
Documentation for rule grammar: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-dsl-documentation
If i was writing my own language I would probably look at the terminals/non-terminals found in System.Linq.Expressions as these would be a great start for your grammar rules.
http://msdn.microsoft.com/en-us/library/system.linq.expressions.aspx
System.Linq.Expressions.Expression
System.Linq.Expressions.BinaryExpression
System.Linq.Expressions.BlockExpression
System.Linq.Expressions.ConditionalExpression
System.Linq.Expressions.ConstantExpression
System.Linq.Expressions.DebugInfoExpression
System.Linq.Expressions.DefaultExpression
System.Linq.Expressions.DynamicExpression
System.Linq.Expressions.GotoExpression
System.Linq.Expressions.IndexExpression
System.Linq.Expressions.InvocationExpression
System.Linq.Expressions.LabelExpression
System.Linq.Expressions.LambdaExpression
System.Linq.Expressions.ListInitExpression
System.Linq.Expressions.LoopExpression
System.Linq.Expressions.MemberExpression
System.Linq.Expressions.MemberInitExpression
System.Linq.Expressions.MethodCallExpression
System.Linq.Expressions.NewArrayExpression
System.Linq.Expressions.NewExpression
System.Linq.Expressions.ParameterExpression
System.Linq.Expressions.RuntimeVariablesExpression
System.Linq.Expressions.SwitchExpression
System.Linq.Expressions.TryExpression
System.Linq.Expressions.TypeBinaryExpression
System.Linq.Expressions.UnaryExpression

Regular Expression - how to find text within particular if blocks?

I'm new to regular expressions and would like to use one to search through our source control to find text within a block of code that follows a particular enum value. I.e.:
/(\/{2}\#debug)(.|\s)*?(\/{2}\#end-debug).*/
var junk = dontWantThis if (junk) {dont want this} if ( **myEnumValue** ) **{ var yes = iWantToFindThis if (true) { var yes2 = iWantThisToo } }**
var junk2 = dontWantThis if (junk) {dont want this}
var stuff = dontWantThis if (junk) {dont want this} if ( enumValue ) { wantToFindThis }
var stuff = iDontWantThis if (junk) {iDontWantThisEither}
I know I can use (\{(/?[^\>]+)\}) to find if blocks, but I only want the first encompassing block of code that follows the enum value I'm looking for. I've also notice that using (\{(/?[^\>]+)\}) gives me the first { and last }, it doesn't group the subsequent {}.
Thank you!
Tim
Regexps simply can't handle this kind of stuff. For this you'll need a parser and scanner.
As others hint at, it's mathematically impossible to do with with regular expressions (at least in general; you might be able to get it to work if you have highly specialized cases). Try using a combination of lex and awk to get the desired results if you want to stick with standard Unix tools, or just go to Perl, Python, Ruby, etc. and build up the lexical parsing you need.
While nesting is a problem, you could use backtracking and lookahead to effectively count your matching braces or quotes. This is not strictly part of a regular expression but has been added to many regex libraries, such as the one in .NET, perl, and java; probably more. I wouldn't recommend that you go this route, as you should find it easier to lexically parse this. But if you do try this as a quick fix, absolutely collect a few test cases and run them through regexbuddy or expresso.