Here is the code:
<div>23 Anywhere Ave<br />Someplace<br />Somewhere 1234</div>
I want to scrape the resulting three lines
23 Anywhere Ave<br />Someplace<br />Somewhere 1234</div>
into separate columns. I can scrape the first string (23 Anywhere Ave) by setting <div> as the front marker and <br /> as the back marker.
Get stuck after that. I've tried setting the front marker as <div>(?=)<br />), /<div>(?=)<br />)/ (Outwit apparently require / / when making a regex statement), and <div>/(?=)/<br />)to get the second value but no luck.
I realise that regex is not suitable for parsing HTML, but this post indicates that it's OK in certain contexts within the Outwit architecture.
In automators/scrapers put this Separator: br
Then in List of Labels: Street,City,ZipCode
Br,
Eusebio.
Related
I have a template for an MkDocs site, which uses Jinja2. I am trying to add a link to a PDF version of each page. The PDF always has the same name as the markdown file. So I am trying to add a link in the template that will automatically target the correct PDF for each page. This feels cleaner than having the writers add a manual link to every page.
Download
The above is almost correct, but there is a '/' at the end of all the URLs. Meaning the result is:
page/url/slug/.pdf
Neither MkDocs nor Jinja seem to provide a filter to remove trailing slashes, so I am wondering if it's possible to use regex to remove it. I believe that would be as simple as \/$? However, I can't see from the docs how to apply a regex filter in Jinja?
You can do something like this:
{{ "string/".rstrip("/") }}
Worked for me.
So I found a workaround for my specific case, but it's nasty:
<a href='{{ config.site_url }}{{ page.url | reverse | replace("/", "", 1) | reverse }}.pdf'>Download</a>
Prepend the site URL
Get the current page URL, reverse it, use replace with the optional count parameter to remove the FIRST '/', then reverse it again to get it back in the right order
Append '.pdf'
According to one of the answers to the question linked by Jan above, you can't simply use regex in Jinja2 without getting into custom filters.
Download
where $ is the end of the line / end of the string.
Therefore, /$ means the / at the end.
I am using a third party indexing service (Swiftype) to search through my database. The returned records contains a property called highlight. This simply adds <em> tags around matching strings.
I then bind this highlight property in Ember.JS Handlebars as such:
<p> Title: {{highlight.title}} </p>
Which results in the following output:
Title: Example <em>matching</em> text
The browse actually displays the <em> tags, instead of formatting them. I.e. Handlebars is not identifying the HTML tags, and simply printing them as a string.
Is there a way around this?
Thanks!
Handlebars by default escapes html, to prevent escaping, use triple brackets:
<p> Title: {{{highlight.title}}} </p>
See http://handlebarsjs.com/#html-escaping
Ember escapes html because it could be potentional bad code which can be executed. To avoid that use
Ember.Handlebars.SafeString("<em>MyString</em>");
Here are the docs
http://emberjs.com/guides/templates/writing-helpers/
if you've done that you could use {{hightlight.title}} like wished,...
HTH
I have suffered a number of XSS attacks against my site. The following HTML fragment is the XSS vector that has been injected by the attacker:
<a href="mailto:">
<a href=\"http://www.google.com onmouseover=alert(/hacked/); \" target=\"_blank\">
<img src="http://www.google.com onmouseover=alert(/hacked/);" alt="" /> </a></a>
It looks like script shouldn't execute, but using IE9's development tool, I was able to see that the browser translates the HTML to the following:
<a href="mailto:"/>
<a onmouseover="alert(/hacked/);" href="\"http://www.google.com" target="\"_blank\"" \?="">
</a/>
After some testing, it turns out that the \" makes the "onmouseover" attribute "live", but i don't know why. Does anyone know why this vector succeeds?
So to summarize the comments:
Sticking a character in front of the quote, turns the quote into a part of the attribute value instead of marking the beginning and end of the value.
This works just as well:
href=a"http://www.google.com onmouseover=alert(/hacked/); \"
HTML allows quoteless attributes, so it becomes two attributes with the given values.
I'm trying to have all image links outside of my site blocked. How can I do this?
ex. I want to accept
http://www.mysite.com/notnecessary/notnecessary/possible.jpg
http://mysite.com/notnecessary/notnecessary/possible.jpg
http://www.mysite.com/possible.gif
but not
http://www.google.com/notnecessary/notnecessary/possible.jpg
http://www.othersite.net/notnecessary/notnecessary/possible.jpg
I'm doing this to try prevent hacking :) But I still want to be able to include my site's images. using
`<img src=""></img>`
Edit:
If I have a comment that says:
' Hello, these are images that contain a car
<img src="http://mysite.com/possiblepath/car.jpg"></img>
<img src="http://www.othersite.com/w/e/car.gif"></img>
<img src='http://othersite.com/car.jpg'></img>
Which car is your favorite?'
So from this comment, I need:
' Hello, these are images that contain a car
<img src="http://mysite.com/possiblepath/car.jpg"></img>
http://www.othersite.com/w/e/car.gif
http://othersite.com/car.jpg'
Which car is your favorite?'
My site img code should stay, others should turn into URLs/Links.
Thank you! Greatly appreciated.
^http://(.*)mysite.com(.*)$
That should work for you. You may need to add \'s before the parenthesis though, depending on what you are parsing it with. It will match if the given url belongs to mysite.com or any of its subdomains.
^https?://(?:www\.)?mysite\.com/
^ Start of line
http
s? Maybe you have SSL??
://
(www.)? With or without www. Similar to
mysite
\. Prevents "mysites"
com/
Try something like this:
^https?://(\w+\.)*mysite\.com($|[/?#&])
notes:
(\w+\.) is a simplistic idea of what your sub domain may hold. If you're only interested in www, you can change that.
($|[/?#&]) - check for the end of the string or one of /, ?, # or & directly after mysite.com. You want to avoid http://mysite.com.example.com, http://example.com/mysite.com, http://example.com?source=mysite.com, etc.
Don't check file extension unless you're going to white-list it, but it is rather useless anyway. Any URL may hide an image - a server may return any file for any request.
Hi i just put up a validation function in jScript to validate filename in fileupload control[input type file]. The function seems to work fine in FF and sometimes in ie but never in Chrome. Basically the function tests if File name is atleast 1 char upto 25 characters long.Contains only valid characters,numbers [no spaces] and are of file types in the list. Could you throw some light on this
function validate(Uploadelem) {
var objRgx = new RegExp(/^[\w]{1,25}\.*\.(jpg|gif|png|jpeg|doc|docx|pdf|txt|rtf)$/);
objRgx.ignoreCase = true;
if (objRgx.test(Uploadelem.value)) {
document.getElementById('moreUploadsLink').style.display = 'block';
} else {
document.getElementById('moreUploadsLink').style.display = 'none';
}
}
EDIT:
Nope still does not seem to work , i am using IE 8(tried all the compatibility modes), Chrome v8.0, FF v 3.6.
Here is a html snippet in which i wired up the validate function,
<div>
<input type="file" name="attachment" id="attachment" onchange="validate(this)" />
<span class="none">Filename should be within (1-25) letters long. Can Contain only letters
& numbers</span>
<div id="moreUploads">
</div>
<div id="moreUploadsLink" style="display: none;">
Attach another File</div>
</div>
It works perfectly for me. How do you call the validate function ? – M42
You tried this on Google Chrome and IE 8 ? i added HTML Snippet in where in i used all of the recommended regX. No Clues as to why doesn't work!!
Mike, i am unable to comment your post here So this is for you.
The Validation Fails for which ever file i choose in the html input. I Also wired the validation in onblur event but proves same. The validate function will mimic the asp.net regular expression validator which displays validation error message when regular expression is not met.
Try simplifying your code.
function validate(Uploadelem) {
var objRgx = /^[\w]{1,25}\.+(jpg|gif|png|jpeg|doc|docx|pdf|txt|rtf)$/i;
if (objRgx.test(Uploadelem.value)) {
document.getElementById('moreUploadsLink').style.display = 'block';
} else {
document.getElementById('moreUploadsLink').style.display = 'none';
}
}
Your specification is hazy, but it appears that you want to allow dots within filenames (in addition to the dot that separates filename and extension).
In that case, try
var objRbx = /^[\w.]{1,25}\.(jpg|gif|png|jpeg|doc|docx|pdf|txt|rtf)$/i;
This allows filenames that consist only of the characters a-z, A-Z, 0-9, _ and ., followed by a required dot and one of the specified extensions.
As far as I know, Chrome adds a path in front of the filename entered, so you have just to change your regex from:
/^[\w]{1,25}\.*\.(jpg|gif|png|jpeg|doc|docx|pdf|txt|rtf)$/
to:
/\b[\w]{1,25}\.+(jpg|gif|png|jpeg|doc|docx|pdf|txt|rtf)$/
SOLVED
Primary reason that all [CORRECT regx pattern] did not work is Different browsers returned different values for HTML File Input control.
Firefox: Returns the File Upload controls FileName {As Expected}
Internet Explorer: Returns the Full Path to the File from Drive to File [Pain in the Ass]
Chrome: Returns a fake path as [C:\FakePath\Filename.extension]
I got a solution to the thing for chrome and FF but not IE.
Chrome and Firefox:
use FileUploadControlID.files[0].fileName or FileUploadControlID.files[0].name
IE
Again biggest pain in the ass [someone suggest a solution]
Valid Regex to Validate both fileName and Extension would be:
/\b([a-zA-Z0-9._/s]{3,50})(?=(\.((jpg)|(gif)|(jpeg)|(png))$))/i
1.File Nameshould be between 3 and 50 characters
2. Only jpg,gif,jpeg,png files are allowed