CURLOPT_URL with {} characters - libcurl

In an example I saw for making a curl connection from PHP, it had this line:
curl_setopt($ch, CURLOPT_URL, "{$url}");
It seems to actually work, but I don't understand why they showed to wrap the URL in the squiggly braces. Does that do something magical for CURL?

"{$url}" is completely equivalent to (string)$url. Please see: PHP: Strings - Manual
If value type of $url is not string, it is converted into string.
If value type of $url is always string, their codes are verbose or nonsense.
You should replace "{$url}" into $url.

All you need is
curl_setopt($ch, CURLOPT_URL, "$url");
You have {$url}, but PHP just replaces the $url, so the braces are around the url.
Good luck!
Update:
From the curl docs and multiple SO questions:
-g/--globoff
This option switches off the "URL globbing parser". When you set this option, you can
specify URLs that contain the letters {}[] without having them being interpreted by curl
itself. Note that these letters are not normal legal URL contents but they should be
encoded according to the URI standard.

libcurl doesn't interpret the curly braces at all, that's done by PHP itself before the string is passed to libcurl.
You can easily see that for yourself if you try to use such on the command line using curl:
$ curl -g '{http://example.com/}'
curl: (1) Protocol "{http" not supported or disabled in libcurl
Using -g to switch off globbing since the globbing the command line tool uses will otherwise change the test and it will work. Again: the globbing also mentioned in the other answer here is implemented and provided by the command line tool only so PHP/CURL has no such support. That's not what you see in use from a PHP program.

Related

How do I escape double quote and other troublesome characters in Powershell regex

I've hit a snag in a script I'm putting together to download the latest installation packages without needing to use Chocolatey or Ketarin. Unfortunately a few utilities aren't provided at a direct download link and are hidden behind redirecting URLs, with the download URL expiring after 15 minutes. To complicate things a bit further, I'm doing this in PowerShell 2 as we have a few Vista machines in our office.
After researching other similar scenarios, it seems as though I can invoke the .NET WebClient to handle the download, though there isn't a progress bar. As I haven't found a sample of code to handle downloading files behind redirects after a certain amount of time that works with a .NET WebClient, I decided that what I could do is use a WebClient request to load the page, and then get the current direct download URL from the page using the following regex, and then use a regex to that URL to download the file. I've checked with regexr.com to verify that the regex catches the sample URL below.
Sample URL
CF DL here
Regex
<a(?: [^>]*?)? href=(["'])([^\1]*?ProgramName*?)\1(?: .*?)?>.*?<\/a>
Unfortunately Powershell red flags this, as it seems to think the double quotes need to be terminated. After attempting to escape any red-flagged characters using backticks, I've wound up with the following, that throws a error saying that '?:' is not recognized as a term, cmdlet, etc.
$downloadLinkRegex = New-Object System.Text.RegularExpressions.Regex (<a(?: [^>]*?)? href=(`[`"`'])(`[`^\1]*?ProgramName.exe*?)\1(?: .*?)?>.*?</a>)
if ("https://www.example.com/randomstring003ejdjd38/dl/ProgramName.exe" -match $downloadLinkRegex){
write-host "yay"
} else{
write-host "nope"}
Attempts to escape the ? using backticks fails also. Regex's are incredibly difficult for me, so at this point I'm out of ideas on how to make the ISE recognize that this is a valid regex, and that it doesn't need to be validated, and that it can be stored as the value of a variable to be called later on the contents of a webrequest.
If anyone could point out where I've gone wrong, or how to resolve the issue, I would be immensely grateful.
The easiest way I can think of is by using the #" bla "# block in powershell (I don't know the official name).
For example :
$regex = #"
Insert regex here
"#
Everything between the #" "# block will be treated as a string value.
I just removed the items PowerShell flags. I had to test several different ways to make sure this was the only way PowerShell would let me print to HTML. Even the ConvertTo-HTML won't bypass PowerShell's issues. It is a like a hybrid to HTML. I also noticed that PowerShell doesn't pay attention to blank space when you type so my real code has lots of spaces and empty lines to differentiate my script.
$My_HTML_table = "<!DOCTYPE html>
<head><title> My Excellent Page </title></head>
<H2> Table 1 </H2>
<text></text>
<table border=1;border-style:solid>
<tr>
<td colspan=1 style=color:blue;background-color:#CCCCCC;font-size:18;padding:5px> Cute Header </td>
</tr>"
$My_HTML_table > C:\File_Path\My_Excellent_HTML.html
But it doesn't match on regexr.com ...? It fails because it thinks the </a> is the end of the regex. It also fails because it's trying to match ProgramNam(one or unlimited 'e') and ignoring the .exe bit. (And "must not match octal number 1"? That's probably not what you want in there (no, I didn't know that, I just saw it while scratching my head trying to decipher this on regex101.com)).
Anyway, to your question: PowerShell doesn't have regex literals, so you can't just write <a(?: [^>]*?... into the shell and have it work. They have to be strings.
But they don't have to be run through New-Object System.Text.RegularExpressions.Regex.
e.g.
$url = 'CF DL here'
$pattern = "<a.*?href=[`"'](.*?)[`"'][^>]*>.*?</a>"
$url -match $pattern
$Matches[1]
I've quoted the string in double quotes around the outside. And then I've used a backtick to escape the double quotes inside the pattern.
Where the regex pattern is explained much more helpfully here
I actually reworked the regex into something simpler to resolve the issue. While the URL continually changes the file name doesn't, so I focused on the filename, rather than the whole URL, and was able to grab the URL I needed.
Looks good
$a='CF DL here'
$a -match '(?<=ef=")[^"]+?(\w+).(exe|pdf)'
Iwr $matches[0] -outfile "$($matches[1]).$($matches[2])"

Perl 5: How to improve regex for URL parsing

I'm trying to parse a text file of tweets and remove URLs and put them into a urls.txt file. At the moment, I have this regex:
($line =~ /((?:https?|ftp|telnet|gopher|file|imap):\/\/[\w\-\.\~\!\*\'\(\)\;\:\#\&\=\+\$\,\/\\\?\%\#\[\]]*)/)
But as I want to build on it further, and it's quite unwieldy even now, I'm wondering if there's any way I can check for valid URL characters (the [\w\-\.\~\!\*\'\(\)\;\:\#\&\=\+\$\,\/\\\?\%\#\[\]]* part) using something like an array or a hash. Or anything that doesn't make it so unnecessarily verbose.
The rest of my code can be provided if needed for whatever reason.
If you want to validate a URL why not use a module from CPAN to do the hard work for you.
my $uri = URI->new("http://www.perl.com");
See the details of the URI module here.
As recommended by Sobrique, you could also use:
use Data::Validate::URI qw(is_uri);
if (is_uri("http://www.perl.com")) {
...
}
See the details of the Data::Validate::URI module here.

Regex don't match if URI contains extension

I'm getting stuck on a bit of regex, needed in an htaccess file on an old project I've taken on. I want to match the following uris
/page?id=12
/admin/users-view?id=3242
/subscribe
Where there may or may not be a query string, and may or may not be multiple segments
I need to insert a .php extension, before the query string. So the first example becomes
/page.php?id=12
I also cannot match any uri with a file extension, so that images, js or css files do not get matched.
I came up with this:
^([/\w-]+)?/?
which does what i need apart from the last point. My regex skills are poor, so any help is appreciated
dont parse URIs with regexp, php has built in functions for that
http://php.net/manual/en/function.parse-url.php
note, there is also reverse function which builds url:
http://php.net/manual/en/function.http-build-url.php
you should use them instead of regexp because they will (at least should) handle url encoding correctly
You might want to think about disassembling the URL with parse_url and putting it back together after manipulation.
However, for a pure regex solution, I think I would try to find a string starting at a slash (or the beginning of the string) and a question mark that does not contain periods:
$url = preg_replace('~(^|/)[^.?]*(?=[?]|$)~', '$0.php', $url);
The parse_url solution would rather look like:
$urlParts = parse_url($url);
if(pathinfo($urlParts['path'], PATHINFO_EXTENSION) === null)
$urlParts['path'] .= '.php';
$url = implode($urlParts);

Remove the first comment from a collection of Java source files using Perl

I'm trying to remove the first C-style comment (only the first) from a collection of Java source files. At first I tried a multi-line sed, but that didn't work properly so after some Googling it seemed Perl was the way to go. I used to like Perl, it was the first language I ever used to make a web program with, but I've run into a wall trying to get this script to work:
#!/usr/bin/perl -i.bak
$s=join("",<>);
$s=~ s/("(\\\\|\\"|[^"])*")|(\/\*([^*]|\*(?=[^\/]))*\*\/)|(\/\/.*)/$1 /;
print $s;
I call it with the filename(s) of the files to be processed, e.g. ./com.pl test.java. According to everything on the Internet, -i (in-place edit) should redirect output from print statements to the file instead of printing to stdout. Now here's the thing: it doesn't. No matter what I try, I can't seem to get it to replace the file with the print output. I've tried $^I too but that doesn't work either.
I don't know if it's relevant but I'm on Ubuntu 11.04.
P.S. I'm aware of the pitfalls of regexing source code :)
Does the following not work from the command line?
$ perl -pi.bak 's|your_regex|here|' *.java
Inside a script
The script equivalent of the above is:
#!/usr/bin/perl -pi.bak
s|your_regex|here|;
The original post was missing the p flag, as pointed out by triplee in his comment.
See perldoc perlrun for more.

Removing Different URLs with Regex

I am looking to remove a ton of bad spam URL links from my forums using regex in either grep or vim and subsequently using find/replace commands. I am looking for a way to select just the bad URLs to do that.
All of the URLs are different and are preceeded by \n________\n. (Thats 8 underscores)
Here is an example of one of the URLs:
\n________\n[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
So basically I was trying to use the \n... and the [/URL] as boundaries to select that and everything inbetween. What I came up with is this:
[\\]n[_][_][_][_][_][_][_][_][\\]n.*\[\/URL\]]
Using that does not correctly close the search and selects pretty much everything. I very am new at this and appreciate any insight. Thanks.
Assuming GNU ERE, this should work:
\\n_{8}\\n\s\[URL=(.*)].*\[/URL]
RegexBuddy seems to agree with me:
That said,
> grep -E \\n_{8}\\n\s\[URL=(.*)].*\[/URL] test.txt
doesn't work on my system (Cygwin with GNU grep 2.6.3; test.txt's contents are shown in the screenshot above).
If you want to give sed a chance following will do the job:
sed 's/^.*\(\[URL.*\)$/\1/' file.txt
PS: You can do same :s/^.*\(\[URL.*\)$/\1/ in your vi session as well.
OUTPUT
For the file.txt that contains:
\n__\n[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
It produces:
[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
In Vim this should remove all lines that match the pattern:
:g/\\n\%(\\_\)\{8}\\n \[URL=.\{-}\/URL\]/d
That pattern matches the sample text taken literally, all in one line.
I was actually able to do this in Microsoft Word using the following:
[\\]n_{8}[\\]n?*/URL\]
Thank you for all the input, couldn't have done it without the help!