PHP preg_replace is not matching entire pattern - regex

I'm stuck trying to get the PHP preg_replace to work properly. I want to find all matches of a pattern and replace them with a string. But, for some reason, it's finding only partial matches and replacing all of them. I'm trying to remove the "password" from every line of a text file. The password is always at the end of each line, contains 4 to 8 alpha-numeric characters, and always follows two pipe characters.
Example:
$data = 'A00000001|A00000001|FirstName|LastName|email#address|Role||password'.PHP_EOL;
$data .= 'B00000002|B00000002|FirstName|LastName|email#address|Role||password'.PHP_EOL;
$delim = '|';
$newData = preg_replace("/".$delim.$delim."[a-zA-Z0-9]{4,8}/", $delim.$delim, $data);
echo $newData;
Output:
||||||1|||||||||1|||||||||e||||||||||||e||m||a||i||l||#||a||d||d||r||e||s||s|||||R||o||l||e||||||||||||
||||||2|||||||||2|||||||||e||||||||||||e||m||a||i||l||#||a||d||d||r||e||s||s|||||R||o||l||e||||||||||||
||
I've tried many variations with different groupings using parenthesis, putting back to back [a-zA-Z0-9] patterns instead of {#}. I've tried adding line start ^ and end $ to my pattern. I'm stuck. I know this will end up being something simple to that I'm just overlooking. That's why I need some fresh eyes on this.

You should use this regex
/(?<=\|\|)[a-zA-Z]{4,8}$/
You need to escape | since it represents OR in regex
$ marks the end of string
(?<=\|\|) is a zero width lookbehind

Looks like you can just escape your delimiters.
$newData = preg_replace('/\'.$delim.'\'.$delim.'[a-zA-Z0-9]{4,8}/', $delim.$delim, $data);

I'm trying to remove the "password" from every line of a text file.
In this case, anchor the regex after properly escaping your delimiter. Assuming the delim shouldn't be kept either, you could use:
preg_replace('/\|.*?$/', '', $data);
If it should, use a look-behind or:
preg_replace('/\|.*?$/', '|', $data);
On a separate note: this looks like an SQL dump or a CSV file. If so, consider using whichever variation of COPY ... DELIMITER ... your RDBMS offers instead:
http://www.postgresql.org/docs/current/static/sql-copy.html
You could then create a temporary table, import, drop the column, do whatever else you need to do, and populate the final tables as needed once you're done.

Related

How to substitute a string even it contains regex meta characters using Shell or Perl?

I want to substitue a word which maybe contains regex meta characters to another word, for example, substitue the .Precilla123 as .Precill, I try to use below solution:
sed 's/.Precilla123/.Precill/g'
but it will change below line
"Precilla123";"aaaa aaa";"bbb bbb"
to
.Precill";"aaaa aaa";"bbb bbb"
This side effect is not I wanted. So I try to use:
perl -pe 's/\Q.Precilla123\E/.Precill/g'
The above solution can disable interpreted regex meta characters, it will not have the side effect.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
Can anybody help this? Many thanks.
Please note that the word I want to substitute is NOT hard coded, it comes from a input file, you can consider it as variable.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
This is not true.
If the value that you want to replace is in a Perl variable, then quotemeta will work on the variable's contents just fine, including the characters $ and #:
echo 'pre$foo to .$foobar' | perl -pe 'my $from = q{.$foo}; s/\Q$from\E/.to/g'
Outputs:
pre$foo to .tobar
If the words that you want to replace are in an external file, then simply load that data in a BEGIN block before composing your regular expressions for replacement.
sed 's/\.Precilla123/.Precill/g'
Escape the meta character with \.
Be carrefull, mleta charactere are not the same for search pattern that are mainly regex []{}()\.*+^$ where replacement is limited to &\^$ (+ the separator that depend of your first char after the s in both pattern)

Regex to find text between second and third slashes

I would like to capture the text that occurs after the second slash and before the third slash in a string. Example:
/ipaddress/databasename/
I need to capture only the database name. The database name might have letters, numbers, and underscores. Thanks.
How you access it depends on your language, but you'll basically just want a capture group for whatever falls between your second and third "/". Assuming your string is always in the same form as your example, this will be:
/.*/(.*)/
If multiple slashes can exist, but a slash can never exist in the database name, you'd want:
/.*/(.*?)/
/.*?/(.*?)/
In the event that your lines always have / at the end of the line:
([^/]*)/$
Alternate split method:
split("/")[2]
The regex would be:
/[^/]*/([^/]*)/
so in Perl, the regex capture statement would be something like:
($database) = $text =~ m!/[^/]*/([^/]*)/!;
Normally the / character is used to delimit regexes but since they're used as part of the match, another character can be used. Alternatively, the / character can be escaped:
($database) = $text =~ /\/[^\/]*\/([^\/]*)\//;
You can even more shorten the pattern by going this way:
[^/]+/(\w+)
Here \w includes characters like A-Z, a-z, 0-9 and _
I would suggest you to give SPLIT function a priority, since i have experienced a good performance of them over RegEx functions wherever it is possible to use them.
you can use explode function with PHP or split with other languages to so such operation.
anyways, here is regex pattern:
/[\/]*[^\/]+[\/]([^\/]+)/
I know you specifically asked for regex, but you don't really need regex for this. You simply need to split the string by delimiters (in this case a backslash), then choose the part you need (in this case, the 3rd field - the first field is empty).
cut example:
cut -d '/' -f 3 <<< "$string"
awk example:
awk -F '/' {print $3} <<< "$string"
perl expression, using split function:
(split '/', $string)[2]
etc.

Perl Regex: How to remove quotes inside quotes from CSV line

I've got a line from a CSV file with " as field encloser and , as field seperator as a string. Sometimes there are " in the data that break the field enclosers. I'm looking for a regex to remove these ".
My string looks like this:
my $csv = qq~"123456","024003","Stuff","","28" stuff with more stuff","2"," 1.99 ","",""~;
I've looked at this but I don't understand how to tell it to only remove quotes that are
not at the beginning of the string
not at the end of the string
not preceded by a ,
not followed by a ,
I managed to tell it to remove 3 and 4 at the same time with this line of code:
$csv =~ s/(?<!,)"(?!,)//g;
However, I cannot fit the ^ and $ in there since the lookahead and lookbehind both do not like being written as (?<!(^|,)).
Is there a way to achieve this only with a regex besides splitting the string up and removing the quote from each element?
For manipulating CSV data I'd reccomend using Text::CSV - there's a lot of potential complexity within CSV data, which while possible to contruct code to handle yourself, isn't worth the effort when there's a tried and tested CPAN module to do it for you
Don't use Regex for parsing CSV file, CPAN provides lot of good modules like as nickifat suggest, use Text::CSV or you can use Text::ParseWords like
use Text::ParseWords;
while (<DATA>) {
chomp;
my #f = quotewords ',', 0, $_;
print join "|" => #f;
}
__DATA__
"123456","024003","Stuff","",""28" stuff with more stuff","2"," 1.99 ","",""
Output:
123456|024003|Stuff||28 stuff with more stuff|2| 1.99 ||
This should work:
$csv =~ s/(?<=[^,])"(?=[^,])//g
1 and 2 implies that there must be at least one character before and after the comma, hence the positive lookarounds. 3 and 4 implies that these characters can be anything but a comma.
Thanks for the help here. I was having issues with badly formatted CSV with embedded double-quotes. I would make one slight addition to the lookahead portion of the regex otherwise null values at the end of the line will be corrupted:
(?<=[^,])\"(?=[^,\n])
Adding the \n will eliminate a match against the last double-quote at end-of-line.
the suggested
$csv =~ s/(?<=[^,])"(?=[^,])//g;
is probably the best answer. Without these advanced regex features, you could also do the same with
$csv =~ s/([^,])"([^,])/$1$2/g;
or
$csv = join (',', map {s/"//g;"\"$_\""} split (',', $csv));
I think you should be aware that your string is not well formated csv. In a csv file, double quotes inside values must be doubled (http://en.wikipedia.org/wiki/Comma-separated_values). With your format, values cannot contain quotes near commas.
csv is a not so simple format. If you decides to use "real" csv, you should use a module.
Otherwise, you should probably remove all the double quotes in order to simplify your code and clarify that you are not doing csv.

Matching everything except a specified regex

I have a huge file, and I want to blow away everything in the file except for what matches my regex. I know I can get matches and just extract those, but I want to keep my file and get rid of everything else.
Here's my regex:
"Id":\d+
How do I say "Match everything except "Id":\d+". Something along the lines of
!("Id":\d+) (pseudo regex) ?
I want to use it with a Regex Replace function. In english I want to say:
Get all text that isn't "Id":\d+ and replace it with and empty string.
Try this:
string path = #"c:\temp.txt"; // your file here
string pattern = #".*?(Id:\d+\s?).*?|.+";
Regex rx = new Regex(pattern);
var lines = File.ReadAllLines(path);
using (var writer = File.CreateText(path))
{
foreach (string line in lines)
{
string result = rx.Replace(line, "$1");
if (result == "")
continue;
writer.WriteLine(result);
}
}
The pattern will preserve spaces between multiple Id:Number occurrences on the same line. If you only have one Id per line you can remove the \s? from the pattern. File.CreateText will open and overwrite your existing file. If a replacement results in an empty string it will be skipped over. Otherwise the result will be written to the file.
The first part of the pattern matches Id:Number occurrences. It includes an alternation for .+ to match lines where Id:Number does not appear. The replacement uses $1 to replace the match with the contents of the first group, which is the actual Id part: (Id:\d+\s?).
well, the opposite of \d is \D in perl-ish regexes. Does .net have something similar?
Sorry, but I totally don't get what your problem is. Shouldn't it be easy to grep the matches into a new file?
Yoo wrote:
Get all text that isn't "Id":\d+ and replace it with and empty string.
A logical equivalent would be:
Get all text that matches "Id":\d+ and place it in a new file. Replace the old file with the new one.
I haven't use .net before, but following works in java
System.out.println("abcd Id:12351abcdf".replaceAll(".*(Id:\\d+).*","$1"));
produces output
Id:12351
Although in true sense it doesnt match the criteria of matching everything except Id:\d+, but it does the job

How to cycle through delimited tokens with a Regular Expression?

How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)
/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.
In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;
Assuming you want to match ###token2### as well...
/###.+###/
Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.
Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###