I am working with text, and wish to replace a # character with an incrementally increasing number. The text looks like this:
Chap.2.#
Chap.2.#
Chap.2.#
Chap.2.#
And I am trying to get it to read:
Chap.2.1
Chap.2.2
Chap.2.3
Chap.2.4 and on up to triple digits.
My source document is Mellel, but I also have Nisus Writer Pro, and a host of other text editors such as TextMate, Atom, TextWrangler, Brackets, CotEditor, jEdit, etc. I have tried using Regular Expressions in apps that indicate availability of using that, but to no avail.
I have tried searching for #
then replacing with \i, or \1, or \1\i, or \1\i.
Can someone help please? I've read many other similar questions on SO, and other sites, bt I can not seem to get the syntax correct (plus the other examples are not close enough to mine to help me figure this out).Thanks.
One solution is to use perl:
perl -pe 'BEGIN { our $i = 1; } s/Chap\.2\.#/"Chap.2.".($i++)/ge;' <chapters.txt;
Extension to handle multiple top-level numbers, incrementing the second-level number from 1 for each top-level number independently:
perl -pe 'BEGIN { our $h = {}; } s/Chap\.(\d+)\.#/"Chap.$1.".(exists($h->{$1}) ? $h->{$1}++ : ($h->{$1} = 1)++)/ge;' <chapters.txt;
To only find and incrementally replace the # character you can do:
perl -pe 'BEGIN { our $i = 1; } s/#/$i++/ge;' <chapters.txt;
Here's a way to do this with PHP at the commandline with a file called convert.php which takes a filename as an argument. Note: no for-loop:
<?php
if ( count($argv) < 2 ) {
echo "Correct syntax is: convert.php filename.ext\n";
exit;
}
$file = $argv[1];
$contents = file_get_contents( $file );
$nu = preg_replace_callback('/\#/',function($matches){
static $i=1;
return $i++;
},$contents);
echo '<pre>',"\n";
echo $nu;
file_put_contents("bestNumberedChapters.txt",$nu);
So, it is feasible with PHP to replace all chapter headings containing a '#' with an incremental, numerical value.
Related
Friends,
need some help with substitution regex.
I have a string
;;;;;;;;;;;;;
and I need to replace it by
;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;
I tried
s/;;/;\\N/;/g
but it gives me
;\N;;\N;;\N;;\N;;\N;;\N;;
tried to fiddle with lookahead and lookbehind, but can't get it solved.
I wouldn't use a regex for this, and instead make use of split:
#!/usr/bin/env perl
use strict;
use warnings;
my $str = ';;;;;;;;;;;;;';
print join ( '\N', split ( //, $str ) );
Splitting on nulls, to get each character, and making use of the fact that join puts delimiters between characters. (So not before first, and not after last).
This gives:
;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;
Which I think matches your desired output?
As a oneliner, this would be:
perl -ne 'print join ( q{\N}, split // )'
Note - we need single quotes ' rather than double around the \N so it doesn't get interpolated.
If you need to handle variable content (e.g. not just ; ) you can add grep or map into the mix - I'd need some sample data to give you a useful answer there though.
I use this for infile edit, the regexp suits me better
Following on from that - perl is quite clever. It allows you to do in place editing (if that's what you're referring to) without needing to stick with regular expressions.
Traditionally you might do
perl -i.bak -p -e 's/something/somethingelse/g' somefile
What this is doing is expanding out that out into a loop:
LINE: while (defined($_ = <ARGV>)) {
s/someting/somethingelse/g;
}
continue {
die "-p destination: $!\n" unless print $_;
}
E.g. what it's actually doing is:
opening the file
iterating it by lines
transforming the line
printing the new line
And with -i that print is redirected to the new file name.
You don't have to restrict yourself to -p though - anything that generates output will work in this way - although bear in mind if it doesn't 'pass through' any lines that it doesn't modify (as a regular expression transform does) it'll lose data.
But you can definitely do:
perl -i.bak -ne 'print join ( q{\N}, split // )'
And inplace edit - but it'll trip over on lines that aren't just ;;;;; as your example.
So to avoid those:
perl -i.bak -ne 'if (m/;;;;/) { print join ( q{\N}, split // ) } else { print }'
Or perhaps more succinctly:
perl -i.bak -pe '$_ = join ( q{\N}, split // ) if m/;;;/'
Since you can't match twice the same character you approach doesn't work. To solve the problem you can only check the presence of a following ; with a lookahead (the second ; isn't a part of the match) :
s/;(?=;)/;\\N/g
I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:
Script:
$string = gc $filePath
$string | % {
$_ -replace 'something0','somethingelse0' `
-replace 'something1','somethingelse1' `
-replace 'something2','somethingelse2' `
-replace 'something3','somethingelse3' `
-replace 'something4','somethingelse4' `
-replace 'something5','somethingelse5' `
...
(600 More Lines...)
...
}
$string | ac "C:\log.txt"
But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.
Is there a better alternative to doing this that is more efficient?
Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:
# Build hashtable of search and replace values.
$replacements = #{
'something0' = 'somethingelse0'
'something1' = 'somethingelse1'
'something2' = 'somethingelse2'
'something3' = 'somethingelse3'
'something4' = 'somethingelse4'
'something5' = 'somethingelse5'
'X:\Group_14\DACU' = '\\DACU$'
'.*[^xyz]' = 'oO{xyz}'
'moresomethings' = 'moresomethingelses'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
Get-Content $filePath |
foreach { $r.Replace( $_, $matchEval ) } |
Add-Content 'C:\log.txt'
So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?
Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.
The Method:
Construct a hash where the keys are the somethings and the values are the somethingelses.
Join the keys of the hash with the | symbol, and use it as a match group in the regex.
In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group
The Problem:
Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.
In Perl, you can do this, for example:
$string =~ s/(1|2|3)/#{[$1 + 5]}/g;
This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".
However, in PowerShell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)"
[regex]::replace($string,'(1|2|3)',"$($1 + 5)")
In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.
The Solution:
[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]
If using another language is acceptable to you, the following Perl script works like a charm:
$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
'something0' => 'somethingelse0',
'something1' => 'somethingelse1',
'something2' => 'somethingelse2',
'something3' => 'somethingelse3',
'something4' => 'somethingelse4',
'something5' => 'somethingelse5',
'X:\Group_14\DACU' => '\\DACU$',
'.*[^xyz]' => 'oO{xyz}',
'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
push #strings, qr/\Q$_\E/;
$replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', #strings;
while (<INPUT>) {
s/($pattern)/$replacements{$1}/g;
print OUTPUT;
}
close INPUT;
close OUTPUT;
It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:
The foreach loop goes through all the elements of the hash and create an array called #strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.
BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.
while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
I simplified #{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so #{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.
I also have no idea how to solve this in powershell, but I do know how to solve it in Bash and that is by using a tool called sed. Luckily, there is also Sed for Windows. If all you want to do is replace "something#" with "somethingelse#" everywhere then this command will do the trick for you
sed -i "s/something([0-9]+)/somethingelse\1/g" c:\log.txt
In Bash you'd actually need to escape a couple of those characters with backslashes, but I'm not sure you need to in windows. If the first command complains you can try
sed -i "s/something\([0-9]\+\)/somethingelse\1/g" c:\log.txt
I would use the powershell switch statement:
$string = gc $filePath
$string | % {
switch -regex ($_) {
'something0' { 'somethingelse0' }
'something1' { 'somethingelse1' }
'something2' { 'somethingelse2' }
'something3' { 'somethingelse3' }
'something4' { 'somethingelse4' }
'something5' { 'somethingelse5' }
'pattern(?<a>\d+)' { $matches['a'] } # sample of more complex logic
...
(600 More Lines...)
...
default { $_ }
}
} | ac "C:\log.txt"
I'm having to replace fqdn's inside a SQL dump for website migration purposes. I've written a perl filter that's supposed to take STDIN, replace the serialized strings containing the domain name that's supposed to be replaced, replace it with whatever argument is passed into the script, and output to STDOUT.
This is what I have so far:
my $search = $ARGV[0];
my $replace = $ARGV[1];
my $offset_s = length($search);
my $offset_r = length($replace);
my $regex = eval { "s\:([0-9]+)\:\\\"(https?\://.*)($search.*)\\\"" };
while (<STDIN>) {
my #fs = split(';', $_);
foreach (#fs) {
chomp;
if (m#$regex#g) {
my ( $len, $extra, $str ) = ( $1, $2, $3 );
my $new_len = $len - $offset_s + $offset_r;
$str =~ eval { s/$search/$replace/ };
print 's:' . $new_len . ':' . $extra . $str . '\"'."\n";
}
}
}
The filter gets passed data that may look like this (this is taken from a wordpress dump, but we're also supposed to accommodate drupal dumps:
INSERT INTO `wp_2_options` VALUES (1,'siteurl','http://to.be.replaced.com/wordpress/','yes'),(125,'dashboard_widget_options','
a:2:{
s:25:\"dashboard_recent_comments\";a:1:{
s:5:\"items\";i:5;
}
s:24:\"dashboard_incoming_links\";a:2:{
s:4:\"home\";s:31:\"http://to.be.replaced.com/wordpress\";
s:4:\"link\";s:107:\"http://blogsearch.google.com/blogsearch?scoring=d&partner=wordpress&q=link:http://to.be.replaced.com/wordpress/\";
}
}
','yes'),(148,'theme_175','
a:1:{
s:13:\"courses_image\";s:37:\"http://to.be.replaced.com/files/image.png\";
}
','yes')
The regex works if I don't have any periods in my $search. I've tried escaping the periods, i.e. domain\.to\.be\.replaced, but that didn't work. I'm probably doing this either in a very roundabout way or missing something obvious. Any help would be greatly appreciated.
There is no need to evaluate (eval) your regular expression because of including variables in them. Also, to avoid the special meaning of metacharacters of those variables like $search, escape them using quotemeta() function or including the variable between \Q and \E inside the regexp. So instead of:
my $regex = eval { "s\:([0-9]+)\:\\\"(https?\://.*)($search.*)\\\"" };
Use:
my $regex = qr{s\:([0-9]+)\:\\\"(https?\://.*)(\Q$search\E.*)\\\"};
or
my $quoted_search = quotemeta $search;
my $regex = qr{s\:([0-9]+)\:\\\"(https?\://.*)($quoted_search.*)\\\"};
And the same advice for this line:
$str =~ eval { s/$search/$replace/ };
you have to double the escape char \ in your $search variable for the interpolated string to contain the escaped periods.
i.e. domain\.to\.be\.replaced -> domain.to.be.replaced (not wanted)
while domain\\.to\\.be\\.replaced -> domain\.to\.be\.replaced (correct).
I'm not sure your perl regex would replace the DNS in string matching several times the old DNS (in the same serialized string).
I made a gist with a script using bash, sed and one big perl regex for this same problem. You may give it a try.
The regex I use is something like that (exploded for lisibility, and having -7 as the known difference between domain names lengths):
perl -n -p -i -e '1 while s#
([;|{]s:)
([0-9]+)
:\\"
(((?!\\";).)*?)
(domain\.to\.be\.replaced)
(.*?)
\\";#"$1".($2-7).":\\\"$3new.domain.tld$6\\\";"#ge;' file
Which is maybe not the best one but at least it seems to de the job. The g option manages lines containing several serialized strings to cleanup and the while loop redo the whole job until no replacement occurs in serilized strings (for strings containing several occurences of the DNS). I'm not fan enough of regex to try a recursive one.
I've been able to find similar, but not identical questions to this one. How do I match one regex pattern multiple times in the same line delimited by unknown characters?
For example, say I want to match the pattern HEY. I'd want to recognize all of the following:
HEY
HEY HEY
HEYxjfkdsjfkajHEY
So I'd count 5 HEYs there. So here's my program, which works for everything but the last one:
open ( FH, $ARGV[0]);
while(<FH>)
{
foreach $w ( split )
{
if ($w =~ m/HEY/g)
{
$count++;
}
}
}
So my question is how do I replace that foreach loop so that I can recognize patterns delimited by weird characters in unknown configurations (like shown in the example above)?
EDIT:
Thanks for the great responses thus far. I just realized I need one other thing though, which I put in a comment below.
One question though: is there any way to save the matched term as well? So like in my case, is there any way to reference $w (say if the regex was more complicated, and I wanted to store it in a hash with the number of occurrences)
So if I was matching a real regex (say a sequence of alphanumeric characters) and wanted to save that in a hash.
One way is to capture all matches of the string and see how many you got. Like so:
open (FH, $ARGV[0]);
while(my $w = <FH>) {
my #matches = $w =~ m/(HEY)/g;
my $count = scalar(#matches);
print "$count\t$w\n";
}
EDIT:
Yes, there is! Just loop over all the matches, and use the capture variables to increment the count in a hash:
my %hash;
open (FH, $ARGV[0]);
while (my $w = <FH>) {
foreach ($w =~ /(HEY)/g) {
$hash{$1}++;
}
}
The problem is you really don't want to call split(). It splits things into words, and you'll note that your last line only has a single "word" (though you won't find it in the dictionary). A word is bounded by white-space and thus is just "everything but whitespace".
What you really want is to continue to do look through each line counting every HEY, starting where you left off each time. Which requires the /g at the end but to keep looking:
while(<>)
{
while (/HEY/g)
{
$count++;
}
}
print "$count\n";
There is, of course, more than one way to do it but this sticks close to your example. Other people will post other wonderful examples too. Learn from them all!
None of the above answers worked for my similar problem. $1 does not seem to change (perl 5.16.3) so $hash{$1}++ will just count the first match n times.
To get each match, the foreach needs a local variable assigned, which will then contain the match variable. Here's a little script that will match and print each occurrence of (number).
#!/usr/bin/perl -w
use strict;
use warnings FATAL=>'all';
my (%procs);
while (<>) {
foreach my $proc ($_ =~ m/\((\d+)\)/g) {
$procs{$proc}++;
}
}
print join("\n",keys %procs) . "\n";
I'm using it like this:
pstree -p | perl extract_numbers.pl | xargs -n 1 echo
(except with some relevant filters in that pipeline). Any pattern capture ought to work as well.
I'm trying to write a bash script that would modify all occurrences of a certain string in a file.
I have a file with a bunch of text, in which urls occur. All urls are in the following format:http://goo.gl/abc23 (that's goo.gl/, followed by 4 OR 5 alphanumeric characters).
What I'd like do is append a string to all urls. I managed (with the help of user Dan Fego) to get this done with sed, but it only works by appending a static string.
What I'm looking for is a way to append a different string to each occurrence. Let's say I have a function generatestring that echoes a different string every time. I'd like to append a different generated string to each url. http://goo.gl/abc23 would become http://goo.gl/abc23?GeneratedString1, http://goo.gl/JB007 would become http://goo.gl/JB007?GeneratedString2 and so on.
Does anyone know if this can be done? I've been told that perl is the way to go, but I have zero experience with perl. That's why I'm asking here.
Thanks in advance for any help.
ETA: Assuming the URLs are embedded in other text:
$ perl -lnwe 's#http://goo.gl/\w{5}\K\b# "?" . rand(100) #ge; print' googl.txt
For example:
$ cat googl
random text here, and perhaps some html <a href="http://goo.gl/abc23">
more stuff http://goo.gl/abc23 foo fake link http://foo.bar/abc12
longer http://goo.gl/abc23123123 foo fake link http://foo.bar/abc12
$ perl -lnwe 's#http://goo.gl/\w{5}\K\b# "?" . rand(100) #ge; print' googl
random text here, and perhaps some html <a href="http://goo.gl/abc23?69.998515">
more stuff http://goo.gl/abc23?26.186867532985 foo fake link http://foo.bar/abc12
longer http://goo.gl/abc23123123 foo fake link http://foo.bar/abc12
-l chomps the file and adds newline to print. -n adds a while(<>) loop around the script, which basically means it reads either from argument file names or from STDIN. \K means "keep the matching text", \b is word boundary, so that you do not match partial strings.
Do note that it will still match http://goo.gl/abc12/foo, but since I do not know what your data looks like, you will have to determine what boundaries are acceptable.
Of course, rand(100) is just there as a placeholder for whatever function you intend to use.
If you needed the script version, here's the deparsed code:
use strict;
use warnings;
BEGIN { $/ = "\n"; $\ = "\n"; }
while (<>) {
chomp;
s[http://goo.gl/\w{5}\K\b]['?' . rand(100);]eg;
print;
}
If the URLs aren't alone in each line, you can do:
#!/usr/bin/perl
use strict;
use warnings;
sub generate {
my $i = shift;
return "GeneratedString$i";
}
my $i = 0;
while(my $line = <>) {
$line =~ s~(http://\S+)~$1 . "?" . &generate($i++)~eg;
print $line;
}
usage:
test.pl file_to__modify
output:
http://goo.gl/abc23?GeneratedString1
http://goo.gl/JB007?GeneratedString2
You can do it in a lot of languages, but in Perl it's pretty straight forward:
#!/usr/bin/perl
use strict;
use constant MAX_RANDOM_STRING_LENGTH => 5;
my $regex_url = '(http://goo.gl/\w{5})';
my #alphanumeric = ("A".."Z", "0".."9");
my $random_cap = $#alphanumeric + 1;
sub generate_string
{
my $string = "?";
for (my $i = 0; $i < MAX_RANDOM_STRING_LENGTH; $i++)
{
$string .= $alphanumeric[int(rand($random_cap))];
}
return $string;
}
my #input = <>;
for(#input)
{
my $cur = $_;
while ($cur =~ /$regex_url/)
{
$cur = $';
my $new_url = $1 . generate_string();
s/$1/$new_url/g;
}
}
print(#input);
Usage:
script_name.pl < input.txt > output.txt
This might work for you:
gs(){ echo $(tr -cd '[:alnum:]' </dev/urandom | head -c5); }
export -f gs
cat <<\! file
> http://goo.gl/abc23
> http://goo.gl/JB007
> bunch of text http://goo.gl/qwert another bunch of text
> another bot http://goo.gl/qwert another bot http://goo.gl/qaza
!
sed '\|http://goo\.gl/[0-9a-zA-Z]\{4,5\}\>|{s//&?'\''$(gs)'\''/g;s/^/echo '\''/;s/$/'\''/}' file |
sh
http://goo.gl/abc23?0Az23
http://goo.gl/JB007?ugczB
bunch of text http://goo.gl/qwert?LDW27 another bunch of text
another bot http://goo.gl/qwert?U9my2 another bot http://goo.gl/qaza?Ybtlp