Substitute first string match in 1-line 2GB file on Linux

Substitute first string match in 1-line 2GB file on Linux - regex

I'm trying to substitute only the first match of one string in a huge file with only one line (2.1 GB), this substitution will occur in a shell script job. The big problem is that the machine that will run this script has only 1GB memory (approximately
300MB free), so i need a buffered strategy that don't overflow my memory. I already tried sed, perl and a python approach, but all of them return me out of memory errors.
Here are my attemps (discovered in other questions):
# With perl
perl -pi -e '!$x && s/FROM_STRING/TO_STRING/ && ($x=1)' file.txt
# With sed
sed '0,/FROM_STRING/s//TO_STRING/' file.txt > file.txt.bak
# With python (in a custom script.py file)
for line in fileinput.input('file.txt', inplace=True):
print line.replace(FROM_STRING, TO_STRING, 1)
break
One good point is that the FROM_STRING that i'm searching is always in the beggining of this huge 1-line file, in the first 100 characters. Other good thing is that the execution time is not a problem, it can take time without problems.
EDIT (SOLUTION):
I tested three solutions of the answers, all them solved the problem, thanks for all of you. I tested the performance with Linux time and all of them take about the same time too, up to 10 seconds approximately... But i choose the #Miller solution because it's simpler (just uses perl).

Since you know that your string is always in the first chunk of the file, you should use dd for this.
You'll also need a temporary file to work with, as in tmpfile="$(mktemp)"
First, copy the first block of the file to a new, temporary location:
dd bs=32k count=1 if=file.txt of="$tmpfile"
Then, do your substitution on that block:
sed -i 's/FROM_STRING/TO_STRING/' "$tmpfile"
Next, concatenate the new first block with the rest of the old file, again using dd:
dd bs=32k if=file.txt of="$tmpfile" seek=1 skip=1
EDIT: As per Mark Setchell's suggestion, I have added a specification of bs=32k to these commands to speed up the pace of the dd operations. This is tunable, per your needs, but if tuning separate commands distinctly, you may need to be careful about the changes in semantics between different input and output block sizes.

If you're certain the string you're trying to replace is just in the first 100 characters, then the following perl one-liner should work:
perl -i -pe 'BEGIN {$/ = \1024} s/FROM_STRING/TO_STRING/ .. undef' file.txt
Explanation:
Switches:
-i: Edit <> files in place (makes backup if extension supplied)
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
BEGIN {$/ = \1024}: Set the $INPUT_RECORD_SEPARATOR to the number of characters to read for each “line”
s/FROM/TO/ .. undef: Use a flip-flop to perform the regex only once. Could also have used if $. == 1.

Given that then string to replace is in the first 100 bytes,
Given that Perl IO is slow unless you start using sysread to read large blocks,
Assuming that the substitution changes the size of the file[1], and
Assuming that binmode isn't needed[2],
I'd use
( head -c 100 | perl -0777pe's/.../.../' && cat ) <file.old >file.new
A faster solution for that exists.
Though it's easy to add if needed.

Untested, but I would do:
perl -pi -we 'BEGIN{$/=\65536} s/FROM_STRING/TO_STRING/ if 1..1' file.txt
to read in 64k chunks.

A practical (not very compact but efficient) would be to split the file, do the search-replace and join: eg:
head -c 100 myfile | sed 's/FROM/TO/' > output.1
tail -c +101 myfile > output.2
cat output.1 output.2 > output && /bin/rm output.1 output.2
Or, in one line:
( ( head -c 100 myfile | sed 's/FROM/TO/' ) && (tail -c +101 myfile ) ) > output

Related

Add double quotes to the first line of a csv file via command line

I have this csv file and I have noticed that during the export it hasn't been added the starting quote. In fact in ubuntu if I type:
head -n 1 file.csv
I get this output:
801","40116","Hazelnut MT -L","Thursday Promo","Large","","5.9000","","801","1.0000","","3.6500","2.2500",".0000","default","","","","","Chatime","02/06/2014","09125a9cfffd4143a00e73e3b62f15f2","CB01","",".0000","5.9000","6.9000",".0000",".0000",".0000",".0000",".0000",".0000","0","","0","0","0","","","","","","","","","Modern Milk Tea","","","0","","","1","0","","","","","","","","0","Hau Chan","","","","","","","","","","0","","","","","","","-1","","","","","","","","","","","","0","00000000420714AA","2014-06-02","1900-01-01","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""
Is there some command type that can help me to add the missing starting quote?

This should work in every posix-shell:
printf \" | cat - file.csv > repaired-file.csv
If you are happy with the result you can overwrite the original
mv repaired-file.csv file.csv
Since your file is 70GB big you might want to avoid creating a second file, however this is harder than it seems. Sure, there are things like sed's inplace option (-i) and the sponge utility from moreutils, but they do not work as in-place as you might expect. sed -i and sponge both use temporary files or hold the whole file in-memory (which does not work for 70GB anymore). A great research on true in-place editing can be found in this blog post. The conclusion: There are no standard-tools supporting true in-place editing. But the following perl one-liner should work (already adapted to your needs).
perl <<'EOF'
use Tie::File;
my #a;
tie #a, 'Tie::File', 'path/to/your/file' or die 'Cannot tie file';
$a[0] = '"' . $a[0];
EOF
Benchmarks
Out of interest I ran the commands discussed here and measured their running times.
The 9.3 GiB input file f was generated using seq 1000000000 > f. Before timing a single command I always re-generated f and emptied the system cache using sync && echo 3 | sudo tee /proc/sys/vm/drop_caches. My system had enough memory to hold the whole file, but I monitored memory usage manually – all commands only used a few KB of memory.
printf \" | cat - f > f2; mv f2 f   1m 05s
perl … # script from above         1m 32s
sed -i '1s/^/"/' f            25m 57s (also used 100% CPU the whole time)
I'm a bit surprised myself that the cat command was faster than the perl script. However, it makes sense since the perl script does a lot of seeks (can be seen using strace) whereas cat just copies.
Summary: Use the cat command if you have enough disk space left. If the file is bigger than the remaining free disk space on your system then use the perl script.

What is the fastest way to remove a number from the beginning of so many files?

I have 1000 files each having one million lines. Each line has the following form:
a number,a text
I want to remove all of the numbers from the beginning of every line of every file. including the ,
Example:
14671823,aboasdyflj -> aboasdyflj
What I'm doing is:
os.system("sed -i -- 's/^.*,//g' data/*")
and it works fine but it's taking a huge amount of time.
What is the fastest way to do this?
I'm coding in python.

This is much faster:
cut -f2 -d ',' data.txt > tmp.txt && mv tmp.txt data.txt
On a file with 11 million rows it took less than one second.
To use this on several files in a directory, use:
TMP=/pathto/tmpfile
for file in dir/*; do
cut -f2 -d ',' "$file" > $TMP && mv $TMP "$file"
done
A thing worth mentioning is that it often takes much longer time to do stuff in place rather than using a separate file. I tried your sed command but switched from in place to a temporary file. Total time went down from 26s to 9s.

I would use GNU awk (to leverage the -i inplace editing of file) with , as the field separator, no expensive Regex manipulation:
awk -F, -i inplace '{print $2}' file.txt
For example, if the filenames have a common prefix like file, you can use shell globbing:
awk -F, -i inplace '{print $2}' file*
awk will treat each file as different argument while applying the in-place modifications.
As a side note, you could simply run the shell command in the shell directly instead of wrapping it in os.system() which is insecure and deprecated BTW in favor of subprocess.

that's probably pretty fast & native python. Reduced loops and using csv.reader & csv.writer which are compiled in most implementations:
import csv,os,glob
for f1 in glob.glob("*.txt"):
f2 = f1+".new"
with open(f1) as fr, open(f2,"w",newline="") as fw:
csv.writer(fw).writerows(x[1] for x in csv.reader(fr))
os.remove(f1)
os.rename(f2,f1) # move back the newfile into the old one
maybe the writerows part could be even faster by using map & operator.itemgetter to remove the inner loop:
csv.writer(fw).writerows(map(operator.itemgetter(1),csv.reader(fr)))
Also:
it's portable on all systems including windows without MSYS installed
it stops with exception in case of problem avoiding to destroy the input
the temporary file is created in the same filesystem on purpose so deleting+renaming is super fast (as opposed to moving temp file to input across filesystems which would require shutil.move & would copy the data)

You can take advantage of your multicore system, along with the tips of other users on handling a specific file faster.
FILES = ['a', 'b', 'c', 'd']
CORES = 4
q = multiprocessing.Queue(len(FILES))
for f in FILES:
q.put(f)
def handler(q, i):
while True:
try:
f = q.get(block=False)
except Queue.Empty:
return
os.system("cut -f2 -d ',' {f} > tmp{i} && mv tmp{i} {f}".format(**locals()))
processes = [multiprocessing.Process(target=handler, args=(q, i)) for i in range(CORES)]
[p.start() for p in processes]
[p.join() for p in processes]
print "Done!"

Sed replacing only part of a longer match with a shorter replacement:

So I'm measuring the total time elapsed of c program. By doing so I have been running this shell script that uses sed to replace the value of a constant (below: N) defined somewhere in the middle of a line in my c program.
#define N 10 // This constant will be incremented by shell program
Before you tell me that I should be using a variable and timing the function that uses it, I have to time the whole execution of the program externally on a single run (meaning no reassignment of N).
I've been using the following in a shell script to help out:
tmp=$(sed "11s/[0-9][0-9]*/$INCREMENTINGVAR/" myprogram.c); printf "%s" "$tmp" > myprogram.c
That replaces a 3 digit number with whatever my INCREMENTINGVAR (replacement) is. However, this doesn't seem to work properly for me when the replacement is 2 digits long. Sed replaces only the first two characters and leaves the the previous 3rd digit from the previous run without deleting it.
TESTS=0
while [ $TESTS -lt 3 ]
do
echo "This is test: $TESTS"
INCREMENTINGVAR=10
while [ "$INCREMENTINGVAR" -lt 10 ]
do
tmp=$(sed "11s/[0-9][0-9]*/$INCREMENTINGVAR/" myprogram.c); printf "%s" "$tmp" > myprogram.c
rm -f myprog.c.bak
echo "$INCREMENTINGVAR"
gcc myprogram.c -o myprogram.out; ./myprogram.out
INCREMENTINGVAR=$((INCREMENTINGVAR+5))
done
TESTS=$((TESTS+1))
done
Is there something I should do instead?
edit: Added whole shell script; Changed pattern for sed.

Do you simply want to replace whatever digit string is on line 11 with the new value? If so, you'd write:
sed -e "11s/[0-9][0-9]*/$INCREMENTINGVAR/"
That looks for a sequence of one or more digits, and replaces it by the current value in $INCREMENTINGVAR. This will rollover from 9 to 10, and from 99 to 100, and from 999 to 1000, etc. Indeed, there's nothing to stop you jumping from 1 to 987,654 if that's what you want to do.
With the GNU and BSD (Mac OS X) versions of sed, you could overwrite the file automatically. The portable way (meaning, works the same with both GNU and BSD variants of sed), is:
sed -i.bak -e "11s/[0-9][0-9]*/$INCREMENTINGVAR/" myprog.c
rm -f myprog.c.bak
This creates a backup file (and removes it). The problem is that GNU sed requires just -i and BSD sed requires -i '' (two arguments) to do an in situ change without a backup. You can decide that portability is not relevant.
Note that using line number to identify what must be changed is delicate; trivial changes (a new header, more commentary) could change the line number. It would probably be better to use a context search:
sed -i.bak -e "/^#define N [0-9]/ s/[0-9][0-9]*/$INCREMENTINGVAR/" myprog.c
rm -f myprog.c.bak
This assumes spaces between define and N and the number. If you might have blanks or tabs in it, then you might write:
sed -i.bak -e "/^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]/ s/[0-9][0-9]*/$INCREMENTINGVAR/" myprog.c
rm -f myprog.c.bak
That looks for optional leading white space before the #, optional white space between the # and the define, mandatory white space (at least one, possibly many) between define and N, and mandatory white space again between N and the first digit of the number. But probably your input isn't that sloppy and a simpler search pattern (like the first option) is sufficient to meet your needs. You could also write code to normalize eccentrically formatted #define lines into a canonical representation — but again, you most probably don't need to.
If you have somewhere else in the same file that contains something like this:
#undef N
#define N 100000
you would have to worry about the pattern matching this line too. However, few files do that; it isn't likely to be a problem in practice (and if it is, the code in general probably has more problems than can be dealt with here). One possibility would be to limit the range to the first 30 lines, assuming the first #define N 123 is somewhere in that range and the second is not.
sed -i.bak -e "1,30 { /^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]/ s/[0-9][0-9]*/$INCREMENTINGVAR/; }" myprog.c
rm -f myprog.c.bak
There are multiple other tricks that could be pulled to limit the damage, with varying degrees of verbosity. For example:
sed -i.bak -e "1,/^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]\{1,\}/ \
s/^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]\{1,\}/#define N $INCREMENTINGVAR/; }" myprog.c
rm -f myprog.c.bak
Working with regexes is generally a judgement call between specificity and verbosity — you can make things incredibly safe but incredibly difficult to read, or you can run a small risk that your more readable code will match something unintended.

How to replace using sed command in shell scripting to replace a string from a txt file present in one directory by another?

I am very new to shell scripting and trying to learn the "sed" command functionality.
I have a file called configurations.txt with some variables defined in it with some string values initialised to each of them.
I am trying to replace a string in a file (values.txt) which is present in some other directory by the values of the variables defined. The name of the file is values.txt.
Data present in configurations.txt:-
mem="cpu.memory=4G"
proc="cpu.processor=Intel"
Data present in the values.txt (present in /home/cpu/script):-
cpu.memory=1G
cpu.processor=Dell
I am trying to make a shell script called repl.sh and I dont have alot of code in it for now but here is what I got:-
#!/bin/bash
source /home/configurations.txt
sed <need some help here>
Expected output is after an appropriate regex applied, when I run script sh repl.sh, in my values.txt , It must have the following data present:-
cpu.memory=4G
cpu.processor=Intell
Originally which was 1G and Dell.
Would highly appreciate some quick help. Thanks

This question lacks some sort of abstract routine and looks like "help me do something concrete please". Thus it's very unlikely that anyone would provide a full solution for that problem.
What you should do try to split this task into number of small pieces.
1) Iterate over configuration.txt and get values from each line. To do that you need to get X and Y from a value="X=Y" string.
This regex could be helpful here - ([^=]+)=\"([^=]+)=([^=]+)\". It contains 3 matching groups separated by ". For example,
>> sed -r 's/([^=]+)=\"([^=]+)=([^=]+)\"/\1/' configurations.txt
mem
proc
>> sed -r 's/([^=]+)=\"([^=]+)=([^=]+)\"/\2/' configurations.txt
cpu.memory
cpu.processor
>> sed -r 's/([^=]+)=\"([^=]+)=([^=]+)\"/\3/' configurations.txt
4G
Intel
2) For each X and Y find X=Z in values.txt and substitute it with a X=Y.
For example, let's change cpu.memory value in values.txt with 4G:
>> X=cpu.memory; Y=4G; sed -r "s/(${X}=).*/\1${Y}/" values.txt
cpu.memory=4G
cpu.processor=Dell
Use -i flag to do changes in place.

Here is an awk based answer:
$ cat config.txt
cpu.memory=4G
cpu.processor=Intel
$ cat values.txt
cpu.memory=1G
cpu.processor=Dell
cpu.speed=4GHz
$ awk -F= 'FNR==NR{a[$1]=$2; next;}; {if($1 in a){$2=a[$1]}}1' OFS== config.txt values.txt
cpu.memory=4G
cpu.processor=Intel
cpu.speed=4GHz
Explanation: First read config.txt & save in memory. Then read values.txt. If a particular value was defined in config.txt, use the saved value from memory (config.txt).

What Vim command to use to delete all text after a certain character on every line of a file?

Scenario:
I have a text file that has pipe (as in the | character) delimited data.
Each field of data in the pipe delimited fields can be of variable length, so counting characters won't work (or using some sort of substring function... if that even exists in Vim).
Is it possible, using Vim to delete all data from the second pipe to the end of the line for the entire file? There are approx 150,000 lines, so doing this manually would only be appealing to a masochist...
For example, change the following lines from:
1111|random sized text 12345|more random data la la la|1111|abcde
2222|random sized text abcdefghijk|la la la la|2222|defgh
3333|random sized text|more random data|33333|ijklmnop
to:
1111|random sized text 12345
2222|random sized text abcdefghijk
3333|random sized text
I'm sure this can be done somehow... I hope.
UPDATE: I should have mentioned that I'm running this on Windows XP, so I don't have access to some of the mentioned *nix commands (cut is not recognized on Windows).

:%s/^\v([^|]+\|[^|]+)\|.*$/\1/

You can also record a macro:
qq02f|Djq
and then you will be able to play it with 100#q to run the macro on the next 100 lines.
Macro explanation:
qq: starts macro recording;
0: goes to the first character of the line;
2f|: finds the second occurrence of the | character on the line;
D: deletes the text after the current position to the end of the line;
j: goes to the next line;
q: ends macro recording.

If you don't have to use Vim, another alternative would be the unix cut command:
cut -d '|' -f 1-2 file > out.file

Instead of substitution, one can use the :normal command to repeat
a sequence of two Normal mode commands on each line: 2f|, jumping
to the second | character on the line, and then D, deleting
everything up to the end of line.
:%norm!2f|D

Just another Vim way to do the same thing:
%s/^\(.\{-}|\)\{2}\zs.*//
%s/^\(.\{-}\zs|\)\{2}.*// " If you want to remove the second pipe as well.
This time, the regex matches as few characters as possible (\{-}) that are followed by |, and twice (\{2}), they are ignored to replace all following text (\zs) by nothing (//).

You can use :command to make a user command to run the substitution:
:command -range=% YourNameHere <line1>,<line2>s/^\v([^|]+\|[^|]+)\|.*$/\1/

You can also do:
:%s/^\([^\|]\+|[^\|]\+\)\|.*$/\1/g

Use Awk:
awk -F"|" '{$0=$1"|"$2}1' file

I've found that vim isn't great at handling very large files. I'm not sure how large your file is. Maybe cat and sed together would work better.

Here is a sed solution:
sed -e 's/^\([^|]*|[^|]*\).*$/\1/'

This will filter all lines in the buffer (1,$) through cut to do the job:
:1,$!cut -d '|' -f 1-2
To do it only on the current line, try:
:.!cut -d '|' -f 1-2

Why use Vim? Why not just run
cat my_pipe_file | cut -d'|' -f1-2

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Substitute first string match in 1-line 2GB file on Linux - regex

Untested, but I would do: perl -pi -we 'BEGIN{$/=\65536} s/FROM_STRING/TO_STRING/ if 1..1' file.txt to read in 64k chunks.

Related

Add double quotes to the first line of a csv file via command line

What is the fastest way to remove a number from the beginning of so many files?

Sed replacing only part of a longer match with a shorter replacement:

How to replace using sed command in shell scripting to replace a string from a txt file present in one directory by another?

What Vim command to use to delete all text after a certain character on every line of a file?

Categories

Resources