Replace White Spaces with sed - replace

I have a large file (100M rows) in the following format:
Week |ID |Product |Count |Price
---------- ------------- -------- ---------- -----
2016-01-01|00056001 |172 |23 |3.50
2016-01-01|1 |125 |15 |2.75
I am trying to use sed to add Xs to the missing digits on the second customer ID, but maintain the number of spaces after the full ID. So, the table would look like:
Week |ID |Product |Count |Price
---------- ------------- -------- ---------- -----
2016-01-01|00056001 |172 |23 |3.50
2016-01-01|1XXXXXXX |125 |15 |2.75
I have tried
sed -i "s/\s\{29,\}/XXXXXXX /g" *.csv
and
sed -i -- "s/1 /1XXXXXXX /g" *.csv
Neither with any change to the file. What am I missing?
Thanks.
EDIT for clarification: There are 29 spaces after the 1 in the actual data. I used less on the example table for readability sake. I assume whatever solution works will apply no matter the number of spaces.

That works for me (not using \s but merely space, and dropped the useless g option because needed once per line only):
sed -i "s/[ ]\{29,\}/XXXXXXX /" *.csv
Although for safety reasons I would rather use a more restrictive script which would perform the substitution only if |1 is encountered:
sed -i "s/\(\|1\)[ ]\{29,\}/\1XXXXXXX /" *.csv

Related

How to use regex in bash while loop to modify csv data per line?

I have a csv file and need to extract number values from 1 column for all rows. How do i go about doing this?
I have the regex to extract the values but unsure how to apply it for all rows in csv files with 1000+ records.
1. CREATE_DATE | NAME | PROD_ID
2. 12/01/2018 | starburst 25g | 2323
3. 01/23/2018 | 43g hersheys | 4353
expected result:
1. 25
2. 43
using a combination of awk and sed you can strip out everything from the csv expect for the digits.
awk -F allows you to change the delimeter from the default space to another special character so in your case a pipe "|" therefore the results of this command will just be startburst 25g then the sed command will only display the int values from that line, so 25.
EDIT just saw the title, that this will be in a loop so you can read the line from the loop and modify each line
while read line ; do
echo $line | awk -F "|" '/1/ {print $2}' | sed 's/[^0-9]*//g'
done < filename.csv

remove sequence of line feeds and spaces in file with sed

I have a file wich contains an undesired sequence of line feeds and spaces that I want to remove. The actual file is about 1 million rows, this is just to provide a reproducible example.
I can grep the offending lines like this:
grep -ciP "\n\n {6,}" problem.rpt
And it correctly returns
## 3
So I tried with sed to replace the string:
sed "s/\n\n {6,}//g" problem.rpt > prob2.rpt
but instead of deleting the sequence "\n\n {6,}" I now have "\r\n\r\n {6,}" (it introduced a CR before each LF, without removing it or the 6+ spaces).
I'm working with GNU sed and grep in a windows 8.1 cmd.
What am I doing wrong, and what's the right way to approach this job?
does one of the followings help you? Very likely the 2nd one is what you are looking for:
awk -v RS="\n\n {6,}" '7' problem.rpt
awk -v RS="\n\n {6,}" -v ORS="" '7' problem.rpt
I think you have gawk too, right?
I don't have windows to test for you....
From a list of sed one-liners I found one command that solved my problem:
sed -e :a -e "$!N; s/\n //;ta" -e "P;D" problem.rpt > prob2.rpt
Then, trying to decipher the command, this is what I found here (copied verbatim):
sed ':a; $!N; s/\n/string/; ta'
--- ---- ------------- --
| | | |--> go back (`t`) to `a`
| | |-------------> substitute newlines with `string`
| |----------------------> If this is not the last line (`$!`), append the
| next line to the pattern space.
|----------------------------> Create the label `a`.
I still don't know what the P;D part does, I'd appreciate if someone with the knowledge edits this answer to add it.

Match last name using awk

Say there's a file like this
1 | John Smith | 70000
2 | Al McSmith | 60000
If I use
awk -F"|" '$2~/Smith/' file
both rows are matched.
Is there a way to only match John Smith? (USING AWK ONLY)
EDIT: I'm trying to match the people that have Smith as their last name, without matching McSmith, or O'Smith, etc.
this may work for you:
awk -F'|' '$2~/ Smith\s*$/' file
it won't match:
fooSmith
Smithfoo
foo Smith is middlename
Just stick a Space before Smith:
awk -F'|' '$2~/ Smith/' testfile
If there is a name like John Smitherton in there, then stick a space after as well (since it looks like you have <space><delim><space> between each field). Otherwise you can get a little fancier with the regex, but your space padding is pretty useful here.
Another solution using grep
grep -E "[^|]*\|[^|]*\<Smith\>"
explanation
[^|] match any character except |
\| match with |
\< \> start and end of word
I've made test. I created file: test.in with your content:
1 | John Smith | 70000
2 | Al McSmith | 60000
Then tried another expression:
awk -F'|' '{print $2~/\sSmith\s/}' test.in
It prints:
1
0
So, 1 for Smith, 0 for McSmith.
[UPD] \s - is an additional character, specific for gawk

Sed : print all lines after match

I got my research result after using sed :
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | cut -f 1 - | grep "pattern"
But it only shows the part that I cut. How can I print all lines after a match ?
I'm using zcat so I cannot use awk.
Thanks.
Edited :
This is my log file :
[01/09/2015 00:00:47] INFO=54646486432154646 from=steve idfrom=55516654455457 to=jone idto=5552045646464 guid=100021623456461451463 n
um=6 text=hi my number is 0 811 22 1/12 status=new survstatus=new
My aim is to find all users that spam my site with their telephone numbers (using grep "pattern") then print all the lines to get all the information about each spam. The problem is there may be matches in INFO or id, so I use sed to get the text first.
Printing all lines after a match in sed:
$ sed -ne '/pattern/,$ p'
# alternatively, if you don't want to print the match:
$ sed -e '1,/pattern/ d'
Filtering lines when pattern matches between "text=" and "status=" can be done with a simple grep, no need for sed and cut:
$ grep 'text=.*pattern.* status='
You can use awk
awk '/pattern/,EOF'
n.b. don't be fooled: EOF is just an uninitialized variable, and by default 0 (false). So that condition cannot be satisfied until the end of file.
Perhaps this could be combined with all the previous answers using awk as well.
Maybe this is what you actually want? Find lines matching "pattern" and extract the field after text= up through just before status=?
zcat file* | sed -e '/pattern/s/.*text=\(.*\)status=[^/]*/\1/'
You are not revealing what pattern actually is -- if it's a variable, you cannot use single quotes around it.
Notice that \(.*\)status=[^/]* would match up through survstatus=new in your example. That is probably not what you want? There doesn't seem to be a status= followed by a slash anywhere -- you really should explain in more detail what you are actually trying to accomplish.
Your question title says "all line after a match" so perhaps you want everything after text=? Then that's simply
sed 's/.*text=//'
i.e. replace up through text= with nothing, and keep the rest. (I trust you can figure out how to change the surrounding script into zcat file* | sed '/pattern/s/.*text=//' ... oops, maybe my trust failed.)
The seldom used branch command will do this for you. Until you match, use n for next then branch to beginning. After match, use n to skip the matching line, then a loop copying the remaining lines.
cat file | sed -n -e ':start; /pattern/b match;n; b start; :match n; :copy; p; n ; b copy'
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | ***cut -f 1 - | grep "pattern"***
instead change the last 2 segments of your pipeline so that:
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | **awk '$1 ~ "pattern" {print $0}'**

Replacing text and duplicates

I have a log file with lines filled with things like this:
/home/Users/b/biaxib/is-clarithromycin-effective-against-strep.html
/home/Users/b/hihi/low-cost-biaxin-free-shipping.html
/home/Users/b/hoho/no-script-biaxin-fast-delivery.html
/home/Users/b/ihatespam/no-script-low-cost-biaxin.html
I want to extract only the username portion, and then remove duplicates, so that I am only left with this:
biaxib
hihi
hoho
ihatespam
The ruleset is:
Extract the text between "/home/Users/" and "/....." at the end
Remove duplicate lines after the above rule is applied
Do this inside Linux
Can someone help me with how to create such a script, or statement to do this?
Assuming that username always appears at 4th component of path:
$ cat test.txt
/home/Users/b/biaxib/is-clarithromycin-effective-against-strep.html
/home/Users/b/hihi/low-cost-biaxin-free-shipping.html
/home/Users/b/hoho/no-script-biaxin-fast-delivery.html
/home/Users/b/ihatespam/no-script-low-cost-biaxin.
$ cat test.txt | cut -d/ -f 5 | sort | uniq
biaxib
hihi
hoho
ihatespam
cat /path/to/your/log/file.txt | python3 -c '
import sys
for line in sys.stdin.readlines():
print( line.split("/")[5] )
' | sort | uniq
More conciseness probably achievable in perl or with other builtin tools (see other answer), but I personally shy away from the standard linux text manipulation tools (edit: cut is a useful one though).