Parse target file based on source file contents - regex

I am trying to search for lines in FileB (which is comma separated) that contain content from lines in FileA. I originally tried using grep but it does not seem to care for some of the characters in FileA. I do not assume that the CSV formatting would matter much, well at least to grep.
$ grep -f FileA FileB
grep: Unmatched [ or [^
I am open to using any generally available Linux command, Perl or Python. There is not a specific expression that can be matched which is the reason for using the content from FileA to match on. Below are some example lines that are in FileA that we want to match in FileB.
page=--&id='`([{^~
page=&rows_select=%' and '%'='
l=admin&x=&id=&pagex=http://.../search/cache?ei=utf-&p=change&fr=mailc&u=http://sub.domain.com/cache.aspx?q=change&d=&mkt=en-us&setlang=en-us&w=afe,dbfcd&icp=&.intl=us&sit=dbajdy.alt
The lines in fileB that contain the above strings will contain additional characters in the line, i.e. the strings the the two files will not be a one for one match:
fileA contains abc and fileB contains 012abc*(), 012abc*() would print

A simple python solution would be:
with open('filea', 'r') as fa:
with open('fileb', 'r') as fb:
patterns = fa.readlines()
for line in fb:
if line in patterns:
print line
which would store the whole pattern file in memory, and compare each line of the other file against the list.
but why wouldn't you just use diff? I'd have to look at the manpage, but I'm pretty sure there's a way to make it tell what are the similarities between two files. After googling:
Using diff to find the portions of many files that are the same? (bizzaro-diff, or inverse-diff)
https://unix.stackexchange.com/questions/1079/output-the-common-lines-similarities-of-two-text-files-the-opposite-of-diff
they give that solution:
diff --unchanged-group-format='## %dn,%df
%<' --old-group-format='' --new-group-format='' \
--changed-group-format='' a.txt b.txt

Untested Solution:
Logic:
Store line from FileB in lines array
For each line in lines array;
Check if line in array appears as a part of your line in FileB
If index(..) returns > 0 then;
Print that line from FileB
awk 'NR==FNR{lines[$0]++;next}{for (line in lines) {if (index($0,line)>0) {print $0}}}' FILEA FILEB`

Use fgrep (or equivalently grep -F). That interprets the pattern (the contents of FileA) as a literal string to search for instead of a regular expression.

Related

How to output multiple regex matches through comma on the same line

I want to use grep/awk/sed to extract matched strings for each line of a log file. Then place it into csv file.
Highlighted strings (1432,53,http://www.espn.com/)
If the input is:
2018-10-31
18:48:01.717,INFO,15592.15627,PfbProxy::handlePfbFetchDone(0x1d69850,
pfbId=561, pid=15912, state=4, fd=78, timer=61), FETCH DONE: len=45,
PFBId=561, pid=0, loadTime=1434 ms, objects=53, fetchReqEpoch=0.0,
fetchDoneEpoch:0.0, fetchId=26, URL=http://www.espn.com/
2018-10-31
18:48:01.806,DEBUG,15592.15621,FETCH DONE: len=45, PFBId=82, pid=0,
loadTime=1301 ms, objects=54, fetchReqEpoch=0.0, fetchDoneEpoch:0.0,
fetchId=28, URL=http://www.diply.com/
Expected output for the above log lines:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
This is an example, and the actual Log File will have much more data.
--My-Solution-So-far-
For now I used grep to get all lines containing keyword 'FETCH DONE' (these lines contain strings I am looking for).
I did come up with regular expression that matches the data I need, but when I grep it and put it in the file it prints each string on the new line which is not quite what I am looking for.
The grep and regular expression I use (online regex tool: https://regexr.com/42cah):
echo -en 'url,loadtime,object\n'>test1.csv #add header
grep -Po '(?<=loadTime=).{1,5}(?= )|((?<=URL=).*|\/(?=.))|((?<=objects=).{1,5}(?=\,))'>>test1.csv #get matching strings
Actual output:
URL,LoadTime,Objects
http://www.espn.com
1434
53
http://www.diply.com
1301
54
Expected output:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
I was trying using awk to match multiple regex and print comma in between. I couldn't get it to work at all for some reason, even though my regex matches correct strings.
Another idea I have is to use sed to replace some '\n' for ',':
for(i=1;i<=n;i++)
if(i % 3 != 0){
sed REPLACE "\n" with "," on i-th line
}
Im pretty sure there is a more efficient way of doing it
Using sed:
sed -n 's/.*loadTime=\([0-9]*\)[^,]*, objects=\([0-9]*\).* URL=\(.*\)/\3,\1,\2/p' input | \
sed 1i'URL,LoadTime,Objects'

Grep invert on string matched, not line matched

I'll keep this explanation of why I need help to a mimimum. One of my file directories got hacked through XSS and placed a long string at the beginning of all php files. I've tried to use sed to replace the string with nothing but it won't work because the pattern to match includes many many characters that would need to be escaped.
I found out that I can use fgrep to match a fixed string saved in a pattern file, but I'd like to replace the matched string (NOT THE LINE) in each file, but grep's -v inverts the result on the line, rather than the end of the matched string.
This is the command I'm using on an example file that contains the hacked
fgrep -v -f ~/hacked-string.txt example.php
I need the output to contain the <?php that's at the end of the line (sometimes it's a <style> tag), but the -v option inverts at the end of that line, so the output doesn't contain the <?php at the beginning.
NOTE
I've tried to use the -o or --only-matching which outputs nothing instead:
fgrep -f ~/hacked-string.txt example.php --only-matching -v
Is there another option in grep that I can use to invert on the end of the matched pattern, rather than the line where the pattern was matched? Or alternatively, is there an easier option to replace the hacked string in all .php files?
Here is a small snippet of what's in hacked-string.txt (line breaks added for readability):
]55Ld]55#*<%x5c%x7825bG9}:}.}-}!#*<%x55c%x7825)
dfyfR%x5c%x7827tfs%x5c%x7c%x785c%x5c%x7825j:^<!
%x5c%x7825w%x5c%x7860%x5c%x785c^>Ew:25tww**WYsb
oepn)%x5c%x7825bss-%x5c%x7825r%x5c%x7878B%x5c%x
7825h>#]y3860msvd},;uqpuft%x5c%x7860msvd}+;!>!}
%x5c%x7827;!%x5c%x7825V%x5c%x7827{ftmfV%x5e56+9
9386c6f+9f5d816:+946:ce44#)zbssb!>!ssbnpe_GMFT%
x5c5c%x782f#00#W~!%x5c%x7825t2w)##Qtjw)#]82#-#!
#-%x5c%x7825tmw)%x5c%x78w6*%x5c%x787f_*#fubfsdX
k5%x5c%xf2!>!bssbz)%x5c%x7824]25%x5c%x7824-8257
-K)fujs%x5c%x7878X6<#o]o]Y%x5c%x78257;utpI#7>-1
-bubE{h%x5c%x7825)sutcvt)!gj!|!*bubEpqsut>j%x5c
%x7825!*72!%x5c%x7827!hmg%x5c%x78225>2q%x5c%x7
Thanks in advance!
I think what you are asking is this:
"Is it possible to use the grep utility to remove all instances of a fixed string (which might contain lots of regex metacharacters) from a file?"
In that case, the answer is "No".
What I think you wanted to ask was:
"What is the easiest way to remove all instances of a fixed string (which might contain lots of regex metacharacters) from a file?"
Here's one reasonably simple solution:
delete_string() {
awk -v s="$the_string" '{while(i=index($0,s))$0=substr($0,1,i-1)substr($0,i+length(s))}1'
}
delete_string 'some_hideous_string_with*!"_inside' < original_file > new_file
The shell syntax is slightly fragile; it will break if the string contains an apostrophe ('). However, you can read a raw string from stdin into a variable with:
$ IFS= read -r the_string
absolutely anything here
which will work with any string which doesn't contain a newline or a NUL character. Once you have the string in a variable, you can use the above function:
delete_string "$the_string" < original_file > new_file
Here's another possible one liner, using python:
delete_string() {
python -c 'import sys;[sys.stdout.write(l.replace(r"""'"$1"'""","")) for l in sys.stdin]'
}
This won't handle strings which have three consecutive quotes (""").
Is the hacked string the same in every file?
If the length of hacked string in chars was 1234 then you can use
tail -c +1235 file.php > fixed-file.php
for each infected file.
Note that tail c +1235 tells to start output at 1235th character of the input file.
With perl:
perl -i.hacked -pe "s/\Q$(<hacked-string.txt)\E//g" example.php
Notes:
The $(<file) bit is a bash shortcut to read the contents of a file.
The \Q and \E bits are from perl, they treat the stuff in between as plain characters, ignoring regex metachars.
The -i.hacked option will edit the file in-place, creating a backup "example.php.hacked"

Comment out file paths in a file matching lines in another file with sed and bash

I have a file (names.txt) with the following content:
/bin/pgawk
/bin/zsh
/dev/cua0
/dev/initctl
/root/.Xresources
/root/.esd_auth
... and so on. I want to read this file line by line, and use sed to comment out matches in another file. I have the code below, but it does nothing:
#/bin/bash
while read line
do
name=$line
sed -e '/\<$name\>/s/^/#/' config.conf
done < names.txt
Lines in the input file needs to be commented out in config.conf file. Like follows:
config {
#/bin/pgawk
#/bin/zsh
#/dev/cua0
#/dev/initctl
#/root/.Xresources
#/root/.esd_auth
}
I don't want to do this by hand, because the file contains more then 300 file paths. Can someone help me to figure this out?
You need to use double quotes around your sed command, otherwise shell variables will not be expanded. Try this:
sed "/\<$name\>/s/^/#/" config.conf
However, I would recommend that you skip the bash for-loop entirely and do the whole thing in one go, using awk:
awk 'NR==FNR{a[$0];next}{for(i=1;i<=NF;++i)if($i in a)$i="#"$i}1' names.txt config.conf
The awk command stores all of the file names as keys in the array a and then loops through every word in each line of the config file, adding a "#" before the word if it is in the array. The 1 at the end means that every line is printed.
It is better not to use regular expression matching here, as some of the characters in your file names (such as .) will be interpreted by the regular expression engine. This approach does a simple string match, which avoids the problem.

how to parse a text file for a particular compound expressions filtering in shell scripting

I want to extract (parse) a text file which has particular word, for my requirement whatever the rows which have the words "cluster" and "week" and "8.2" it should be written to the output file.
sample text in the file
2013032308470272~800000102507~Cluster-Mode~WEEK~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~monthly~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
2013032308470272~800000102507~Cluster-Mode~yearly~8.1.2~V6240
Desired output into another text file by above mentioned filters
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
I have writen a code using the awk command, however the output file contains the rows which are out of the scope of the filters.
code used to extract the text
awk '/Cluster/ && /WEEK/ && /8.2/ { print $NF > "/u/nbsvc/Data/Lookup/derived_asup_2010404_201409_2.txt" }' /u/nbsvc/Data/Lookup/cmode_asup_lookup.txt
obtained output
2013032308470272~800000102507~Cluster-Mode~WEEK~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
Note: the first line of obtained output is not needed in the desired output. How can I change my script to only get the line that I want?
To remove any ambiguity and false matches on partial fields or the wrong field, THIS is the command you need to run:
$ awk -F'~' '$3~/^Cluster/ && $4=="WEEK" && $5~/^8\.2/' file
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
I don't think that awk is needed at all here. Just use grep to match the line that you're interested in:
grep 'Cluster.*WEEK.*8\.2' file > output_file
The .* matches zero or more of any character and > is used to redirect the output to a new file. I have escaped the . in between "8.2" so that it is interpreted literally, rather than matching any character (although it would work either way).
there is actually little more in my requirement, it is I need to read this text file, then I need to split the line (where the cursor is) and push the values into a array and then I need to check for the values does it match with my pattern or not, if it matches then I need to write it to a out put text file, else simply ignore it, this one I did like as below..
cat /inputfolder_path/lookup_filename.txt | awk '{IGNORECASE = 1;line=$0;split(line,a, "~") ;if (a[1] ~ /201404/ && a[3]~/Cluster/ && a[4]~/WEEK/ && a[5]~/8.2/){print $0}}' > /outputfolder_path/derived_output_filename.txt
this is working exactly for my requirement..
Just thought to update this to every one, as it may help someone..
Thanks,
Siva

Counting number of lines which contain a pattern

I have data in the following form:
<id_mytextadded1829>
<text1> <text2> <text3>.
<id_m_abcdef829>
<text4> <text5> <text6>.
<id_mytextadded1829>
<text7> <text2> <text8>.
<id_mytextadded1829>
<text2> <text1> <text9>.
<id_m_abcdef829>
<text11> <text12> <text2>.
Now I want to the number of lines in which <text2> is present. I know I can do the same using python's regex. But regex would tell me whether a pattern is present in a line or not? On the other hand my requirement is to find a string which is present exactly in the middle of a line. I know sed is good for replacing contents present in a line. But instead of replacing if I only want the number of lines..is it possible to do so using sed.
EDIT:
Sorry I forgot to mention. I want lines where <text2> occurs in the middle of the line. I dont want lines where <text2> occurs in the beginning or at the end of the line.
E.g. in the data shown above the number of lines which have <text2> in the middle are 2 (rather than 4).
Is there some way by which I may achieve the desired count of the number of lines by which I may find out the number of lines which have <text2> in middle using linux or python
I want lines where <text2> occurs in the middle of the line.
You could say:
grep -P '.+<text2>.+' filename
to list the lines containing <text2> not at the beginning or the end of a line.
In order to get only the count of matches, you could say:
grep -cP '.+<text2>.+' filename
You can use grep for this. For example, this will count number of lines in the file that match the ^123[a-z]+$ pattern:
egrep -c ^123[a-z]+$ file.txt
P.S. I'm not quite sure about the syntax and I don't have the possibility to test it at the moment. Maybe the regex should be quoted.
Edit: the question is a bit tricky since we don't know for sure what your data is and what exactly you're trying to count in it, but it all comes down to correctly formulating a regular expression.
If we assume that <text2> is an exact sequence of characters that should be present in the middle of the line and should not be present at the beginning and in the end, then this should be the regex you're looking for: ^<text[^2]>.*text2.*<text[^2]>\.$
Using awk you can do this:
awk '$2~/text2/ {a++} END {print a}' file
2
It will count all line with text2 in the middle of the line.
I want lines where occurs in the middle of the line. I dont
want lines where occurs in the beginning or at the end of the
line.
Try using grep with -c
grep -c '>.*<text2>.*<' file
Output:
2
Where occur (everywhere)
sed -n "/<text2>/ =" filename
if you want in the middle (like write later in comment)
sed -n "/[^ ] \{1,\}<text2> \{1,\}[^ ]/ =" filename