Why don't `csplit` and `grep` agree on whether there are matches? - regex

I am trying to use csplit in BASH to separate a file by years in the 1500-1600's as delimiters.
When I do the command
csplit Shakespeare.txt '/1[56]../' '{36}'
it almost works, except for at least two issues:
This outputs 38 files, not 36, numbered xx00 through xx37. (Also xx00 is completely blank.) I don't understand how this is possible.
One of the files (why, it seems, that csplit returns 37 non-empty files instead of the 36 non-empty files I expected) doesn't begin with 15XX or 16XX -- it begins with "ACT 4 SCENE 15\n" (where \n is supposed to denote a newline or line break). I don't understand how csplit can match a new line/line break with a number.
When I do the command (which is what I want)
csplit Shakespeare.txt '/1[56][0-9][0-9]/' '{36}'
the terminal returns the error: csplit: 1[56][0-9][0-9]: no match plus listing all of the numbers it lists when the above is executed.
This especially doesn't make sense to me, since grep says otherwise:
grep -c "1[56][0-9][0-9]" Shakespeare.txt
36
grep -c "1[56].." Shakespeare.txt
36
Note: man csplit indicates that I have the BSD version from January 26, 2005. man grep indicates that I have the BSD version from July 28, 2010.

Based on the answer given here by user 'DRL' on 06-20-2008, I decided to try adding the -k option to csplit.
csplit -k Shakespeare.txt '/^1[56][0-9][0-9]/' '{36}'
This returned an error: csplit: ^1[56][0-9][0-9]: no match
However, it still gave (more or less) the desired output: files xx00.txt through xx36.txt (not xx37.txt), and each of the non-empty files, xx01.txt-xx36.txt had the expected/desired content. (In particular, no file began with "ACT 4 SCENE 15".
The man page for csplit says the following about the -k flag:
-k Do not remove output files if an error occurs or a HUP, INT or TERM signal is received.
Honestly I don't quite understand what this means, but I still have the following conjecture about why this solution worked/works:
Conjecture: csplit expects the beginning of the file to match the regex. Thus, since the beginning line of the file did not match ^1[56][0-9][0-9], it threw a tantrum and quit without the -k flag.
Nevertheless, I still don't understand why 1[56][0-9][0-9] did not work, maybe the same reason. And I definitely don't understand why 1[56].. did not work (i.e. why csplit produced a 37th file not beginning with the pattern).

Related

How do I grep a string using the previous output for my next argument?

There is a string located within a file that starts with 4bceb and is 32 characters long.
To find it I tried the following
Input:
find / -type f 2>/dev/null | xargs grep "4bceb\w{27}" 2>/dev/null
after entering the command it seems like the script is awaiting some additional command.
Your command seems alright in principle, i.e. it should correctly execute the grep command for each file find returns. However, I don't believe your regular expression (respectively the way you call grep) is correct for what you want to achieve.
First, in order to get your expression to work, you need to tell grep that you are using Perl syntax by specifying the -P flag.
Second, your regexp will return the full lines that contain sequences starting with "4bceb" that are at least 32 characters long, but may be longer as well. If, for example your ./test.txt file contents were
4bcebUUUUUUUUUUUUUUUUUUUUUUUU31
4bcebVVVVVVVVVVVVVVVVVVVVVVVVV32
4bcebWWWWWWWWWWWWWWWWWWWWWWWWWW33
sometext4bcebYYYYYYYYYYYYYYYYYYYYYYYYY32somemoretext
othertext 4bcebZZZZZZZZZZZZZZZZZZZZZZZZZ32 evenmoretext
your output would include all lines except the first one (in which the sequence is shorter than 32 characters). If you actually want to limit your results to lines that just contain sequences that are exactly 32 characters long, you can use the -w flag (for word-regexp) with grep, which would only return lines 2 and 5 in the above example.
Third, if you only want the match but not the surrounding text, the grep flag -o will do exactly this.
And finally, you don't need to pipe the find output into xargs, as grep can directly do what you want:
grep -rnPow / -e "4bceb\w{27}"
will recursively (-r) scan all files starting from / and return just the ones that contain matching words, along with the matches (as well as the line numbers they were found in, as result of the flag -n):
./test.txt:2:4bcebVVVVVVVVVVVVVVVVVVVVVVVVV32
./test.txt:5:4bcebZZZZZZZZZZZZZZZZZZZZZZZZZ32

Convert paged columns to rows with a regular expressions

So first a sample of the actual data mangled (data is originally a mix of text and numbers, there's no significance to any of the data at this point and some of the patterns are just because I replaced most of the characters with 0s, 1s and Zs because the random number generator in my brain is broken):
011.0ZN1ZZ 001.F5ZS1Z 001.ZO5ZY0
014.5ZZZ1Z 001.1SZZOZ 001.ZLMZY0
016.01NM1SU54 001.EX0Z1Z 001.LIZZOZ
018.01NM1SS41 001.F83Z1Z 001.0011M1SU54
014.ZZ1YZZ 001.ZZZ1IZ 001.0011M1SS41
013.2EBSIZ 001.ZZZ11Z 001.0011SE4
01N.ZINSIZ 001.ZZZZ1Z P01.ZZZZ1Z
01N.01NSE4 001.LSZZHG N01.ZZZZ1Z
001.01ON5O 001.5Z21OL F01.ZZZZ1Z
001.NE5ZO1 001.ZOM05O D01.ZZZZ1Z
001.ZO5ZOZ 001.01NO1G Z01.ZZZZ1Z
001.ZO5ZOZ 001.01NO1G Z01.ZZZZ1Z
001.011ZOZ 001.01NZ0Y
Some additional comments.. I can clean up whitespace and deal with record length with no issues, so I'd like to simplify the question to this, I'm just including the above in case there's a solution to the simplified version that can't be easily extended to a more complex version.
1 7 13
2 8 14
3 9 15
4 10 16
5 11 17
6 12 18
19 25
20 26
21 27
22 28
23 29
24
So there will be a variable number of pages, but the same number of columns and rows on each page (although, in case it matters significantly, it's actually 12x3 instead of 6x3 but I wanted to keep it simple if possible), although the last page may be some empty rows/columns.
I'm using notepad++ but I have access to various gnutilities so if there's a solution that's way, way better than a regular expression I don't mind, although since I'll be using this a lot and use notepad++ a lot I'd appreciate a regex solution if it isn't too insane.
If you've got Git installed on your Windows machine, you may use Perl bundled with it from Git bash. Provided your input file is named data, try the following command (caution: it will orverwrite the input file):
echo >>data ; \
perl -i -lane'
$i=0;
push #{$c[$i++]}, $_ foreach #F;
if (/^\s*$/) {
push #l, #{$_} foreach #c;
print "#l\015";
#l=#c=();
}' data
The Perl command treats each line of input as space delimited fields and accumulates the fields in the #c matrix. When encounters an empty line (if (/^\s*$/) ...), it prints the matrix columns concatenated in a list.
The input file is changed in-place. A backup copy data.bak is created.
The input file may not end with an empty line so I add one with echo >>data. This makes the Perl script shorter and easier.
Another trick is the trailing \015 in print "#l\015";. This allows us to get Windows CRLF line endings in Unix-flavoured Git bash environment.
A demo can be found here: https://ideone.com/vnYoOd. But since Ideone forbids file read/write, the original command has been modified to make the code run there.

What is the difference b/w two sed commands below?

Information about the environment I am working in:
$ uname -a
AIX prd231 1 6 00C6B1F74C00
$ oslevel -s
6100-03-10-1119
Code Block A
( grep schdCycCleanup $DCCS_LOG_FILE | sed 's/[~]/ \
/g' | grep 'Move(s) Exist for cycle' | sed 's/[^0-9]*//g' ) > cycleA.txt
Code Block B
( grep schdCycCleanup $DCCS_LOG_FILE | sed 's/[~]/ \n/g' | grep 'Move(s) Exist for cycle' | sed 's/[^0-9]*//g' ) > cycleB.txt
I have two code blocks(shown above) that make use of sed to trim the input down to 6 digits but one command is behaving differently than I expected.
Sample of input for the two code blocks
Mar 25 14:06:16 prd231 ajbtux[33423660]: 20160325140616:~schd_cem_svr:1:0:SCHD-MSG-MOVEEXISTCYCLE:200705008:AUDIT:~schdCycCleanup - /apps/dccs/ajbtux/source/SCHD/schd_cycle_cleanup.c - line 341~ SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210~
I get the following output when the sample input above goes through the two code blocks.
cycleA.txt content
389210
cycleB.txt content
25140616231334236602016032514061610200705008341389210
I understand that my last piped sed command (sed 's/[^0-9]*//g') is deleting all characters other than numbers so I omitted it from the block codes and placed the output in two additional files. I get the following output.
cycleA1.txt content
SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210
cycleB1.txt content
Mar 25 15:27:58 prd231 ajbtux[33423660]: 20160325152758: nschd_cem_svr:1:0:SCHD-MSG-MOVEEXISTCYCLE:200705008:AUDIT: nschdCycCleanup - /apps/dccs/ajbtux/source/SCHD/schd_cycle_cleanup.c - line 341 n SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210 n
I can see that the first code block is removing every thing other that (SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210) and is using the tilde but the second code block is just replacing the tildes with the character n. I can also see that it is necessary in the first code block for a line break after this(sed 's/[~]/ ) and that is why I though having \n would simulate a line break but that is not the case. I think my different output results are because of the way regular expressions are being used. I have tried to look into regular expressions and searched about them on stackoverflow but did not obtain what I was looking for. Could someone explain how I can achieve the same result from code block B as code block A without having part of my code be on a second line?
Thank you in advance
This is an example of the XY problem (http://xyproblem.info/). You're asking for help to implement something that is the wrong solution to your problem. Why are you changing ~s to newlines, etc when all you need given your posted sample input and expected output is:
$ sed -n 's/.*schdCycCleanup.* \([0-9]*\).*/\1/p' file
389210
or:
$ awk -F'[ ~]' '/schdCycCleanup/{print $(NF-1)}' file
389210
If that's not all you need then please edit your question to clarify your requirements for WHAT you are trying to do (as opposed to HOW you are trying to do it) as your current approach is just wrong.
Etan Reisner's helpful answer explains the problem and offers a single-line solution based on an ANSI C-quoted string ($'...'), which is appropriate, given that you originally tagged your question bash.
(Ed Morton's helpful answer shows you how to bypass your problem altogether with a different approach that is both simpler and more efficient.)
However, it sounds like your shell is actually something different - presumably ksh88, an older version of the Korn shell that is the default sh on AIX 6.1 - in which such strings are not supported[1]
(ANSI C-quoted strings were introduced in ksh93, and are also supported not only in bash, but in zsh as well).
Thus, you have the following options:
With your current shell, you must stick with a two-line solution that contains an (\-escaped) actual newline, as in your code block A.
Note that $(printf '\n') to create a newline does not work, because command substitutions invariably trim all trailing newlines, resulting in the empty string in this case.
Use a more modern shell that supports ANSI C-quoted strings, and use Etan's answer. http://www.ibm.com/support/knowledgecenter/ssw_aix_61/com.ibm.aix.cmds3/ksh.htm tells me that ksh93 is available as an alternative shell on AIX 6.1, as /usr/bin/ksh93.
If feasible: install GNU sed, which natively understands escape sequences such as \n in replacement strings.
[1] As for what actually happens when you try echo 'foo~bar~baz' | sed $'s/[~]/\\\n/g' in a POSIX-like shell that does not support $'...': the $ is left as-is, because what follow is not a valid variable name, and sed ends up seeing literal $s/[~]/\\\n/g, where the $ is interpreted as a context address applying to the last input line - which doesn't make a difference here, because there is only 1 line. \\ is interpreted as plain \, and \n as plain n, effectively replacing ~ instances with literal \n sequences.
GNU sed handles \n in the replacement the way you expect.
OS X (and presumably BSD) sed does not. It treats it as a normal escaped character and just unescapes it to n. (Though I don't see this in the manual anywhere at the moment.)
You can use $'' quoting to use \n as a literal newline if you want though.
echo 'foo~bar~baz' | sed $'s/[~]/\\\n/g'

Bash: Regex for SVN Conflicts

So I'm trying to write a regex to use for a grep command on an SVN status command. I want only files with conflicts to be displayed, and if it's a tree conflict, the extra information SVN provides about it (which is on a line with a > character).
So, here's my description of how SVN outputs lines with conflicts, and then I'll show my regex:
[Single Char Code][Spaces][Letter "C"][Space]Filename
[Spaces][Letter "C"][Space]Filename
[Letter "C"][Space]Filename
This is what I have so far to try and get the proper regex. The second part, after the OR condition, works fine to get the tree conflict extra line. It's the first part, where I'm trying to get lines with the letter C under very specific conditions.
Anyway, I'm not exactly the greatest with Regex, so some help here (plus an explanation of what I'm doing wrong, so I can learn from this) would be great.
CONFLICTS=($(svn status | grep "^(.)*C\s\|>"))
Thanks.
This regex should match your lines :
CONFLICTS=$(svn status | grep '^[ADMRCXI?!~ ]\? *C')
^[ADMRCXI?!~ ]\?: lines starting with zero or one \?status character ^[ADMRCXI?!~ ]
*zero or more spaces
character C
I removed the extra parenthesis surrounding the command substitution.
You have to read description of svn st output more deeply and try to get at least one Tree Conflict.
I'll start it for you:
> The first seven columns in the output are each one character wide:
>...
> Seventh column: Whether the item is the victim of a tree conflict
>...
> 'C' tree-Conflicted
and note: theoretically any of these 7 columns can be non-empty
status for tree-conflict
M wc/bar.c
! C wc/qaz.c
> local missing, incoming edit upon update
D wc/qax.c
Dirty lazy draft of regexp
^[enumerate_all_chars_here]{6}C\s

Sed command find and replace in even lines of a file

Hi I am new to this forum. I want to use SED to replace an expression on even lines of a file. My problem is that I cannot think f how to save the changes in the original file (i.e, how to overwrite the changes in the file). I have tried with :
sed -n 'n;p;' filename | sed 's/aaa/bbb/'
but this does not save the changes. I appreciate your help on this.
Try :
sed -i '2~2 s/aaa/bbb/' filename
The -i option tells sed to work in place, so not to write the edited version to stout and leave the original file be, but to apply the changes to the file. The 2~2 portion is the address for the lines sed should apply the commands. 2~2 means edit only even lines. 1~2 would edit only odd lines. 5~6 would edit every fifth line, starting at line 5 etc...
#Mithrandir's answer is an excellent, correct and complete one.
I will just add that the m~n addressing method is a GNU sed extension that may not work everywhere. For example, not all Macs have GNU sed, as well as *BSD systems may not have it either.
So, if you have a file like the following one:
$ cat f
1 ab
2 ad
3 ab
4 ac
5 aa
6 da
7 aa
8 ad
9 aa
...here is a more universal solution:
$ sed '2,${s/a/#A#/g;n}' f
1 ab
2 #A#d
3 ab
4 #A#c
5 aa
6 d#A#
7 aa
8 #A#d
9 aa
What does it do? The address of the command is 2,$, which means it will be applied to all lines between the second one (2) and the last one ($). The command in fact are two commands, treated as one because they are grouped by brackets ({ and }). The first command is the replacement s/a/#A#/g. The second one is the n command, which gets, in the current iteration, the next line, appends it to the current pattern space. So the current iteration will print the current line plus the next line, and the next iteration will process the next next line. Since I started it at the 2nd line, I am doing this process at each even line.
Of course, since you want to update the original file, you should call it with the -i flag. I would note that some of those non-GNU seds require you to give a parameter to the -i flag, which will an extension to be append to a file name. This file name is the name of a generated backup file with the old content. (So, if you call, for example, sed -i.bkp s/a/b/ myfile.txt the file myfile.txt will be altered, but another file, called myfile.txt.bkp, will be created with the old content of myfile.txt.) Since a) it is required in some places and b) it is accepted in GNU sed and c) it is a good practice nonetheless (if something go wrong, you can reuse the backup), I recommend to use it:
$ ls
f
$ sed -i.bkp '2,${s/a/#A#/g;n}' f
$ ls
f f.bkp
Anyway, my answer is just a complement for some specific scenarios. I would use #Mithrandir's solution, even because I am a Linux user :)
This might work for you:
sed -i 'n;s/aaa/bbb/' file
Use sed -i to edit the file in place.