Is there an awk-like or sed-like command line hack I can issue to generate a list of all keyboard characters (such as a-zA-z0-9!-*, etc)? I'm writing a simple Caesar cipher program in my intro programming class where we do the rotation not through ASCII values, but indexing into an alphabet string, something like this:
String alphabet = "abcdefghijklmnopqrstuvwxyz";
for (int pos = 0; pos < message.length(); pos++) {
char ch = message.charAt(pos);
int chPos = alphabet.indexOf(ch);
char cipherCh = alphabet.charAt(chPos+rotation%alphabet.length());
System.out.print(cipherCh);
}
Clearly I can write a loop in some other language and print all ASCII values, but I'd love something closer to the command line as flashier example.
Is this what you're looking for:
awk 'END {for (i=33; i<=126; i++) printf("%c",i); print ""}' /dev/null
This generates:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
I chose the range from 33 to 126 as the printable chars. See ascii man page
This is pure shell, no externals:
$ for i in {32..126}; do printf \\$(($i/64*100+$i%64/8*10+$i%8)); done
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
It converts decimals to octals and prints the corresponding character.
It works in Bash and ksh, dash and ash (if you use $(seq 32 126) instead of {32..126}) and zsh (if you use print -n instead of printf).
Related
The following code gives a very strange result:
#include <iostream>
#include <fstream>
using namespace std;
ifstream f("f1.in");
ofstream g("f1.out");
char sir[255];
int i;
char strlwr(char sir[]) //if void nothing changes
{
int i = 0;
for (i = 0; sir[i] != NULL; i++) {
sir[i] = tolower(sir[i]);
}
return 0; //if instead of 0 is 1 it will kinda work , but strlwr(sir) still needs to be displayed
}
int main()
{
f.get(sir, 255);
g << sir << '\n'; // without '\n' strlwr will no more maters
g << strlwr(sir);
g << sir;
return 0;
}
f1.in:
JHON HAS A COW
f1.out:
䡊乏䠠十䄠䌠坏
桪湯栠獡愠挠睯
It shows this only when I am using just CAPS.
I am using Code::Blocks 13.12 on Ubuntu 14, European version.
I would be very interested in knowing why it shows this.
I am interested in knowing if it gives you the same thing.
Congratulations! You've discovered mojibake! Your output text is 100% correct, but whatever your viewing it with is interpreting it as unicode.
If you convert the unicode output into their hex numerical values, the issue will become clear. (Code borrowed from this StackOverflow answer.)
$ cat unicode.txt
䡊乏䠠十䄠䌠坏
桪湯栠獡愠挠睯
$ cat unicode.txt | while IFS= read -r -d '' -n1 c; do printf "%02X\n" "'$c"; done
484A
4E4F
4820
5341
4120
4320
574F
0A
686A
6E6F
6820
7361
6120
6320
776F
0A
The second command reads the file character by character and prints the little endian form in hex. The reason each character is two bytes of data is because the input is understood to be UTF-16, a 2-byte encoding.
If you reinterpret the hex output as single byte ASCII instead (and correct for endianness) you can see that your program did work:
$ cat unicode.txt | while IFS= read -r -d '' -n1 c; do printf "%02X\n" "'$c"; done
484A ; JH
4E4F ; ON
4820 ; H
5341 ; AS
4120 ; A
4320 ; C
574F ; OW
0A ; \n
686A ; jh
6E6F ; on
6820 ; h
7361 ; as
6120 ; a
6320 ; c
776F ; ow
0A ; \n
To determine if the issue is your C++ program or your viewing program, try running the following command xxd f1.out. If it looks like ASCII, then it's your viewing programs fault. Otherwise, it's your program's fault and you should look into setlocale and/or opening your output file in binary mode.
Either way, you should probably change g<<strlwr(sir); to just strlwr(sir);. Currently it's adding a NULL byte to your output which is probably unintended.
I am working on a command parser that is supposed to accept a command line terminating with \r\n and extract its parameters
The command line structure is as follows:
all the parameters inside () are mandatory and the arguments inside [] are optional,and spc
stands for blank-space or space. and \t stands for tab
AP is and decimal integer between 1...4
RT,WL are a decimal unsigned integer numbers
= is equal symbol
% is percentage symbol
Followings is an acceptable command structure
[spc] MYCMD [spc] (\t) [spc] (AP) [spc] (:) (WL)(=)(RT)spcspc(\n)
As an example follwoing commands sre correct: (The whole command is case insensitive)
MYCMD \t 1 : 540 = 21% \r\n
MYCMD \t 2 : 712= 25 % \r\n
MYCMD\t 3 : 200 =17%\r\n
and ...
Following commands are incorrect:
MYCMD \t 5: 540 = 21% \r\n ---> 5 is not in range 1..4
MYCMD \t 2 : 712 25% \r\n ---> There is no equal symbol
MYCMD 3 200 =17\r\n --->there is no : between 3 and 200, no percentage symbol
MYCMD 3 100 =1 ,,.\n ----> there are extra symbols after 1 and \r does not exist
MYCMD 2: 130 =17.1\r\n ----> the sscanf parser must not translate 17.1 float to integer 7
I have implemented sscanf control format but it does not parse correctly!
int n_parsed=sscanf(cmd_str,"%*sMYCMD[*^\t]%*s%[1234]:%u%*s%[=]%u\r\n",&int_ap,&uint_wl,&uint_rt);
But this does not work for the correct commands (n_parsed never gets 3).
Any hint or comments on fixing the parsing issue will be appreciated
Thanks
Cannot be done solely with sscanf().
A key problem is that " " as well as "\r" as well as "\n" in the format string (aside from inside "[ ]") will optionally scan any number 0+ white-spaces and OP has very specific requirements. Optional spaces ' ', but not other white-spaces, is difficult to do in sscanf().
Another problem is the %d et al, consume optionally leading whitespace and we need to prevent that or let it go.
There is a discrepancy between the format and the examples in the location of the "%". I assume the example is correct.
There is a discrepancy between the format and the examples in the end-of-line \r\n versus \n. I assume any trailing whitespace before a final \r\n.
There is a discrepancy between the format and the examples in that spaces are allowed before the numbers. I assume spaces are OK.
The more I look at it I see lots of discrepancies between the stated format and the correct examples. I'll go for whatever is easiest to pass the examples in those cases.
int sep[4] = { 0 };
int int_ap;
unsigned uint_wl, uint_rt;
// [spc] MYCMD [spc] (\t) [spc] (AP) [spc] (:) (WL)(RT)(=)spcspc(\n)
const char *format = " MYCMD%n %n%1d :%u =%u%n %n";
int n_parsed = sscanf(cmd_str, format,
&sep[0], &sep[1], &int_ap, &uint_wl, &uint_rt, &sep[2], &sep[3]);
if (sep[3] == 0) DidNotReadEnd();
if ((int_ap < 1) || (int_ap > 4)) RangeError();
unsigned TabCount = 0;
int n;
for (n = sep[0]; n < sep[1]; n++) {
if (cmd_str[n] == '\t') TabCount++;
}
if (TabCount != 1) WrongTabCount;
for (n = sep[2]; n < sep[3]; n++) {
if (cmd_str[n] != ' ') break;
}
if (strcmp(&cmd_str[n], "\r\n") != 0) EOLError();
Note: int_ap could be scanned with %1[1-4] into a string and than converted to an int.
I fully expect a claim that this can all be done with only a sscanf() format. I am confident such and approach can be broken.
Can split(string, array, separator) in awk use sequence of whitespaces as the separator (or more generally any regexp as the separator)?
Obviously, one could use the internal autosplit (that runs on each line of the input with value of FS variable as the separator) and with simple for and $0 magic do the trick. However, I was just wondering if there's a more straightforward way using the splititself.
The GNU Awk User's Guide states:
split(string, array, fieldsep)
This divides string into pieces separated by fieldsep, and stores the
pieces in array. The first piece is stored in array[1], the second
piece in array[2], and so forth. The string value of the third
argument, fieldsep, is a regexp describing where to split string (much
as FS can be a regexp describing where to split input records). If
the fieldsep is omitted, the value of FS is used. split returns the
number of elements created. The split function, then, splits strings
into pieces in a manner similar to the way input lines are split into
fields
Here is a short (somewhat silly) example that uses a simple regular expression ".s " that will match any single character followed by a lower-case s and a space. The result of the split is put into array a. Note that the parts that match are not placed into the array.
BEGIN {
s = "this isn't a string yes isodore?"
count = split(s, a, ".s ")
printf("number of splits: %d\n", count)
print "Contents of array:"
for (i = 1; i <= count; i++)
printf "a[%d]: %s\n", i, a[i]
}
The output:
$ awk -f so.awk
number of splits: 3
Contents of array:
a[1]: th
a[2]: isn't a string y
a[3]: isodore?
The article Advanced Awk for Sysadmins show an example of parsing a line using split(). This page contains an example of using a regular expression to split data into
an array.
From the GNU awk(1) manual page:
split(s, a [, r])
Splits the string s into the array a on the regular expression r, and returns the number of fields. If r is omitted, FS is used instead.
The point here is that you can use any regular expression to perform field splitting--at least you can with gawk. If you're using something else, you'll need to check your documentation.
I have a requirement to grep a string or pattern (say around 200 characters before and after the string or pattern) from an extremely long line ed file. The file contains streams of data (market trading data) coming from a remote server and getting appended onto this line of the file.
I know that I can match lines containing a specific pattern using grep (or other tools), but once I have such lines, how can I extract a portion of the line? I want to grab the part of the line with the pattern plus roughly 200 characters before and after the pattern. I would be especially interested in answers using...(supply tools or languages you're comfortable with here).
If what you need is the 200 characters before and after the expression plus the expression itself, then you are looking at:
/.{200}aaa.{200}/
If you need captures for each (allowing you to extract each part as a unit), then you use this regexp:
/(.{200})(aaa)(.{200})/
If your grep has -o then that will output only the matched part.
echo "abc def ghi jkl mno pqr" | egrep -o ".{4}ghi.{4}"
produces:
def ghi jkl
(.{0,200}(pattern).{0,200}), or something?
Is this what you want (in C)?
If it is, feel free to adapt to your specific needs.
#include <stdio.h>
#include <string.h>
void prt_grep(const char *haystack, const char *needle, int padding) {
char *ptr, *start, *finish;
ptr = strstr(haystack, needle);
if (!ptr) return;
start = (ptr - padding);
if (start < haystack) start = haystack;
finish = ptr + strlen(needle) + padding;
if (finish > haystack + strlen(haystack)) finish = haystack + strlen(haystack);
for (ptr = start; ptr < finish; ptr++) putchar(*ptr);
}
int main(void) {
const char *longline = "123456789 ASDF 123456789";
const char *pattern = "ASDF";
prt_grep(longline, pattern, 5); /* you want 200 */
return 0;
}
I think I might approach the problem by matching the part of the string I need, then using the match position as the starting point for the substring extraction. In Perl, once your regex suceeds, the pos built-in tells you where you left off:
if( $long_string = m/$regex/ ) {
$substring = substr( $long_string, pos( $long_string ), 200 );
}
I tend to write my programs in Perl instead of doing everything in the regular expression. There's nothing particularly special about Perl in this case.
I think this may be more basic that everybody is thinking, correct me if I'm wrong...
Do you want to print before and after the string excluding the string?
awk -F "ASDF" '{print "Before ASDF" $1 "\n" "After ASDF" $2}' $FILE
This will print something like:
Before ASDF blablabla
After ASDF blablablabla
Change it to match your needs, remove the "\n" and or the "Before..." and "After..." comments
Do you want to supress the string from the file?
This will replace the string with a blank space, again, change it to whatever you need.
sed -i 's/ASDF/\ /' longstring.txt
HTH
I've got a bunch of strings like:
"Hello, here's a test colon:. Here's a test semi-colon;"
I would like to replace that with
"Hello, here's a test colon:. Here's a test semi-colon;"
And so on for all printable ASCII values.
At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).
Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.
Thanks,
Dom
The big advantage of using a regex is to deal with the tricky cases like & Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with &#. Finally, the regex solution is one without a lot of surprises for future maintainers.
I'd say a regex was the right choice.
Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all -~. For that reason, you could consider ?\d\d;.
As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :
For each match // Using regex_iterator<>
Print the prefix of the match
Remove the first 2 and last character of the match (&#;)
lexical_cast the result to int, then truncate to char and append.
Print the suffix of the last match.
This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.
NUMS = '1234567890'
MAIN LINE = INPUT :F(END)
SWAP LINE ? '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
OUTPUT = LINE :(MAIN)
END
* Repaired SNOBOL4 Solution
* & -> &
digit = '0123456789'
main line = input :f(end)
result =
swap line arb . l
+ '&#' span(digit) . n ';' rem . line :f(out)
result = result l char(n) :(swap)
out output = result line :(main)
end
I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.
Here's a Python implementation:
s = "Hello, here's a test colon:. Here's a test semi-colon;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)
Producing:
"Hello, here's a test colon:. Here's a test semi-colon;"
I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.
The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:
dd = "0123456789"
ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.
echo "Hello, here's a test colon:. Here's a test semi-colon; Further test A. abc.~.def." | perl -we 'sub translate { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) ) { return sprintf("%c",$x); } else { return "&#".$x.";"; } } while (<>) { s/&#(1?\d\d);/&translate($1)/ge; print; }'
Pretty-printing that:
#!/usr/bin/perl -w
sub translate
{
my $x=$_[0];
if ( ($x >= 32) && ($x <= 126) )
{
return sprintf( "%c", $x );
}
else
{
return "&#" . $x . ";" ;
}
}
while (<>)
{
s/&#(1?\d\d);/&translate($1)/ge;
print;
}
Though perl being perl, I'm sure there's a much better way to write that...
Back to C code:
You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.
Here's another Perl's one-liner (see #mrree's answer):
a test file:
$ cat ent.txt
Hello, here's a test colon:.
Here's a test semi-colon; ''
the one-liner:
$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt
or using more specific regex:
$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt
both one-liners produce the same output:
Hello, here's a test colon:.
Here's a test semi-colon; ''
boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.
// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>
int main() {
using namespace BOOST_SPIRIT_CLASSIC_NS;
std::string line;
while (std::getline(std::cin, line)) {
assert(parse(line.begin(), line.end(),
// match "&#(\d+);" where 32 <= $1 <= 126 or any char
*(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
| anychar_p[&putchar])).full);
putchar('\n');
}
}
compile:
$ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp
run:
$ echo "Hello, here's a test colon:." | spirit_ncr2a
output:
"Hello, here's a test colon:."
I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!
I'm currently using python and would have solved it with this oneliner:
''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])
Does that make any sense?
Here's a NCR scanner created using Flex:
/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
/**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}
To make an executable:
$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl
Example:
$ echo "Hello, here's a test colon:.
> Here's a test semi-colon; ''
> ; <-- may be recursive" \
> | ncr2a
It prints for non-recursive version:
Hello, here's a test colon:.
Here's a test semi-colon; ''
; <-- may be recursive
And the recursive one produces:
Hello, here's a test colon:.
Here's a test semi-colon; ''
; <-- may be recursive
This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).
dd = "0123456789"
ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:
dd = "0123456789"
ccp = "#" (span(dd) $ n ";") $ enc
+ *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = replace(line,char(10),"#") :(rdl)
done
end
Here's a version based on boost::regex_token_iterator. The program replaces decimal NCRs read from stdin by corresponding ASCII characters and prints them to stdout.
#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>
int main()
{
boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
const int subs[] = {-1, 1}; // non-match & subexpr
boost::sregex_token_iterator end;
std::string line;
while (std::getline(std::cin, line)) {
boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);
for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
if (isncr) { // convert NCR e.g., ':' -> ':'
const int d = boost::lexical_cast<int>(*tok);
assert(32 <= d && d < 127);
std::cout << static_cast<char>(d);
}
else
std::cout << *tok; // output as is
}
std::cout << '\n';
}
}