search document for non-ascii - regex

An application on my computer needs to read in a text file. I have several, and one doesn't work; the program fails to read it and tells me that there is a bad character in it somewhere. My first guess is that there's a non-ascii character in there somewhere, but I have no idea how to find it. Perl or any generic regex would be nice. Any ideas?

You can use [^\x20-\x7E] to match a non-ASCII character.
e.g. grep -P '[^\x20-\x7E]' suspicious_file

perl -wne 'printf "byte %02X in line $.\n", ord $& while s/[^\t\n\x20-\x7E]//;'
will find every character that is not an ASCII glyphic character, tab, space, or newline.
If it reports 0Ds (carriage-returns) in files that are O.K., then change \t\n to \t\n\r.
If it only reports 0Ds in files that are bad, then you can probably fix those files by running dos2unix on them.

If you use tabulators in your source code as well, try this pattern:
[^\x08-\x7E]
Works also in Notepad++

Related

sed add text around regex

I would like to be able to go:
sed "s/^\(\w+\)$/leftside\1rightside/"
and have the group matched by (\w+\) appear in between 'leftside' and 'rightside'.
But it seems like I have to pipe it twice, one for the left of the text, another time for the right. If anyone knows a way to do it in one pass, I'd appreciate it.
The reason it's not working is that you probably specify the wrong regex. In your case, text will be added in the end and beginning of the line only if it consists only of word characters (given that your version of sed supports the \w notation). Also you didn't escape the + which you should do if not using the -r option.
Try starting with sed "s/^\(.*\)$/leftside\1rightside/" or just sed "s/.*/leftside&rightside/" and working from that.

Simple way of converting slashes in a Makefile?

I need to convert all paths with '\' in them to '/'. The makefile is quite long and doing this manually is impossible.
Is there some way to quickly convert them? Keep in mind that a global replace is not possible because '\' is also used to denote that a command is continued on the following line.
It looks like you could do this with a sed command:
sed -e 's/\\\(.\)/\/\1/g'
This converts any backslash followed by some other character (which doesn't include newline) into a forward slash followed by that same character.
This command line has a bit of a "leaning toothpick" problem, sorry about that.
I think that Gregs solution was nearly correct, but I would do
sed -e 's/\\\(.\)/\/\1/g'
to make sure that not only the first slash gets replaced. Sorry for not doing this as a comment, but I don't have the privilege yet.

Regex (grep) for multi-line search needed [duplicate]

This question already has answers here:
How can I search for a multiline pattern in a file?
(11 answers)
Closed 1 year ago.
I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.
I've tried a few variations on the following:
$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"
This, however, just runs forever. Can anyone help me with the correct syntax please?
Without the need to install the grep variant pcregrep, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P activate perl-regexp for grep (a powerful extension of regular expressions)
-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.
-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s) activate PCRE_DOTALL, which means that . finds any character or newline
\N find anything except newline, even with PCRE_DOTALL activated
.*? find . in non-greedy mode, that is, stops as soon as possible.
^ find start of line
\1 backreference to the first group (\s*). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c) source file.
I am not very good in grep. But your problem can be solved using AWK command.
Just see
awk '/select/,/from/' *.sql
The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.
Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.
Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.
I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.
In outline, for a single file:
$/ = "\n\n"; # Paragraphs
while (<>)
{
if ($_ =~ m/SELECT.*customerName.*FROM/mi)
{
printf file name
go to next file
}
}
That needs to be wrapped into a sub that is then invoked by the methods of File::Find.

(grep) Regex to match non-ASCII characters?

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?
This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char
[^[:print:]] will probably suffice for you.**
No, [^\x20-\x7E] is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!
You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.
[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.
To Validate Text Box Accept Ascii Only use this Pattern
[\x00-\x7F]+
I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.
You don't really need a regex.
printf "%s\n" *[!\ -~]*
This will show file names with control characters in their names, too, but I consider that a feature.
If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)
This turned out to be very flexible and extensible.
$field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.

How to match a decimal letter and blank in vim?

I need to change
1 A
2 B
3 C
4 D
to
A
B
C
D
which means the decimal letter at the begining of every line and the following one or more blank should be deleted .
I'm only familiar with Reqex in Perl, so I try to use :%s/^\d\s+// to solve my problem, but it does not work. so does anyone of you can tell me how to get the work done using vim ?
thanks.
Vim needs a backslash for +, so try
:%s/^\d\s\+//
One way is to use the global command with the search-and-replace command:
:g/^[0-9] */s//
It searches for the sequence:
start of line ^
a digit [0-9]
a space <space>
zero or more spaces <space>*
and then substitutes it for nothing (s//).
While you can do a similar thing with just the search-and-replace command on its own, it's useful to learn the global command since you can do all sorts of wonderful things with the lines selected (not just search and replace).
Use the following
:%s/^[0-9] *//
You should use instead
:%s/^\d\s\+//
Being a text editor, vim tends to treat more characters literally‒as text‒when matching a pattern than perl. In the default mode, + matches literal +.
Of course, this is configurable. Try
:%s/\v^\d\s+//
and read the help file.
:help magic
You can also use the Visual Block mode (Ctrl+V), then move down and to the right to highlight a block of characters and use 'x' to remove them. Depending on the layout of the line, that may in fact be quicker (and easier to remember).
If you still want to use Perl for this, you can:
:%!perl -pe 's/^\d\s+//'
Vim will write the file to a temporary file, run the given Perl script on it, and reload the file into the edit buffer.
Escape the plus sign:
:%s/^\d\s\+//
If it's in a column like that you could go into the column visual mode by pressing:
esc ctrl+q
then you can highlight what you want to delete