What does "$\=$/;" mean in perl? - regex

I came across a perl program which counts the number of vowels in a string. But I'm not able to infer a single line how it is working. Anyone who can decode this program line by line?
$\=$/;map{
$_=<>;print 0+s/[aeiou]//gi
}1..<>

What does $\=$/; mean in perl?
Sets $\ to the value of $/.
$/ defines the line ending ending for readline (<>). It's default is a line feed (U+000A).
$\ is appended to the output of each print. It's default is the empty string.
So, assuming $/ hadn't been changed, it sets $\ to line feed, which makes print act like say.
Anyone who can decode this program line by line?
Globally make print act like say.
Read a line from ARGV.
For a number of times equal to the number read,
Read a line from ARGV.
Use s/[aeiou]//gi to count the number of vowels.
Print the result.
In scalar context, s///g returns the number of matches/replacements. 0+ forces scalar context.
By the way, tr/aeiouAEIOU// would be faster than 0+s/[aeiou]//gi, and no longer. It's also non-destructive.

Related

Remove newline after incorrect field splitting in csv file

I use linux and I'm trying to use sed for this. I download a CSV from an institutional site providing some data to be analyzed. There are several thousand lines per CSV, and many columns per row (I haven't counted them, but I think the number is useless). The fields are separated by semicolons and quoted, so the format per line is:
"Field 1";"Field 2";"Field 3"; .... ;"Field X";
Each correct line ends with semicolon and '\n'. The problem is that, from time to time, there's some field that incorrectly has a newline, and the solution is to delete the newline character, so the two lines go back to be together into only one. Example of an incorrect line:
"Field 1";"Field 2";"Fi
eld 3";"Field X";
I've found that there can be a \n right after the opening quote or somewhere in the between the quotes.
I've found a way to manage this last case, where the newline is right after the quote:
sed ':a;N;$!ba;s/";"\n/";"/g' file.csv
but not for "any number of alphabet characters after the quote not ending in semicolon". I have a pattern file (to be used with -f) with these lines:
:a;N;$!ba;s/";"\n/";"/g
:a;N;$!ba;s/\([A-z]\)\n/\1/g
:a;N;$!ba;s/\([:alpha:]\)\n/\1/g
The first line of the pattern file works, but I've tried combinations of the second and third and I always get an empty file.
If current line doesn't end with a semicolon, read and append next line to pattern space and remove line break.
sed '/[^;]$/{N;s/\n//}' file

Find and remove specific string from a line

I am hoping to receive some feedback on some code I have written in Python 3 - I am attempting to write a program that reads an input file which has page numbers in it. The page numbers are formatted as: "[13]" (this means you are on page 13). My code right now is:
pattern='\[\d\]'
for line in f:
if pattern in line:
re.sub('\[\d\]',' ')
re.compile(line)
output.write(line.replace('\[\d\]', ''))
I have also tried:
for line in f:
if pattern in line:
re.replace('\[\d\]','')
re.compile(line)
output_file.write(line)
When I run these programs, a blank file is created, rather than a file containing the original text minus the page numbers. Thank you in advance for any advice!
Your if statement won't work because not doing a regex match, it's looking for the literal string \[\d\] in line.
for line in f:
# determine if the pattern is found in the line
if re.match(r'\[\d\]', line):
subbed_line = re.sub(r'\[\d\]',' ')
output_file.writeline(subbed_line)
Additionally, you're using the re.compile() incorrectly. The purpose of it is to pre-compile your pattern into a function. This improves performance if you use the pattern a lot because you only evaluate the expression once, rather than re-evaluating each time you loop.
pattern = re.compile(r'\[\d\]')
if pattern.match(line):
# ...
Lastly, you're getting a blank file because you're using output_file.write() which writes a string as the entire file. Instead, you want to use output_file.writeline() to write lines to the file.
You don't write unmodified lines to your output.
Try something like this
if pattern in line:
#remove page number stuff
output_file.write(line) # note that it's not part of the if block above
That's why your output file is empty.

Applying a regular expression to a text file Python 3

#returns same result i.e. only the first line as many times as 'draws'
infile = open("results_from_url.txt",'r')
file =infile.read() # essential to get correct formatting
for line in islice(file, 0, draws): # allows you to limit number of draws
for line in re.split(r"Wins",file)[1].split('\n'):
mains.append(line[23:38]) # slices first five numbers from line
stars.append(line[39:44]) # slices last two numbers from line
infile.close()
I am trying to use the above code to iterate through a list of numbers to extract the bits of interest. In this attempt to learn how to use regular expressions in Python 3, I am using lottery results opened from the internet. All this does is to read one line and return it as many times as I instruct in the value of 'draws'. Could someone tell me what I have done incorrectly, please. Does re 'terminate' somehow? The strange thing is if I copy the file into a string and run this routine, it works. I am at a loss - problem 'reading' a file or in my use of the regular expression?
I can't tell you why your code doesn't work, because I cannot reproduce the result you're getting. I'm also not sure what the purpose of
for line in islice(file, 0, draws):
is, because you never use the line variable after that, you immediately overwrite it with
for line in re.split(r"Wins",file)[1].split('\n'):
Plus, you could have used file.split('Wins') instead of re.split(r"Wins",file), so you aren't really using regex at all.
Regex is a tool to find data of a certain format. Why do you use it to split the input text, when you could use it to find the data you're looking for?
What is it you're looking for? A sequence of seven numbers, separated by commas. Translated into regex:
(?:\d+,){7}
However, we want to group the first 5 numbers - the "mains" - and the last 2 numbers - the "stars". So we'll add two named capture groups, named "mains" and "stars":
(?P<mains>(?:\d+,){5})(?P<stars>(?:\d+,){2})
This pattern will find all numbers you're looking for.
import re
data= open("infile.txt",'r').read()
mains= []
stars= []
pattern= r'(?P<mains>(?:\d+,){5})(?P<stars>(?:\d+,){2})'
iterator= re.finditer(pattern, data)
for count in range(int(input('Enter number of draws to examine: '))):
try:
match= next(iterator)
except StopIteration:
print('no more matches')
break
mains.append(match.group('mains'))
stars.append(match.group('stars'))
print(mains,stars)
This will print something like ['01,03,31,42,46,'] ['04,11,']. You may want to remove the commas and convert the numbers to ints, but in essence, this is how you would use regex.

Stata: removing line feed control characters

I have a dataset which I export with command outsheet into a csv-file. There are some rows which breaks line at a certain place. Using a hexadecimal editor I could recognize the control character for line feed "0a" in the record. The value of the variable producing the line break shows visually (in Stata) only 5 characters. But if I count the number of characters:
gen xlen = length(x)
I get 6. I could write a Perl programm to get rid of this problem but I prefer to remove the control characters in Stata before exporting (for example using regexr()). Does anyone have an idea how to remove the control characters?
The char() function calls up particular ASCII characters. So, you can delete such characters by replacing them with empty strings.
replace x = subinstr(x, char(10), "", .)

Regular Expression over multiple lines

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1
This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.
Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
Input
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
Output
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
Yours works with a couple of small changes:
sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile
The ? needs to be escaped and . doesn't match newlines.
Here's another way to do it which doesn't require using the hold space:
sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile
Here is a commented version:
sed -n '
$ # for the last input line
{
p; # print
q # and quit
};
N; # otherwise, append the next line
/\n,/ # if it starts with a comma
{
s/"\?\n//p; # delete an optional comma and the newline and print the result
b # branch to the end to read the next line
};
P; # it doesn't start with a comma so print it
D # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile