Aspell only dumps small number of English words - aspell

I am trying to dump aspell words to a list using windows 11.
I have installed the English dictionary from http://aspell.net/win32/
I then run the following command
aspell -d en dump master | aspell -l en expand > words3.en.txt
It results in a list that is 4272 lines. This clearly is not all the words...
How do I get it to dump all words?

Related

Sublime Text 3 files not opening in command line/powershell

I’ve just finished Codecademy and am setting up Python (2.7) and Sublime Text 3 on my own computer using the Codecademy guide page. I can successfully perform simple operations (eg. Print “Hello world”) a) in Python and b) by using ctrl+b on text in my Sublime Text 3 editor page, which gives the output in the box at the bottom of the page.
I’m stuck/confused when it comes to running the Sublime Text 3 file through the command line and confused on the powershell vs. command line issue (I’m running Windows 10).
My command line does not display the $ sign in the codecademy page example (https://www.codecademy.com/articles/setup-python), so should I be entering the commands below through Powershell rather than the command line? If so, I don’t get the $ in Powershell either.
If I carry on regardless and try to change directories etc through Command Prompt or Powershell, I only get error messages and can't seem to run the Sublime Text 3 file.
I’m also not clear on whether this is an issue about which directory I’m in. Running dir in both the Command prompt and Powershell returns the result: after I’ve opened Python, but lists all subfolders of C:\Users\my_name if I run it before opening Python. Does this mean that I need to save my Sublime Text files in some sort of subfolder of Python in order to be able to run them as above?
Or does it not matter that I’m not able to run the Sublime Text 3 files through Python directly and I should just stick to doing so through Sublime Text 3 itself? Will this limit me later on?
Thanks for your help
BC89

Ubuntu grep on Windows does not find all entries

I've exported a bunch of MS-Outlook mails to a text file. Now I'm trying to find some particular lines within that text file, but this seems not to work:
Prompt>/C/Temp_Folder$ egrep "Found crash|process disappearance " testtttt.txt | wc -l
13
Prompt>/C/Temp_Folder$ grep "Found crash" testtttt.txt | wc -l
11
Prompt>/C/Temp_Folder$ grep "process disappearance " testtttt.txt | wc -l
3
Opening this file in Notepad++, I have these results:
Found crash : 921 matches
process disappearance : 4975 matches
This can be interesting:
When I launch the grep without the wc -l, I see following result at the end:
Binary file testtttt.txt matches
This probably means that the file is treated as a binary file, although it's just a "regular" textfile.
When I ask what kind of file I'm dealing with, I get following result:
file testtttt.txt
testtttt.txt: news or mail, ISO-8859 text, with very long lines, with CRLF line terminators
What's going on here: is it the news or mail, the ISO-8859, the very long lines, ..., and how can I solve this?
For your understanding, I'm working on a Linux subsystem on a Windows-10 machine (the Ubuntu app from Canonical Group Limited).
I managed to solve my issue using the Notepad++ feature Find all in current document.

Capture group from regex in bash script

When building an R package the command outputs the process steps to std out. From that output I would like to capture the final name of the package.
In the simulated script below I show the output of the build command. The part that needs to be captured is the last line starting with building.
How do I get the regex to match with these quotes, and then capture the package name into a variable?
#!/usr/bin/env bash
var=$(cat <<"EOF"
Warning message:
* checking for file ‘./DESCRIPTION’ ... OK
* preparing ‘analysis’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
Removed empty directory ‘analysis/.idea/inspectionProfiles’
Removed empty directory ‘analysis/.idea/snapshots’
* creating default NAMESPACE file
* building ‘analysis_0.1.tar.gz’
EOF
)
regex="building [\u2018](.*?)?[\u2019]"
if [[ "${var}" =~ $regex ]]; then
pkgname="${BASH_REMATCH[1]}"
echo "${pkgname}"
else
echo "sad face"
fi
This should work on both macOS and CentOS.
Support for the \u and \U unicode escapes was introduced in Bash 4.2. CentOS 7 has Bash 4.2, so this should work on that platform:
regex=$'.*building[[:space:]]+\u2018(.*)\u2019'
Unfortunately, earlier versions of CentOS had older versions of Bash, and I believe the default version of Bash on MacOS is still 3.2. For those, assuming that the quotes are encoded as UTF-8, this should work:
regex=$'.*building[[:space:]]+\xe2\x80\x98(.*)\xe2\x80\x99'
If the quotes are encoded in different ways on different platforms, then you could use alternation (e.g. (\xe2\x80\x98|...) instead of xe2\x80\x98) to match all of the possibilities (and adjusting the index used for BASH_REMATCH).
See How do you echo a 4-digit Unicode character in Bash? for more information about Unicode in Bash.
I've used $'...' to set the regular expression because it supports \x and (from Bash 4.2) \u escapes for characters, and Bash regular expressions don't.
With regard to the regular expression:
The leading .* is to ensure that the match occurs at the end of the text.
I've dropped the ?s because they aren't compatible with Bash's built-in regular expressions. See mkelement0's excellent answer to How do I use a regex in a shell script? for information about Bash regular expressions.
There are many ways to do it, this is one:
file=`echo "$var" | grep '^\* building' | grep -o '‘.*’' | head -c -4 | tail -c +4`
echo $file
Find the line starting with * building (first grep)
Find the text between ‘’ (second grep)
Discard the quotes (first 4 bytes and last 4 bytes) (head and tail)

How do I copy regex matches from a file? Need to get all MAC addresses from log file

I have a linux dhcpd log that I need to get a list of only the MAC addresses. The MAC addresses are formatted like 00:ab:27:d8:dd:dd
Using linux command line tools,parse INPUT file for MAC addresses and send to OUTPUT file. Where OUTPUT file is just a list of the MAC addresses, where then duplicate MAC addresses can be removed.
I suspect this might be a multi-step, complex command. I've searched the site and could not find a match for copy the results of regex search. I've had mixed results getting a reg-expression that works to even find the MAC addresses in the file, let alone copy out all of the right matches to a file.
You can use the following command to extract unique MAC addresses:
grep -o -E '([[:xdigit:]]{1,2}:){5}[[:xdigit:]]{1,2}' /var/log/dhcpd.log | uniq > unique_MAC.txt
Explanation:
This will retreive the MAC addresses from the log:
grep -o -E '([[:xdigit:]]{1,2}:){5}[[:xdigit:]]{1,2}'
The uniq command will remove all duplicate MAC addresses:
uniq
References:
grep
uniq

Why does aspell suggest the very word that it fails to check?

Here is the command I run:
> echo "civilization" | aspell -a
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.6.1)
& civilization 3 0: civilization, civilizations, civilization's
Why does aspell suggest the very word ("civilization") but fails to check its spelling? In contrast, hunspell seems to get this right
> echo "civilization" | hunspell
Hunspell 1.3.2
*
but that is probably because the two spell checkers use different dictionaries.
EDIT: Running this on a different machine and different/older aspell version seems to work though:
> echo civilization | aspell -a
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.3)
*
Uppercase and lowercase
What you get, if you try it with Civilization ?
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
T:\msys\1.0\src\aspell-0.60.6\.libs>echo "zivilisation" | aspell -a
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.6)
& zivilisation 3 1: Zivilisation, Zivilisationen, Sterilisation
T:\msys\1.0\src\aspell-0.60.6\.libs>echo "Zivilisation" | aspell -a
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.6)
*
T:\msys\1.0\src\aspell-0.60.6\.libs>
According to Kevin Atkinson (aspell maintainer, link) that's a bug and he wasn't sure if there's a report open for it. He wasn't sure if/when this will get fixed either.