Why wordstats?
Sometimes you want to know how long your amazing new novel is or what word appears most often in the Pulp Fiction screenplay. That’s where this simple little tool comes in. It replaces a really ugly shell pipeline that had multiple one-liner scripts in it (caveman lambda functions!). I think this is much friendlier to use! You get:
-
No web upload.
Your writing never leaves your terminal—privacy by default. -
Blazing fast, works everywhere.
Use it with pipes, files, or straight from the clipboard. -
Human-friendly output.
Prints a markdown-style table by default, or CSV for automation. -
Zero dependencies ( besides Python 3.7+ )
Usage
# Count word frequency in a file (top 10)
./wordstats.py my_notes.txt --top 10
# Pipe input
cat todo.txt | ./wordstats.py --top 20
# Output CSV for Excel or further scripting
./wordstats.py moby-dick.txt --top 50 --csv
# Show *all* words (don’t filter stopwords)
./wordstats.py speech.txt --no-stopwords --top 100
# Read from clipboard (requires paste alias; see below)
paste | ./wordstats.py
Sample Output
Default report output
┌──[ grumble@shinobi ]:~/codelab/adminjitsu (main*)
└─$ cat ../dotfiles/README.md | wordstats --top 10
word | count
-------------+------
bash | 10
dotfiles | 8
shell | 8
system | 6
arsenal | 6
setup | 5
youre | 5
hostspecific | 5
bootstrapsh | 5
sourcing | 5
Total words: 685
Unique words: 378
Filtered words: 515
Character count: 4480
Avg word length: 6.54
Longest word: stringusersyournamedotfilesbinarsenalupdatestring (49)
Shortest word: a (1)
CSV output
┌──[ grumble@shinobi ]:~/codelab/adminjitsu (main*)
└─$ cat ../dotfiles/README.md | wordstats --csv --top 8
word,count
bash,10
dotfiles,8
shell,8
system,6
arsenal,6
setup,5
youre,5
hostspecific,5
Script Source
You can download the script here
Or with the CLI, with:
-
with
curl
curl -O https://adminjitsu.com/scripts/wordstats.py chmod +x wordstats.py
-
with
wget
wget https://adminjitsu.com/scripts/wordstats.py chmod +x wordstats.py
wordstats.py
|
|
Customization
This script would be easy to extend or customize if needed. A few things that spring to mind:
- edit the stopwords list to control what words the script will ignore
- for other languages, you can paste in a different language set from NLTK on GitHub. You could use the files directly by changing
STOPWORDS
:STOPWORDS = set(open("french_stopwords.txt").read().split())
- you can edit the default value of
--top
in theparse_args
section to show more/fewer words. - change column widths, table formatting, or add new columns (like relative frequency)
See also
You may want to define a clip
and a paste
alias. This makes it simple to use the clipboard with this (and tons of other programs)
-
Linux
# Linux clipboard (assumes xclip is installed) alias clip='xclip -selection clipboard' alias paste='xclip -selection clipboard -o'
-
macOS
# macOS clipboard # for consistency with linux environment alias clip='pbcopy' alias paste='pbpaste'
Add those aliases to your shell startup (e.g., .bashrc
or .zshrc
) and reload.
Now you can interact with the clipboard like so:
# save stats to system clipboard
wordstats.py --top 20 README.md | clip
# get stats on text IN the system clipboard
paste | wordstats.py
# use in a pipeline
cat somefile.txt | grep -i error | wordstats.py --top | clip
Conclusion
That’s it. quick word stats from the CLI. Just feed it your text and get a useful response! Any feedback or questions, please feel free to email me: feedback@adminjitsu.com