Wordstats

Why wordstats?

Sometimes you want to know how long your amazing new novel is or what word appears most often in the Pulp Fiction screenplay. That’s where this simple little tool comes in. It replaces a really ugly shell pipeline that had multiple one-liner scripts in it (caveman lambda functions!). I think this is much friendlier to use! You get:

No web upload.
Your writing never leaves your terminal—privacy by default.
Blazing fast, works everywhere.
Use it with pipes, files, or straight from the clipboard.
Human-friendly output.
Prints a markdown-style table by default, or CSV for automation.
Zero dependencies ( besides Python 3.7+ )

Usage

# Count word frequency in a file (top 10)
./wordstats.py my_notes.txt --top 10

# Pipe input
cat todo.txt | ./wordstats.py --top 20

# Output CSV for Excel or further scripting
./wordstats.py moby-dick.txt --top 50 --csv

# Show *all* words (don’t filter stopwords)
./wordstats.py speech.txt --no-stopwords --top 100

# Read from clipboard (requires paste alias; see below)
paste | ./wordstats.py

Sample Output

Default report output

┌──[ grumble@shinobi ]:~/codelab/adminjitsu  (main*)
└─$ cat ../dotfiles/README.md | wordstats --top 10
word         | count
-------------+------
bash         | 10
dotfiles     | 8
shell        | 8
system       | 6
arsenal      | 6
setup        | 5
youre        | 5
hostspecific | 5
bootstrapsh  | 5
sourcing     | 5

Total words:      685
Unique words:     378
Filtered words:   515
Character count:  4480
Avg word length:  6.54
Longest word:     stringusersyournamedotfilesbinarsenalupdatestring (49)
Shortest word:    a (1)

CSV output

┌──[ grumble@shinobi ]:~/codelab/adminjitsu  (main*)
└─$ cat ../dotfiles/README.md | wordstats --csv --top 8
word,count
bash,10
dotfiles,8
shell,8
system,6
arsenal,6
setup,5
youre,5
hostspecific,5

Script Source

You can download the script here

Or with the CLI, with:

with curl

curl -O https://adminjitsu.com/scripts/wordstats.py
chmod +x wordstats.py

with wget

wget https://adminjitsu.com/scripts/wordstats.py
chmod +x wordstats.py

wordstats.py

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152


  #!/usr/bin/env python3
  """
  wordstats.py — Analyze and print word frequency and text stats.

  Usage:
    wordstats.py [input.txt] [--top N] [--no-stopwords] [--csv]

  Reads from stdin if no input file is given.

  Options:
    --top N           Show top N words (default: 100)
    --no-stopwords    Do not filter out stopwords
    --csv             Output as CSV instead of ASCII table
    -h, --help        Show this help and exit
    -v, --version     Show version and exit

  Outputs:
    - Table (or CSV) of N most common words and counts
    - Total words, unique words, filtered word count
    - Character count, average word length, longest/shortest word
  """

  import sys
  import os
  import re
  import argparse
  from collections import Counter

  __version__ = "1.0.1"

  # ─── BUILT-IN STOPWORDS ──────────────────────────────────────────────────────
  # These are the standard English stopwords from wordcloud's STOPWORDS set,
  # hardcoded here for zero dependencies. Expand/edit as you like.
  STOPWORDS = set([
      "a", "about", "above", "after", "again", "against", "all", "am", "an",
      "and", "any", "are", "aren't", "as", "at", "be", "because", "been", "before",
      "being", "below", "between", "both", "but", "by", "can't", "cannot", "could",
      "couldn't", "did", "didn't", "do", "does", "doesn't", "doing", "don't", "down",
      "during", "each", "few", "for", "from", "further", "had", "hadn't", "has",
      "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "her",
      "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's",
      "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "isn't", "it",
      "it's", "its", "itself", "let's", "me", "more", "most", "mustn't", "my", "myself",
      "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought",
      "our", "ours", "ourselves", "out", "over", "own", "same", "shan't", "she",
      "she'd", "she'll", "she's", "should", "shouldn't", "so", "some", "such", "than",
      "that", "that's", "the", "their", "theirs", "them", "themselves", "then",
      "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've",
      "this", "those", "through", "to", "too", "under", "until", "up", "very", "was",
      "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't", "what",
      "what's", "when", "when's", "where", "where's", "which", "while", "who",
      "who's", "whom", "why", "why's", "with", "won't", "would", "wouldn't", "you",
      "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"
  ])

  # ─── PARSE ARGUMENTS ─────────────────────────────────────────────────────────
  def parse_args():
      """Parse command line arguments and handle help/version for help2man."""
      p = argparse.ArgumentParser(
          description="Show word frequency and text stats from text input."
      )
      p.add_argument('input', nargs='?', help='Input file (or stdin)')
      p.add_argument('--top', type=int, default=100, help='Show top N words (default: 100)')
      p.add_argument('--no-stopwords', action='store_true', help='Do not filter out stopwords')
      p.add_argument('--csv', action='store_true', help='Output CSV instead of table')
      p.add_argument('-v', '--version', action='store_true', help='Show version and exit')
      return p.parse_args()

  # ─── LOAD AND CLEAN TEXT ─────────────────────────────────────────────────────
  def load_text(path):
      """
      Load text from a file or stdin.
      If no path is given and stdin is not a TTY, read stdin.
      """
      if not sys.stdin.isatty() and not path:
          return sys.stdin.read()
      elif path:
          with open(path, encoding="utf-8") as f:
              return f.read()
      else:
          print(__doc__)
          sys.exit(1)

  def clean_words(text):
      """
      Lowercase, strip punctuation/digits, split into words.
      Returns list of words.
      """
      text = text.lower()
      text = re.sub(r"[^\w\s]", "", text)
      text = re.sub(r"\d+", "", text)
      return text.split()

  # ─── OUTPUT FORMATTING ───────────────────────────────────────────────────────
  def print_table(rows, headers=None):
      """
      Print an ASCII table (like markdown) for terminal output.
      """
      col_widths = [max(len(str(x)) for x in col) for col in zip(*([headers] + rows))] if headers else [max(len(str(x)) for x in col) for col in zip(*rows)]
      if headers:
          print(" | ".join(str(h).ljust(w) for h, w in zip(headers, col_widths)))
          print("-+-".join("-" * w for w in col_widths))
      for row in rows:
          print(" | ".join(str(v).ljust(w) for v, w in zip(row, col_widths)))

  # ─── MAIN LOGIC ──────────────────────────────────────────────────────────────
  def main():
      args = parse_args()

      # --help and --version for help2man compatibility
      if args.version:
          print(f"wordstats.py {__version__}")
          sys.exit(0)

      text = load_text(args.input)
      words = clean_words(text)

      total_words = len(words)
      unique_words = len(set(words))
      char_count = sum(len(w) for w in words)
      avg_wordlen = (char_count / total_words) if total_words else 0
      longest = max(words, key=len) if words else ""
      shortest = min(words, key=len) if words else ""

      # ─── STOPWORDS FILTERING (NOW ZERO DEPENDENCIES) ─────────────────────────
      # If user didn't specify --no-stopwords, filter using the built-in set.
      stops = set(STOPWORDS) if not args.no_stopwords else set()
      filtered = [w for w in words if w not in stops]

      freq = Counter(filtered)
      most_common = freq.most_common(args.top)

      # ── Output ──
      if args.csv:
          print("word,count")
          for word, count in most_common:
              print(f"{word},{count}")
      else:
          print_table([(w, c) for w, c in most_common], headers=["word", "count"])
          print()
          # Extra stats for the curious
          print(f"Total words:      {total_words}")
          print(f"Unique words:     {unique_words}")
          print(f"Filtered words:   {len(filtered)}")
          print(f"Character count:  {char_count}")
          print(f"Avg word length:  {avg_wordlen:.2f}")
          print(f"Longest word:     {longest} ({len(longest)})")
          print(f"Shortest word:    {shortest} ({len(shortest)})")

  # ─── ENTRYPOINT ──────────────────────────────────────────────────────────────
  if __name__ == "__main__":
      main()

Customization

This script would be easy to extend or customize if needed. A few things that spring to mind:

edit the stopwords list to control what words the script will ignore
for other languages, you can paste in a different language set from NLTK on GitHub. You could use the files directly by changing STOPWORDS :
```
STOPWORDS = set(open("french_stopwords.txt").read().split())
```
you can edit the default value of --top in the parse_args section to show more/fewer words.
change column widths, table formatting, or add new columns (like relative frequency)

Conclusion

That’s it. quick word stats from the CLI. Just feed it your text and get a useful response! Any feedback or questions, please feel free to email me: feedback@adminjitsu.com

Why wordstats?#

Usage#

Sample Output#

Script Source#

wordstats.py#

Customization#

See also#

Conclusion#