How To Count Instances Of A Particular Word In A Website

Share this on:

I wanted to count how many times a particular word was used on a website, so how did I do it?

This is a solution for a Linux based computer, working on the command line. (It might work on a Mac I need to test that out)

First grab a copy of the website

wget -m -q -E -R jpg,tar,gz,png,gif,mpg,mp3,iso,wav,ogg,ogv, css,zip,djvu,js,rar,mov,3gp,tiff,mng

wget will grab all the files from the site except those listed and put them in a folder named after the site.

-m turns on options suitable for mirroring i.e. copying the whole website.

-q quiet mode, don't show any feedback while the site is being downloaded

-E this will adjust the Extension so files not ending with HTML will be given that extension (not entirely necessary, but useful if we want to browse the files locally.

-R tells wget to download recursively, ie from all folders but reject files with the file extensions listed

So then when you have a copy of the site locally, change into that directory, and run (if you are searching for the word Leeds):

grep -rohi Leeds . | wc -w

So grep finds occurrences of words and returns the lines containing that word

-r is for recursive, i.e. search all the foldres too as well as files

-o show only the matching part of the line showing the pattern, so the word Leeds

-h suppress the prefixing of file names on output when multiple files are searched, we aren't interested in the filenames

-i ignore case, we want every concurrence not exact matches

The | wc -w sends all the words from grep to wc (word count) and counts them.

Share this on:
Mike Nuttall

Author: Mike Nuttall

Mike has been web designing, programming and building web applications in Leeds for many years. He founded Onsitenow in 2009 and has been helping clients turn business ideas into on-line reality ever since. Mike can be followed on Twitter and has a profile on Google+.