How To Count Instances Of A Particular Word In A Website
Posted by Mike Nuttall
I wanted to count how many times a particular word was used on a website, so how did I do it?
This is a solution for a Linux based computer, working on the command line. (It might work on a Mac I need to test that out)
First grab a copy of the website
wget -m -q -E -R jpg,tar,gz,png,gif,mpg,mp3,iso,wav,ogg,ogv, css,zip,djvu,js,rar,mov,3gp,tiff,mng http://onsitenow.co.uk
wget will grab all the files from the site except those listed and put them in a folder named after the site.
-m turns on options suitable for mirroring i.e. copying the whole website.
-q quiet mode, don't show any feedback while the site is being downloaded
-E this will adjust the Extension so files not ending with HTML will be given that extension (not entirely necessary, but useful if we want to browse the files locally.
-R tells wget to download recursively, ie from all folders but reject files with the file extensions listed
So then when you have a copy of the site locally, change into that directory, and run (if you are searching for the word Leeds):
grep -rohi Leeds . | wc -w
So grep finds occurrences of words and returns the lines containing that word
-r is for recursive, i.e. search all the foldres too as well as files
-o show only the matching part of the line showing the pattern, so the word Leeds
-h suppress the prefixing of file names on output when multiple files are searched, we aren't interested in the filenames
-i ignore case, we want every concurrence not exact matches
The | wc -w sends all the words from grep to wc (word count) and counts them.