Spell checking an entire site

This is a perl script I wrote (with some help from Stray Taoist) to spell check a website for Nelly, since Steve was asking to have a look at it for something in work I thought I would put it here.


#!/usr/bin/perl

use warnings;
use strict;

use WWW::Mechanize;
use Lingua::Ispell qw( :all );
$Lingua::Ispell::path = '/usr/bin/ispell';
Lingua::Ispell::allow_compounds(1);
Lingua::Ispell::use_dictionary('/path/to/ispell/lib/english.hash');
Lingua::Ispell::use_personal_dictionary('/path/to/.ispell_custom');
use HTML::Element;
no warnings 'redefine';

my $target = 'http://www.site.com/'; # the site you want to spellcheck

our $nillio = [];
local *HTML::Element::as_text = sub {
  my ($this,%options) = @_;
  my $skip_dels = $options{'skip_dels'} || 0;
  my(@pile) = ($this);
  my $tag;
  my $text = '';
  while (@pile) {
  if(!defined($pile[0])) { # undef!
      # no-op
   } elsif(!ref($pile[0])) { # text bit!  save it!
    $text .= shift(@pile) . ' ';
   } else { # it's a ref -- traverse under it
     unshift @pile, @{$this->{'_content'} || $nillio} unless
       ($tag = ($this = shift @pile)->{'_tag'}) eq 'style'
       or $tag eq 'script'
       or ($skip_dels and $tag eq 'del');
    }
  }
  return $text;
};

sub spellCheck {
  my ($s, $mech) = @_;
  $mech->get($s);
  my $text = $mech->content( format => 'text' );
  chomp($text);
  for my $r ( spellcheck( $text ) ) {
    print "$r->{'type'}\t$r->{'term'}\t$s\n";
  }
}

my $mech = WWW::Mechanize->new();

$mech->get( $target );

for my $url ($mech->links) {
 next if $url->url_abs !~ /$target/;
 spellCheck($url->url_abs, $mech);
 # sleep 10;
}


The script will crawl your entire site and spell check each page, outputting a tab separated list for you to use in Excel or whatever.


Sunset Over Down

Sunset over Mourne Mountains in County Down from ScraboSunset over County Down from Scrabo Hill, Newtownards


Salutations in Emails

From Bobulate, Anatomy of a Salutation:

“Just as you wouldn't ignore body language that indicates whether someone is intending to shake your hand or high-five you, nor should you ignore email-greeting intentions—no matter how well you know someone.”

I am forever procrastinating over how to start an email to persons unknown to me, and even those known to me.


On firing a tin of ravioli at a star destroyer:

“At around .998 c, the impacting ravioli begins to behave less like ravioli and more like an extremely intense radiation beam. Protons in the water of the ravioli begin to successfully penetrate the nuclei of the hull metal. Thermonuclear interactions, such as hydrogen fusion, may take place in the tomato sauce. ”


Table Display Table Not Block

I've just spent half an hour reducing a test case for a browser bug in Camino related to table layout, only to find it's not a bug at all but my own ignorance. Today's lesson for me is that a table is only a table when it is set to 'display:table'.

So, if you're styling a table in HTML be careful not to use tr {display:block} as the browser will no longer treat the table as a table (even though it's only on the tr). The consequences of this are subtle layout differences to what you'd expect of a table (test case here, view with Camino, the width of the last cell in the last row is so ... odd).


Rigginzilla

Riggins, Black Labrador


Riggins is a 3 month old Black Labrador pup, he moved in with us at the start of December and while he pays no rent he is probably the most popular guy in the house. He is pictured above having just destroyed a roll of kitchen towels he found on the sofa.


HTML5: Article vrs Section

Two new elements to HTML have been defined in the draft HTML5 spec, they are section and article. I'll be honest and say that when I first sat down to use these in a document I wasn't entirely clear what the difference was. Lachlan Hunt's article 'Preview of HTML5' on A List Apart is useful in understanding the difference, an article is a specialised type of section, it's independent and self-contained.

From messing around I think that a clue to the difference between them is that an article will almost certainly have a byline whereas a section is less likely to.

For now I'm going to work on the rule of thumb that unless I can clearly identify something as an article I'll use a section in the first instance.


Styling Paragraphs in CSS

Recently I've been paying a closer attention to the styling of paragraphs in CSS. The typographers bible, Robert Bringhurst's "The Elements of Typographic Style", recommends setting the first paragraph flush left and any subsequent paragraphs in continuous text with an indent of at least one en (there are no en units in CSS, to write an en is simply 0.5em), with one em or one lead being the most common variants.


We could write up these style guides in CSS as:

p { text-indent:0; margin:1em 0 0 0 }
p+p { text-indent:1em; margin:0; padding:0 }

In CSS the + selector is called an adjacent selector and matches any element on the right side of the + selector immediately preceded by an element on the left side of the + selector.

Where a body of text is not continuous and you wish to mark a break in the flow without resorting to DIVs the HR element is your friend, in the current draft of the HTML5 spec the HR element is defined as:

The hr element represents a paragraph-level thematic break, e.g. a scene change in a story, or a transition to another topic within a section of a reference book.

Simply style your HR to have no appearance, the above CSS will take care of the rest. As the paragraph element following the HR no longer adjacent to an other paragraph element, instead an HR element, this new paragraph will be styled flush left.


You may have noticed that my writing style of CSS above involves placing all declarations on one line, this I find easier to read and edit as I can see more selectors at once and without interruption, traversing these lengthier declarations is trivial in any capable text editor.


Gitweb on Mac OS X

This is a very short HOWTO on installing Gitweb on Mac OS X. Gitweb is a functional web interface to Git, the version control system, which allows you to browse your Git repositories using a web browser.

In this HOWTO I've made some assumptions about your set-up: 1) Git is already installed on your machine, 2) apache is your webserver and 3) that you haven't altered the default apache vhost conf or host file entry for localhost.


In your home dir (or wherever you build stuff from) try the following ($PATH_TO_PROJECTROOT is where ever you keep your git repositories, it's OK if some repositories are actually a few directories beneath that root, Gitweb will find them.):

git clone git://git.kernel.org/pub/scm/git/git.git
cd git
make GITWEB_PROJECTROOT=$PATH_TO_PROJECTROOT \
  GITWEB_CSS="/gitweb/gitweb.css" \
  GITWEB_LOGO="/gitweb/git-logo.png"  \
  GITWEB_FAVICON="/gitweb/git-favicon.png"  \
  bindir=$PATH_TO_GIT_BINARY

sudo cp gitweb/gitweb.css gitweb/git-logo.png gitweb/git-favicon.png /Library/WebServer/Documents/gitweb/
mkdir -p /Library/WebServer/CGI-Executables/gitweb
sudo cp gitweb/gitweb.cgi /Library/WebServer/CGI-Executables/gitweb/

Visit http://localhost/cgi-bin/gitweb/gitweb.cgi and you should have a web interface to your git repositories.