drupal

Python script to tidy up ugly MS Frontpage HTML

Posted on May 27th, 2010 by roderik

drupal
tech

After writing this script, I believe I can now safely say I know Python (at least beginner level)...

Basic need/use

I needed it for cleaning up HTML documents created by (a very old version of?) Microsoft Frontpage, as I was working on converting a website containing 3000+ existing MS Frontpage pages. Upon trying to import them using Drupal's import_html module (which runs HTMLTidy on the source pages), it came out that the documents were 'tidied' in the wrong manner. The cause are totally illegal constructs like the following:

<body bgcolor="#FFFFFF" text="#000000" link="#993300" vlink="#666600" alink="#CC3300">
<!--mstheme--><font face="Book Antiqua, Times New Roman, Times">
<p>( ...navigation buttons and more stuff... )</p>
<h3><!--mstheme--><font color="#660033">some subtitle<!--mstheme--></font></h3>
 
<div align="center">
  <center>
  <!--mstheme--></font><table border="0" width="600">
( ...table contents... )

<!--mstheme--><font face="Book Antiqua, Times New Roman, Times">
<b>
<p ALIGN="left"><font color="#000000">Schulte-Stracke, Peter, </font><font FACE="Garamond" SIZE="4" COLOR="#000000">
<a href="remarks_on_rethinking.htm">Remarks on `Rethinking Puberty'</a> by McClintock and Herdt; </font></b>
<font COLOR="#000000">About</font><i><font COLOR="#000000">Rethinking Puberty : The Development of blah.--</font></i>
<font COLOR="#000000"> Current Developments, by blah<br></font>The view that puberty is blah....</p>
        <!--mstheme--></font>

I mean, come on... Nested font tags?
Font start tag outside a 'div' & end tag inside it!??
switched 'b' and 'p' start tags?
It's a miracle that browsers actually render this stuff correctly! Unfortunately, HTMLTidy gets confused (and e.g. converts the b/p mess to <b></b><p ALIGN="left">, so the outcome will be non-bold text inside the paragraph.)

So I had to write code which does pure string operations (especially: just strips out useless font tags) on the document, before the string can be read into a correct parse tree.
But while I was at it, I decided to clean up the parsed HTML too, convert ancient tags, delete unnecessary ones, convert tables into ul/li constructs... Every time I discovered something new to do, and after a few weeks I had a pretty big Python script.

Usage/details

Basic usage is simple: you call it with with a filename (the HTML document) as the first argument, and it prints the cleaned-up HTML on stdout.

If you use this script, you should really check if the end result is what you expect. The script will run fine, but you will probably not be able to use it as-is. It may perform too many or not enough 'cleanup' actions, which may need to be modified for each set of HTML documents.
So it's just a blurb of top-down functional code. Although I did create some functions, the code has not been neatly separated. Some variables (regex objects) are just global, even though they're referenced in functions. I say good luck with it, and if anyone needs to clean up ancient MS Frontpage HTML like I did, maybe you will find this a useful reference.

This script uses the BeautifulSoup Python library for manipulating the HTML. You must use version 3.0.8 or higher; earlier versions had a nasty bug affecting a.o. the extract() function, which will make the script choke on any reasonably sized HTML document (either yielding weird errors inside a seemingly unrelated library function, or just wasting CPU forever in an internal loop). Version 3.1 (at least <= v3.1.0.1) will not work, since this still has the bug. I created a patch for v3.1 which fixes this (see below), but I did not submit the patch and switched back to using 3.0.8 myself, since the BeautifulSoup author says v3.1 will be discontinued.

In order to integrate with the Drupal import_html module, that module needed a 'hook' of sorts, to run this script before running HTMLTidy. I created a patch for this (which hasn't been included into the module so far).

#!/usr/bin/python
 
# Read old MS FrontPage HTML document and tidy it up.
# Contains site specific functions, so the script will need to be changed somewhat
# for every site.
# Version: 20100414/ploog+ipce.info
 
from optparse import OptionParser
import os
import re
from BeautifulSoup import BeautifulSoup, Tag, NavigableString, Comment
 
# We have no options as yet, but still this is a convenient way of printing usage
#usage = "usage: %prog [options] arg1 arg2"
a = OptionParser(usage = "usage: %prog htmlfile",
                 description = "filename should be a HTML file")
(optio

Casually inspecting/editing PHP files? Install XDebug with your Vim!

Posted on November 21st, 2009 by roderik

drupal
tech

Note: this is not a tutorial on what XDebug is and how it exactly works, or an install howto. If you want that, then please read this post on box.net. This post is more like a 'declaration of awesomeness'. I just wanted to blog about it - and I think this things needs more 'advertising'. Maybe this illustration will give someone who's googling for 'vim' and 'xdebug' a still better idea what this is about.

My story:

Finally I can tweet in style... (using my own domain)

Posted on November 21st, 2009 by roderik

drupal
tech

I have my own 'ShortURL service' now!

But, very much in the Roderik spirit, when investigating this one task, I discovered that it could be done better if I tackled another task first. But I wanted that other thing done right, so I had to do another thing at the same time. Etc.

Drupal Bazaar repository

Posted on October 14th, 2009 by roderik

drupal
tech

NOTE: do not use the 6-patched tree for now, unless you are sure you'll stick with it. At the moment, the update sequence # for system.module is one higher than in the original sources! This means that if you switch back to the original sources, you must run "UPDATE system SET schema_version=schema_version-1 WHERE name='system'" on your database, or you will miss the next update in Drupal Core stable that comes out. I'll fix this at some point.

mijn werk

Talen

Navigatie

Zichtbaarheid

Inhoud met deze termen is verborgen

geruzie (tonen)
pedofilie (tonen)

Inhoud met deze termen is zichtbaar

gericht/kort/onbelangrijk (verbergen)

Recente reacties

il est 25% moins gros que son prédécesseur
10 jaren 43 weken geleden
parce qu'il fut la volonté inflexible. Danie femme
10 jaren 43 weken geleden
Beste Mara,Bedankt voor je
18 jaren 5 dagen geleden
Beste Roderik, Ik kwam je
18 jaren 6 dagen geleden
go with the flow
18 jaren 1 week geleden

Roderik Muit's webcorner