tech

Python script to tidy up ugly MS Frontpage HTML

Posted on May 27th, 2010 by roderik

in

drupal
tech

After writing this script, I believe I can now safely say I know Python (at least beginner level)...

Basic need/use

I needed it for cleaning up HTML documents created by (a very old version of?) Microsoft Frontpage, as I was working on converting a website containing 3000+ existing MS Frontpage pages. Upon trying to import them using Drupal's import_html module (which runs HTMLTidy on the source pages), it came out that the documents were 'tidied' in the wrong manner. The cause are totally illegal constructs like the following:

<body bgcolor="#FFFFFF" text="#000000" link="#993300" vlink="#666600" alink="#CC3300">
<!--mstheme--><font face="Book Antiqua, Times New Roman, Times">
<p>( ...navigation buttons and more stuff... )</p>
<h3><!--mstheme--><font color="#660033">some subtitle<!--mstheme--></font></h3>
 
<div align="center">
  <center>
  <!--mstheme--></font><table border="0" width="600">
( ...table contents... )

<!--mstheme--><font face="Book Antiqua, Times New Roman, Times">
<b>
<p ALIGN="left"><font color="#000000">Schulte-Stracke, Peter, </font><font FACE="Garamond" SIZE="4" COLOR="#000000">
<a href="remarks_on_rethinking.htm">Remarks on `Rethinking Puberty'</a> by McClintock and Herdt; </font></b>
<font COLOR="#000000">About</font><i><font COLOR="#000000">Rethinking Puberty : The Development of blah.--</font></i>
<font COLOR="#000000"> Current Developments, by blah<br></font>The view that puberty is blah....</p>
        <!--mstheme--></font>

I mean, come on... Nested font tags?
Font start tag outside a 'div' & end tag inside it!??
switched 'b' and 'p' start tags?
It's a miracle that browsers actually render this stuff correctly! Unfortunately, HTMLTidy gets confused (and e.g. converts the b/p mess to <b></b><p ALIGN="left">, so the outcome will be non-bold text inside the paragraph.)

So I had to write code which does pure string operations (especially: just strips out useless font tags) on the document, before the string can be read into a correct parse tree.
But while I was at it, I decided to clean up the parsed HTML too, convert ancient tags, delete unnecessary ones, convert tables into ul/li constructs... Every time I discovered something new to do, and after a few weeks I had a pretty big Python script.

Usage/details

Basic usage is simple: you call it with with a filename (the HTML document) as the first argument, and it prints the cleaned-up HTML on stdout.

If you use this script, you should really check if the end result is what you expect. The script will run fine, but you will probably not be able to use it as-is. It may perform too many or not enough 'cleanup' actions, which may need to be modified for each set of HTML documents.
So it's just a blurb of top-down functional code. Although I did create some functions, the code has not been neatly separated. Some variables (regex objects) are just global, even though they're referenced in functions. I say good luck with it, and if anyone needs to clean up ancient MS Frontpage HTML like I did, maybe you will find this a useful reference.

This script uses the BeautifulSoup Python library for manipulating the HTML. You must use version 3.0.8 or higher; earlier versions had a nasty bug affecting a.o. the extract() function, which will make the script choke on any reasonably sized HTML document (either yielding weird errors inside a seemingly unrelated library function, or just wasting CPU forever in an internal loop). Version 3.1 (at least <= v3.1.0.1) will not work, since this still has the bug. I created a patch for v3.1 which fixes this (see below), but I did not submit the patch and switched back to using 3.0.8 myself, since the BeautifulSoup author says v3.1 will be discontinued.

In order to integrate with the Drupal import_html module, that module needed a 'hook' of sorts, to run this script before running HTMLTidy. I created a patch for this (which hasn't been included into the module so far).

#!/usr/bin/python
 
# Read old MS FrontPage HTML document and tidy it up.
# Contains site specific functions, so the script will need to be changed somewhat
# for every site.
# Version: 20100414/ploog+ipce.info
 
from optparse import OptionParser
import os
import re
from BeautifulSoup import BeautifulSoup, Tag, NavigableString, Comment
 
# We have no options as yet, but still this is a convenient way of printing usage
#usage = "usage: %prog [options] arg1 arg2"
a = OptionParser(usage = "usage: %prog htmlfile",
                 description = "filename should be a HTML file")
(optio

Casually inspecting/editing PHP files? Install XDebug with your Vim!

Posted on November 21st, 2009 by roderik

in

drupal
tech

Note: this is not a tutorial on what XDebug is and how it exactly works, or an install howto. If you want that, then please read this post on box.net. This post is more like a 'declaration of awesomeness'. I just wanted to blog about it - and I think this things needs more 'advertising'. Maybe this illustration will give someone who's googling for 'vim' and 'xdebug' a still better idea what this is about.

My story:

Finally I can tweet in style... (using my own domain)

Posted on November 21st, 2009 by roderik

in

drupal
tech

I have my own 'ShortURL service' now!

But, very much in the Roderik spirit, when investigating this one task, I discovered that it could be done better if I tackled another task first. But I wanted that other thing done right, so I had to do another thing at the same time. Etc.

Struggle with technology continued: Bluetooth & iPhone

Posted on November 19th, 2009 by roderik

in

tech

Continuing on the 'battle with technology':

First of all: I've stumbled over a recent much-commented blog post from Jason Kasper titled "I think I'm tired of Desktop Linux". Which airs the same sentiments I uttered in my last post here. He also knows he won't ditch Linux - but is really tired of stuff breaking all the time.

The constant battle with technology

Posted on November 5th, 2009 by roderik

in

linux
tech

[ Dear diary.... Yeah, I know. This is another way-too-long pointless blog. But I can't stop writing when I start. And I might find it funny to look back in a few years, at how my 'battles' have evolved... ]

Technology is wonderful.

Trying to get technology working exactly like I want it to, is a whole lot more frustrating.

Drupal Bazaar repository

Posted on October 14th, 2009 by roderik

in

drupal
tech

NOTE: do not use the 6-patched tree for now, unless you are sure you'll stick with it. At the moment, the update sequence # for system.module is one higher than in the original sources! This means that if you switch back to the original sources, you must run "UPDATE system SET schema_version=schema_version-1 WHERE name='system'" on your database, or you will miss the next update in Drupal Core stable that comes out. I'll fix this at some point.

VCS comparison: Git / Mercurial / Bazaar

Posted on October 5th, 2009 by roderik

in

tech

...mainly from the perspective of a Drupal developer.

In november 2008, I was going to compare version control systems, to give input on which system my employer was going to use for its toolchain. We never really got to a proper decision about this, but it led to me settling on one in my personal business. I'll document reasons for choosing my current one, so I may look at them later and see if anything's changed. (This may grow into a number of evolving blog posts about the same subject.)

Installing OpenSUSE 11 Xen DomU on Debian Lenny

Posted on October 3rd, 2009 by roderik

in

linux
tech

I'm a happy Debian user since 1997 and my friends are too, so I've never really looked at other distributions.

Debian Xen comes with good instructions including a config file, for installing a DomU/guest system. So until now, I just followed the instructions and never found out the nitty gritty of how an image is actually booted.

Stuff

Posted on January 22nd, 2000 by roderik

in

tech

This page is growing slowly, very slowly...

For now, the 'stuff' is 'stuff I programmed'.

MSX Stuff

The most important part is some MSX stuff I wrote a long long time ago, in a life far far away, which some people asked for. I'm in the process of going through my old files and seeing what I can put online here. No promises on a timeline...

Read more

MSX Stuff

Posted on January 22nd, 2000 by roderik

in

tech

Hi, MSX lovers around the world! Look, I have an MSX page too! Sooooo retro! Aint I cool?

Eh, no, I'm not an active MSX user anymore. I sold mine in 1998 and hardly touched an MSX emulator since. But 3 years later, several people asked me for sources of stuff I'd written, so I dug them up and put them here. (What I could recover, which isn't much -- because a few months before that, I seem to have lost the backup of my MSX harddisk! AARGH!)

Read more

Roderik Muit's webcorner

Python script to tidy up ugly MS Frontpage HTML

Basic need/use

Usage/details

Casually inspecting/editing PHP files? Install XDebug with your Vim!

Finally I can tweet in style... (using my own domain)

Struggle with technology continued: Bluetooth & iPhone

The constant battle with technology

Drupal Bazaar repository

VCS comparison: Git / Mercurial / Bazaar

Installing OpenSUSE 11 Xen DomU on Debian Lenny

Stuff

MSX Stuff

MSX Stuff

Languages

Navigation

Tags

Visibility

Content with these tags is hidden

Recent comments