Clean up HTML docs & convert Word to clean HTML
HTML Washer cleans up messy HTML and converts Word documents (.docx) into simple, basic HTML
Convert a Word document to clean HTML
Drop a Word file (.docx) — get clean, ready-to-use HTML in seconds. No signup required.
Settings
What exactly does it do?
- Fixes malformed HTML (unclosed tags, invalid nesting)
- Reduces the markup to:
<html lang>,<head>,<meta charset name content>,<title>,<body>,<p>,<blockquote cite>,<hr>,<figure>,<figcaption>,<a href title target>,<strong>,<em>,<b>,<i>,<s>,<u>,<br>,<h1>,<h2>,<h3>,<h4>,<h5>,<h6>,<img src alt width height loading>,<picture>,<source srcset sizes media type src>,<video src width height poster controls preload>,<audio src controls preload>,<table>,<caption>,<thead>,<tbody>,<tfoot>,<tr>,<th colspan rowspan>,<td colspan rowspan>,<col span>,<colgroup span>,<code>,<pre>,<ul>,<ol start type reversed>,<li>,<dl>,<dt>,<dd>,<abbr title>,<cite>,<dfn>,<kbd>,<samp>,<var>,<mark>,<small>,<q>,<wbr>,<del datetime cite>,<ins datetime cite>,<sub>,<sup>,<time datetime> - Replaces:
<strike>to<del>,<tt>to<code>,<acronym>to<abbr>,<dir>to<ul>,<listing>to<pre>,<xmp>to<pre>,<plaintext>to<pre> - Reformats the HTML (line breaks, indents)
Check out our Apify actor for web scraping.
Powered by Trafilatura, a battle-tested Python library that accurately extracts main content from web pages while filtering out boilerplate like navigation, ads, and sidebars.
Ideal for building RAG pipelines, training datasets, or content analysis at scale.
Did you know? Apify offers a free tier — you get $5 to use monthly.