4 min read · 951 words
Practical Tips / Blog Operations / Python · Automation
Approx. 2,300 characters
When you manage a blog with over 200 posts, human review inevitably misses things. Markdown remnants (like bold being exposed as-is), emoji whitelist violations, missing sources, empty tables, and leftover box styles are common culprits. That is why we created a separate step to automatically check and fix posts right before they are sent to the blog API.
This post explains the intent behind building this automated QC system, how it works, the actual results we achieved, and how we validated it. We have distilled the core concepts so that any blog operator facing similar issues can implement it with just a single page of code.
Why We Built It
During the first year, we frequently encountered two types of issues.
First, model output remnants. When generating body text with an LLM, markdown tokens like bold, ## Subheading, or --- often remained unconverted to HTML. Asterisks were visible directly on the live site.
Second, cases where the post looked fine right after writing, but some hook broke it just before publishing. For example, a function might open an extra The checkpoint consists of two stages. Stage 1: Sanitize — Unconditional fixes It takes the HTML and applies the following across the board: This stage is a mechanical process that requires no human judgment. It is designed to produce consistent results for any post. Stage 2: Quality Gate — Block publishing on failure It automatically checks for omissions that a human would have noticed. If a post fails, publishing is rejected. Results over the 6 months since adoption: The 38 blocked posts were not lost. The authors simply became aware of the issues, refined the content, and retried, leading to successful publishing. The distribution of blocking reasons was: missing sources (41%), insufficient character count (26%), 0 images (21%), and others (12%). Here is how we validated the checkpoint after building it: Golden Set Regression Testing — We collected the original drafts of 41 posts that had issues in the past to create a "golden set." We automatically verified whether the issue patterns disappeared when running them through the sanitize + quality gate process. Initially, 39/41 passed. After analyzing the 2 failures and reinforcing our regular expressions, we achieved a 41/41 pass rate. Live Spot-Checks — In the first week of applying the new sanitizer, we randomly selected 8 out of 18 published posts and fetched their live pages. We checked if horizontal scrolling occurred, if text overflowed the container, or if images broke at two widths: desktop (1280px) and mobile (360px). 8/8 were normal. Double-Pass Idempotency — We verified whether running the sanitizer a second time on an already sanitized output produced the exact same result. This validation ensures safety in case the publish hook chain runs twice. 100/100 were identical. Rather than copying the entire code, you can adapt just one or two core elements to fit your environment. You only need to call these two functions at a single point right before publishing. If In short, it boils down to one line: "Prevent all errors automatically at a single checkpoint before publishing." The time humans used to spend reviewing posts is completely eliminated. Category Coverage Notice This article follows our label-specific editorial criteria. Details:
How It Works
width:800px, margin-left:-30px, position:absolute)width/height attributes from tags -> Preserve responsivenessX -> X, arbitrary --- -> )border, box-shadow, or padding>20px)
max-width:100%, overflow-wrap:anywhere)
tags -> fail (for guide/comparison posts)
Actual Results
Validation Methods
How to Build It Yourself
import re
def sanitize_pre_publish(html: str) -> tuple[str, list[str]]:
fixes = []
# Remove dangerous inline width
html, n = re.subn(r'width\s*:\s*(?:[4-9]\d{2}|[1-9]\d{3,})px\s*;?', '', html)
if n: fixes.append('strip_wide_width')
# Markdown remnants -> HTML
html, n = re.subn(r'\*\*(.+?)\*\*', r'<strong>\1</strong>', html)
if n: fixes.append('md_bold')
# Strip emojis (if necessary)
html, n = re.subn(r'[\U0001F300-\U0001FAFF]', '', html)
if n: fixes.append('strip_emoji')
return html, fixes
def quality_gate(html: str, post_type: str) -> tuple[bool, list[str]]:
fails = []
text = re.sub(r'<[^>]+>', '', html)
if len(text.replace(' ', '')) < 600: fails.append('too_short')
if html.count('<h2') < 3 and post_type in ('howto', 'compare'): fails.append('few_h2')
if '<img' not in html: fails.append('no_image')
if 'TODO' in html or 'REDACTED' in html: fails.append('placeholder')
return (len(fails) == 0), fails
quality_gate returns a failure, block the publishing process and return the reasons to the user. For sanitize, simply take the output HTML and pass it directly to the publishing API.