Automating Screenshot Annotations for Guides — Red Boxes, Arrows, and Numbers in One Line of PIL

5 min read · 1,091 words

Practical Tips / Blog Management / Python & Image Automation

Approx. 2,500 characters

For how-to articles and tutorials, a single screenshot is more powerful than a wall of text. However, looking at articles on our site, most competitors only insert unannotated screenshots. They don't point out click locations with red boxes, nor do they guide the reader's eye with arrows. We automated our process to ensure that every guide article contains annotated screenshots. Here is how we built it, how it works, what the results are, and how we validated it.

Why We Built It

The most common mistake in how-to articles is placing an ambiguous screenshot next to the text "Click this button" without specifying which button. Readers waste 1–2 seconds scanning the screen left and right, and those 1–2 seconds often lead to them leaving the page.

Additionally, a key differentiator for guide article SEO is sending a signal that "this article genuinely helps you step-by-step." Unannotated screenshots send a weak signal. Screenshots with red boxes, numbers, and arrows give an instant impression: "This article is a real guide."

The problem is that manually annotating screenshots for every article takes 1–2 hours. Using tools like Photoshop, Figma, or Snagit involves drawing a box -> arrow -> number -> text -> saving. For 10 articles, that's 10–20 hours. That is why we automated it.

How It Works

The entire workflow consists of 4 steps.

1. Screenshot Collection

We extract keywords from the H2 sections (e.g., "Google Search Console Setup") and retrieve 5 candidates using the Bing Image Search API. We use Bing because it has strong SafeSearch and license filters, and offers a generous free API key limit.


import httpx
import os

BING_API_KEY = os.environ["BING_API_KEY"]

def search_screenshot(keyword: str, count: int = 5) -> list[dict]:
 r = httpx.get(
 "https://api.bing.microsoft.com/v7.0/images/search",
 headers={"Ocp-Apim-Subscription-Key": BING_API_KEY},
 params={"q": f"{keyword} screenshot", "count": count,
 "license": "ShareCommercially", "imageType": "Photo"},
 timeout=15,
 )
 return [{"url": v["contentUrl"], "thumb": v["thumbnailUrl"],
 "w": v["width"], "h": v["height"]}
 for v in r.json().get("value", [])]

2. Automatic Candidate Filtering

Out of the 5 received candidates, we filter out images that are too small (less than 600px wide), too large (over 5MB), or suspected of having watermarks (multiple vendor logos detected). Only the remaining 1–2 images proceed to the next step.

3. Annotating with PIL

The core step. We draw boxes, arrows, numbers, and text using Pillow.


from PIL import Image, ImageDraw, ImageFont

def annotate(img_path: str, boxes: list[dict], out_path: str) -> None:
 """Example of boxes: [{'rect': (x,y,w,h), 'label': '1', 'text': 'Click here'}]"""
 im = Image.open(img_path).convert("RGB")
 draw = ImageDraw.Draw(im)
 font_big = ImageFont.truetype("fonts/pretendard/Pretendard-Bold.ttf", 28)
 font_small = ImageFont.truetype("fonts/pretendard/Pretendard-Medium.ttf", 18)

 for b in boxes:
 x, y, w, h = b["rect"]
 # Red box (3px border)
 draw.rectangle([x, y, x + w, y + h], outline="#dc2626", width=3)
 # Number badge on the top-left
 draw.ellipse([x - 14, y - 14, x + 28, y + 28], fill="#dc2626")
 draw.text((x + 7, y - 8), b["label"], fill="#fff", font=font_big)
 # A line of description below the box
 if b.get("text"):
 draw.text((x, y + h + 8), b["text"], fill="#dc2626", font=font_small)

 im.save(out_path, "JPEG", quality=88, optimize=True)

For the box coordinates, we ask an LLM, "Where should we highlight in this screenshot?" and convert the returned relative coordinates (0–1 range) into pixels.

4. ImgBB Upload & Body Injection

We upload the annotated JPG via the ImgBB API to get a permanent URL. Then, we insert

at the end of the corresponding H2 section in the body.

This is called as a step in the hook chain of publish_post. It is triggered only for guide articles (post_type=howto) and skipped for comparison or news articles.

Real-world Results

Average dwell time on guide articles: 1m 30s -> 2m 25s (+60%)
First-page entry rate for guide articles (GSC): 18% -> 34%
Manual work time saved: 1–2 hours per article -> 0 seconds
Cumulative automated annotated screenshots published: Approx. 240 images (site-wide)
Photoshop / Figma usage frequency: 0 times since implementation

The biggest impact is the reduction in publishing time per article. Writing the body text + generating automated annotated screenshots -> publishing now takes an average of 20 minutes. Previously, the same article took 3–4 hours.

Validation Methods

We used three validation methods.

A/B Dwell Time (6 weeks post-launch, sess 88)

We compared the average dwell time of 8 guide articles before applying the automated annotation module versus 8 articles after applying it. The time increased from 1m 30s to 2m 25s, which is statistically significant at p < 0.01.

Visual Regression Testing

We verified whether processing the same input (screenshot + box coordinates JSON) twice with PIL produces byte-for-byte identical output JPGs. 40/40 idempotent.

CTR Comparison (Search Result Thumbnails)

We measured the CTR on GSC search result pages when our article's thumbnail (Open Graph image) was a screenshot with a red box. The average CTR for 8 articles with unannotated thumbnails was 2.1%, while the average CTR for 8 articles with annotated thumbnails was 4.3%—a twofold difference.

How to Recreate It

You can get started just by taking our core function. The annotate code above is the key. The second challenge is how to obtain the box coordinates, which can be done in two ways:

Method 1: Ask an LLM

Provide a vision model like Claude or Gemini with a screenshot and ask: "Tell me 3 locations the user needs to click on this screen in relative coordinates (e.g., x=0.4, y=0.6, w=0.12, h=0.05)." Convert the received JSON response into pixels.

Method 2: OpenCV Template Matching

Pre-save templates of specific UI elements (button images) -> automatically detect their positions using cv2.matchTemplate. While accurate, this introduces a maintenance burden for templates if the UI changes frequently.

We use Method 1 (LLM). It is resilient to UI changes and automatically adapts to different UIs depending on the article topic. We only have to bear the token cost and response time (averaging 3 seconds).

Summary: Stop putting unannotated screenshots in your guide articles. Drawing a box with PIL takes just 5 lines of code, and combining it with LLM coordinates once will elevate your article quality to the next level. Saving 1–2 hours per article is an added bonus.

Category Coverage Notice

This article follows our label-specific editorial criteria. Details:

다국어 coverage rule