Estimating an article's average reading time (Python)

Offering an article's reading time estimation to your site's content, can contribute greatly to your end users.  First of all, it allows end users to prepare the time they needs to read an article in full. Secondly, it could help them choose the right article for the right amount of available time they have. Lastly, it opens a whole new range of features, sorting options and filter improvements you can offer (like filtering articles by reading time).

In this post, I will walk you through on how to estimate the reading time of any public article url by crawling and making simple calculations (written in Python). By the way, this post was estimated to be 6 minutes.

 

Estimating words per minute

Words per minute, commonly abbreviated WPM, is a measure of words processed in a minute, often used as a measurement of the speed of typing or reading. WPM has many meanings and complications. The first is, that average reading time is subjective. Secondly, the length or duration of words is clearly variable, as some words can be read very quickly (like 'dog') while others take much longer (like 'rhinoceros'). Therefore, the definition of each word is often standardized to be five characters long. There are other parameters that effect the reading time such as font type and size, your age, rather you're reading on a monitor or paper, and even the number of paragraphs, images and buttons in the article's site. 

Based on research done in this field, people are able to read English at 200 WPM on paper, and 180 WPM on a monitor (the current record is 290 WPM).  

For the sake of simplicity, we'll define a word as five characters (including spaces and punctuation), and WPM = 200. Feel free to add additional parameters to your calculation. Note that if all you're looking for is a broad estimation, what we've defined will suffice.

 

From URL to Estimating reading time

Lets design the simple algorithm process:

  1. Extract visible webpage text (title, subtitle, body, page buttons, etc.) from given url.
  2. Filter unnecessary content from text.
  3. Estimate filtered text reading time.

1. Extracting visible webpage text

In order to extract a webpage's text content, we'll use Python libraries called BeatifulSoup and Urllib:

import bs4
import urllib, re

def extract_text(url):
    html = urllib.urlopen(url).read()
    soup = bs4.BeautifulSoup(html, 'html.parser')
    texts = soup.findAll(text=True)
    return texts

2. Filter unnecessary page content

Once we've extracted the desired text content, we need to filter out all the unnecessary content such styles (CSS), scripts (JS), html headers, comments, etc:

def is_visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif isinstance(element, bs4.element.Comment):
        return False
    elif element.string == "\n":
        return False
    return True

def filter_visible_text(page_texts):
    return filter(is_visible, page_texts)

3. Estimate reading time

To estimate the article's reading time, we need to count number of words (as defined above) and divide by defined WPM (200):

WPM = 200
WORD_LENGTH = 5

def count_words_in_text(text_list, word_length):
    total_words = 0
    for current_text in text_list:
        total_words += len(current_text)/word_length
    return total_words

def estimate_reading_time(url):
    texts = extract_url(url)
    filtered_text = filter_visible_text(texts)
    total_words = count_words_in_text(filtered_text, WORD_LENGTH)
    return total_words/WPM
    

That's it! Feel free to test it out with any string url, by calling the method estimate_reading_time.

To view the source code, please visit my GitHub page. If you have any questions, feel free to drop me a line.