Customize description in mass templates, HTML documents or context json files



To be able to find pages without description. If you work in a file system with text files without a content management system or even with a CMS but you have to add a description everywhere in 1000 articles because writers thought it is not important... Then you can try summer generation of text and add a description in HTML in context or in a database of CMS.

So we are going to generate a summary with an extractive summarization technique.


We first count how many files there are without description. With me that is now just text files with not filled context of description.


grep -r '"description": "",' context/



I tried to do so manually that is not less than 1 week for sure. In 1 or 2 hours I have done piece 50 or so.... So I need to have it exactly automatically processed. But how?


Read context file if no description there then open html file via request of file system. With soup polish all tags to text then divide text into sentences with a split('.') and take randomly a sentence from there and take only... Meta descriptions can technically be any length, but Google generally truncates snippets to ~155-160 characters.



Because it is done like this.


Read output from file or stream in python


grep -r '"description": "",' context* > /tmp/files.txt



Open a file


context/0571-ramen-standaardmaten.json


Read json


Request url


example.be/page.html



Soap div class: content content-width


Remove HtML



from bs4 import BeautifulSoup, SoupStrainer, Comment

import re

import random



soup = BeautifulSoup(page_source, 'lxml')

content = soup.find(div, {"class":"content content-width"})

text = ' '.join(soup.findAll(text=True))



We can of course now use an extractive summarization technique from spacy to fill in the description with the most relevant sentence of this text.... But we can with equal success. Select a random sentence from the text and abbreviate it by 160 characters. So you have to choose. We do it with Spacy




import spacy

from spacy.lang.en.stop_words import STOP_WORDS

from string import punctuation

from collections import Counter

from heapq import nlargest




In this case generated description is not as clickable and not as readable as a sentence that is randomly selected from text so we stick to doing a simple random sentence selection because it is more human than generated description. Afterwards anyway all pages copywriter will check and rewrite description that is a quick fix of previous mistake of other people. We will see to what extent this method is useful.

Script example.. 


#!/usr/bin/python


from bs4 import BeautifulSoup, SoupStrainer, Comment

import sys, getopt, os

import re

import random

import glob

import json

from slugify import slugify

import shutil

from startpage import replace_tag

import sys, traceback

import json

import requests


def get_filename(line):

    """

    Output from grep command >

    context/0202-pvc-deuren-te-koop.json:"description": "",    

    """

    url = line.split(":")[0]

    return url


def get_context(filename):

    text = open(filename, "r").read()

    context = json.loads(text)

    if len(context["description"]) == 0:

        return context

    else:

        return None


def get_content(url):

    r = requests.get(url)

    soup = BeautifulSoup(r.text) # lxml

    content = soup.find("div", {"class":"content content-width"})

    text = ' '.join(content.findAll(text=True))

    return text


def select_random(text):

    tmp = []

    sentences = text.split(".")

    for sent in sentences:

        if len(sent) > 60:

            tmp.append(sent)

    return random.choice(tmp)[:155].replace("\n", "")


def update_description(filename, sentence, context):

    context["description"] = sentence

    text = json.dumps(context)

    f = open(filename, "w")

    print(text)

    f.write(text)

    f.close()



def main(argv):

    fname = "/tmp/files.txt"

    lines = open(fname, "r").readlines()

    for line in lines:

        try:

            filename = get_filename(line)

            context = get_context(filename)

            if context != None:

                text = get_content(context['url'])

                sentence = select_random(text)

                update_description(filename, sentence, context)

        except:

            traceback.print_exc(file=sys.stdout)


if __name__ == "__main__":

   main(sys.argv[1:])





Comments