Clean / Remove HTML tags from raw text Clean Script, Style or Iframe tags

When working with content you often need an ability to format HTML and remove certain tags and sometimes just parse text so that no tag appears. That's how you do it in python.

https://github.com/sergejdergatsjev/PageAdmin/blob/master/bulkupdate.py
  
  
from bs4 import BeautifulSoup

class Command(BaseCommand):
    def handle(self, *args, **options):
       jobs = PermanentJob.objects.all() #filter(raw_text__icontains="<script")

     for job in jobs:
        soup = BeautifulSoup(job.raw_text, "html.parser")
        soup = self.clean_script(soup)
       job.raw_text = str(soup)
       job.description = self.clean_tags(soup)
       job.save()

def clean_tags(self, soup):
    return ' '.join(soup.stripped_strings)

def clean_script(self, soup):
    for data in soup(['iframe', 'script']):
        # Remove tags
       data.decompose()
       return soup

 

 

 Or you can use bleach lib. Please check example below.

Bleach is an allowlist based HTML sanitization library that blocks or removes tags and attributes.

Bleach can also safely link text, apply filters that Django's URL filters cannot, and optionally set the rel attribute, even on links that already exist in the text.

Bleach is designed to sanitize text from unreliable sources. If you find yourself jumping through hoops to allow your site admin to do a lot of things, you're probably outside the scope of your use case. You either trust these users or you don't.

Because it's based on html5lib, Bleach is as good as modern browsers at handling weird, quirky HTML fragments. Each of Bleach's methods corrects unbalanced or incorrectly nested labels.


attrs = {
                '*': ['style']
            }
            styles = ['color', 'font-weight', 'background-color', 'text-decoration', 'font-style', 'font-size', 'text-align']
            tags = ['p', 'em', 'strong', 'b', 'ul','ol', 'li', 'br', 'a', 'div', 'h1' ,'h2', 'h3', 'h4', 'h5', 'h6', 'blockquote', 'span', 'pre']
            self.raw_text = bleach.clean(self.raw_text, tags, attrs, styles, strip=True)
            self.description = BeautifulSoup(self.raw_text, 'html5lib').getText()
            self.description = re.sub(r"\s+", ' ', self.description) // remove white spaces
            self.slug_description = slugify(self.description)[:2500]


Or you can use something like this.. lxml with BeautifulSoup

def clean_job_html(text):

    try:

        soup = BeautifulSoup(text, 'lxml')

        soup = remove_tags(soup, 'nav')

        soup = remove_tags(soup, 'a')

        soup = remove_tags(soup, 'button')

        soup = remove_tags(soup, 'input')

        soup = remove_tags(soup, 'select')

        soup = remove_tags(soup, 'style')

        soup = remove_tags(soup, 'script')

        soup = remove_tags(soup, 'div', {'id': 'otherjob-links'})

        soup = remove_tags(soup, 'div', {'id': 'company-links'})

        soup = remove_tags(soup, 'div', {'id': 'headerList'})

        soup = remove_tags(soup, 'div', {'id': 'job-links'})

        soup = remove_tags(soup, 'div', {'class': 'jobLinks'})

        soup = remove_tags(soup, 'ul', {'class': 'actions'})

        soup = remove_tags(soup, 'div', {'class': 'btn btn-green'})

        soup = remove_tags(soup, 'div', {'id': 'mostPopularJobs'})

        soup = remove_tags(soup, 'div', {'id': 'LatestJobs'})

        soup = remove_tags(soup, 'div', {'class': 'also-searched content-module'})

        soup = remove_tags(soup, 'ul', {'id': 'headerList'})

        soup = remove_tags(soup, 'div', {'class': 'acties ver forMobile htmljob'})

        soup = remove_tags(soup, 'div', {'id': 'socialmedia'})

        soup = remove_tags(soup, 'div', {'id': 'footer'})

        soup = remove_tags(soup, 'ul', {'class': 'socialmedia'})

        soup = remove_tags(soup, 'div', {'class': 'genericViewApplyBar clearfix'})

        soup = remove_tags(soup, 'div', {'id': 'sidebar'})

        soup = remove_attrs(soup)

        text = clean_html(soup.prettify())

    except:

        text = ""

    return text

Comments