Finding duplicate content in your own text fields

Find and remove duplicate texts in your site.

Tried external tool but it doesn't do much, there are too many errors and I still have to pay money for each check. The control itself is of lower quality. More useful for very small sites smaller than 1000 pages..

I won't go into too much detail why I don't like it and what mistakes there are. So I have to do a little research and try to build in my own code and improve correct texts.

So first test

with open('some_file_1.txt', 'r') as file1:

with open('some_file_2.txt', 'r') as file2:

same = set(file1).intersection(file2)


with open('some_output_file.txt', 'w') as file_out:

for line in same:


Rewrite in django command

😅😆 Received something very far from expected..

Ah, ok I forgot to split. job.description.split('\n')

 Now, create django command to see dubble content in descriptions.


class Command(BaseCommand):

   def handle(self, *args, **options):

       jobs = PermanentJob.objects.all()

       self.used = {}

       self.same = {}

       for job in jobs:

           self.find_same(job, jobs)

       print("Count:" + str(len(self.same)))



   def create_txt_report(self):

       for k,v in self.same.items():

           print("---------------------- " + str(k) + "


           print("pagina's: " + str(len(v)))

           for same_text in v:



           print("--------------------- //// 



   def is_not_used(self, job_id):

       if (job_id in self.used):

           return False


           self.used[job_id] = ""

           return True



   def find_same(self, job, jobs):

       for current_job in jobs:

           if current_job != job:

               same = set(job.description.split('\n'))


               if(len(str(same)) > 200) and 


                   self.add_same(job, current_job, same)


   def add_same(self, job, current_job, same):

       job_key = self.same.get(

       if job_key:

           self.same[].append((, same))


           self.same[] = [(, same)]



This way already works well. I will use it further and the other diff is that filecamp is just extra information.


import difflib

text1 = open("sample1.txt").readlines()

text2 = open("sample2.txt").readlines()

for line in difflib.unified_diff(text1, text2):

print line,




@@ -1 +1 @@

-Sample file 1

+Sample file 2





from difflib import Differ

with open('cfg1.txt') as f1, open('cfg2.txt') as f2:

differ = Differ()

for line in, f2.readlines()):

if line.startswith(" "):

print(line[2:], end="")

The example that can be found on the internet with filecmp is Unusable in this case, because I need to see what is different and I need to be able to see how many percent is the same and how much differs.

So this example is informational only

import filecmp

f1 = "C:/Users/user/Documents/intro.txt"

f2 = "C:/Users/user/Desktop/intro1.txt"

# shallow comparison

result = filecmp.cmp(f1, f2)


# deep comparison

result = filecmp.cmp(f1, f2, shallow=False)


Links on about the same topic

Spacey similarities