Finding duplicate content in your own text fields

Find and remove duplicate texts in your site.

Tried external tool but it doesn't do much, there are too many errors and I still have to pay money for each check. The control itself is of lower quality. More useful for very small sites smaller than 1000 pages..


I won't go into too much detail why I don't like it and what mistakes there are. So I have to do a little research and try to build in my own code and improve correct texts.


So first test


with open('some_file_1.txt', 'r') as file1:

with open('some_file_2.txt', 'r') as file2:

same = set(file1).intersection(file2)


same.discard('\n')


with open('some_output_file.txt', 'w') as file_out:

for line in same:

file_out.write(line)



Rewrite in django command






😅😆 Received something very far from expected..


Ah, ok I forgot to split. job.description.split('\n')

https://github.com/sergejdergatsjev/PageAdmin/blob/master/diffpermanentjobs.py

 Now, create django command to see dubble content in descriptions.

 

class Command(BaseCommand):

   def handle(self, *args, **options):

       jobs = PermanentJob.objects.all()

       self.used = {}

       self.same = {}

       for job in jobs:

           self.find_same(job, jobs)

       print("Count:" + str(len(self.same)))

       self.create_txt_report()

 

   def create_txt_report(self):

       for k,v in self.same.items():

           print("---------------------- " + str(k) + "

 ----------------")

           print("pagina's: " + str(len(v)))

           for same_text in v:

               print(same_text[0])

               print(same_text[1])

           print("--------------------- //// 

-------------------------")

 

   def is_not_used(self, job_id):

       if (job_id in self.used):

           return False

       else:

           self.used[job_id] = ""

           return True

 

 

   def find_same(self, job, jobs):

       for current_job in jobs:

           if current_job != job:

               same = set(job.description.split('\n'))

.intersection(current_job.description.split('\n'))

               if(len(str(same)) > 200) and 

(self.is_not_used(current_job.id)):

                   self.add_same(job, current_job, same)

      

   def add_same(self, job, current_job, same):

       job_key = self.same.get(job.id)

       if job_key:

           self.same[job.id].append((current_job.id, same))

       else:

           self.same[job.id] = [(current_job.id, same)]

 

 





This way already works well. I will use it further and the other diff is that filecamp is just extra information.





————————


import difflib

text1 = open("sample1.txt").readlines()

text2 = open("sample2.txt").readlines()


for line in difflib.unified_diff(text1, text2):

print line,

OUTPUT


---

+++

@@ -1 +1 @@

-Sample file 1

+Sample file 2

INPUT FILES

sample1.txt

sample2.txt





——————————


from difflib import Differ


with open('cfg1.txt') as f1, open('cfg2.txt') as f2:

differ = Differ()


for line in differ.compare(f1.readlines(), f2.readlines()):

if line.startswith(" "):

print(line[2:], end="")



The example that can be found on the internet with filecmp is Unusable in this case, because I need to see what is different and I need to be able to see how many percent is the same and how much differs.


So this example is informational only


import filecmp


f1 = "C:/Users/user/Documents/intro.txt"

f2 = "C:/Users/user/Desktop/intro1.txt"




# shallow comparison

result = filecmp.cmp(f1, f2)


print(result)

# deep comparison

result = filecmp.cmp(f1, f2, shallow=False)

print(result)

Links on about the same topic



https://stackoverflow.com/questions/55061542/how-to-check-for-differences-between-two-spacy-doc-objects


Spacey similarities


https://stackoverflow.com/questions/11008519/detecting-and-printing-the-difference-between-two-text-files-using-python-3-2


https://docs.python.org/3/library/difflib.html

Comments