Finding duplicate content in your own text fields

Find and remove duplicate texts in your site.

Tried external tool but it doesn't do much, there are too many errors and I still have to pay money for each check. The control itself is of lower quality. More useful for very small sites smaller than 1000 pages..

I won't go into too much detail why I don't like it and what mistakes there are. So I have to do a little research and try to build in my own code and improve correct texts.

So first test

with open('some_file_1.txt', 'r') as file1:

with open('some_file_2.txt', 'r') as file2:

same = set(file1).intersection(file2)

same.discard('\n')

with open('some_output_file.txt', 'w') as file_out:

for line in same:

file_out.write(line)

Rewrite in django command

😅😆 Received something very far from expected..

Ah, ok I forgot to split. job.description.split('\n')

https://github.com/sergejdergatsjev/PageAdmin/blob/master/diffpermanentjobs.py

Now, create django command to see dubble content in descriptions.

class Command(BaseCommand):

def handle(self, *args, **options):

jobs = PermanentJob.objects.all()

self.used = {}

self.same = {}

for job in jobs:

self.find_same(job, jobs)

print("Count:" + str(len(self.same)))

self.create_txt_report()

def create_txt_report(self):

for k,v in self.same.items():

print("---------------------- " + str(k) + "

----------------")

print("pagina's: " + str(len(v)))

for same_text in v:

print(same_text[0])

print(same_text[1])

print("--------------------- ////

-------------------------")

def is_not_used(self, job_id):

if (job_id in self.used):

return False

else:

self.used[job_id] = ""

return True

def find_same(self, job, jobs):

for current_job in jobs:

if current_job != job:

same = set(job.description.split('\n'))

.intersection(current_job.description.split('\n'))

if(len(str(same)) > 200) and

(self.is_not_used(current_job.id)):

self.add_same(job, current_job, same)

def add_same(self, job, current_job, same):

job_key = self.same.get(job.id)

if job_key:

self.same[job.id].append((current_job.id, same))

else:

self.same[job.id] = [(current_job.id, same)]

This way already works well. I will use it further and the other diff is that filecamp is just extra information.

————————

import difflib

text1 = open("sample1.txt").readlines()

text2 = open("sample2.txt").readlines()

for line in difflib.unified_diff(text1, text2):

print line,

OUTPUT

---

+++

@@ -1 +1 @@

-Sample file 1

+Sample file 2

INPUT FILES

sample1.txt

sample2.txt

——————————

from difflib import Differ

with open('cfg1.txt') as f1, open('cfg2.txt') as f2:

differ = Differ()

for line in differ.compare(f1.readlines(), f2.readlines()):

if line.startswith(" "):

print(line[2:], end="")

The example that can be found on the internet with filecmp is Unusable in this case, because I need to see what is different and I need to be able to see how many percent is the same and how much differs.

So this example is informational only

import filecmp

f1 = "C:/Users/user/Documents/intro.txt"

f2 = "C:/Users/user/Desktop/intro1.txt"

# shallow comparison

result = filecmp.cmp(f1, f2)

print(result)

# deep comparison

result = filecmp.cmp(f1, f2, shallow=False)

print(result)

Links on about the same topic

https://stackoverflow.com/questions/55061542/how-to-check-for-differences-between-two-spacy-doc-objects

Spacey similarities

https://stackoverflow.com/questions/11008519/detecting-and-printing-the-difference-between-two-text-files-using-python-3-2

https://docs.python.org/3/library/difflib.html

Search This Blog

Finding duplicate content in your own text fields

Links on about the same topic

Comments

Post a Comment