Method XML parser or crawler with Job Postings Json or Raw text / HTML

Intern nota about crawlers

Summary of XML method

Copy crontab command like jobboost for example and update url and methods

Copy commands for jobboost and name them after source Jooble / Talentus for example

Check structure of cleaner and parser
Normally checkduplicatesbysource and cleanjob are generic commands which you can reuse with parameters.
Parser command you have to check because there are unique XML parser functions which are dependent on XML and sometimes change at source, so it does not make much sense to make them generic.

Summary Crawler method

Use rich data for indexing via JSON
If no markup for it in html or json then you can parse HTML or string directly

In details XML parser

When we welcome a new customer, we create a separate process for him that will take care of several things. Now we are talking about large clients with thousands of vacancies. Automation happens mostly through XML mechanizm and sometimes through crawler. Crawler is less desirable because crawler has to parse HTML itself or use jobs structured data.

First I will explain the XML working method. We do not use external frameworks. We only use our own input mechanisms because they are optimized for tasks we need.

In most cases XML looks completely different on different websites and needs a mapping of categories. We do the mapping with Natural Language Processing. The rest we just parse with a parser like etree. So with a django command we can just make a connector and xml parser this way.

self.session = session = jobutils.create_session()

for item in self.start_url:

try:

response = session.get(item.get("url"))

except:

traceback.print_exc(file=sys.stdout)

sleep(2)

if item.get("premium"):

self.redirect = True

self.payed = True

else:

self.redirect = False

self.payed = False

root = etree.fromstring(response.content)

jobs = root.xpath("//job")

for job in jobs:

self.parse_job(job, source)

print("Done")

Then you can browse through all objects and XML and parse them like for example;

jobs = Job.objects.filter(source_unique=job_id)

if not jobs.exists():

o_job = Job()

o_job.source_unique = job_id

o_job.url = job.findtext('url')

o_job.email = utils.email_validation(self.get_email(description))

o_job.sol_url = job.findtext('url')

# o_job.user = user

o_job.source = source

o_job.weight = source.weight

o_job.title = title[:240]

o_job.slug = slugify(o_job.title)

This way you avoid overheating of external frameworks and can work in princepe with all customers without additional customer requirements.

As a client does not use XML yet, which used to be about 10 years ago, hardly anyone had XML interfaces and we just had to set up crawlers for client sites. Often these sites have not only HTML structure but Javascript navigation and that was still a challenge to create bots for such clients.

Then you need some mechanisms. 1 that will search vacancies on the site and quickly move them in the navigation. 2 Parse jobs that has found first process. 3 Once a day run through a source and check if the vacancy is still online at the website of origin.

For this method we also use our own framework which is based on stretched classes methods of these objects can be rewritten for any solution.

There is no better method than reading examples

vim crawler/management/commands/abstractjob.py

This command is based on BaseCommand from django and does only one thing. Retrieve a page where vacancy is found and parse it into 1 object and store it in the database via django ORM.

One of the problems after pressing is the duplicates.

Many interims and selection offices do double vacancies and only difference they add only a separate state. I can't say it could help for better results but can definitely cause problems so we need to remove duplicates and add extra cities only in case of such vacancies.

These people think that they are helping this job to receive more traffic but in fact they are adding extra work and slowing down the indexing process.

We should also have a separate command so that we do not see duplicates in our index because that can become very frustrating when you see such results.

The problem is that a city is a text field. Therefore it would be easier to customise titles just have a separate command that would add a state with a strip. Thus results look more unique and employer will indeed receive more responses.

You can see an example of such a command

vim crawler/management/commands/checkduplicatesbysource.py

This works mostly per source because we usually don't get a crossed source check.... There are hundreds to find exactly the same text or images or texts that are 70% alike, for example. A common method to generate for example a slug of title and filter objects with the same slug. That is one of the simplest methods not like natural language processing and is the fastest and works reasonably effectively if you need to filter thousands or even millions of records.

Example python function:

def check_by_slug(self, job, source):

if self.verbose == True:

print("ID %d" % (job.id))

print(job.slug)

print(datetime.datetime.now())

jobs = Job.objects.filter(slug=job.slug, for_index=True, company_name=job.company_name, source=source).exclude(id=job.id)

count = jobs.count()

if self.verbose == True:

print "count %d " % count

if count > 0:

try:

# jobs.delete()

jobs.update(has_checked_duplicate=True)

jobs.update(for_index=False)

except:

traceback.print_exc(file=sys.stdout)

transaction.rollback()

if self.verbose == True:

print datetime.datetime.now()

In details Crawler configuration

Just try browsing the site manually first to study how navigation is developed and how to set up a crawler to browse through categories and to job descriptions. Don't forget to test a separate incognito window because some sites let you browse and display jobs in 1 session and not in other sessions. Those urls live 1 session and are made unique for each visitor. That makes browsing difficult and sometimes impossible. Because you actually know where you've stopped and what files you still need to get.

Make a copy of a similar crawler so you can automate this site.

cp crawler/management/commands/jobbsitecat.py crawler/management/commands/newjobsitecat.py

Put a screenshot patch in so you can monitor via browser what is happening with crawler and see what the boot actually sees. Because some sites apply cloaking principles

SCREENSHOT_FILE = "/var/www/vindazo_de/static/newjobsite.png"

Then immediately start debugging without sentiments and stop where you need to browse so you can manually run everything and write down correct actions.

import pdb;pdb.set_trace()

Then you can check screenshots and see what crawler does via https protocol. Just make sure your screen is saved in a distributed folder like /static/ for example.

self.driver.find_element_by_xpath('//span[@class="typauswahl start"]/a').click()

For example:

element = self.driver.find_element_by_xpath('//a[@class="position-link"]')

You can find elements and click to see what appears.

self.driver.save_screenshot(SCREENSHOT_FILE)

Then in the real browser go to about the same point and see what you have to do to perform the next action. That is done.

element = self.driver.find_element_by_xpath('//a[@class="jobview-paging-control jobview-paging-next"]')

Also see that you always remove such things as consent from the road.

* ElementClickInterceptedException: Message: Element <a class="jobview-paging-control jobview-paging-next" href="#next"> is not clickable at point (577,951) because another element <div class="consent-banner"> obscures it

element = self.driver.find_element_by_xpath('//button[@id="accept-sta-consent"]')

element.click()

Thus, with already tested parts of the text, you can

(Pdb) element = self.driver.find_element_by_xpath('//a[@class="jobview-paging-control jobview-paging-next"]')
(Pdb) element.click()
(Pdb) self.driver.save_screenshot(SCREENSHOT_FILE)

Self.driver.current_url

Structured data

When we start parsing content, we check whether there is already structured data on the page.

https://search.google.com/test/rich-results

If we see the Job posting structure directly and can parse it from Json or xml then we don't need any extra HTML parser and this json is standard for industry.

So you can parse json that are between Json script tags.

<script type="application/ld+json">

script = soup.find('script', {"type":"application/ld+json"})

import json

job_data = json.loads(script.text)

job_data.keys()

u'description', u'title', u'employmentType', u'datePosted', u'validThrough', u'directApply', u'jobLocation', u'@context', u'baseSalary', u'hiringOrganization', u'@type']

Job_data["title"]

job_data["jobLocation"]

[{u'geo': {u'latitude': 52.5099338311689, u'@type': u'GeoCoordinates', u'longitude': 13.3867898863636}, u'@type': u'Place', u'address': {u'addressCountry': u'DE', u'addressLocality': u'Berlin', u'addressRegion': u'berlin', u'streetAddress': u'', u'postalCode': u'10115', u'@type': u'PostalAddress'}}]

job_data["jobLocation"][0].keys()

[u'geo', u'@type', u'address']

So, City for example you can get with..

job_data['jobLocation'][0]['address']['addressLocality']
job_data['jobLocation'][0]['address']['postalCode']

job.title = job_data["title"]
job.company_name = job_data['hiringOrganization']['name']
job.email = self.get_email(page_source)
job.phone = self.get_phone(page_source)
job.city = job_data['jobLocation'][0]['address']['addressLocality']
job.zip_code = job_data['jobLocation'][0]['address']['postalCode']
job.address = job_data['jobLocation'][0]['address']['streetAddress']
job.country = "Deutschland"
job.slug = slugify(job.title)
text = utils.parse_all_text(job_data['description'])
job.description = text
job.raw_text = utils.clean_job_html(job_data['description'])

Better quality actually via json and more importantly that is a generic command we can make.

Direct HTML parsing or Raw String as No Structured Rich data

In principle, we should already parse it immediately because if we see direct content and still have to move from one job to the other.

while next_page:

try:

self.driver.find_element_by_xpath('//img[contains(@src, "paginierung_rechts_aktiv")]').click()

sleep(3)

soup = BeautifulSoup(self.driver.page_source)

urls = self.get_job_urls(soup)

self.create_pages(urls)

except:

next_page = False

traceback.print_exc(file=sys.stdout)

def parse_job(self, text, source, url, session):

"""

Parse job

"""

job = Job()

job.status = 0

job.source_unique = url

job.url = url

job.source = source

job.weight = source.weight

job.online_since = datetime.now()

job.online_since_refreshed = datetime.now()

job.online_to = datetime.now()+timedelta(days=60)

if self.test:

import pdb; pdb.set_trace()

if text.find(self.offline) != -1:

return False

soup = BeautifulSoup(text, 'lxml')

job.title = self.get_title(soup).strip()

if not job.title:

return False

else:

job.title = job.title[0:120]

job.slug = slugify(job.title)

try:

job = self.get_contact_information(soup, job, session)

except:

traceback.print_exc(file=sys.stdout)

if not self.has_correct_location(soup):

return False

soup = self.get_content_zone(soup)

if soup == None:

def get_phone(self, soup):

tmp = ''

company_phone = soup.find(text=re.compile('Telefonnummer'))

if company_phone:

tmp = company_phone.split(':')[-1].strip()

return tmp

def get_city(self, soup):

tmp = ''

company_city = soup.find("span", id=re.compile(".*Ort*."))

if company_city:

tmp = company_city.text

return tmp

def get_zip_code(self, soup):

tmp = ''

company_zip = soup.find("span", id=re.compile(".*Plz*."))

if company_zip:

tmp = company_zip.text

return tmp

def get_address(self, soup):

tmp = ''

company_address = soup.find("span", id=re.compile(".*Strasse*."))

if company_address:

tmp = company_address.text

return tmp

def get_content_zone(self, soup):

job.company_name = self.get_company_name(soup)

job.email = self.get_email(soup)

job.phone = self.get_phone(soup)

job.city = self.get_city(soup)

job.zip_code = self.get_zip_code(soup)

job.address = self.get_address(soup)

job.country = "Deutschland"

return job

def get_company_name(self, soup):

tmp = ''

company_name = soup.find('a', id=re.compile(".*arbeitgeber"))

if company_name:

tmp = company_name.find('span').text

return tmp

def get_title(self, soup):

# tmp = soup.find('div', {"id":"containerInhaltKopf"}).find('h3')

tmp = soup.select("div#containerInhaltKopf > h3")

try:

tmp = tmp[0].text.replace("\n", "").replace("\t", "").replace("Stellenangebot -", "").strip()

except:

tmp = ""

if not tmp:

tmp = soup.title.text.replace("JOBBÖRSE - Stellenangebot -".decode('utf-8'), "" ).strip()

return tmp

Raw String parse with Regular Expressions

For such matters as telephone numbers or email, please use regular expressions.

For example Telephone:

https://regex101.com/r/McD0KW/2/

You can, for example, divide your function into two or more states as here.

def get_phone(self, page_source):

pattern = re.compile(r"[0\+\(][\d\-\.\+ \)\(\/]{10,22}")

phones = re.findall(pattern, page_source)

if len(phones) > 0:

return self.filter_phone(phones)

else:

return None

def filter_phone(self, phones):

# filter

pattern = re.compile("[\-\.\+ \)\(\/]")

for phone in phones:

symbols = re.findall(pattern, phone)

if len(symbols) > 3:

return phone

return None

A good regular expression is like a work of art. You can write the same thing with much less performance or something that doesn't work at all or works in some cases that can take several hours to solve.

E-mail to cover most cases.

https://regex101.com/r/xKUnaN/1

Then the email parser could look like this.

pattern = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')

return re.findall(pattern, content)

def get_email(self, page_source):

email = None

# in text

emails = parse_emails(page_source)

# mailto

if len(emails) > 0:

try:

email = normalization_email(emails[0])

validate_email(emails[0])

except:

email = None

return email

So if you have to parse HTML then occasionally it is interesting to make abstraction of tags and just rough string parsing and find necessary information with patterns ...

Search This Blog