A crash course in scraping websites and indexing in Elasticsearch

"I (…) am rarely happier than when spending an entire day programming my computer to perform automatically a task that would otherwise take me a good ten seconds to do by hand." — Douglas Adams, Last Chance to See

My girlfriend and I plan to move together. We have been in the queue for too short to get an apartment through the biggest actor, Stångåstaden. Luckily, the city website has contact information to a good number of local, smaller landlords without queueing systems. Unfortunately, the website is a pain to work with.

We are interested in two or three areas of town, close to the center. The website lets us list landlords active in one area of town at a time. Further, the landlord listings do not include detailed information; you have to scan detail pages and look for an email address, phone number, or whatever in semi-structured HTML to find out how to contact each of them. This all generates tedious cross-referencing work, just to find the landlords we should contact.

This article describes how to scrape the Linköping website for landlords, index them in Elasticsearch, and present only the relevant ones along with a list of email addresses suitable for copypasting to an email client.

Scrape

We use requests for HTTP and pyquery to work with HTML. Figuring out how to scrape linkoping.se is not as straighforward as one would hope. I wonder if EPiSERVER is to blame.

In [1]:
import re

from pyquery import PyQuery
import requests

LANDLORDS_URL = "http://www.linkoping.se/sv/Bygga-bo/Hitta-bostad/Hyresvardar-i-Linkoping/"
PLACE_SELECT_NAME = "ctl00$FullContentRegion$FullRightContentArea$MainAndSideBodyRegion$SelectionListPage2$DropDownList1"
SHOW_SELECT_NAME = "ctl00$FullContentRegion$FullRightContentArea$MainAndSideBodyRegion$SelectionListPage2$btnShow"
SHOW_SELECT_VALUE = "Visa"


def get_place_options():
    r = requests.get(LANDLORDS_URL)
    r.raise_for_status()
    pq = PyQuery(r.text)
    place_options = pq("select[name='{}']".format(PLACE_SELECT_NAME)).children()
    
    def mapper(i, el):
        place = PyQuery(el)
        return place.text(), place.attr("value")
    
    return place_options.map(mapper)


def get_landlord_links(place_value):
    r = requests.post(LANDLORDS_URL, data={
        PLACE_SELECT_NAME: place_value, 
        SHOW_SELECT_NAME: SHOW_SELECT_VALUE,
        "__EVENTTARGET": PLACE_SELECT_NAME})
    r.raise_for_status()
    pq = PyQuery(r.text)
    pq.make_links_absolute(base_url=LANDLORDS_URL)
    landlord_links = pq("table#PageListTable tr a.PageListItemHeading")
    
    def mapper(i, el):
        landlord_link = PyQuery(el)
        return landlord_link.text(), landlord_link.attr("href")
    
    return landlord_links.map(mapper)


def get_landlord_info(link):
    r = requests.get(link)
    r.raise_for_status()
    pq = PyQuery(r.text)
    pq.make_links_absolute(base_url=link)
    
    info = Info()
    main = pq(".mainBody > p")
    for content in main.contents():
        is_text = isinstance(content, basestring)
        if is_text:
            info.current_text.append(content.strip(" :"))
        elif content.text:
            tag = content.tag
            if tag == "br":
                info.current_text.append(u"\n")
            elif tag == "a":
                href = PyQuery(content).attr("href")
                info.current_text.append(href.replace("mailto:", ""))
            elif tag in ["span", "strong"]:
                info.add()
                info.current_key = content.text.strip(" :")
            else:
                raise Exception("unexpected tag: {}".format(tag))
    info.add()
    info.parsed = info.parsed[1:]  # always the address and key = landlord name
    info.parsed.append(("text", main.text()))
    info.parsed.append(("html", main.html()))
    
    meta_description = pq("meta[name='EPi.Description']").attr("content")
    if meta_description:
        emails = re.findall(r"(\b[\w.]+@+[\w.]+.+[\w.]\b)", meta_description)
        if emails:
            assert len(emails) == 1
            info.parsed.append(("meta_email", emails[0]))
            
    return info.parsed
    

class Info(object):
    
    def __init__(self):
        self.current_key = ""
        self.current_text = []
        self.parsed = []
    
    def add(self):
        if self.current_key and self.current_text:
            fixed_key = "_" + self.current_key.strip().encode("ascii", "ignore").lower() \
                            .replace(" ", "_").replace("-", "")
            self.parsed.append((fixed_key, u"".join(self.current_text)))
            self.current_key = ""
            self.current_text = []

Now we are ready to perform the actual scraping and put all information in landlords.

In [2]:
import collections

landlords = []
place_options = get_place_options()

landlord_places = collections.defaultdict(set)
for place_name, place_option in place_options[1:]:
    for landlord, link in get_landlord_links(place_option):
        landlord_places[landlord].add(place_name)
        
all_places = place_options[0][1]
for landlord, link in get_landlord_links(all_places):
    info = {k: v for k, v in get_landlord_info(link)}
    info[u"name"] = landlord
    info[u"link"] = link
    info[u"areas"] = list(landlord_places.get(landlord, []))
    landlords.append(info)

Index

We could do without a search engine but I want to get some experience of using Elasticsearch. Let us start by preparing an index and a mapping containing all the fields we found when scraping. We say that all fields are strings and equally important. Elasticsearch handles arrays and detects that the areas field is an array of strings.

In [3]:
import itertools
import pyes

es = pyes.ES("127.0.0.1:9200")
es.indices.delete_index_if_exists("landlords")
es.indices.create_index("landlords")

flatten = itertools.chain.from_iterable
fields = set(flatten({f for f in landlord} for landlord in landlords))
mapping_value = {
    "store": "yes", 
    "index": "analyzed",
    "type": "string",
    "boost": 1.0,
    "term_vector": "with_positions_offsets",
    }
mapping = {field.encode("utf-8"): mapping_value for field in fields}
try:
    es.delete_mapping("landlord")
except Exception:
    pass
es.put_mapping("landlord", {'properties': mapping}, ["landlords"])

for landlord in landlords:
    es.index(landlord, "landlords", "landlord")

es.indices.refresh("landlords")
Out[3]:
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 10}, u'ok': True}

Query

Search for landlords active in the areas we are interested in.

In [4]:
import pyes.queryset

model = pyes.queryset.generate_model("landlords", "landlord")
areas_of_interest = ["vasastaden", "innerstaden"]
results = list(model.objects.filter(areas__in=areas_of_interest))

Present

Dump results in something we can open in a web browser, including a list of email addresses as well as links to detail pages of landlords that we were unable to find email addresses to.

The output is not included here because it would be too easy for spambots to pick up email addresses (obviously the Linköping website expose them though). Contact me if you are interested in it, or run the notebook yourself.

In [5]:
import os

import jinja2

emails = []
missing_emails = []
for r in results:
    email = r.get("meta_email", r.get("_epost", ""))
    if email:
        emails.append(email)
    else:
        missing_emails.append(r)

template = jinja2.Template(u""" \
<html>
<head>
<style type="text/css">
.bold { font-weight: bold; }
</style>
</head>
<body>
<h1>Hyresvärdar</h1>
{% for landlord in results %}
<h2>{{landlord["name"]}}</h2>
<p>{{landlord["html"]}}</p>
{% endfor %}
</dl>

<h1>Epost</h1>
<pre>
{{", \n".join(emails)}}
</pre>

<h1>Saknar epost</h1>
<ul>
{% for landlord in missing_emails %}
<li>
<a href="{{landlord["link"]}}">{{landlord["name"]}}</a>
</li>
{% endfor %}
</ul>
</body>
</html>
""")

with open(os.path.expanduser("~/bo.html"), "wp") as f:
    f.write(template.render(results=results, emails=emails, missing_emails=missing_emails).encode("utf-8"))

Required packages and software

$ cat requirements.txt requests==1.2.0 pyquery==1.2.4 pyes==0.20.0

Elasticsearch should be running on localhost. I did

$ brew install elasticsearch

and

$ elasticsearch -f -D es.config=/usr/local/opt/elasticsearch/config/elasticsearch.yml

Follow

If you enjoyed this you should follow me on Twitter, Github, Coderwall, or Geeklist.