More details to Web Scraping with Python and Selenium

This article is a part of an Educational Data Science Project — Airbnb Analytics

Igor' Smirnov
Analytics Vidhya
6 min readMar 5, 2021

The idea is to implement an example Data Science project using the Airbnb website as a source of data. Please read the stories here:

Part 0 - Intro to the project
Part 1 - Scrape the data from Airbnb website
Part 2 - More details to Web Scraping (this article)
Part 3 - Explore and clean the data set
Part 4 - Build a machine learning model for listing price prediction
Part 5 - Explore the results and apply the model

Intro

After writing the first version of a scraper, I devoted some time to test the script and explore how good it does the job. Several issues were identified and today I’m gonna write how we can solve them.

  • Problem #1 — Sometimes nothing is scraped
  • Problem #2 — Sudden Timeout
  • Problem #3 — A button, not a button
  • Problem #4 — Element is not in a viewport
  • Problem #5 — Too heavy detail pages
  • Problem #6 — Damn cookie popups
  • Problem #7 — Airbnb is playing with me

Expectation management

What could be easier than scraping a webpage? We’re smart and can wrap our thoughts in a script:

Step 1 — Get the link

Step 2 — Send a GET request

Step 3 — Process the answer with BeautifulSoup

Step 4 — PROFIT

Well.

Right. But what if…

Our results are inconsistent?

Needed page elements are hidden beneath buttons and popups?

Get request times out?

At this point, one realizes scraping it’s not such an easy task anymore.

That was me some weeks ago —😭. But I did not surrender — 🐱‍👤. And started a long journey of continuous improvements.

By the way, don’t be like me. Stop chasing perfection. Remember about the Pareto 80-20 principle.

Problem #1. Sometimes nothing is scraped

I am not very confident about my programming skills. Because of that, I have a habit of writing code in small chunks. I write 3–4 lines, check how they work, and continue if everything’s fine. Following this principle, I wrote a function that extracts listings from an Airbnb search page.

It always performed as expected while testing, meaning that a single page returned 20 separate listings. But when I was about to scrape a couple of locations at once, it turned out that some pages were either empty or contained only 4–5 listings.

I spend a couple of days exploring and trying to understand what exactly is happening. That was quite a frustrating time honestly.

Solution

Eventually, I didn’t find a reason, gave up, and decided to solve this issue in a very straightforward way. The updated function is trying to scrape the page multiple times and returns the best result.

Problem #2. Sudden Timeout

By default GET method from requests Python library doesn’t have any timeout restrictions. It might be risky when we’re trying to access multiple pages in a row, as the website definitely won’t like it. Probably, we won’t be banned right away, but response times to our requests are likely to increase.

At some moment we would have to wait for tens of seconds, so to avoid that we better introduce some limitations.

Solution

First of all, we limit the waiting time with a timeout argument:

answer = requests.get(page_url, timeout=5)

And utilize my beloved Try and Except statement not to get a “Connection timed out” error in the middle of the scraping run.

try:
answer = requests.get(page_url, timeout=5)
except:
print(f"Connection timed out for URL: {page_url}")

Problem #3. A button, not a button

When dealing with dynamic pages we use Selenium, which is capable of mocking a real user behavior: scrolling clicking, typing, etc. I was ready for an easy walk here, but no such luck.

Life hasn’t prepared me to deal with page elements that are not the ones they seem to be. At first glance I thought that the buttons “show amenities” and “show price details” are similar: you find them, you click on them, you get the hidden data.

However, it turned out that the former is more like a link and the latter is an HTML tag button.

Solution

That’s not rocket science though. To deal with the link we utilize the Selenium command click :

element = webdriver.find_element_by_class_name(‘some_id’)
element.click()

And for the button we have Action Chains:

ActionChains(webdriver).click(element).perform()

Problem #4. Element is not in a viewport

One cannot click on a button that is out of one’s visible area of a web page, so-called viewport. As Selenium mocks real user behavior, we have to first scroll the page to see the button, before clicking it.

Sounds quite easy, but it was a real challenge to scroll properly. Somehow methods like move_to_element do not work right away and we need several attempts.

Solution

Yes, the same technique here. We scroll, try to click, and scroll again if it failed. I capped that amount of attempts at 10, but in most cases, it succeeded already at 4.

Problem #5. Too heavy detail pages

When dealing with detail pages we have to download tons of images before starting the scraping. That leads to longer running times and to less scraped data, as sometimes the page is still not fully loaded after our hard-coded limit (currently it’s 20 seconds).

Solution

Well, here we know what to do 😎. Let’s tell the browser not to load the images. Problem solved!

Problem #6. Damn cookie popups

Nowadays browsing through the web is getting harder and harder. We have to read a bunch of different cookie agreements and explicitly give our consent to be tracked. Airbnb follows this obligation, so we might stumble upon an unexpected pop-up window.

It might happen that this unexpected cookie policy will interfere with clicking on “amenities” or “price details” buttons.

Solution

Just clicking OK will be enough here. We simply have to be sure there’s something to click on. The “Try and Except” statement will save us again.

🎉🎉🎉

Special thanks to Oğuz Erdoğan for identifying Problems #5 and #6 and coming up with solutions to them.

🎉🎉🎉

Problem #7. Airbnb is playing with me

All previous problems could be solved in one manner or another. The worst case is just brute-forcing until we succeed, like in problems #1 and #4. But there is one aspect of scraping that makes us extremely vulnerable to the changes.

If Airbnb alters CSS styles, we’re doomed. And it’s more correct to say “when”, but not “if”. Since I started this project ~2 months ago, Airbnb already introduced some new styles and got rid of the old ones.

For example, the listing price:

old_class = '_1p7iugi'
new_class = '_olc9rf0'

Very often 2 (or maybe more) versions of styles exist in parallel, for different versions of the pages during some AB-testing. And we don’t know in advance, which version we landed on.

(Re)solution

Relying on exact CSS classes is not a long-term solution. Unfortunately, I haven’t found an easier one. But I hope that for our educational project it will be sufficient.

Finally

All scripts are available on Github:

Feel free to watch our webinar where we discussed the first part of this article — Scraping Airbnb website with Python and Selenium:

https://youtu.be/L8ooiuBnZ8M

We’ve just started, feel free to join us on:

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

No responses yet

Write a response