More details to Web Scraping with Python and Selenium

This article is a part of an Educational Data Science Project — Airbnb Analytics

Published in

Analytics Vidhya

6 min readMar 5, 2021

The idea is to implement an example Data Science project using the Airbnb website as a source of data. Please read the stories here:

Part 0 - Intro to the project
Part 1 - Scrape the data from Airbnb website
Part 2 - More details to Web Scraping (this article)
Part 3 - Explore and clean the data set
Part 4 - Build a machine learning model for listing price prediction
Part 5 - Explore the results and apply the model

Intro

After writing the first version of a scraper, I devoted some time to test the script and explore how good it does the job. Several issues were identified and today I’m gonna write how we can solve them.

Problem #1 — Sometimes nothing is scraped
Problem #2 — Sudden Timeout
Problem #3 — A button, not a button
Problem #4 — Element is not in a viewport
Problem #5 — Too heavy detail pages
Problem #6 — Damn cookie popups
Problem #7 — Airbnb is playing with me

Expectation management

What could be easier than scraping a webpage? We’re smart and can wrap our thoughts in a script:

Step 1 — Get the link
Step 2 — Send a GET request
Step 3 — Process the answer with BeautifulSoup
Step 4 — PROFIT

Well.

Right. But what if…

Our results are inconsistent?

Needed page elements are hidden beneath buttons and popups?

Get request times out?

At this point, one realizes scraping it’s not such an easy task anymore.

That was me some weeks ago —😭. But I did not surrender — 🐱‍👤. And started a long journey of continuous improvements.

By the way, don’t be like me. Stop chasing perfection. Remember about the Pareto 80-20 principle.

Problem #1. Sometimes nothing is scraped

I am not very confident about my programming skills. Because of that, I have a habit of writing code in small chunks. I write 3–4 lines, check how they work, and continue if everything’s fine. Following this principle, I wrote a function that extracts listings from an Airbnb search page.

It always performed as expected while testing, meaning that a single page returned 20 separate listings. But when I was about to scrape a couple of locations at once, it turned out that some pages were either empty or contained only 4–5 listings.

I spend a couple of days exploring and trying to understand what exactly is happening. That was quite a frustrating time honestly.

Solution

Eventually, I didn’t find a reason, gave up, and decided to solve this issue in a very straightforward way. The updated function is trying to scrape the page multiple times and returns the best result.

Problem #2. Sudden Timeout

By default GET method from requests Python library doesn’t have any timeout restrictions. It might be risky when we’re trying to access multiple pages in a row, as the website definitely won’t like it. Probably, we won’t be banned right away, but response times to our requests are likely to increase.

At some moment we would have to wait for tens of seconds, so to avoid that we better introduce some limitations.

Solution

First of all, we limit the waiting time with a timeout argument:

answer = requests.get(page_url, timeout=5)

And utilize my beloved Try and Except statement not to get a “Connection timed out” error in the middle of the scraping run.

try:
    answer = requests.get(page_url, timeout=5)
except:
    print(f"Connection timed out for URL: {page_url}")

Problem #3. A button, not a button

When dealing with dynamic pages we use Selenium, which is capable of mocking a real user behavior: scrolling clicking, typing, etc. I was ready for an easy walk here, but no such luck.

Life hasn’t prepared me to deal with page elements that are not the ones they seem to be. At first glance I thought that the buttons “show amenities” and “show price details” are similar: you find them, you click on them, you get the hidden data.

However, it turned out that the former is more like a link and the latter is an HTML tag button.

Solution

That’s not rocket science though. To deal with the link we utilize the Selenium command click :

element = webdriver.find_element_by_class_name(‘some_id’)
element.click()

And for the button we have Action Chains:

ActionChains(webdriver).click(element).perform()

Problem #4. Element is not in a viewport

One cannot click on a button that is out of one’s visible area of a web page, so-called viewport. As Selenium mocks real user behavior, we have to first scroll the page to see the button, before clicking it.

Sounds quite easy, but it was a real challenge to scroll properly. Somehow methods like move_to_element do not work right away and we need several attempts.

Solution

Yes, the same technique here. We scroll, try to click, and scroll again if it failed. I capped that amount of attempts at 10, but in most cases, it succeeded already at 4.

Problem #5. Too heavy detail pages

When dealing with detail pages we have to download tons of images before starting the scraping. That leads to longer running times and to less scraped data, as sometimes the page is still not fully loaded after our hard-coded limit (currently it’s 20 seconds).

Solution

Well, here we know what to do 😎. Let’s tell the browser not to load the images. Problem solved!

Problem #6. Damn cookie popups

Nowadays browsing through the web is getting harder and harder. We have to read a bunch of different cookie agreements and explicitly give our consent to be tracked. Airbnb follows this obligation, so we might stumble upon an unexpected pop-up window.

It might happen that this unexpected cookie policy will interfere with clicking on “amenities” or “price details” buttons.

Solution

Just clicking OK will be enough here. We simply have to be sure there’s something to click on. The “Try and Except” statement will save us again.

🎉🎉🎉
Special thanks to Oğuz Erdoğan for identifying Problems #5 and #6 and coming up with solutions to them.
🎉🎉🎉

Problem #7. Airbnb is playing with me

All previous problems could be solved in one manner or another. The worst case is just brute-forcing until we succeed, like in problems #1 and #4. But there is one aspect of scraping that makes us extremely vulnerable to the changes.

If Airbnb alters CSS styles, we’re doomed. And it’s more correct to say “when”, but not “if”. Since I started this project ~2 months ago, Airbnb already introduced some new styles and got rid of the old ones.

For example, the listing price:

old_class = '_1p7iugi'
new_class = '_olc9rf0'

Very often 2 (or maybe more) versions of styles exist in parallel, for different versions of the pages during some AB-testing. And we don’t know in advance, which version we landed on.

(Re)solution

Relying on exact CSS classes is not a long-term solution. Unfortunately, I haven’t found an easier one. But I hope that for our educational project it will be sufficient.

Finally

All scripts are available on Github:

x-technology/airbnb-analytics

Airbnb data + data science. Contribute to x-technology/airbnb-analytics development by creating an account on GitHub.

github.com

Feel free to watch our webinar where we discussed the first part of this article — Scraping Airbnb website with Python and Selenium:

https://youtu.be/L8ooiuBnZ8M

We’ve just started, feel free to join us on:

More details to Web Scraping with Python and Selenium

This article is a part of an Educational Data Science Project — Airbnb Analytics

Intro

Expectation management

Problem #1. Sometimes nothing is scraped

Solution

Problem #2. Sudden Timeout

Solution

Problem #3. A button, not a button

Solution

Problem #4. Element is not in a viewport

Solution

Problem #5. Too heavy detail pages

Solution

Problem #6. Damn cookie popups

Solution

Problem #7. Airbnb is playing with me

Finally

x-technology/airbnb-analytics

Airbnb data + data science. Contribute to x-technology/airbnb-analytics development by creating an account on GitHub.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Analytics Vidhya

Written by Igor' Smirnov

No responses yet

More from Igor' Smirnov and Analytics Vidhya

What is the best recovery company that can help me recover my lost bitcoins from a scammer that…

What is the best recovery company that can help me recover my lost bitcoins from a scammer that hacked my blockchain wallet?

How to Swap Crypto on Sushi Swap: A Step-by-Step Guide

As decentralized finance (DeFi) continues to grow, Sushi Swap remains one of the most popular platforms for crypto swaps. Whether you’re…

Insane Bitcoin Price Predictions for 2025

Will Bitcoin Surpass Your Salary?

Unlike the traditional system in finance run by a single order, blockchain technology allows a…

Image 1: Central bank a.k.a single order

Recommended from Medium

I used AI to analyze every single US stock. Here’s what to look out for in 2025.

All of my articles are 100% free to read! Non-members can read for free by clicking my friend link here!

Sol Sniper xyz Review: Sniping Solana Tokens via Twitter and Telegram Chats

In the fast-paced world of cryptocurrency trading, timing is everything — especially when it comes to sniping newly launched tokens on the…

Lists

How to Find a Mentor

Stories to Help You Live Better

Stories to Help You Level-Up at Work

Predictive Modeling w/ Python

This Is How Tesla Will Die

The vultures are circling the tech giant.

If You Missed Solana at $10, This Is Your Second Chance

Possibly…

What is Blockchain? What is the future of Blockchain?

What is Blockchain?

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.