More details to Web Scraping with Python and Selenium
This article is a part of an Educational Data Science Project — Airbnb Analytics
The idea is to implement an example Data Science project using the Airbnb website as a source of data. Please read the stories here:
Part 0 - Intro to the project
Part 1 - Scrape the data from Airbnb website
Part 2 - More details to Web Scraping (this article)
Part 3 - Explore and clean the data set
Part 4 - Build a machine learning model for listing price prediction
Part 5 - Explore the results and apply the model
Intro
After writing the first version of a scraper, I devoted some time to test the script and explore how good it does the job. Several issues were identified and today I’m gonna write how we can solve them.
- Problem #1 — Sometimes nothing is scraped
- Problem #2 — Sudden Timeout
- Problem #3 — A button, not a button
- Problem #4 — Element is not in a viewport
- Problem #5 — Too heavy detail pages
- Problem #6 — Damn cookie popups
- Problem #7 — Airbnb is playing with me
Expectation management
What could be easier than scraping a webpage? We’re smart and can wrap our thoughts in a script:
Step 1 — Get the link
Step 2 — Send a GET request
Step 3 — Process the answer with BeautifulSoup
Step 4 — PROFIT
Well.
Right. But what if…
Our results are inconsistent?
Needed page elements are hidden beneath buttons and popups?
Get request times out?
At this point, one realizes scraping it’s not such an easy task anymore.
That was me some weeks ago —😭. But I did not surrender — 🐱👤. And started a long journey of continuous improvements.
By the way, don’t be like me. Stop chasing perfection. Remember about the Pareto 80-20 principle.
Problem #1. Sometimes nothing is scraped
I am not very confident about my programming skills. Because of that, I have a habit of writing code in small chunks. I write 3–4 lines, check how they work, and continue if everything’s fine. Following this principle, I wrote a function that extracts listings from an Airbnb search page.
It always performed as expected while testing, meaning that a single page returned 20 separate listings. But when I was about to scrape a couple of locations at once, it turned out that some pages were either empty or contained only 4–5 listings.
I spend a couple of days exploring and trying to understand what exactly is happening. That was quite a frustrating time honestly.

Solution
Eventually, I didn’t find a reason, gave up, and decided to solve this issue in a very straightforward way. The updated function is trying to scrape the page multiple times and returns the best result.
Problem #2. Sudden Timeout
By default GET method from requests Python library doesn’t have any timeout restrictions. It might be risky when we’re trying to access multiple pages in a row, as the website definitely won’t like it. Probably, we won’t be banned right away, but response times to our requests are likely to increase.
At some moment we would have to wait for tens of seconds, so to avoid that we better introduce some limitations.
Solution
First of all, we limit the waiting time with a timeout argument:
answer = requests.get(page_url, timeout=5)
And utilize my beloved Try and Except statement not to get a “Connection timed out” error in the middle of the scraping run.
try:
answer = requests.get(page_url, timeout=5)
except:
print(f"Connection timed out for URL: {page_url}")
Problem #3. A button, not a button
When dealing with dynamic pages we use Selenium, which is capable of mocking a real user behavior: scrolling clicking, typing, etc. I was ready for an easy walk here, but no such luck.
Life hasn’t prepared me to deal with page elements that are not the ones they seem to be. At first glance I thought that the buttons “show amenities” and “show price details” are similar: you find them, you click on them, you get the hidden data.

However, it turned out that the former is more like a link and the latter is an HTML tag button.
Solution
That’s not rocket science though. To deal with the link we utilize the Selenium command click
:
element = webdriver.find_element_by_class_name(‘some_id’)
element.click()
And for the button we have Action Chains:
ActionChains(webdriver).click(element).perform()
Problem #4. Element is not in a viewport
One cannot click on a button that is out of one’s visible area of a web page, so-called viewport. As Selenium mocks real user behavior, we have to first scroll the page to see the button, before clicking it.
Sounds quite easy, but it was a real challenge to scroll properly. Somehow methods like move_to_element
do not work right away and we need several attempts.

Solution
Yes, the same technique here. We scroll, try to click, and scroll again if it failed. I capped that amount of attempts at 10, but in most cases, it succeeded already at 4.
Problem #5. Too heavy detail pages
When dealing with detail pages we have to download tons of images before starting the scraping. That leads to longer running times and to less scraped data, as sometimes the page is still not fully loaded after our hard-coded limit (currently it’s 20 seconds).
Solution
Well, here we know what to do 😎. Let’s tell the browser not to load the images. Problem solved!
Problem #6. Damn cookie popups
Nowadays browsing through the web is getting harder and harder. We have to read a bunch of different cookie agreements and explicitly give our consent to be tracked. Airbnb follows this obligation, so we might stumble upon an unexpected pop-up window.

It might happen that this unexpected cookie policy will interfere with clicking on “amenities” or “price details” buttons.
Solution
Just clicking OK will be enough here. We simply have to be sure there’s something to click on. The “Try and Except” statement will save us again.
🎉🎉🎉
Special thanks to Oğuz Erdoğan for identifying Problems #5 and #6 and coming up with solutions to them.
🎉🎉🎉
Problem #7. Airbnb is playing with me
All previous problems could be solved in one manner or another. The worst case is just brute-forcing until we succeed, like in problems #1 and #4. But there is one aspect of scraping that makes us extremely vulnerable to the changes.
If Airbnb alters CSS styles, we’re doomed. And it’s more correct to say “when”, but not “if”. Since I started this project ~2 months ago, Airbnb already introduced some new styles and got rid of the old ones.
For example, the listing price:
old_class = '_1p7iugi'
new_class = '_olc9rf0'
Very often 2 (or maybe more) versions of styles exist in parallel, for different versions of the pages during some AB-testing. And we don’t know in advance, which version we landed on.
(Re)solution
Relying on exact CSS classes is not a long-term solution. Unfortunately, I haven’t found an easier one. But I hope that for our educational project it will be sufficient.
Finally
All scripts are available on Github:
Feel free to watch our webinar where we discussed the first part of this article — Scraping Airbnb website with Python and Selenium:
We’ve just started, feel free to join us on: