Advanced web scraping in Python

The essential tips for advanced web scraping in Python.

Recently, I just created and open-sourced an unofficial Medium API, called PyMedium, which provides developers an easy way to access Medium.

One of the API in PyMedium is to parse post content, here I try to simply use web scraping technique to parse in the beginning. As the normal process of web scraping, I started to use “inspect element” in Chrome to find the tag pattern of post content (Right-click on the title element and select Inspect Element on Medium post page):

螢幕快照 2017-04-06 下午8.24.52
Right-click on the title element and select Inspect Element


螢幕快照 2017-03-28 下午5.39.24

Obviously, the post content tag relies on <div class=”postArticle-content js-postField js-notesSource js-trackedPost” data-post-id=”99a3d86df228"..., then I write simple python code to get the tag:

# -*- coding: utf8 -*-
import requests
from bs4 import BeautifulSoup
__author__ = "Engine Bai"
url = ";
req = requests.get(url)
html = BeautifulSoup(req.text, "html.parser")
content_tag = html.find("div", attrs={"class": "postArticle-content",
"data-source": "post_page"})

And execute it to print the tag first:

(env)$ python
(evn)$ None

The result output is None, what happen? 🤔🤔🤔

I return Chrome to double check again that the tag I searched is correct. So I try another way to find the tag pattern: “view source” in Chrome (Right-click on any page element and select View source):

螢幕快照 2017-04-08 上午9.52.57
Right-click on any page element and select View source
螢幕快照 2017-04-08 上午9.56.25
The source code of web page

And I found that there is a little bit difference between the result of inspect element and view source. There is no any tag like <div class=”postArticle-content js-postField js-notesSource js-trackedPost” data-post-id=”99a3d86df228".... So I conclude that the post content is generated by JavaScript.

You may wonder how can I tell the difference? How can I know the web page is generated by JavaScript dynamically or just a simple static page?

It’s not too difficult. The easy way is just use Chrome to help us tell the difference, that are the two ways to find the post content tag pattern we used above:

  1. “View source” (Right-click on any page element and select View source): get the actual source code of web page, without executing any scripts. This is what simple web scraper gets.
  2. “Inspect element” (Right-click on the title element and select Inspect Element): get the html after executing all the source code of web page, including JavaScript. It includes the dynamic content. The original simple web scraper can’t get dynamic content. It has to use some technique to do this job.

Once you find the tags you can’t find from source code, but they appear in inspect element, it means that the tags is generated by JavaScript, and you need to use particular technique to get them. If you can find the tags from source code, you can use simple web scraping to get them.

OK, turn back to our program, in this situation, here is the problem:

>> how can I get the tags generated by JavaScript then?

All we have to do for our program is to simulate browser to execute the whole source code including all the JavaScript, and then get the tag after getting the generated page.

Selenium or some web drivers can help. Here I use the popular one — Selenium as web driver, you have to download and install it at first. The following is the code to use Selenium to get medium post content tags:

# -*- coding: utf8 -*-
from selenium import webdriver
from bs4 import BeautifulSoup
__author__ = "Engine Bai"
url = ";
driver = webdriver.Chrome(executable_path="./driver/chromedriver")
content_element = driver.find_element_by_class_name("postArticle-content")
content_html = content_element.get_attribute("innerHTML")
soup = BeautifulSoup(content_html, "html.parser")
p_tags = soup.find_all("p")
for p in p_tags:

There is a useful technique to get the HTML tags from Selenium after you find elements by some specification, that is content_element.get_attribute("innerHTML"). Execute the code, it will open your Chrome to load the URL you specify and get the post content tags to parse.

螢幕快照 2017-04-09 上午12.15.27
Result of executing the selenium code

OK, it’s done! Now I can keep other parsing flow to get what I want from Medium post. This is my repository:

Feel free to star my repository and like this post. ❤️

Happy coding!!!

11 thoughts on “Advanced web scraping in Python

Add yours

  1. Thanks so much for sharing this tip! I’ve used web scraping several times and never knew that there was a difference between the output of inspect element and view source. This is definitely going to come in handy!

  2. It’s the best way to scrape js content but it’s a bit slow. You can try with scrapy +selenium.

    1. Yes, It will start a new Chrome session without any cookie, (like incognito mode), and request the web to render the page.

    1. Yes, however, I found that the Medium keeps change the way they display the contents, maybe AB testing. It’s hard to maintain the parse method.

  3. Hello! This is my first comment here so I just wanted to give a quick shout out and tell you I truly enjoy reading your blog posts. Can you suggest any other blogs/websites/forums that go over the same topics? Thanks for your time!

  4. Superb post but I was wanting to know if you could write a litte more on this subject? I’d be very thankful if you could elaborate a little bit further. Many thanks!

  5. Greetings! I know this is kind of off topic but I was wondering which blog platform
    are you using for this website? I’m getting tired of WordPress because I’ve had
    problems with hackers and I’m looking at alternatives for another platform.
    I would be awesome if you could point me in the direction of a good

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

Up ↑