The essential tips for advanced web scraping in Python.
Recently, I just created and open-sourced an unofficial Medium API, called PyMedium, which provides developers an easy way to access Medium.
One of the API in PyMedium is to parse post content, here I try to simply use web scraping technique to parse in the beginning. As the normal process of web scraping, I started to use “inspect element” in Chrome to find the tag pattern of post content (Right-click on the title element and select Inspect Element on Medium post page):


Obviously, the post content tag relies on <div class=”postArticle-content js-postField js-notesSource js-trackedPost” data-post-id=”99a3d86df228"...
, then I write simple python code to get the tag:
#!/usr/bin/python3 | |
# -*- coding: utf8 -*- | |
import requests | |
from bs4 import BeautifulSoup | |
__author__ = "Engine Bai" | |
url = "https://medium.com/dualcores-studio/make-an-android-custom-view-publish-and-open-source-99a3d86df228" | |
req = requests.get(url) | |
html = BeautifulSoup(req.text, "html.parser") | |
content_tag = html.find("div", attrs={"class": "postArticle-content", | |
"data-source": "post_page"}) | |
print(content_tag) |
And execute it to print the tag first:
(env)$ python medium_scraper.py (evn)$ None
The result output is None
, what happen? 🤔🤔🤔
I return Chrome to double check again that the tag I searched is correct. So I try another way to find the tag pattern: “view source” in Chrome (Right-click on any page element and select View source):


And I found that there is a little bit difference between the result of inspect element and view source. There is no any tag like <div class=”postArticle-content js-postField js-notesSource js-trackedPost” data-post-id=”99a3d86df228"...
. So I conclude that the post content is generated by JavaScript.
You may wonder how can I tell the difference? How can I know the web page is generated by JavaScript dynamically or just a simple static page?
It’s not too difficult. The easy way is just use Chrome to help us tell the difference, that are the two ways to find the post content tag pattern we used above:
- “View source” (Right-click on any page element and select View source): get the actual source code of web page, without executing any scripts. This is what simple web scraper gets.
- “Inspect element” (Right-click on the title element and select Inspect Element): get the html after executing all the source code of web page, including JavaScript. It includes the dynamic content. The original simple web scraper can’t get dynamic content. It has to use some technique to do this job.
Once you find the tags you can’t find from source code, but they appear in inspect element, it means that the tags is generated by JavaScript, and you need to use particular technique to get them. If you can find the tags from source code, you can use simple web scraping to get them.
OK, turn back to our program, in this situation, here is the problem:
>> how can I get the tags generated by JavaScript then?
All we have to do for our program is to simulate browser to execute the whole source code including all the JavaScript, and then get the tag after getting the generated page.
Selenium or some web drivers can help. Here I use the popular one — Selenium as web driver, you have to download and install it at first. The following is the code to use Selenium to get medium post content tags:
#!/usr/bin/python3 | |
# -*- coding: utf8 -*- | |
from selenium import webdriver | |
from bs4 import BeautifulSoup | |
__author__ = "Engine Bai" | |
url = "https://medium.com/dualcores-studio/make-an-android-custom-view-publish-and-open-source-99a3d86df228" | |
driver = webdriver.Chrome(executable_path="./driver/chromedriver") | |
driver.get(url) | |
content_element = driver.find_element_by_class_name("postArticle-content") | |
content_html = content_element.get_attribute("innerHTML") | |
soup = BeautifulSoup(content_html, "html.parser") | |
p_tags = soup.find_all("p") | |
for p in p_tags: | |
print(p.prettify()) | |
driver.close() |
There is a useful technique to get the HTML tags from Selenium after you find elements by some specification, that is content_element.get_attribute("innerHTML")
. Execute the code, it will open your Chrome to load the URL you specify and get the post content tags to parse.

OK, it’s done! Now I can keep other parsing flow to get what I want from Medium post. This is my repository: https://github.com/enginebai/PyMedium
Feel free to star my repository and like this post. ❤️
Happy coding!!!
Thanks so much for sharing this tip! I’ve used web scraping several times and never knew that there was a difference between the output of inspect element and view source. This is definitely going to come in handy!
It’s the best way to scrape js content but it’s a bit slow. You can try with scrapy +selenium.
And what happens if you run the script? Does Chrome starts?
Yes, It will start a new Chrome session without any cookie, (like incognito mode), and request the web to render the page.
Do you think it is possible to use PyMedium to retrieve a list of all title, authorname, authorID and url for the “has-recommended” list o f a given user, has we can see in https://medium.com/@username/has-recommended page ?
Yes, however, I found that the Medium keeps change the way they display the contents, maybe AB testing. It’s hard to maintain the parse method.
Definitely been running into issues with this. Thanks for the solution!
You can try ScrapeStorm, which extracts the data you want very quickly.
Hello! This is my first comment here so I just wanted to give a quick shout out and tell you I truly enjoy reading your blog posts. Can you suggest any other blogs/websites/forums that go over the same topics? Thanks for your time!
Superb post but I was wanting to know if you could write a litte more on this subject? I’d be very thankful if you could elaborate a little bit further. Many thanks!