Advanced web scraping in Python

The essential tips for advanced web scraping in Python.

Recently, I just created and open-sourced an unofficial Medium API, called PyMedium, which provides developers an easy way to access Medium.

One of the API in PyMedium is to parse post content, here I try to simply use web scraping technique to parse in the beginning. As the normal process of web scraping, I started to use “inspect element” in Chrome to find the tag pattern of post content (Right-click on the title element and select Inspect Element on Medium post page):

螢幕快照 2017-04-06 下午8.24.52
Right-click on the title element and select Inspect Element

 

螢幕快照 2017-03-28 下午5.39.24
Source: https://medium.com/dualcores-studio/make-an-android-custom-view-publish-and-open-source-99a3d86df228

Obviously, the post content tag relies on <div class=”postArticle-content js-postField js-notesSource js-trackedPost” data-post-id=”99a3d86df228"..., then I write simple python code to get the tag:

And execute it to print the tag first:

(env)$ python medium_scraper.py
(evn)$ None

The result output is None, what happen? 🤔🤔🤔

I return Chrome to double check again that the tag I searched is correct. So I try another way to find the tag pattern: “view source” in Chrome (Right-click on any page element and select View source):

螢幕快照 2017-04-08 上午9.52.57
Right-click on any page element and select View source
螢幕快照 2017-04-08 上午9.56.25
The source code of web page

And I found that there is a little bit difference between the result of inspect element and view source. There is no any tag like <div class=”postArticle-content js-postField js-notesSource js-trackedPost” data-post-id=”99a3d86df228".... So I conclude that the post content is generated by JavaScript.

You may wonder how can I tell the difference? How can I know the web page is generated by JavaScript dynamically or just a simple static page?

It’s not too difficult. The easy way is just use Chrome to help us tell the difference, that are the two ways to find the post content tag pattern we used above:

  1. “View source” (Right-click on any page element and select View source): get the actual source code of web page, without executing any scripts. This is what simple web scraper gets.
  2. “Inspect element” (Right-click on the title element and select Inspect Element): get the html after executing all the source code of web page, including JavaScript. It includes the dynamic content. The original simple web scraper can’t get dynamic content. It has to use some technique to do this job.

Once you find the tags you can’t find from source code, but they appear in inspect element, it means that the tags is generated by JavaScript, and you need to use particular technique to get them. If you can find the tags from source code, you can use simple web scraping to get them.


OK, turn back to our program, in this situation, here is the problem:

>> how can I get the tags generated by JavaScript then?

All we have to do for our program is to simulate browser to execute the whole source code including all the JavaScript, and then get the tag after getting the generated page.

Selenium or some web drivers can help. Here I use the popular one — Selenium as web driver, you have to download and install it at first. The following is the code to use Selenium to get medium post content tags:

There is a useful technique to get the HTML tags from Selenium after you find elements by some specification, that is content_element.get_attribute("innerHTML"). Execute the code, it will open your Chrome to load the URL you specify and get the post content tags to parse.

螢幕快照 2017-04-09 上午12.15.27
Result of executing the selenium code

OK, it’s done! Now I can keep other parsing flow to get what I want from Medium post. This is my repository: https://github.com/enginebai/PyMedium

Feel free to star my repository and like this post. ❤️

Happy coding!!!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑