As more sites are leaning on JavaScript to load dynamic data, web scraping is hitting new hurdles. Purely using the urllib2 library from Python will return the site without the dynamic data available to gather if a JavaScript library is used to populate site information.
Hitting this wall, I decided to take a deep dive in scraping data from a local paper who I freelance for’s rosters to create code replacement files to use in Photo Mechanic in this upcoming season. This makes captioning faster and easier, making photos available quicker to the paper.
There are many options but I went with Selenium. Selenium essentially runs as a browser, opening a window, rendering the page and returning the HTML. That just touches the surface on its power, but it’s all that’s needed for this exercise.
This is a resource intensive and slow process. There are seemingly faster ways to achieve this, but Selenium looked interesting and seemed like a good skill to have since performance isn’t paramount in this instance.
Loading the site, generating content and returning the rendered HTML
from selenium import webdriver
from bs4 import BeautifulSoup
import time, re, string
path_to_chromedriver = '/Users/usracct/Development/Python/chromedriver' # change path as needed; can use other browsers
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
url = 'http://urlhere.com'
browser.get(url)
time.sleep(5) #extra time to let all JS on site load
#returns site HTML and encodes to ascii for BeautifulSoup
html = (browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")).encodi','ignore')
Once the data is returned, I used BeautifulSoup to parse out the data into the format needed. (Note: I made this much more dynamic, taking the list of schools and using BeautifulSoup and regex to extract the appropriate URLs, replace to the correct page – since the list went to a main page for the school – and then loop through.)
Example of finding the stat table using BeautifulSoup:
#find the stat block tables
table = soup.findAll("table", { "class" : "className" })
From there it’s just extracting the information needed and formatting or writing to the desired location.
This is a combination of many articles, forum posts (such as this to get the HTML returned correctly) and sources, but I thought it may be useful to someone looking to get information from a site with JavaScript at its core. Have fun!