Introduction
In this part we’re going to discuss different data munging techniques.
Data munging is the process of transforming “raw” data into a readable format.
One of the most common processes is when we want to scrape data from a website.
HTML
The web is built on Hypertext Markup Language.
HTML form a tree of nested elements, marked with tags.
<p class="foo">
bar
</p>
This creates a paragraph element.
<img src="url" alt="text">
This creates an image element.
Elements can have attributes that affect their behavior or appearance.
Above class, src, alt
are all attributes. The class
attribute is the most important one.
This is what we can use to identify these classes for, e.g. in a CSS file. The id
attribute can also be used to uniquely identify an element.
In the paragraph element, we have some content in between the tags.
This can be pure text as in the above case, but we can also have new tags in between, or nothing, as in the img
tag.
Web scraping
Due to this convenient tree structure that HTML is built upon, information and content can easily be extracted from web pages.
This is called web scraping. In the most simple cases this can easily be done with writing some manual code. In more modern and complex websites, where the HTML is automatically generated and has non-human structure, libraries can do the job for us.
Golden rule of web scraping:
If the user can read it, it can be scraped
BeautifulSoup
BeautifulSoup is Python library that parses HTML (and XML) documents, and creates an abstract tree from the elements.
This enables us to easily navigate the tree, access some tag element along with all its siblings and children elements.
Selenium
However, in modern web pages are built on JavaScript and are most often rendered in the end user’s web browser.
The HTML for these websites aren’t necessarily a complete description of the data, but the data is dynamically loaded.
Selenium is a browser automation framework that is often used for testing web pages. Happens that selenium is also very convenient for web scraping, since we can perform user actions.
So we can click “I accept cookies”, “load next page” etc.
Regular expression
A regular expression or a regex is a sequence of characters that can match text.
We use regular expressions for:
- Determine if a string matches a pattern completely
- Find the first or all matches of a pattern
- Extract groups that have been matched within the pattern
- Replace the matched text with some other text or a new pattern composed of matched groups.
Matching characters
When matching characters, most characters are matched regularly, but some characters have special meaning:
.
matches any character.^
matches start of line.$
matches end of line.$[acf]$
matches any of the characters a, c, f.$[a-z]$
matches any lowercase characters.$[A-Z]$
matches any uppercase characters.$[0-9]$
matches any digits.\w
matches alphanumeric characters.\W
matches non-alphanumeric characters.\d
matches digits.\D
matches non-digits.\s
matches whitespace.\S
matches non-whitespace.
Any special character that wants to be matched literally need to be escaped with a \
. E.g. \.
matches a period.
Repetitions
By default, exactly one character is matched, but this behavior can be changed:
*
matches 0 or more occurrences of the preceding character+
matches 1 or more?
matches 0 or 1{m, n}
matches at leastm
but no more thann
occurrences of the preceding character.
Groups
Regexes can include group, subregexes within parentheses.
Groups can include alternatives, denoted with the pipe |
. One of these alternatives are matched.
If we’re dealing with replacement, groups can be referenced with backreferences, e.g. \1
refers to the first match grouped.
Examples
Swedish social security numbers are in the format yyyymmddxxxx
We could match this simply with:
[12]\d{3}[01]\d[0-3]\d{5}
Regex in Python
The most important features of the Python regex module are:
re.match(regex, string)
returns aMatch
object that evaluates toTrue
if the beginning of the string matched the regex.re.fullmatch(regex, string)
returns aMatch
object that evaluates toTrue
if the whole string matched to the regex.re.search(regex, string)
returns aMatch
object to the first match of the regex in the string.re.findall(regex, string)
returns all non-overlapping matches of pattern in string (left-to-right).re.sub(regex, replacement, string)
replaces all occurrences of regex with replacement, the replacement can contain backreferences to groups in the match.