Visualizing real estate prices with Altair
In this post, I played around with the Altair library, which not only makes plotting fancy charts easy-peasy but also makes it possible to show interactive charts on fastpages-powered websites. With Altair, we'll visualize a dataset of real estate prices.
Several months into the journey of Python programming, I was already aware of visualization tools like Matplotlib
Seaborn
, and Plotly
, which are commonly discussed on Medium. But I'd never heard of Altair
until I came across fastpages
. Since I plan to keep writing on this fastpages-powered blog, I did some experiments with Altair
. For illustration purprose, I'll be using the dataset of real estate prices in Kaohsiung, TW, which I've cleaned and put together in my GitHub repo. For those of you who don't know Kaohsiung, it's selected by the New York Times as one of the 52 places to love in 2021. Maybe you'll consider buying an apartment in Kaohsiung after reading this post. Who knows?
Altair
is alrealdy pre-installed on Colab. So there's no need to pip-install it if you're doing this on Colab.
import pandas as pd
import altair as alt
from altair import datum
The first thing to do is to git-clone the dataset into your environment.
!git clone -l -s https://github.com/howard-haowen/kh-real-estate cloned-repo
%cd cloned-repo
!ls
Let's take a look at 5 random observations.
df = pd.read_pickle('kh-house-prices.pkl')
df.sample(5)
The dataset includes 45717 observations and 21 columns.
df.shape
Most of the column names should be self-explanatory since I've translated them from the original Chinese to English.
columns = df.columns.tolist()
columns
Here're some basic stats.
df.describe()
MaxRowsError
is the first trouble I got! It turns out that by default Altair
only allows you to plot a dataset with a maximum of 5000 rows.
alt.Chart(df).mark_point().encode(
x='trading_year',
y='price_per_sqm',
color='district',
).interactive()
The limitation can be lifted by calling this function.
alt.data_transformers.disable_max_rows()
According to the official documentation, this is not a good solution. But I did it anyway because I didn't know better. I was then able to make a plot, but it only took seconds for my Colab notebook to crash. So the lesson learned is this:
A better way to deal with this is to pass data by URL, which only supports json
and csv
files. So I converted my dataframe to csv
and then uploaded it to my GitHub repo. Then all that's needed to start using Altair
is the URL to that file.
with open('kh-house-prices.csv', 'w', encoding='utf-8') as file:
df.to_csv(file, index=False)
Altair
to load your dataset properly, make sure the dataset is viewable by entering the URL in your browser. If your dataset is stored on GitHub, that means the URL has to start with https://raw.githubusercontent.com
rather than https://github.com
.
This URL is the data source from which we'll be making all the charts.
url= "https://raw.githubusercontent.com/howard-haowen/kh-real-estate/main/kh-house-prices.csv"
After we got the data loading and performance issue taken care of, let's break down the syntax of Altair
.
I'm a visual learner, so I personally think the easiest way to get started is to go to the Example Gallery and pick the kind of charts that you'd like to draw. Most of the time, all you need to do is copy-paste the codes and change the data source as well as column names.
All fancy charts start with something simple.In the case of Altair
, it's alt.Chart()
, which takes either URL or a pandas
DataFrame object (like df
in our failed example above) as its argument.
Then you decide what kinds of marks you'd like to draw on the chart by calling the .mark_X()
function, where X could be circle
if you want to represent an observation with a circle. Other types of marks used in this post include point
, line
, bar
, and area
.
Finally, you need to call the encode()
function in order to map the properties of your dataset onto the chart you're making. In this example below, the function takes three arguments:
-
x
for which column to be mapped to the x axis -
y
for which column to be mapped to the y axis -
color
for which column to be colored on the chart
Once you pass url
to alt.Chart()
and the column names in your dataset to encode()
, you'll get this chart.
alt.Chart(url).mark_circle().encode(
x='built_date:T',
y='price_per_sqm:Q',
color='district:N',)
:X
right after the column names, where X can be one of these:
-
Q
for quantitative data -
O
for ordinal data -
N
for nominal data -
T
for temporal data -
G
for geographic data
And one thing that I like about Altair
is that there're lots of predefined aggregate functions that you can use on the fly. For instance, you can pass temporal data to the function yearmonth()
, which aggreates data points in terms of year and month. Or you can pass quantitative data to average()
, which calculates the mean for you. This way, you won't have to create additional columns using pandas
and keep your raw data as minimal as possible.
alt.Chart(url).mark_circle().encode(
x='yearmonth(built_date):T',
y='average(price_per_sqm):Q',
color='district:N',)
In pandas
, we'd filter data using df[FILTER]
. In Altair
, this is done by .transform_filter()
. In the chart above, we see that the majority of data points gather in the lower right corner. So one way to zoom in is to set a range for built_year
on the x axis, which represents the year a property was built. Suppose we want built_year
to fall within 1950 and 2020, we do alt.FieldRangePredicate(field='built_year', range=[1950, 2020])
.
alt.Chart(url).mark_circle().encode(
x='yearmonth(built_date):T',
y='average(price_per_sqm):Q',
color='district:N',).transform_filter(
alt.FieldRangePredicate(field='built_year', range=[1950, 2020])
)
Similarly, if we want price_per_sqm
on the y axis, which represents property prices per square meter (in NT$ of course!) to be in the range of 10k and 300k, then we do alt.FieldRangePredicate(field='price_per_sqm', range=[10000, 300000])
.
alt.Chart(url).mark_circle().encode(
x='yearmonth(built_date):T',
y='average(price_per_sqm):Q',
color='district:N',).transform_filter(
alt.FieldRangePredicate(field='price_per_sqm', range=[10000, 300000])
)
But what if we want to filter data from multiple columns? I found that an easy way to do that is to use datum.X
, where X is a column name. Then the syntax is just like what you'd see in pandas
. Suppose we want built_year
to be greater than 1950 and price_per_sqm
less than 300k, then we do (datum.built_year > 1950) & (datum.price_per_sqm < 300000)
.
datum
is. It turns out that Altair
is smart enough to take care of everything for you as long as you import datum
. So be sure to do this: from altair import datum
.
alt.Chart(url).mark_circle().encode(
x='yearmonth(built_date):T',
y='average(price_per_sqm):Q',
color='district:N',).transform_filter(
(datum.built_year > 1950) & (datum.price_per_sqm < 300000)
)
Finally, if you want to give viewers of your chart the liberty to zoom in and out, you can make an interactive chart simply by adding .interactive()
to the end of your syntax. To see the effect, click on any grid of the following chart and then scroll your mouse or move two of your fingers up and down on your Magic Trackpad.
alt.Chart(url).mark_circle().encode(
x='yearmonth(built_date):T',
y='average(price_per_sqm):Q',
color='district:N',).transform_filter(
(datum.built_year > 1950) & (datum.price_per_sqm < 300000)
).interactive()
I think that's enough for the basics and for you to keep the ball rolling. Coming up are some of the numerous fancy charts that you can make with Altair
.
Suppose we want to create a scatter plot where viewers can focus on data points from a particular district of their choice, the .add_selection()
function can be quite handy. Let's first check out the unique districts in the datasets. (Btw, there're more districts in Kaohsiung. These are simply more densely populated areas.)
districts = df.district.unique().tolist()
districts
We first create a variable selection
, which we'll pass to .add_selection()
later. The selection
itself is a built-in function called alt.selection_single()
, which takes the following arguments:
-
name
for the name you want to display in the selection area -
fields
for a list of column names that views can choose from -
init
for a dictionary specifying the default value for each selectable column -
bind
for a dictionary specifying the way a column is to be selected (in this case,alt.binding_select()
for a drop down box) and its possible values (indicated by the argumentoptions
)
Additionally, if we want to display information about a data point upon mouseover, we can pass a list of column names to the argument tooltip
of the .encode()
function.
Importantly, for the interaction to work, we have to add .add_selection(selection)
right before the .encode()
function.
selection = alt.selection_single(
name='Select',
fields=['district', ],
init={'district': '左營區', },
bind={'district': alt.binding_select(options=districts), }
)
alt.Chart(url).mark_circle().add_selection(selection).encode(
x='yearmonth(built_date):T',
y='price_per_sqm:Q',
color=alt.condition(selection, 'district:N', alt.value('lightgray')),
tooltip=['property_type:N','property_area:Q','parking_area:Q',
'built_date:T','tradinng_date:T','price_per_sqm:Q'],
).transform_filter(
(datum.built_year > 1950) & (datum.price_per_sqm < 200000)
) # add ".interactive()" at the end to make the chart interactive
We can also make two charts and then concatenat them vertically by calling the function alt.vconcat()
, which takes chart objects and data
as its arguments.
selection = alt.selection_multi(fields=['district'])
top = alt.Chart().mark_line().encode(
x='yearmonth(built_date):T',
y='mean(price_per_sqm):Q',
color='district:N'
).properties(
width=600, height=200
).transform_filter(
selection
)
bottom = alt.Chart().mark_bar().encode(
x='yearmonth(trading_date):T',
y='mean(price_per_sqm):Q',
color=alt.condition(selection, alt.value('steelblue'), alt.value('lightgray'))
).properties(
width=600, height=100
).add_selection(
selection
)
alt.vconcat(
top, bottom,
data=url
)
We can make one chart respond to another chart based on selection on the second one. This can be useful when we want to have both a global and detailed view of the same chart. The key function we need is alt.Scale()
. Watch the top chart change as you select different areas of the bottom chart.
brush = alt.selection(type='interval', encodings=['x'])
base = alt.Chart(url).mark_area().encode(
x = 'yearmonth(built_date):T',
y = 'price_per_sqm:Q'
).properties(
width=600,
height=200
)
upper = base.encode(
alt.X('yearmonth(built_date):T', scale=alt.Scale(domain=brush))
)
lower = base.properties(
height=60
).add_selection(brush)
upper & lower
Finally, you can also pick three random variables from your dataset and make a 3 times 3 grid of charts, with each varing in the x and y axis combination. To do that, we'll need to specify repetition in two places: once in the argument of the x and y axis (i.e. alt.repeat()
within alt.X
and alt.Y
) and the other time in the outmost layer of the syntax (i.e. .repeat()
at the very end).
alt.Chart(url).mark_circle().encode(
alt.X(alt.repeat("column"), type='quantitative'),
alt.Y(alt.repeat("row"), type='quantitative'),
color='district:N'
).properties(
width=150,
height=150
).repeat(
row=['property_area', 'price_per_sqm', 'built_year'],
column=['built_year', 'price_per_sqm', 'property_area']
)
Altair
is a Python library worth looking into if you want to show interactive charts on your websites and give your visitors some freedom to play with the outcome. This post only shows what I've tried. If you wish to dig deeper into this library, uwdata/visualization-curriculum seems like a great resource, aside from the official documentation. Now that you know the average price of real estate in Kaohsiung, TW, would you consider moving down here? 👨💻