Intro

Several months into the journey of Python programming, I was already aware of visualization tools like Matplotlib Seaborn, and Plotly, which are commonly discussed on Medium. But I'd never heard of Altair until I came across fastpages. Since I plan to keep writing on this fastpages-powered blog, I did some experiments with Altair. For illustration purprose, I'll be using the dataset of real estate prices in Kaohsiung, TW, which I've cleaned and put together in my GitHub repo. For those of you who don't know Kaohsiung, it's selected by the New York Times as one of the 52 places to love in 2021. Maybe you'll consider buying an apartment in Kaohsiung after reading this post. Who knows?

Import dependencies

Altair is alrealdy pre-installed on Colab. So there's no need to pip-install it if you're doing this on Colab.

import pandas as pd
import altair as alt 
from altair import datum

Load the dataset

The first thing to do is to git-clone the dataset into your environment.

!git clone -l -s https://github.com/howard-haowen/kh-real-estate cloned-repo
%cd cloned-repo
!ls

Cloning into 'cloned-repo'...
warning: --local is ignored
remote: Enumerating objects: 100, done.
remote: Counting objects: 100% (100/100), done.
remote: Compressing objects: 100% (100/100), done.
remote: Total 100 (delta 46), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (100/100), 3.30 MiB | 1.03 MiB/s, done.
Resolving deltas: 100% (46/46), done.
/content/cloned-repo
catboost-model-feature-importance.png		  catboost-model-residuals.png
catboost-model-feature-importance-shap-value.png  compare-models.png
catboost-model-learning-curve.png		  kh-house-prices.csv
catboost-model-outliers.png			  kh-house-prices.pkl
catboost-model.png				  LICENSE
catboost-model-prediction-errors.png		  README.md

Let's take a look at 5 random observations.

df = pd.read_pickle('kh-house-prices.pkl')
df.sample(5)

The dataset includes 45717 observations and 21 columns.

df.shape

(45717, 21)

Most of the column names should be self-explanatory since I've translated them from the original Chinese to English.

columns = df.columns.tolist()
columns

['purpose',
 'trading_target',
 'land_area',
 'property_type',
 'living_room',
 'bedroom',
 'bathroom',
 'partition',
 'property_area',
 'is_managed',
 'total_floor',
 'parking_area',
 'parking_price',
 'parking_type',
 'land_use',
 'district',
 'trading_date',
 'trading_year',
 'built_date',
 'built_year',
 'price_per_sqm']

Here're some basic stats.

df.describe()

MaxRowsError is the first trouble I got! It turns out that by default Altair only allows you to plot a dataset with a maximum of 5000 rows.

alt.Chart(df).mark_point().encode(
    x='trading_year',
    y='price_per_sqm',
    color='district',
).interactive()

---------------------------------------------------------------------------
MaxRowsError                              Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/altair/vegalite/v4/api.py in to_dict(self, *args, **kwargs)
    361         copy = self.copy(deep=False)
    362         original_data = getattr(copy, "data", Undefined)
--> 363         copy.data = _prepare_data(original_data, context)
    364 
    365         if original_data is not Undefined:

/usr/local/lib/python3.6/dist-packages/altair/vegalite/v4/api.py in _prepare_data(data, context)
     82     # convert dataframes  or objects with __geo_interface__ to dict
     83     if isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
---> 84         data = _pipe(data, data_transformers.get())
     85 
     86     # convert string input to a URLData

/usr/local/lib/python3.6/dist-packages/toolz/functoolz.py in pipe(data, *funcs)
    625     """
    626     for func in funcs:
--> 627         data = func(data)
    628     return data
    629 

/usr/local/lib/python3.6/dist-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    301     def __call__(self, *args, **kwargs):
    302         try:
--> 303             return self._partial(*args, **kwargs)
    304         except TypeError as exc:
    305             if self._should_curry(args, kwargs, exc):

/usr/local/lib/python3.6/dist-packages/altair/vegalite/data.py in default_data_transformer(data, max_rows)
     17 @curried.curry
     18 def default_data_transformer(data, max_rows=5000):
---> 19     return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
     20 
     21 

/usr/local/lib/python3.6/dist-packages/toolz/functoolz.py in pipe(data, *funcs)
    625     """
    626     for func in funcs:
--> 627         data = func(data)
    628     return data
    629 

/usr/local/lib/python3.6/dist-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    301     def __call__(self, *args, **kwargs):
    302         try:
--> 303             return self._partial(*args, **kwargs)
    304         except TypeError as exc:
    305             if self._should_curry(args, kwargs, exc):

/usr/local/lib/python3.6/dist-packages/altair/utils/data.py in limit_rows(data, max_rows)
     82             "than the maximum allowed ({}). "
     83             "For information on how to plot larger datasets "
---> 84             "in Altair, see the documentation".format(max_rows)
     85         )
     86     return data

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation

alt.Chart(...)

The limitation can be lifted by calling this function.

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

According to the official documentation, this is not a good solution. But I did it anyway because I didn't know better. I was then able to make a plot, but it only took seconds for my Colab notebook to crash. So the lesson learned is this:

Warning: Never disable the restriction for max rows if you’re dealing with a huge amount of data!

A better way to deal with this is to pass data by URL, which only supports json and csv files. So I converted my dataframe to csv and then uploaded it to my GitHub repo. Then all that's needed to start using Altair is the URL to that file.

with open('kh-house-prices.csv', 'w', encoding='utf-8') as file:
    df.to_csv(file, index=False)

Tip: For Altair to load your dataset properly, make sure the dataset is viewable by entering the URL in your browser. If your dataset is stored on GitHub, that means the URL has to start with https://raw.githubusercontent.com rather than https://github.com.

This URL is the data source from which we'll be making all the charts.

url= "https://raw.githubusercontent.com/howard-haowen/kh-real-estate/main/kh-house-prices.csv"

Simple charts

After we got the data loading and performance issue taken care of, let's break down the syntax of Altair.

I'm a visual learner, so I personally think the easiest way to get started is to go to the Example Gallery and pick the kind of charts that you'd like to draw. Most of the time, all you need to do is copy-paste the codes and change the data source as well as column names.

All fancy charts start with something simple.In the case of Altair, it's alt.Chart(), which takes either URL or a pandas DataFrame object (like df in our failed example above) as its argument.

Then you decide what kinds of marks you'd like to draw on the chart by calling the .mark_X() function, where X could be circle if you want to represent an observation with a circle. Other types of marks used in this post include point, line, bar, and area.

Finally, you need to call the encode() function in order to map the properties of your dataset onto the chart you're making. In this example below, the function takes three arguments:

x for which column to be mapped to the x axis
y for which column to be mapped to the y axis
color for which column to be colored on the chart

Once you pass url to alt.Chart() and the column names in your dataset to encode(), you'll get this chart.

alt.Chart(url).mark_circle().encode(
    x='built_date:T',
    y='price_per_sqm:Q',
    color='district:N',)

Note: If your data source is a dataframe, then column names are sufficient. But if your data source is an URL as is the case here, you have to specify your data types with :X right after the column names, where X can be one of these:

Q for quantitative data
O for ordinal data
N for nominal data
T for temporal data
G for geographic data

And one thing that I like about Altair is that there're lots of predefined aggregate functions that you can use on the fly. For instance, you can pass temporal data to the function yearmonth(), which aggreates data points in terms of year and month. Or you can pass quantitative data to average(), which calculates the mean for you. This way, you won't have to create additional columns using pandas and keep your raw data as minimal as possible.

alt.Chart(url).mark_circle().encode(
    x='yearmonth(built_date):T',
    y='average(price_per_sqm):Q',
    color='district:N',)

In pandas, we'd filter data using df[FILTER]. In Altair, this is done by .transform_filter(). In the chart above, we see that the majority of data points gather in the lower right corner. So one way to zoom in is to set a range for built_year on the x axis, which represents the year a property was built. Suppose we want built_year to fall within 1950 and 2020, we do alt.FieldRangePredicate(field='built_year', range=[1950, 2020]).

alt.Chart(url).mark_circle().encode(
    x='yearmonth(built_date):T',
    y='average(price_per_sqm):Q',
    color='district:N',).transform_filter(
        alt.FieldRangePredicate(field='built_year', range=[1950, 2020])
    )

Similarly, if we want price_per_sqm on the y axis, which represents property prices per square meter (in NT$ of course!) to be in the range of 10k and 300k, then we do alt.FieldRangePredicate(field='price_per_sqm', range=[10000, 300000]).

alt.Chart(url).mark_circle().encode(
    x='yearmonth(built_date):T',
    y='average(price_per_sqm):Q',
    color='district:N',).transform_filter(
        alt.FieldRangePredicate(field='price_per_sqm', range=[10000, 300000])
    )

But what if we want to filter data from multiple columns? I found that an easy way to do that is to use datum.X, where X is a column name. Then the syntax is just like what you'd see in pandas. Suppose we want built_year to be greater than 1950 and price_per_sqm less than 300k, then we do (datum.built_year > 1950) & (datum.price_per_sqm < 300000).

Important: It took me a while to figure what what kind of object datum is. It turns out that Altair is smart enough to take care of everything for you as long as you import datum. So be sure to do this: from altair import datum.

alt.Chart(url).mark_circle().encode(
    x='yearmonth(built_date):T',
    y='average(price_per_sqm):Q',
    color='district:N',).transform_filter(
        (datum.built_year > 1950) & (datum.price_per_sqm < 300000)
    )

Finally, if you want to give viewers of your chart the liberty to zoom in and out, you can make an interactive chart simply by adding .interactive() to the end of your syntax. To see the effect, click on any grid of the following chart and then scroll your mouse or move two of your fingers up and down on your Magic Trackpad.

Warning: Try not to make too many interactive charts if your dataset is huge because they can cause serious performance issues. That’s why I only made one interactive chart in this post.

alt.Chart(url).mark_circle().encode(
    x='yearmonth(built_date):T',
    y='average(price_per_sqm):Q',
    color='district:N',).transform_filter(
        (datum.built_year > 1950) & (datum.price_per_sqm < 300000)
    ).interactive()

I think that's enough for the basics and for you to keep the ball rolling. Coming up are some of the numerous fancy charts that you can make with Altair.

Complex charts

Suppose we want to create a scatter plot where viewers can focus on data points from a particular district of their choice, the .add_selection() function can be quite handy. Let's first check out the unique districts in the datasets. (Btw, there're more districts in Kaohsiung. These are simply more densely populated areas.)

districts = df.district.unique().tolist()
districts

['鼓山區', '前金區', '前鎮區', '三民區', '楠梓區', '左營區', '鳳山區', '新興區', '苓雅區']

We first create a variable selection, which we'll pass to .add_selection() later. The selection itself is a built-in function called alt.selection_single(), which takes the following arguments:

name for the name you want to display in the selection area
fields for a list of column names that views can choose from
init for a dictionary specifying the default value for each selectable column
bind for a dictionary specifying the way a column is to be selected (in this case, alt.binding_select() for a drop down box) and its possible values (indicated by the argument options)

Additionally, if we want to display information about a data point upon mouseover, we can pass a list of column names to the argument tooltip of the .encode() function.

Importantly, for the interaction to work, we have to add .add_selection(selection) right before the .encode() function.

selection = alt.selection_single(
    name='Select',
    fields=['district', ],
    init={'district': '左營區', },
    bind={'district': alt.binding_select(options=districts), }
)

alt.Chart(url).mark_circle().add_selection(selection).encode(
    x='yearmonth(built_date):T',
    y='price_per_sqm:Q',
    color=alt.condition(selection, 'district:N', alt.value('lightgray')),
    tooltip=['property_type:N','property_area:Q','parking_area:Q', 
             'built_date:T','tradinng_date:T','price_per_sqm:Q'],
    ).transform_filter(
        (datum.built_year > 1950) & (datum.price_per_sqm < 200000)
        ) # add ".interactive()" at the end to make the chart interactive

We can also make two charts and then concatenat them vertically by calling the function alt.vconcat(), which takes chart objects and data as its arguments.

selection = alt.selection_multi(fields=['district'])

top = alt.Chart().mark_line().encode(
    x='yearmonth(built_date):T',
    y='mean(price_per_sqm):Q',
    color='district:N'
).properties(
    width=600, height=200
).transform_filter(
    selection
)

bottom = alt.Chart().mark_bar().encode(
    x='yearmonth(trading_date):T',
    y='mean(price_per_sqm):Q',
    color=alt.condition(selection, alt.value('steelblue'), alt.value('lightgray'))
).properties(
    width=600, height=100
).add_selection(
    selection
)

alt.vconcat(
    top, bottom,
    data=url
)

We can make one chart respond to another chart based on selection on the second one. This can be useful when we want to have both a global and detailed view of the same chart. The key function we need is alt.Scale(). Watch the top chart change as you select different areas of the bottom chart.

brush = alt.selection(type='interval', encodings=['x'])

base = alt.Chart(url).mark_area().encode(
    x = 'yearmonth(built_date):T',
    y = 'price_per_sqm:Q'
).properties(
    width=600,
    height=200
)

upper = base.encode(
    alt.X('yearmonth(built_date):T', scale=alt.Scale(domain=brush))
)

lower = base.properties(
    height=60
).add_selection(brush)

upper & lower

Finally, you can also pick three random variables from your dataset and make a 3 times 3 grid of charts, with each varing in the x and y axis combination. To do that, we'll need to specify repetition in two places: once in the argument of the x and y axis (i.e. alt.repeat() within alt.X and alt.Y) and the other time in the outmost layer of the syntax (i.e. .repeat() at the very end).

alt.Chart(url).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color='district:N'
).properties(
    width=150,
    height=150
).repeat(
    row=['property_area', 'price_per_sqm', 'built_year'],
    column=['built_year', 'price_per_sqm', 'property_area']
)

Recap

Altair is a Python library worth looking into if you want to show interactive charts on your websites and give your visitors some freedom to play with the outcome. This post only shows what I've tried. If you wish to dig deeper into this library, uwdata/visualization-curriculum seems like a great resource, aside from the official documentation. Now that you know the average price of real estate in Kaohsiung, TW, would you consider moving down here? 👨‍💻

	purpose	trading_target	land_area	property_type	living_room	bedroom	bathroom	partition	property_area	is_managed	total_floor	parking_area	parking_price	parking_type	land_use	district	trading_date	trading_year	built_date	built_year	price_per_sqm
25204	住家用	房地(土地+建物)+車位	13.53	住宅大樓(11層含以上有電梯)	2	4	2	有	129.39	有	13	0.00	0	坡道平面	商	楠梓區	2017-01-20	2017	1995-01-26	1995	33233.0
19272	住家用	房地(土地+建物)+車位	18.24	住宅大樓(11層含以上有電梯)	0	0	0	無	360.51	有	36	61.10	0	坡道平面	商	鼓山區	2016-05-20	2016	2014-06-26	2014	62717.0
12575	住家用	房地(土地+建物)+車位	13.12	住宅大樓(11層含以上有電梯)	2	3	2	有	145.90	有	15	12.66	840000	坡道機械	住	鼓山區	2015-07-14	2015	2014-05-15	2014	73101.0
15299	住家用	房地(土地+建物)+車位	15.42	住宅大樓(11層含以上有電梯)	2	3	2	有	125.39	有	15	11.24	0	坡道機械	住	左營區	2015-11-08	2015	2007-01-12	2007	43066.0
31446	住家用	房地(土地+建物)+車位	13.91	住宅大樓(11層含以上有電梯)	2	3	2	有	177.61	有	13	0.00	0	坡道機械	商	鼓山區	2017-12-12	2017	1996-04-05	1996	44479.0

	land_area	living_room	bedroom	bathroom	property_area	total_floor	parking_area	parking_price	trading_year	built_year	price_per_sqm
count	45717.000000	45717.000000	45717.000000	45717.000000	45717.000000	45717.000000	45717.000000	4.571700e+04	45717.000000	45717.000000	4.571700e+04
mean	24.949719	1.739987	2.921058	1.907540	145.261129	13.729947	6.606456	9.966087e+04	2016.760702	1999.837938	5.222278e+04
std	32.301563	0.583373	1.299294	1.084739	89.910644	7.810174	81.029070	5.323162e+05	1.699207	11.445783	2.236209e+04
min	0.010000	0.000000	0.000000	0.000000	0.020000	1.000000	0.000000	0.000000e+00	2012.000000	1913.000000	0.000000e+00
25%	10.450000	2.000000	2.000000	1.000000	89.080000	8.000000	0.000000	0.000000e+00	2015.000000	1994.000000	3.849700e+04
50%	16.630000	2.000000	3.000000	2.000000	128.440000	14.000000	0.000000	0.000000e+00	2017.000000	1999.000000	4.829400e+04
75%	26.200000	2.000000	3.000000	2.000000	171.200000	15.000000	0.000000	0.000000e+00	2018.000000	2009.000000	6.233000e+04
max	2140.100000	22.000000	52.000000	50.000000	4119.900000	85.000000	17098.000000	1.000000e+07	2020.000000	2020.000000	1.048343e+06