freeCodeCamp project Demographic Data Analyzer
This post goes through the steps for completing the freeCodeCamp project Demographic Data Analyzer with the pandas library.
- Intro
- Import
- Dataset
- Task instructions
- Steps
- Q1: Race counts
- Q2: Average age of men
- Q3: Percentage of people with a Bachelor degree
- Q4: Percentage of people with advanced education who make more than 50k
- Q5: Percentage of people without advanced education who make more than 50k
- Q6: Minimum hours per week
- Q7: Percentage of people working minimum hours per week who make more than 50k
- Q8: Country with the highest percentage of people who make more than 50k
- Q9: Most popular occupation for those who make more than 50k in India
- Put it together
- Recap
To earn the freeCodeCamp certification on Data Analysis with Python, one has to complete five projects. This post goes through the steps for completing the one named Demographic Data Analyzer
with the pandas
library. For more background info, read the first post of this series.
For this project, we need nothing but the pandas
๐ผ library, which is preinstalled on Colab. This following graph summarizes its major use cases.
import pandas as pd
Let's first clone the project repo and change the current directory to boilerplate-demographic-data-analyzer
. In it, you'll find the adult.data.csv
file, which is the dataset.
repo_url = "https://github.com/freeCodeCamp/boilerplate-demographic-data-analyzer"
!git clone {repo_url}
%cd boilerplate-demographic-data-analyzer
Since we're dealing with a csv file, we'll load the dataset with the read_csv
function and save it as the variable df
, which is pretty much the default name for DataFrame objects. A sample of 10 records is shown below.
fname = "adult.data.csv"
df = pd.read_csv(fname)
df.sample(10)
The dataset has in total 32561 rows and 15 columns, with each row representing a person and each column indicating an attribute of that person.
df.shape
As in my previous post, I use the rich
library to render the README.md
file for task instructions.
!pip install rich
from rich.console import Console
from rich.markdown import Markdown
def show_readme():
console = Console()
with open("README.md") as readme:
markdown = Markdown(readme.read())
console.print(markdown)
show_readme()
The assignment lists 9 questions, and I'll just follow the order in which they occur.
The pandas
syntax is quite powerful, and a lot of things could happen with just one line of code. When I first started learning pandas
, I wish there were some kind of visual aids that could show me the transformation of a dataset from one state to another. Luckily ๐, I came across the Pandas Tutor, which can be very helpful if you are a visual learner like me. With its help, I'll be providing links to various visualizations in due course. But bear in mind that for the purpose of visualization I only took 20 random records rather than the whole dataset. The point is to visualize a tranformation process, not to actually do data analysis on Pandas Tutor.
To get the counts of each race, we group the dataset by race
. Calling the groupby
method on df
returns a GroupBy object, which cannot be displayed.
df.groupby("race")
In order to show some results, we'll have to call an additional method on the Groupby object. For instance, the size
method returns the numbers we're looking for. Follow this link to see a visualization of this process.
race_count = df.groupby("race").size()
race_count
I thought race_count
would be the answer to the first question, but it didn't pass the unittest. If you look at the test_module.py
file, you'll see that the expected values of race_count
are in ascending order, which is not made clear in the instructions. So, we'll have to sort the race counts by calling the sort_values
method on race_count
and setting the value of ascending
to be False
.
race_count = df.groupby("race").size().sort_values(ascending=False)
race_count
To get the average age of men, we first group the dataset by sex
, then take the age
column, and finally call the mean
method. Many steps are going on here, follow this link to visualize each of them.
average_age = df.groupby("sex")["age"].mean()
average_age
At this point, average_age
is a Series object with two keys Female
and Male
. So we just need to specify the Male
key to get the average age of men. Then we round it to the tenth by calling the builtin function round
.
average_age_men = average_age["Male"]
print(f"Before rounding: {average_age_men}")
average_age_men = round(average_age_men, 1)
print(f"After rounding: {average_age_men}")
To answer this question, we'll first need the fraction of the number of people with a Bachelor degree over the total number of people. To get the numerator, let's first create a Boolean filter named edu_filt
, which will return True
if the value of the education
column for a given row is Bachelors
or False
if otherwise. Then we can just use the bracket notation df[edu_filt]
to create a view of the dataset that satisfies the education criterion. Follow this link to see a visualization.
While the original dataset has 32561 records, the filtered one has only 5355 records. Since each record represents a person, these two numbers are precisely what we're looking for.
edu_filt = df["education"] == "Bachelors"
df[edu_filt]
The number of records in a DataFrame object is accessible via the attribute shape
, which returns a tuple of (number of records, number of columns). So here's how we arrive at the percentage we're after.
percentage_bachelors = (df[edu_filt].shape[0]/df.shape[0]) * 100
print(f"Before rounding: {percentage_bachelors}")
percentage_bachelors = round(percentage_bachelors, 1)
print(f"After rounding: {percentage_bachelors}")
As in the previous question, we need to filter the dataset by the education
column. But this time around, we're filtering three values (Bachelors
, Masters
, and Doctorate
for higher degrees) rather than just one. So, it'd be easier to create a Boolean filter by calling the isin
method on df
while providing a list of higher degrees as its argument. Let's save the filtered dataset as higher_education
. Follow this link to visualize the filtering process.
higher_degrees = ["Bachelors", "Masters", "Doctorate"]
higher_education_filt = df["education"].isin(higher_degrees)
higher_education = df[higher_education_filt]
higher_education
The number of records in higher_education
is then the denominator of the fraction we're looking for. For the numerator, it'll be the number of records in higher_education
that is additionally filtered by the salary
column. So my first attempted solution was as follows.
salary_filt = df["salary"] == ">50K"
higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
print(f"Before rounding: {higher_education_rich}")
higher_education_rich = round(higher_education_rich, 1)
print(f"After rounding: {higher_education_rich}")
Though I did get the correct percentage, a warning popped up saying that "Boolean Series key will be reindexed to match DataFrame index." I thought I could just ignore the warning, but the solution ended up failing the unittest ๐. I later figured out what might have caused the warning. It's probably because the keys of the salary_filt
Series don't match up with the index of the higher_education
DataFrame, as shown below.
salary_filt = df["salary"] == ">50K"
print("Keys of salary_filt >>>")
print(salary_filt.keys())
print("Index of higher_education >>>")
print(higher_education.index)
To fix the problem, I redefined salary_filt
based on the filtered higher_education
rather than the original df
. This way, all the keys of salary_filt
would be equal to the index of higher_education
.
salary_filt = higher_education["salary"] == ">50K"
print("Keys of salary_filt match index of higher_education?")
all(salary_filt.keys() == higher_education.index)
Once the salary filter is fixed, the warning is gone ๐!
salary_filt = higher_education["salary"] == ">50K"
higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
print(f"Before rounding: {higher_education_rich}")
higher_education_rich = round(higher_education_rich, 1)
print(f"After rounding: {higher_education_rich}")
The logic for this question is pretty much like that for the previous one, except that we'll first filter out people without higher degrees rather than those with them. To negate the Boolean filter higher_education_filt
, we just need to add the ~
symbol in front of it. To see a visualization of the filtering process, follow this link.
lower_education = df[~ higher_education_filt]
lower_education
Now that we've got lower_education
, we can use a salary filter as before to get a subset of lower_education
, which, when divided by the total set, gives us the percentage we're looking for.
salary_filt = lower_education["salary"] == ">50K"
lower_education_rich = (lower_education[salary_filt].shape[0]/lower_education.shape[0]) * 100
print(f"Before rounding: {lower_education_rich}")
lower_education_rich = round(lower_education_rich, 1)
print(f"After rounding: {lower_education_rich}")
This question is the easiest of all. We simply grab the hours-per-week
Series, and find out the minimal value in it by calling the methond min
on it. The value is saved as min_work_hours
for later use.
min_work_hours = df["hours-per-week"].min()
min_work_hours
Given min_work_hours
, we can use a filter, named hour_filt
, to filter out people who work minimum hours per week, a subset saved as min_hour_workers
.
This link shows how the filtered subset is created.
hour_filt = df["hours-per-week"] == min_work_hours
min_hour_workers = df[hour_filt]
min_hour_workers
Now, let's create a salary filter based on min_hour_workers
. The number of records in min_hour_workers
with the filter divided by the number without it times 100 produces the answer, which is saved as rich_percentage
.
salary_filt = min_hour_workers["salary"] == ">50K"
rich_percentage = (min_hour_workers[salary_filt].shape[0]/min_hour_workers.shape[0]) * 100
print(f"Before rounding: {rich_percentage}")
rich_percentage = round(rich_percentage, 1)
print(f"After rounding: {rich_percentage}")
This question is perhaps the most challenging of all. My initial move was to create a GroupBy object first selected by by native-country
then by salary
, and call the size
method on it. This would produce a Series of counts for <=50K
and >50K
in each country, as follows. Go to this page to see this process in action.
df.groupby(["native-country", "salary"]).size()
But then it's cubersome to get the percentage of the counts for >50K
over total counts in each country. The solution I submited for the first time worked, but it was not neat enough. After tackling the problem on a different occasion, I came up with a cleaner solution, to be explained below.
Let's forget about the breakdown of percentages by countries for the time being, and work out a way to get the percentage for all the dataset. Calling the value_counts
method on the salary
Series creates another Series with unique values in it as keys and their counts as the corresponding values. Let's name it salary_counts
.
salary_counts = df["salary"].value_counts()
salary_counts
Now we can retrieve the number of people who make more than 50K by using the bracket notation.
salary_counts[">50K"]
For the total number of people, we can just call the sum
method on the salary_counts
Series.
salary_counts.sum()
With the two numbers in place, we're now ready to calculate the percentage, which is named rich_percentage
.
rich_percentage = (salary_counts[">50K"]/salary_counts.sum()) * 100
rich_percentage
Next, to replicate what we just did, let's create a function named get_rich_percentage
, which takes the salary
Series as its argument and returns the percentage of people who make more than 50K.
def get_rich_percentage(salary_series):
salary_counts = salary_series.value_counts()
res = (salary_counts[">50K"]/salary_counts.sum()) * 100
return res
A test on the function gives us the expected number.
get_rich_percentage(df["salary"])
Then, we apply the get_rich_percentage
function to each group of a GroupBy object named by_country
, which is selected by the native-country
column. The outcome, saved as rich_percentage_by_country
, would have been a Series of percentages across countries, were it not for the KeyError
I got ๐.
by_country = df.groupby("native-country")
rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"]))
rich_percentage_by_country
Apparently, the key >50K
is missing from some countries, so let's revise the get_rich_percentage
function to handle the key error.
def get_rich_percentage(salary_series):
salary_counts = salary_series.value_counts()
if salary_counts.get(">50K"):
res = (salary_counts.get(">50K")/salary_counts.sum()) * 100
else:
res = 0
return res
Having taken care of the key error, we now got the Series we're after ๐. The only thing modified this time is to additionally call the sort_values
method on the Series so that the numbers are in descending order. And I'm surprised to see Taiwan ๐น among the top five on the list!
by_country = df.groupby("native-country")
rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"])).sort_values(ascending=False)
rich_percentage_by_country
Finally, we call the idxmax
method on rich_percentage_by_country
to get the country with the highest percentage, which turns out to be Iran
.
highest_earning_country = rich_percentage_by_country.idxmax()
highest_earning_country
And with the key Iran
, we can easily get the percentage for that country.
highest_earning_country_percentage = rich_percentage_by_country[highest_earning_country]
print(f"Before rounding: {highest_earning_country_percentage}")
highest_earning_country_percentage = round(highest_earning_country_percentage, 1)
print(f"After rounding: {highest_earning_country_percentage}")
This final question is specifically asking about people who make more than 50K in India, so we need two Boolean filters, one based on salary
and the other on native-country
. Adding the symbol &
between the two filters creates the and
condition, which filters out the subset we want, named rich_indians
๐ฐ๐ฎ .
salary_filt = df["salary"] == ">50K"
indian_filt = df["native-country"] == "India"
rich_indians = df[(indian_filt)&(salary_filt)]
rich_indians
Next, since we're trying to find out the most popular job among rich Indians, we need a count for each unique job. One of the neatest way to go about this is to call the groupby
method on rich_indians
and then the size
method on the resulting GroupBy object.
ind_occupations = rich_indians.groupby("occupation").size()
ind_occupations
Finally, we call the idxmax
method on the ind_occupations
Series to get the most popular occupation, which is Prof-specialty
.
top_IN_occupation = ind_occupations.idxmax()
top_IN_occupation
Here's the complete code for this project, including the main function calculate_demographic_data
and the helper function get_rich_percentage
for Q8.
import pandas as pd
fpath = "adult.data.csv"
def get_rich_percentage(salary_series):
salary_counts = salary_series.value_counts()
if salary_counts.get(">50K"):
res = (salary_counts.get(">50K")/salary_counts.sum()) * 100
else:
res = 0
return res
def calculate_demographic_data(print_data=True):
# Read data from file
df = pd.read_csv(fpath)
# Q1 How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
race_count = df.groupby("race").size().sort_values(ascending=False)
# Q2 What is the average age of men?
average_age_men = df.groupby("sex")["age"].mean()["Male"]
average_age_men = round(average_age_men, 1)
# Q3 What is the percentage of people who have a Bachelor's degree?
edu_filt = df["education"] == "Bachelors"
percentage_bachelors = (df[edu_filt].shape[0]/df.shape[0]) * 100
percentage_bachelors = round(percentage_bachelors, 1)
# Q4 What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
higher_degrees = ["Bachelors", "Masters", "Doctorate"]
higher_education_filt = df["education"].isin(higher_degrees)
higher_education = df[higher_education_filt]
salary_filt = higher_education["salary"] == ">50K"
higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
higher_education_rich = round(higher_education_rich, 1)
# Q5 What percentage of people without advanced education make more than 50K?
lower_education = df[~ higher_education_filt]
salary_filt = lower_education["salary"] == ">50K"
lower_education_rich = (lower_education[salary_filt].shape[0]/lower_education.shape[0]) * 100
lower_education_rich = round(lower_education_rich, 1)
# Q6 What is the minimum number of hours a person works per week (hours-per-week feature)?
min_work_hours = df["hours-per-week"].min()
# Q7 What percentage of the people who work the minimum number of hours per week have a salary of >50K?
hour_filt = df["hours-per-week"] == min_work_hours
min_hour_workers = df[hour_filt]
salary_filt = min_hour_workers["salary"] == ">50K"
rich_percentage = (min_hour_workers[salary_filt].shape[0]/min_hour_workers.shape[0]) * 100
rich_percentage = round(rich_percentage, 1)
# Q8 What country has the highest percentage of people that earn >50K and what is that percentage?
by_country = df.groupby("native-country")
rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"])).sort_values(ascending=False)
highest_earning_country = rich_percentage_by_country.idxmax()
highest_earning_country_percentage = rich_percentage_by_country[highest_earning_country]
highest_earning_country_percentage = round(highest_earning_country_percentage, 1)
# Q9 Identify the most popular occupation for those who earn >50K in India.
salary_filt = df["salary"] == ">50K"
indian_filt = df["native-country"] == "India"
rich_indians = df[(indian_filt)&(salary_filt)]
ind_occupations = rich_indians.groupby("occupation").size()
top_IN_occupation = ind_occupations.idxmax()
# DO NOT MODIFY BELOW THIS LINE
if print_data:
print("Number of each race:\n", race_count)
print("Average age of men:", average_age_men)
print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
print(
f"Percentage with higher education that earn >50K: {higher_education_rich}%"
)
print(
f"Percentage without higher education that earn >50K: {lower_education_rich}%"
)
print(f"Min work time: {min_work_hours} hours/week")
print(
f"Percentage of rich among those who work fewest hours: {rich_percentage}%"
)
print("Country with highest percentage of rich:",
highest_earning_country)
print(
f"Highest percentage of rich people in country: {highest_earning_country_percentage}%"
)
print("Top occupations in India:", top_IN_occupation)
return {
'race_count': race_count,
'average_age_men': average_age_men,
'percentage_bachelors': percentage_bachelors,
'higher_education_rich': higher_education_rich,
'lower_education_rich': lower_education_rich,
'min_work_hours': min_work_hours,
'rich_percentage': rich_percentage,
'highest_earning_country': highest_earning_country,
'highest_earning_country_percentage':
highest_earning_country_percentage,
'top_IN_occupation': top_IN_occupation
}
Now it's time to harvest the fruits of our labor by calling the calculate_demographic_data
function! And it worked as expected! ๐ใ
calculate_demographic_data()
Once you finish the freeCodeCamp project Demographic Data Analyzer, you'll become comfortable doing data analysis with the pandas
library while manipulating pandas
objects like Series
, DataFrame
, and GroupBy
. Starting from the next project, we'll enter the realm of visualization, drawing figures with the help of libraries like matplotlib
and seaborn
. Until then ๐!