freeCodeCamp project Demographic Data Analyzer
This post goes through the steps for completing the freeCodeCamp project Demographic Data Analyzer with the pandas library.
- Intro
- Import
- Dataset
- Task instructions
- Steps
- Q1: Race counts
- Q2: Average age of men
- Q3: Percentage of people with a Bachelor degree
- Q4: Percentage of people with advanced education who make more than 50k
- Q5: Percentage of people without advanced education who make more than 50k
- Q6: Minimum hours per week
- Q7: Percentage of people working minimum hours per week who make more than 50k
- Q8: Country with the highest percentage of people who make more than 50k
- Q9: Most popular occupation for those who make more than 50k in India
- Put it together
- Recap

To earn the freeCodeCamp certification on Data Analysis with Python, one has to complete five projects. This post goes through the steps for completing the one named Demographic Data Analyzer with the pandas library. For more background info, read the first post of this series.
For this project, we need nothing but the pandas ๐ผ library, which is preinstalled on Colab. This following graph summarizes its major use cases.

import pandas as pd
Let's first clone the project repo and change the current directory to boilerplate-demographic-data-analyzer. In it, you'll find the adult.data.csv file, which is the dataset.
repo_url = "https://github.com/freeCodeCamp/boilerplate-demographic-data-analyzer"
!git clone {repo_url}
%cd boilerplate-demographic-data-analyzer
Since we're dealing with a csv file, we'll load the dataset with the read_csv function and save it as the variable df, which is pretty much the default name for DataFrame objects. A sample of 10 records is shown below.
fname = "adult.data.csv"
df = pd.read_csv(fname)
df.sample(10)
The dataset has in total 32561 rows and 15 columns, with each row representing a person and each column indicating an attribute of that person.
df.shape
As in my previous post, I use the rich library to render the README.md file for task instructions.
!pip install rich
from rich.console import Console
from rich.markdown import Markdown
def show_readme():
console = Console()
with open("README.md") as readme:
markdown = Markdown(readme.read())
console.print(markdown)
show_readme()
The assignment lists 9 questions, and I'll just follow the order in which they occur.
The pandas syntax is quite powerful, and a lot of things could happen with just one line of code. When I first started learning pandas, I wish there were some kind of visual aids that could show me the transformation of a dataset from one state to another. Luckily ๐, I came across the Pandas Tutor, which can be very helpful if you are a visual learner like me. With its help, I'll be providing links to various visualizations in due course. But bear in mind that for the purpose of visualization I only took 20 random records rather than the whole dataset. The point is to visualize a tranformation process, not to actually do data analysis on Pandas Tutor.
To get the counts of each race, we group the dataset by race. Calling the groupby method on df returns a GroupBy object, which cannot be displayed.
df.groupby("race")
In order to show some results, we'll have to call an additional method on the Groupby object. For instance, the size method returns the numbers we're looking for. Follow this link to see a visualization of this process.
race_count = df.groupby("race").size()
race_count
I thought race_count would be the answer to the first question, but it didn't pass the unittest. If you look at the test_module.py file, you'll see that the expected values of race_count are in ascending order, which is not made clear in the instructions. So, we'll have to sort the race counts by calling the sort_values method on race_count and setting the value of ascending to be False.
race_count = df.groupby("race").size().sort_values(ascending=False)
race_count
To get the average age of men, we first group the dataset by sex, then take the age column, and finally call the mean method. Many steps are going on here, follow this link to visualize each of them.
average_age = df.groupby("sex")["age"].mean()
average_age
At this point, average_age is a Series object with two keys Female and Male. So we just need to specify the Male key to get the average age of men. Then we round it to the tenth by calling the builtin function round.
average_age_men = average_age["Male"]
print(f"Before rounding: {average_age_men}")
average_age_men = round(average_age_men, 1)
print(f"After rounding: {average_age_men}")
To answer this question, we'll first need the fraction of the number of people with a Bachelor degree over the total number of people. To get the numerator, let's first create a Boolean filter named edu_filt, which will return True if the value of the education column for a given row is Bachelors or False if otherwise. Then we can just use the bracket notation df[edu_filt] to create a view of the dataset that satisfies the education criterion. Follow this link to see a visualization.
While the original dataset has 32561 records, the filtered one has only 5355 records. Since each record represents a person, these two numbers are precisely what we're looking for.
edu_filt = df["education"] == "Bachelors"
df[edu_filt]
The number of records in a DataFrame object is accessible via the attribute shape, which returns a tuple of (number of records, number of columns). So here's how we arrive at the percentage we're after.
percentage_bachelors = (df[edu_filt].shape[0]/df.shape[0]) * 100
print(f"Before rounding: {percentage_bachelors}")
percentage_bachelors = round(percentage_bachelors, 1)
print(f"After rounding: {percentage_bachelors}")
As in the previous question, we need to filter the dataset by the education column. But this time around, we're filtering three values (Bachelors, Masters, and Doctorate for higher degrees) rather than just one. So, it'd be easier to create a Boolean filter by calling the isin method on df while providing a list of higher degrees as its argument. Let's save the filtered dataset as higher_education. Follow this link to visualize the filtering process.
higher_degrees = ["Bachelors", "Masters", "Doctorate"]
higher_education_filt = df["education"].isin(higher_degrees)
higher_education = df[higher_education_filt]
higher_education
The number of records in higher_education is then the denominator of the fraction we're looking for. For the numerator, it'll be the number of records in higher_education that is additionally filtered by the salary column. So my first attempted solution was as follows.
salary_filt = df["salary"] == ">50K"
higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
print(f"Before rounding: {higher_education_rich}")
higher_education_rich = round(higher_education_rich, 1)
print(f"After rounding: {higher_education_rich}")
Though I did get the correct percentage, a warning popped up saying that "Boolean Series key will be reindexed to match DataFrame index." I thought I could just ignore the warning, but the solution ended up failing the unittest ๐. I later figured out what might have caused the warning. It's probably because the keys of the salary_filt Series don't match up with the index of the higher_education DataFrame, as shown below.
salary_filt = df["salary"] == ">50K"
print("Keys of salary_filt >>>")
print(salary_filt.keys())
print("Index of higher_education >>>")
print(higher_education.index)
To fix the problem, I redefined salary_filt based on the filtered higher_education rather than the original df. This way, all the keys of salary_filt would be equal to the index of higher_education.
salary_filt = higher_education["salary"] == ">50K"
print("Keys of salary_filt match index of higher_education?")
all(salary_filt.keys() == higher_education.index)
Once the salary filter is fixed, the warning is gone ๐!
salary_filt = higher_education["salary"] == ">50K"
higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
print(f"Before rounding: {higher_education_rich}")
higher_education_rich = round(higher_education_rich, 1)
print(f"After rounding: {higher_education_rich}")
The logic for this question is pretty much like that for the previous one, except that we'll first filter out people without higher degrees rather than those with them. To negate the Boolean filter higher_education_filt, we just need to add the ~ symbol in front of it. To see a visualization of the filtering process, follow this link.
lower_education = df[~ higher_education_filt]
lower_education
Now that we've got lower_education, we can use a salary filter as before to get a subset of lower_education, which, when divided by the total set, gives us the percentage we're looking for.
salary_filt = lower_education["salary"] == ">50K"
lower_education_rich = (lower_education[salary_filt].shape[0]/lower_education.shape[0]) * 100
print(f"Before rounding: {lower_education_rich}")
lower_education_rich = round(lower_education_rich, 1)
print(f"After rounding: {lower_education_rich}")
This question is the easiest of all. We simply grab the hours-per-week Series, and find out the minimal value in it by calling the methond min on it. The value is saved as min_work_hours for later use.
min_work_hours = df["hours-per-week"].min()
min_work_hours
Given min_work_hours, we can use a filter, named hour_filt, to filter out people who work minimum hours per week, a subset saved as min_hour_workers.
This link shows how the filtered subset is created.
hour_filt = df["hours-per-week"] == min_work_hours
min_hour_workers = df[hour_filt]
min_hour_workers
Now, let's create a salary filter based on min_hour_workers. The number of records in min_hour_workers with the filter divided by the number without it times 100 produces the answer, which is saved as rich_percentage.
salary_filt = min_hour_workers["salary"] == ">50K"
rich_percentage = (min_hour_workers[salary_filt].shape[0]/min_hour_workers.shape[0]) * 100
print(f"Before rounding: {rich_percentage}")
rich_percentage = round(rich_percentage, 1)
print(f"After rounding: {rich_percentage}")
This question is perhaps the most challenging of all. My initial move was to create a GroupBy object first selected by by native-country then by salary, and call the size method on it. This would produce a Series of counts for <=50K and >50K in each country, as follows. Go to this page to see this process in action.
df.groupby(["native-country", "salary"]).size()
But then it's cubersome to get the percentage of the counts for >50K over total counts in each country. The solution I submited for the first time worked, but it was not neat enough. After tackling the problem on a different occasion, I came up with a cleaner solution, to be explained below.
Let's forget about the breakdown of percentages by countries for the time being, and work out a way to get the percentage for all the dataset. Calling the value_counts method on the salary Series creates another Series with unique values in it as keys and their counts as the corresponding values. Let's name it salary_counts.
salary_counts = df["salary"].value_counts()
salary_counts
Now we can retrieve the number of people who make more than 50K by using the bracket notation.
salary_counts[">50K"]
For the total number of people, we can just call the sum method on the salary_counts Series.
salary_counts.sum()
With the two numbers in place, we're now ready to calculate the percentage, which is named rich_percentage.
rich_percentage = (salary_counts[">50K"]/salary_counts.sum()) * 100
rich_percentage
Next, to replicate what we just did, let's create a function named get_rich_percentage, which takes the salary Series as its argument and returns the percentage of people who make more than 50K.
def get_rich_percentage(salary_series):
salary_counts = salary_series.value_counts()
res = (salary_counts[">50K"]/salary_counts.sum()) * 100
return res
A test on the function gives us the expected number.
get_rich_percentage(df["salary"])
Then, we apply the get_rich_percentage function to each group of a GroupBy object named by_country, which is selected by the native-country column. The outcome, saved as rich_percentage_by_country, would have been a Series of percentages across countries, were it not for the KeyError I got ๐.
by_country = df.groupby("native-country")
rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"]))
rich_percentage_by_country
Apparently, the key >50K is missing from some countries, so let's revise the get_rich_percentage function to handle the key error.
def get_rich_percentage(salary_series):
salary_counts = salary_series.value_counts()
if salary_counts.get(">50K"):
res = (salary_counts.get(">50K")/salary_counts.sum()) * 100
else:
res = 0
return res
Having taken care of the key error, we now got the Series we're after ๐. The only thing modified this time is to additionally call the sort_values method on the Series so that the numbers are in descending order. And I'm surprised to see Taiwan ๐น among the top five on the list!
by_country = df.groupby("native-country")
rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"])).sort_values(ascending=False)
rich_percentage_by_country
Finally, we call the idxmax method on rich_percentage_by_country to get the country with the highest percentage, which turns out to be Iran.
highest_earning_country = rich_percentage_by_country.idxmax()
highest_earning_country
And with the key Iran, we can easily get the percentage for that country.
highest_earning_country_percentage = rich_percentage_by_country[highest_earning_country]
print(f"Before rounding: {highest_earning_country_percentage}")
highest_earning_country_percentage = round(highest_earning_country_percentage, 1)
print(f"After rounding: {highest_earning_country_percentage}")
This final question is specifically asking about people who make more than 50K in India, so we need two Boolean filters, one based on salary and the other on native-country. Adding the symbol & between the two filters creates the and condition, which filters out the subset we want, named rich_indians ๐ฐ๐ฎ .
salary_filt = df["salary"] == ">50K"
indian_filt = df["native-country"] == "India"
rich_indians = df[(indian_filt)&(salary_filt)]
rich_indians
Next, since we're trying to find out the most popular job among rich Indians, we need a count for each unique job. One of the neatest way to go about this is to call the groupby method on rich_indians and then the size method on the resulting GroupBy object.
ind_occupations = rich_indians.groupby("occupation").size()
ind_occupations
Finally, we call the idxmax method on the ind_occupations Series to get the most popular occupation, which is Prof-specialty.
top_IN_occupation = ind_occupations.idxmax()
top_IN_occupation
Here's the complete code for this project, including the main function calculate_demographic_data and the helper function get_rich_percentage for Q8.
import pandas as pd
fpath = "adult.data.csv"
def get_rich_percentage(salary_series):
salary_counts = salary_series.value_counts()
if salary_counts.get(">50K"):
res = (salary_counts.get(">50K")/salary_counts.sum()) * 100
else:
res = 0
return res
def calculate_demographic_data(print_data=True):
# Read data from file
df = pd.read_csv(fpath)
# Q1 How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
race_count = df.groupby("race").size().sort_values(ascending=False)
# Q2 What is the average age of men?
average_age_men = df.groupby("sex")["age"].mean()["Male"]
average_age_men = round(average_age_men, 1)
# Q3 What is the percentage of people who have a Bachelor's degree?
edu_filt = df["education"] == "Bachelors"
percentage_bachelors = (df[edu_filt].shape[0]/df.shape[0]) * 100
percentage_bachelors = round(percentage_bachelors, 1)
# Q4 What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
higher_degrees = ["Bachelors", "Masters", "Doctorate"]
higher_education_filt = df["education"].isin(higher_degrees)
higher_education = df[higher_education_filt]
salary_filt = higher_education["salary"] == ">50K"
higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
higher_education_rich = round(higher_education_rich, 1)
# Q5 What percentage of people without advanced education make more than 50K?
lower_education = df[~ higher_education_filt]
salary_filt = lower_education["salary"] == ">50K"
lower_education_rich = (lower_education[salary_filt].shape[0]/lower_education.shape[0]) * 100
lower_education_rich = round(lower_education_rich, 1)
# Q6 What is the minimum number of hours a person works per week (hours-per-week feature)?
min_work_hours = df["hours-per-week"].min()
# Q7 What percentage of the people who work the minimum number of hours per week have a salary of >50K?
hour_filt = df["hours-per-week"] == min_work_hours
min_hour_workers = df[hour_filt]
salary_filt = min_hour_workers["salary"] == ">50K"
rich_percentage = (min_hour_workers[salary_filt].shape[0]/min_hour_workers.shape[0]) * 100
rich_percentage = round(rich_percentage, 1)
# Q8 What country has the highest percentage of people that earn >50K and what is that percentage?
by_country = df.groupby("native-country")
rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"])).sort_values(ascending=False)
highest_earning_country = rich_percentage_by_country.idxmax()
highest_earning_country_percentage = rich_percentage_by_country[highest_earning_country]
highest_earning_country_percentage = round(highest_earning_country_percentage, 1)
# Q9 Identify the most popular occupation for those who earn >50K in India.
salary_filt = df["salary"] == ">50K"
indian_filt = df["native-country"] == "India"
rich_indians = df[(indian_filt)&(salary_filt)]
ind_occupations = rich_indians.groupby("occupation").size()
top_IN_occupation = ind_occupations.idxmax()
# DO NOT MODIFY BELOW THIS LINE
if print_data:
print("Number of each race:\n", race_count)
print("Average age of men:", average_age_men)
print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
print(
f"Percentage with higher education that earn >50K: {higher_education_rich}%"
)
print(
f"Percentage without higher education that earn >50K: {lower_education_rich}%"
)
print(f"Min work time: {min_work_hours} hours/week")
print(
f"Percentage of rich among those who work fewest hours: {rich_percentage}%"
)
print("Country with highest percentage of rich:",
highest_earning_country)
print(
f"Highest percentage of rich people in country: {highest_earning_country_percentage}%"
)
print("Top occupations in India:", top_IN_occupation)
return {
'race_count': race_count,
'average_age_men': average_age_men,
'percentage_bachelors': percentage_bachelors,
'higher_education_rich': higher_education_rich,
'lower_education_rich': lower_education_rich,
'min_work_hours': min_work_hours,
'rich_percentage': rich_percentage,
'highest_earning_country': highest_earning_country,
'highest_earning_country_percentage':
highest_earning_country_percentage,
'top_IN_occupation': top_IN_occupation
}
Now it's time to harvest the fruits of our labor by calling the calculate_demographic_data function! And it worked as expected! ๐ใ
calculate_demographic_data()
Once you finish the freeCodeCamp project Demographic Data Analyzer, you'll become comfortable doing data analysis with the pandas library while manipulating pandas objects like Series, DataFrame, and GroupBy. Starting from the next project, we'll enter the realm of visualization, drawing figures with the help of libraries like matplotlib and seaborn. Until then ๐!