Intro

To earn the freeCodeCamp certification on Data Analysis with Python, one has to complete five projects. This post goes through the steps for completing the one named Demographic Data Analyzer with the pandas library. For more background info, read the first post of this series.

Import

For this project, we need nothing but the pandas 🐼 library, which is preinstalled on Colab. This following graph summarizes its major use cases.

We'll be primarily dealing with the following three types of objects:

import pandas as pd

Dataset

Let's first clone the project repo and change the current directory to boilerplate-demographic-data-analyzer. In it, you'll find the adult.data.csv file, which is the dataset.

repo_url = "https://github.com/freeCodeCamp/boilerplate-demographic-data-analyzer"
!git clone {repo_url}
%cd boilerplate-demographic-data-analyzer

Cloning into 'boilerplate-demographic-data-analyzer'...
remote: Enumerating objects: 22, done.
remote: Counting objects: 100% (22/22), done.
remote: Compressing objects: 100% (19/19), done.
remote: Total 22 (delta 7), reused 12 (delta 2), pack-reused 0
Unpacking objects: 100% (22/22), done.
/content/boilerplate-demographic-data-analyzer

Since we're dealing with a csv file, we'll load the dataset with the read_csv function and save it as the variable df, which is pretty much the default name for DataFrame objects. A sample of 10 records is shown below.

fname = "adult.data.csv"
df = pd.read_csv(fname)
df.sample(10)

The dataset has in total 32561 rows and 15 columns, with each row representing a person and each column indicating an attribute of that person.

df.shape

(32561, 15)

Task instructions

As in my previous post, I use the rich library to render the README.md file for task instructions.

!pip install rich

from rich.console import Console
from rich.markdown import Markdown

def show_readme():
    console = Console()
    with open("README.md") as readme:
        markdown = Markdown(readme.read())
    console.print(markdown)

Collecting rich
  Downloading rich-11.2.0-py3-none-any.whl (217 kB)
     |████████████████████████████████| 217 kB 3.2 MB/s 
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     |████████████████████████████████| 51 kB 5.8 MB/s 
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich) (2.6.1)
Requirement already satisfied: typing-extensions<5.0,>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from rich) (3.10.0.2)
Installing collected packages: commonmark, colorama, rich
Successfully installed colorama-0.4.4 commonmark-0.9.1 rich-11.2.0

Assignment

show_readme()

                                         Assignment                                          

╔═══════════════════════════════════════════════════════════════════════════════════════════╗
║                                 Demographic Data Analyzer                                 ║
╚═══════════════════════════════════════════════════════════════════════════════════════════╝

In this challenge you must analyze demographic data using Pandas. You are given a dataset of 
demographic data that was extracted from the 1994 Census database. Here is a sample of what  
the data looks like:                                                                         

|    |   age | workclass        |   fnlwgt | education   |   education-num | marital-status  
| occupation        | relationship   | race   | sex    |   capital-gain |   capital-loss |   
hours-per-week | native-country   | salary   | |---:|------:|:-----------------|---------:|:-
-----------|----------------:|:-------------------|:------------------|:---------------|:----
---|:-------|---------------:|---------------:|-----------------:|:-----------------|:-------
--| |  0 |    39 | State-gov        |    77516 | Bachelors   |              13 |             
Never-married      | Adm-clerical      | Not-in-family  | White  | Male   |           2174 | 
0 |               40 | United-States    | <=50K    | |  1 |    50 | Self-emp-not-inc |       
83311 | Bachelors   |              13 | Married-civ-spouse | Exec-managerial   | Husband     
| White  | Male   |              0 |              0 |               13 | United-States    |  
<=50K    | |  2 |    38 | Private          |   215646 | HS-grad     |               9 |      
Divorced           | Handlers-cleaners | Not-in-family  | White  | Male   |              0 | 
0 |               40 | United-States    | <=50K    | |  3 |    53 | Private          |       
234721 | 11th        |               7 | Married-civ-spouse | Handlers-cleaners | Husband    
| Black  | Male   |              0 |              0 |               40 | United-States    |  
<=50K    | |  4 |    28 | Private          |   338409 | Bachelors   |              13 |      
Married-civ-spouse | Prof-specialty    | Wife           | Black  | Female |              0 | 
0 |               40 | Cuba             | <=50K    |                                         

You must use Pandas to answer the following questions:                                       

 • How many people of each race are represented in this dataset? This should be a Pandas     
   series with race names as the index labels. (race column)                                 
 • What is the average age of men?                                                           
 • What is the percentage of people who have a Bachelor's degree?                            
 • What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make 
   more than 50K?                                                                            
 • What percentage of people without advanced education make more than 50K?                  
 • What is the minimum number of hours a person works per week?                              
 • What percentage of the people who work the minimum number of hours per week have a salary 
   of more than 50K?                                                                         
 • What country has the highest percentage of people that earn >50K and what is that         
   percentage?                                                                               
 • Identify the most popular occupation for those who earn >50K in India.                    

Use the starter code in the file demographic_data_analyzer. Update the code so all variables 
set to "None" are set to the appropriate calculation or code. Round all decimals to the      
nearest tenth.                                                                               

Unit tests are written for you under test_module.py.                                         

                                         Development                                         

For development, you can use main.py to test your functions. Click the "run" button and      
main.py will run.                                                                            

                                           Testing                                           

We imported the tests from test_module.py to main.py for your convenience. The tests will run
automatically whenever you hit the "run" button.                                             

                                         Submitting                                          

Copy your project's URL and submit it to freeCodeCamp.                                       

                                       Dataset Source                                        

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository                                
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information 
and Computer Science.

Steps

The assignment lists 9 questions, and I'll just follow the order in which they occur.

The pandas syntax is quite powerful, and a lot of things could happen with just one line of code. When I first started learning pandas, I wish there were some kind of visual aids that could show me the transformation of a dataset from one state to another. Luckily 😙, I came across the Pandas Tutor, which can be very helpful if you are a visual learner like me. With its help, I'll be providing links to various visualizations in due course. But bear in mind that for the purpose of visualization I only took 20 random records rather than the whole dataset. The point is to visualize a tranformation process, not to actually do data analysis on Pandas Tutor.

Q1: Race counts

To get the counts of each race, we group the dataset by race. Calling the groupby method on df returns a GroupBy object, which cannot be displayed.

df.groupby("race")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f528fe884d0>

In order to show some results, we'll have to call an additional method on the Groupby object. For instance, the size method returns the numbers we're looking for. Follow this link to see a visualization of this process.

race_count = df.groupby("race").size()
race_count

race
Amer-Indian-Eskimo      311
Asian-Pac-Islander     1039
Black                  3124
Other                   271
White                 27816
dtype: int64

I thought race_count would be the answer to the first question, but it didn't pass the unittest. If you look at the test_module.py file, you'll see that the expected values of race_count are in ascending order, which is not made clear in the instructions. So, we'll have to sort the race counts by calling the sort_values method on race_count and setting the value of ascending to be False.

race_count = df.groupby("race").size().sort_values(ascending=False)
race_count

race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
dtype: int64

Q2: Average age of men

To get the average age of men, we first group the dataset by sex, then take the age column, and finally call the mean method. Many steps are going on here, follow this link to visualize each of them.

average_age = df.groupby("sex")["age"].mean()
average_age

sex
Female    36.858230
Male      39.433547
Name: age, dtype: float64

At this point, average_age is a Series object with two keys Female and Male. So we just need to specify the Male key to get the average age of men. Then we round it to the tenth by calling the builtin function round.

average_age_men = average_age["Male"]
print(f"Before rounding: {average_age_men}")
average_age_men = round(average_age_men, 1)
print(f"After rounding: {average_age_men}")

Before rounding: 39.43354749885268
After rounding: 39.4

Q3: Percentage of people with a Bachelor degree

To answer this question, we'll first need the fraction of the number of people with a Bachelor degree over the total number of people. To get the numerator, let's first create a Boolean filter named edu_filt, which will return True if the value of the education column for a given row is Bachelors or False if otherwise. Then we can just use the bracket notation df[edu_filt] to create a view of the dataset that satisfies the education criterion. Follow this link to see a visualization.

While the original dataset has 32561 records, the filtered one has only 5355 records. Since each record represents a person, these two numbers are precisely what we're looking for.

edu_filt = df["education"] == "Bachelors"
df[edu_filt]

The number of records in a DataFrame object is accessible via the attribute shape, which returns a tuple of (number of records, number of columns). So here's how we arrive at the percentage we're after.

percentage_bachelors = (df[edu_filt].shape[0]/df.shape[0]) * 100
print(f"Before rounding: {percentage_bachelors}")
percentage_bachelors = round(percentage_bachelors, 1)
print(f"After rounding: {percentage_bachelors}")

Before rounding: 16.44605509658794
After rounding: 16.4

Q4: Percentage of people with advanced education who make more than 50k

As in the previous question, we need to filter the dataset by the education column. But this time around, we're filtering three values (Bachelors, Masters, and Doctorate for higher degrees) rather than just one. So, it'd be easier to create a Boolean filter by calling the isin method on df while providing a list of higher degrees as its argument. Let's save the filtered dataset as higher_education. Follow this link to visualize the filtering process.

higher_degrees = ["Bachelors", "Masters", "Doctorate"]
higher_education_filt = df["education"].isin(higher_degrees)
higher_education = df[higher_education_filt]
higher_education

The number of records in higher_education is then the denominator of the fraction we're looking for. For the numerator, it'll be the number of records in higher_education that is additionally filtered by the salary column. So my first attempted solution was as follows.

salary_filt = df["salary"] == ">50K"
higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
print(f"Before rounding: {higher_education_rich}")
higher_education_rich = round(higher_education_rich, 1)
print(f"After rounding: {higher_education_rich}")

Before rounding: 46.535843011613935
After rounding: 46.5

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

Though I did get the correct percentage, a warning popped up saying that "Boolean Series key will be reindexed to match DataFrame index." I thought I could just ignore the warning, but the solution ended up failing the unittest 😞. I later figured out what might have caused the warning. It's probably because the keys of the salary_filt Series don't match up with the index of the higher_education DataFrame, as shown below.

salary_filt = df["salary"] == ">50K"
print("Keys of salary_filt >>>")
print(salary_filt.keys())
print("Index of higher_education >>>")
print(higher_education.index)

Keys of salary_filt >>>
RangeIndex(start=0, stop=32561, step=1)
Index of higher_education >>>
Int64Index([    0,     1,     4,     5,     8,     9,    11,    12,    19,
               20,
            ...
            32530, 32531, 32532, 32533, 32536, 32538, 32539, 32544, 32553,
            32554],
           dtype='int64', length=7491)

To fix the problem, I redefined salary_filt based on the filtered higher_education rather than the original df. This way, all the keys of salary_filt would be equal to the index of higher_education.

salary_filt = higher_education["salary"] == ">50K"
print("Keys of salary_filt match index of higher_education?")
all(salary_filt.keys() == higher_education.index)

Keys of salary_filt match index of higher_education?

True

Once the salary filter is fixed, the warning is gone 😎!

salary_filt = higher_education["salary"] == ">50K"
higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
print(f"Before rounding: {higher_education_rich}")
higher_education_rich = round(higher_education_rich, 1)
print(f"After rounding: {higher_education_rich}")

Before rounding: 46.535843011613935
After rounding: 46.5

Q5: Percentage of people without advanced education who make more than 50k

The logic for this question is pretty much like that for the previous one, except that we'll first filter out people without higher degrees rather than those with them. To negate the Boolean filter higher_education_filt, we just need to add the ~ symbol in front of it. To see a visualization of the filtering process, follow this link.

lower_education = df[~ higher_education_filt]
lower_education

Now that we've got lower_education, we can use a salary filter as before to get a subset of lower_education, which, when divided by the total set, gives us the percentage we're looking for.

salary_filt = lower_education["salary"] == ">50K"
lower_education_rich = (lower_education[salary_filt].shape[0]/lower_education.shape[0]) * 100
print(f"Before rounding: {lower_education_rich}")
lower_education_rich = round(lower_education_rich, 1)
print(f"After rounding: {lower_education_rich}")

Before rounding: 17.3713601914639
After rounding: 17.4

Q6: Minimum hours per week

This question is the easiest of all. We simply grab the hours-per-week Series, and find out the minimal value in it by calling the methond min on it. The value is saved as min_work_hours for later use.

min_work_hours = df["hours-per-week"].min()
min_work_hours

1

Q7: Percentage of people working minimum hours per week who make more than 50k

Given min_work_hours, we can use a filter, named hour_filt, to filter out people who work minimum hours per week, a subset saved as min_hour_workers. This link shows how the filtered subset is created.

hour_filt = df["hours-per-week"] == min_work_hours
min_hour_workers = df[hour_filt]
min_hour_workers

Now, let's create a salary filter based on min_hour_workers. The number of records in min_hour_workers with the filter divided by the number without it times 100 produces the answer, which is saved as rich_percentage.

salary_filt = min_hour_workers["salary"] == ">50K"
rich_percentage = (min_hour_workers[salary_filt].shape[0]/min_hour_workers.shape[0]) * 100
print(f"Before rounding: {rich_percentage}")
rich_percentage = round(rich_percentage, 1)
print(f"After rounding: {rich_percentage}")

Before rounding: 10.0
After rounding: 10.0

Q8: Country with the highest percentage of people who make more than 50k

This question is perhaps the most challenging of all. My initial move was to create a GroupBy object first selected by by native-country then by salary, and call the size method on it. This would produce a Series of counts for <=50K and >50K in each country, as follows. Go to this page to see this process in action.

df.groupby(["native-country", "salary"]).size()

native-country  salary
?               <=50K      437
                >50K       146
Cambodia        <=50K       12
                >50K         7
Canada          <=50K       82
                          ... 
United-States   >50K      7171
Vietnam         <=50K       62
                >50K         5
Yugoslavia      <=50K       10
                >50K         6
Length: 82, dtype: int64

But then it's cubersome to get the percentage of the counts for >50K over total counts in each country. The solution I submited for the first time worked, but it was not neat enough. After tackling the problem on a different occasion, I came up with a cleaner solution, to be explained below.

Let's forget about the breakdown of percentages by countries for the time being, and work out a way to get the percentage for all the dataset. Calling the value_counts method on the salary Series creates another Series with unique values in it as keys and their counts as the corresponding values. Let's name it salary_counts.

salary_counts = df["salary"].value_counts()
salary_counts

<=50K    24720
>50K      7841
Name: salary, dtype: int64

Now we can retrieve the number of people who make more than 50K by using the bracket notation.

salary_counts[">50K"]

7841

For the total number of people, we can just call the sum method on the salary_counts Series.

salary_counts.sum()

32561

With the two numbers in place, we're now ready to calculate the percentage, which is named rich_percentage.

rich_percentage = (salary_counts[">50K"]/salary_counts.sum()) * 100
rich_percentage

24.080955744602438

Next, to replicate what we just did, let's create a function named get_rich_percentage, which takes the salary Series as its argument and returns the percentage of people who make more than 50K.

def get_rich_percentage(salary_series):
    salary_counts = salary_series.value_counts()
    res = (salary_counts[">50K"]/salary_counts.sum()) * 100
    return res

A test on the function gives us the expected number.

get_rich_percentage(df["salary"])

24.080955744602438

Then, we apply the get_rich_percentage function to each group of a GroupBy object named by_country, which is selected by the native-country column. The outcome, saved as rich_percentage_by_country, would have been a Series of percentages across countries, were it not for the KeyError I got 😞.

by_country = df.groupby("native-country")
rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"]))
rich_percentage_by_country

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

/usr/local/lib/python3.7/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/usr/local/lib/python3.7/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '>50K'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-44-34911dd4f740> in <module>()
----> 1 df.groupby("native-country").apply(lambda group: get_rich_percentage(group["salary"]))

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
   1273         with option_context("mode.chained_assignment", None):
   1274             try:
-> 1275                 result = self._python_apply_general(f, self._selected_obj)
   1276             except TypeError:
   1277                 # gh-20949

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f, data)
   1307             data after applying f
   1308         """
-> 1309         keys, values, mutated = self.grouper.apply(f, data, self.axis)
   1310 
   1311         return self._wrap_applied_output(

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    813             try:
    814                 sdata = splitter.sorted_data
--> 815                 result_values, mutated = splitter.fast_apply(f, sdata, group_keys)
    816 
    817             except IndexError:

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/ops.py in fast_apply(self, f, sdata, names)
   1358         # must return keys::list, values::list, mutated::bool
   1359         starts, ends = lib.generate_slices(self.slabels, self.ngroups)
-> 1360         return libreduction.apply_frame_axis0(sdata, f, names, starts, ends)
   1361 
   1362     def _chop(self, sdata: DataFrame, slice_obj: slice) -> DataFrame:

/usr/local/lib/python3.7/dist-packages/pandas/_libs/reduction.pyx in pandas._libs.reduction.apply_frame_axis0()

<ipython-input-44-34911dd4f740> in <lambda>(group)
----> 1 df.groupby("native-country").apply(lambda group: get_rich_percentage(group["salary"]))

<ipython-input-42-95f3068a1aa3> in get_rich_percentage(salary_series)
      1 def get_rich_percentage(salary_series):
      2     salary_counts = salary_series.value_counts()
----> 3     res = (salary_counts[">50K"]/salary_counts.sum()) * 100
      4     return res

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __getitem__(self, key)
    940 
    941         elif key_is_scalar:
--> 942             return self._get_value(key)
    943 
    944         if is_hashable(key):

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
   1049 
   1050         # Similar to Index.get_value, but we do not fall back to positional
-> 1051         loc = self.index.get_loc(label)
   1052         return self.index._get_values_for_loc(self, loc, label)
   1053 

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: '>50K'

Apparently, the key >50K is missing from some countries, so let's revise the get_rich_percentage function to handle the key error.

def get_rich_percentage(salary_series):
    salary_counts = salary_series.value_counts()
    if salary_counts.get(">50K"):
        res = (salary_counts.get(">50K")/salary_counts.sum()) * 100
    else:
        res = 0
    return res

Having taken care of the key error, we now got the Series we're after 🙌. The only thing modified this time is to additionally call the sort_values method on the Series so that the numbers are in descending order. And I'm surprised to see Taiwan 🇹 among the top five on the list!

by_country = df.groupby("native-country")
rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"])).sort_values(ascending=False)
rich_percentage_by_country

native-country
Iran                          41.860465
France                        41.379310
India                         40.000000
Taiwan                        39.215686
Japan                         38.709677
Yugoslavia                    37.500000
Cambodia                      36.842105
Italy                         34.246575
England                       33.333333
Canada                        32.231405
Germany                       32.116788
Philippines                   30.808081
Hong                          30.000000
Greece                        27.586207
China                         26.666667
Cuba                          26.315789
?                             25.042882
Scotland                      25.000000
United-States                 24.583476
Hungary                       23.076923
Ireland                       20.833333
South                         20.000000
Poland                        20.000000
Thailand                      16.666667
Ecuador                       14.285714
Jamaica                       12.345679
Laos                          11.111111
Portugal                      10.810811
Trinadad&Tobago               10.526316
Puerto-Rico                   10.526316
Haiti                          9.090909
El-Salvador                    8.490566
Honduras                       7.692308
Vietnam                        7.462687
Peru                           6.451613
Nicaragua                      5.882353
Mexico                         5.132193
Guatemala                      4.687500
Columbia                       3.389831
Dominican-Republic             2.857143
Outlying-US(Guam-USVI-etc)     0.000000
Holand-Netherlands             0.000000
dtype: float64

Finally, we call the idxmax method on rich_percentage_by_country to get the country with the highest percentage, which turns out to be Iran.

highest_earning_country = rich_percentage_by_country.idxmax()
highest_earning_country

'Iran'

And with the key Iran, we can easily get the percentage for that country.

highest_earning_country_percentage = rich_percentage_by_country[highest_earning_country]
print(f"Before rounding: {highest_earning_country_percentage}")
highest_earning_country_percentage = round(highest_earning_country_percentage, 1)
print(f"After rounding: {highest_earning_country_percentage}")

Before rounding: 41.86046511627907
After rounding: 41.9

Q9: Most popular occupation for those who make more than 50k in India

This final question is specifically asking about people who make more than 50K in India, so we need two Boolean filters, one based on salary and the other on native-country. Adding the symbol & between the two filters creates the and condition, which filters out the subset we want, named rich_indians 💰🇮 .

salary_filt = df["salary"] == ">50K"
indian_filt = df["native-country"] == "India"
rich_indians = df[(indian_filt)&(salary_filt)]
rich_indians

Next, since we're trying to find out the most popular job among rich Indians, we need a count for each unique job. One of the neatest way to go about this is to call the groupby method on rich_indians and then the size method on the resulting GroupBy object.

ind_occupations = rich_indians.groupby("occupation").size()
ind_occupations

occupation
Adm-clerical         1
Exec-managerial      8
Other-service        2
Prof-specialty      25
Sales                1
Tech-support         2
Transport-moving     1
dtype: int64

Finally, we call the idxmax method on the ind_occupations Series to get the most popular occupation, which is Prof-specialty.

top_IN_occupation = ind_occupations.idxmax()
top_IN_occupation

'Prof-specialty'

Put it together

Here's the complete code for this project, including the main function calculate_demographic_data and the helper function get_rich_percentage for Q8.

import pandas as pd

fpath = "adult.data.csv"

def get_rich_percentage(salary_series):
    salary_counts = salary_series.value_counts()
    if salary_counts.get(">50K"):
        res = (salary_counts.get(">50K")/salary_counts.sum()) * 100
    else:
        res = 0
    return res

def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv(fpath)

    # Q1 How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df.groupby("race").size().sort_values(ascending=False)

    # Q2 What is the average age of men?
    average_age_men = df.groupby("sex")["age"].mean()["Male"]
    average_age_men = round(average_age_men, 1)

    # Q3 What is the percentage of people who have a Bachelor's degree?
    edu_filt = df["education"] == "Bachelors"
    percentage_bachelors = (df[edu_filt].shape[0]/df.shape[0]) * 100
    percentage_bachelors = round(percentage_bachelors, 1)

    # Q4 What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    higher_degrees = ["Bachelors", "Masters", "Doctorate"]
    higher_education_filt = df["education"].isin(higher_degrees)
    higher_education = df[higher_education_filt]
    salary_filt = higher_education["salary"] == ">50K"
    higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
    higher_education_rich = round(higher_education_rich, 1)

    # Q5 What percentage of people without advanced education make more than 50K?
    lower_education = df[~ higher_education_filt]
    salary_filt = lower_education["salary"] == ">50K"
    lower_education_rich = (lower_education[salary_filt].shape[0]/lower_education.shape[0]) * 100
    lower_education_rich = round(lower_education_rich, 1)

    # Q6 What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df["hours-per-week"].min()

    # Q7 What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    hour_filt = df["hours-per-week"] == min_work_hours
    min_hour_workers = df[hour_filt]
    salary_filt = min_hour_workers["salary"] == ">50K"
    rich_percentage = (min_hour_workers[salary_filt].shape[0]/min_hour_workers.shape[0]) * 100
    rich_percentage = round(rich_percentage, 1)

    # Q8 What country has the highest percentage of people that earn >50K and what is that percentage?
    by_country = df.groupby("native-country")
    rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"])).sort_values(ascending=False)
    highest_earning_country = rich_percentage_by_country.idxmax()
    highest_earning_country_percentage = rich_percentage_by_country[highest_earning_country]
    highest_earning_country_percentage = round(highest_earning_country_percentage, 1)
    
    # Q9 Identify the most popular occupation for those who earn >50K in India.
    salary_filt = df["salary"] == ">50K"
    indian_filt = df["native-country"] == "India"
    rich_indians = df[(indian_filt)&(salary_filt)]
    ind_occupations = rich_indians.groupby("occupation").size()
    top_IN_occupation = ind_occupations.idxmax()

    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count)
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(
            f"Percentage with higher education that earn >50K: {higher_education_rich}%"
        )
        print(
            f"Percentage without higher education that earn >50K: {lower_education_rich}%"
        )
        print(f"Min work time: {min_work_hours} hours/week")
        print(
            f"Percentage of rich among those who work fewest hours: {rich_percentage}%"
        )
        print("Country with highest percentage of rich:",
              highest_earning_country)
        print(
            f"Highest percentage of rich people in country: {highest_earning_country_percentage}%"
        )
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

Now it's time to harvest the fruits of our labor by calling the calculate_demographic_data function! And it worked as expected! 🎊㊗

calculate_demographic_data()

Number of each race:
 race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
dtype: int64
Average age of men: 39.4
Percentage with Bachelors degrees: 16.4%
Percentage with higher education that earn >50K: 46.5%
Percentage without higher education that earn >50K: 17.4%
Min work time: 1 hours/week
Percentage of rich among those who work fewest hours: 10.0%
Country with highest percentage of rich: Iran
Highest percentage of rich people in country: 41.9%
Top occupations in India: Prof-specialty

{'average_age_men': 39.4,
 'higher_education_rich': 46.5,
 'highest_earning_country': 'Iran',
 'highest_earning_country_percentage': 41.9,
 'lower_education_rich': 17.4,
 'min_work_hours': 1,
 'percentage_bachelors': 16.4,
 'race_count': race
 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
 dtype: int64,
 'rich_percentage': 10.0,
 'top_IN_occupation': 'Prof-specialty'}

Recap

Once you finish the freeCodeCamp project Demographic Data Analyzer, you'll become comfortable doing data analysis with the pandas library while manipulating pandas objects like Series, DataFrame, and GroupBy. Starting from the next project, we'll enter the realm of visualization, drawing figures with the help of libraries like matplotlib and seaborn. Until then 👋!

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
21634	38	Private	179668	HS-grad	9	Married-civ-spouse	Protective-serv	Husband	White	Male	0	40	Scotland	<=50K
4268	36	Private	281021	HS-grad	9	Never-married	Transport-moving	Not-in-family	White	Male	0	45	United-States	<=50K
1863	25	Private	176520	HS-grad	9	Never-married	Craft-repair	Other-relative	White	Male	0	40	United-States	<=50K
10570	21	?	188535	Some-college	10	Never-married	?	Own-child	White	Male	0	40	United-States	<=50K
22607	19	Private	243373	12th	8	Never-married	Sales	Other-relative	White	Male	1055	40	United-States	<=50K
13134	41	State-gov	180272	Masters	14	Never-married	Prof-specialty	Own-child	White	Female	0	35	United-States	<=50K
14015	49	Self-emp-inc	119565	Masters	14	Married-civ-spouse	Sales	Husband	White	Male	0	40	United-States	>50K
32054	23	Private	335067	Bachelors	13	Never-married	Sales	Not-in-family	White	Male	0	50	United-States	<=50K
14950	51	Private	74660	10th	6	Married-civ-spouse	Craft-repair	Husband	White	Male	0	40	United-States	>50K
28137	49	Private	209146	HS-grad	9	Married-civ-spouse	Transport-moving	Husband	White	Male	0	40	United-States	<=50K

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	<=50K
9	42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	0	40	United-States	>50K
11	30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	India	>50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32530	35	?	320084	Bachelors	13	Married-civ-spouse	?	Wife	White	Female	0	0	55	United-States	>50K
32531	30	?	33811	Bachelors	13	Never-married	?	Not-in-family	Asian-Pac-Islander	Female	0	0	99	United-States	<=50K
32533	54	Private	337992	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	Asian-Pac-Islander	Male	0	0	50	Japan	>50K
32536	34	Private	160216	Bachelors	13	Never-married	Exec-managerial	Not-in-family	White	Female	0	0	55	United-States	>50K
32538	38	Private	139180	Bachelors	13	Divorced	Prof-specialty	Unmarried	Black	Female	15020	0	45	United-States	>50K

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	<=50K
6	49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	0	16	Jamaica	<=50K
7	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	>50K
10	37	Private	280464	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	0	80	United-States	>50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32556	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States	<=50K
32557	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States	>50K
32558	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States	<=50K
32559	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States	<=50K
32560	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	0	40	United-States	>50K

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
189	58	State-gov	109567	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	1	United-States	>50K
1036	66	Self-emp-inc	150726	9th	5	Married-civ-spouse	Exec-managerial	Husband	White	Male	1409	0	1	?	<=50K
1262	69	?	195779	Assoc-voc	11	Widowed	?	Not-in-family	White	Female	0	0	1	United-States	<=50K
5590	78	?	363134	HS-grad	9	Widowed	?	Not-in-family	White	Female	0	0	1	United-States	<=50K
5632	45	?	189564	Masters	14	Married-civ-spouse	?	Wife	White	Female	0	0	1	United-States	<=50K
5766	62	?	97231	Some-college	10	Married-civ-spouse	?	Wife	White	Female	0	0	1	United-States	<=50K
5808	76	?	211574	10th	6	Married-civ-spouse	?	Husband	White	Male	0	0	1	United-States	<=50K
8447	67	?	244122	Assoc-voc	11	Widowed	?	Not-in-family	White	Female	0	0	1	United-States	<=50K
9147	75	?	260543	10th	6	Widowed	?	Other-relative	Asian-Pac-Islander	Female	0	0	1	China	<=50K
11451	27	Private	147951	HS-grad	9	Never-married	Machine-op-inspct	Other-relative	White	Male	0	0	1	United-States	<=50K
19337	72	?	76860	HS-grad	9	Married-civ-spouse	?	Husband	Asian-Pac-Islander	Male	0	0	1	United-States	<=50K
19750	23	Private	72887	HS-grad	9	Never-married	Craft-repair	Own-child	Asian-Pac-Islander	Male	0	0	1	Vietnam	<=50K
20072	65	?	76043	HS-grad	9	Married-civ-spouse	?	Husband	White	Male	0	0	1	United-States	>50K
20909	77	Self-emp-not-inc	71676	Some-college	10	Widowed	Adm-clerical	Not-in-family	White	Female	0	1944	1	United-States	<=50K
22960	21	Private	184135	HS-grad	9	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	1	United-States	<=50K
23536	69	?	320280	Some-college	10	Never-married	?	Not-in-family	White	Male	1848	0	1	United-States	<=50K
24284	57	Self-emp-not-inc	56480	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	1	United-States	<=50K
25078	74	Private	260669	10th	6	Divorced	Other-service	Not-in-family	White	Female	0	0	1	United-States	<=50K
29752	69	?	117525	Assoc-acdm	12	Divorced	?	Unmarried	White	Female	0	0	1	United-States	<=50K
32525	81	?	120478	Assoc-voc	11	Divorced	?	Unmarried	White	Female	0	0	1	?	<=50K