Open In Colab

Intro

To earn the freeCodeCamp certification on Data Analysis with Python, one has to complete five projects. This post goes through the steps for completing the one named Demographic Data Analyzer with the pandas library. For more background info, read the first post of this series.

Import

For this project, we need nothing but the pandas ๐Ÿผ library, which is preinstalled on Colab. This following graph summarizes its major use cases.

We'll be primarily dealing with the following three types of objects:

import pandas as pd

Dataset

Let's first clone the project repo and change the current directory to boilerplate-demographic-data-analyzer. In it, you'll find the adult.data.csv file, which is the dataset.

repo_url = "https://github.com/freeCodeCamp/boilerplate-demographic-data-analyzer"
!git clone {repo_url}
%cd boilerplate-demographic-data-analyzer

Cloning into 'boilerplate-demographic-data-analyzer'...
remote: Enumerating objects: 22, done.
remote: Counting objects: 100% (22/22), done.
remote: Compressing objects: 100% (19/19), done.
remote: Total 22 (delta 7), reused 12 (delta 2), pack-reused 0
Unpacking objects: 100% (22/22), done.
/content/boilerplate-demographic-data-analyzer

Since we're dealing with a csv file, we'll load the dataset with the read_csv function and save it as the variable df, which is pretty much the default name for DataFrame objects. A sample of 10 records is shown below.

fname = "adult.data.csv"
df = pd.read_csv(fname)
df.sample(10)

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
21634 38 Private 179668 HS-grad 9 Married-civ-spouse Protective-serv Husband White Male 0 0 40 Scotland <=50K
4268 36 Private 281021 HS-grad 9 Never-married Transport-moving Not-in-family White Male 0 0 45 United-States <=50K
1863 25 Private 176520 HS-grad 9 Never-married Craft-repair Other-relative White Male 0 0 40 United-States <=50K
10570 21 ? 188535 Some-college 10 Never-married ? Own-child White Male 0 0 40 United-States <=50K
22607 19 Private 243373 12th 8 Never-married Sales Other-relative White Male 1055 0 40 United-States <=50K
13134 41 State-gov 180272 Masters 14 Never-married Prof-specialty Own-child White Female 0 0 35 United-States <=50K
14015 49 Self-emp-inc 119565 Masters 14 Married-civ-spouse Sales Husband White Male 0 0 40 United-States >50K
32054 23 Private 335067 Bachelors 13 Never-married Sales Not-in-family White Male 0 0 50 United-States <=50K
14950 51 Private 74660 10th 6 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States >50K
28137 49 Private 209146 HS-grad 9 Married-civ-spouse Transport-moving Husband White Male 0 0 40 United-States <=50K

The dataset has in total 32561 rows and 15 columns, with each row representing a person and each column indicating an attribute of that person.

df.shape
(32561, 15)

Task instructions

As in my previous post, I use the rich library to render the README.md file for task instructions.

!pip install rich

from rich.console import Console
from rich.markdown import Markdown

def show_readme():
    console = Console()
    with open("README.md") as readme:
        markdown = Markdown(readme.read())
    console.print(markdown)

Collecting rich
  Downloading rich-11.2.0-py3-none-any.whl (217 kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 217 kB 3.2 MB/s 
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 51 kB 5.8 MB/s 
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich) (2.6.1)
Requirement already satisfied: typing-extensions<5.0,>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from rich) (3.10.0.2)
Installing collected packages: commonmark, colorama, rich
Successfully installed colorama-0.4.4 commonmark-0.9.1 rich-11.2.0

Assignment

show_readme()

                                         Assignment                                          

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                                 Demographic Data Analyzer                                 โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

In this challenge you must analyze demographic data using Pandas. You are given a dataset of 
demographic data that was extracted from the 1994 Census database. Here is a sample of what  
the data looks like:                                                                         

|    |   age | workclass        |   fnlwgt | education   |   education-num | marital-status  
| occupation        | relationship   | race   | sex    |   capital-gain |   capital-loss |   
hours-per-week | native-country   | salary   | |---:|------:|:-----------------|---------:|:-
-----------|----------------:|:-------------------|:------------------|:---------------|:----
---|:-------|---------------:|---------------:|-----------------:|:-----------------|:-------
--| |  0 |    39 | State-gov        |    77516 | Bachelors   |              13 |             
Never-married      | Adm-clerical      | Not-in-family  | White  | Male   |           2174 | 
0 |               40 | United-States    | <=50K    | |  1 |    50 | Self-emp-not-inc |       
83311 | Bachelors   |              13 | Married-civ-spouse | Exec-managerial   | Husband     
| White  | Male   |              0 |              0 |               13 | United-States    |  
<=50K    | |  2 |    38 | Private          |   215646 | HS-grad     |               9 |      
Divorced           | Handlers-cleaners | Not-in-family  | White  | Male   |              0 | 
0 |               40 | United-States    | <=50K    | |  3 |    53 | Private          |       
234721 | 11th        |               7 | Married-civ-spouse | Handlers-cleaners | Husband    
| Black  | Male   |              0 |              0 |               40 | United-States    |  
<=50K    | |  4 |    28 | Private          |   338409 | Bachelors   |              13 |      
Married-civ-spouse | Prof-specialty    | Wife           | Black  | Female |              0 | 
0 |               40 | Cuba             | <=50K    |                                         

You must use Pandas to answer the following questions:                                       

 โ€ข How many people of each race are represented in this dataset? This should be a Pandas     
   series with race names as the index labels. (race column)                                 
 โ€ข What is the average age of men?                                                           
 โ€ข What is the percentage of people who have a Bachelor's degree?                            
 โ€ข What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make 
   more than 50K?                                                                            
 โ€ข What percentage of people without advanced education make more than 50K?                  
 โ€ข What is the minimum number of hours a person works per week?                              
 โ€ข What percentage of the people who work the minimum number of hours per week have a salary 
   of more than 50K?                                                                         
 โ€ข What country has the highest percentage of people that earn >50K and what is that         
   percentage?                                                                               
 โ€ข Identify the most popular occupation for those who earn >50K in India.                    

Use the starter code in the file demographic_data_analyzer. Update the code so all variables 
set to "None" are set to the appropriate calculation or code. Round all decimals to the      
nearest tenth.                                                                               

Unit tests are written for you under test_module.py.                                         

                                         Development                                         

For development, you can use main.py to test your functions. Click the "run" button and      
main.py will run.                                                                            

                                           Testing                                           

We imported the tests from test_module.py to main.py for your convenience. The tests will run
automatically whenever you hit the "run" button.                                             

                                         Submitting                                          

Copy your project's URL and submit it to freeCodeCamp.                                       

                                       Dataset Source                                        

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository                                
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information 
and Computer Science.                                                                        

Steps

The assignment lists 9 questions, and I'll just follow the order in which they occur.

The pandas syntax is quite powerful, and a lot of things could happen with just one line of code. When I first started learning pandas, I wish there were some kind of visual aids that could show me the transformation of a dataset from one state to another. Luckily ๐Ÿ˜™, I came across the Pandas Tutor, which can be very helpful if you are a visual learner like me. With its help, I'll be providing links to various visualizations in due course. But bear in mind that for the purpose of visualization I only took 20 random records rather than the whole dataset. The point is to visualize a tranformation process, not to actually do data analysis on Pandas Tutor.

Q1: Race counts

To get the counts of each race, we group the dataset by race. Calling the groupby method on df returns a GroupBy object, which cannot be displayed.

df.groupby("race")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f528fe884d0>

In order to show some results, we'll have to call an additional method on the Groupby object. For instance, the size method returns the numbers we're looking for. Follow this link to see a visualization of this process.

race_count = df.groupby("race").size()
race_count
race
Amer-Indian-Eskimo      311
Asian-Pac-Islander     1039
Black                  3124
Other                   271
White                 27816
dtype: int64

I thought race_count would be the answer to the first question, but it didn't pass the unittest. If you look at the test_module.py file, you'll see that the expected values of race_count are in ascending order, which is not made clear in the instructions. So, we'll have to sort the race counts by calling the sort_values method on race_count and setting the value of ascending to be False.

race_count = df.groupby("race").size().sort_values(ascending=False)
race_count
race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
dtype: int64

Q2: Average age of men

To get the average age of men, we first group the dataset by sex, then take the age column, and finally call the mean method. Many steps are going on here, follow this link to visualize each of them.

average_age = df.groupby("sex")["age"].mean()
average_age
sex
Female    36.858230
Male      39.433547
Name: age, dtype: float64

At this point, average_age is a Series object with two keys Female and Male. So we just need to specify the Male key to get the average age of men. Then we round it to the tenth by calling the builtin function round.

average_age_men = average_age["Male"]
print(f"Before rounding: {average_age_men}")
average_age_men = round(average_age_men, 1)
print(f"After rounding: {average_age_men}")
Before rounding: 39.43354749885268
After rounding: 39.4

Q3: Percentage of people with a Bachelor degree

To answer this question, we'll first need the fraction of the number of people with a Bachelor degree over the total number of people. To get the numerator, let's first create a Boolean filter named edu_filt, which will return True if the value of the education column for a given row is Bachelors or False if otherwise. Then we can just use the bracket notation df[edu_filt] to create a view of the dataset that satisfies the education criterion. Follow this link to see a visualization.

While the original dataset has 32561 records, the filtered one has only 5355 records. Since each record represents a person, these two numbers are precisely what we're looking for.

edu_filt = df["education"] == "Bachelors"
df[edu_filt]

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
9 42 Private 159449 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 5178 0 40 United-States >50K
11 30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32530 35 ? 320084 Bachelors 13 Married-civ-spouse ? Wife White Female 0 0 55 United-States >50K
32531 30 ? 33811 Bachelors 13 Never-married ? Not-in-family Asian-Pac-Islander Female 0 0 99 United-States <=50K
32533 54 Private 337992 Bachelors 13 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 0 0 50 Japan >50K
32536 34 Private 160216 Bachelors 13 Never-married Exec-managerial Not-in-family White Female 0 0 55 United-States >50K
32538 38 Private 139180 Bachelors 13 Divorced Prof-specialty Unmarried Black Female 15020 0 45 United-States >50K

5355 rows ร— 15 columns

The number of records in a DataFrame object is accessible via the attribute shape, which returns a tuple of (number of records, number of columns). So here's how we arrive at the percentage we're after.

percentage_bachelors = (df[edu_filt].shape[0]/df.shape[0]) * 100
print(f"Before rounding: {percentage_bachelors}")
percentage_bachelors = round(percentage_bachelors, 1)
print(f"After rounding: {percentage_bachelors}")
Before rounding: 16.44605509658794
After rounding: 16.4

Q4: Percentage of people with advanced education who make more than 50k

As in the previous question, we need to filter the dataset by the education column. But this time around, we're filtering three values (Bachelors, Masters, and Doctorate for higher degrees) rather than just one. So, it'd be easier to create a Boolean filter by calling the isin method on df while providing a list of higher degrees as its argument. Let's save the filtered dataset as higher_education. Follow this link to visualize the filtering process.

higher_degrees = ["Bachelors", "Masters", "Doctorate"]
higher_education_filt = df["education"].isin(higher_degrees)
higher_education = df[higher_education_filt]
higher_education

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
5 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K
8 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32538 38 Private 139180 Bachelors 13 Divorced Prof-specialty Unmarried Black Female 15020 0 45 United-States >50K
32539 71 ? 287372 Doctorate 16 Married-civ-spouse ? Husband White Male 0 0 10 United-States >50K
32544 31 Private 199655 Masters 14 Divorced Other-service Not-in-family Other Female 0 0 30 United-States <=50K
32553 32 Private 116138 Masters 14 Never-married Tech-support Not-in-family Asian-Pac-Islander Male 0 0 11 Taiwan <=50K
32554 53 Private 321865 Masters 14 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 United-States >50K

7491 rows ร— 15 columns

The number of records in higher_education is then the denominator of the fraction we're looking for. For the numerator, it'll be the number of records in higher_education that is additionally filtered by the salary column. So my first attempted solution was as follows.

salary_filt = df["salary"] == ">50K"
higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
print(f"Before rounding: {higher_education_rich}")
higher_education_rich = round(higher_education_rich, 1)
print(f"After rounding: {higher_education_rich}")
Before rounding: 46.535843011613935
After rounding: 46.5
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  

Though I did get the correct percentage, a warning popped up saying that "Boolean Series key will be reindexed to match DataFrame index." I thought I could just ignore the warning, but the solution ended up failing the unittest ๐Ÿ˜ž. I later figured out what might have caused the warning. It's probably because the keys of the salary_filt Series don't match up with the index of the higher_education DataFrame, as shown below.

salary_filt = df["salary"] == ">50K"
print("Keys of salary_filt >>>")
print(salary_filt.keys())
print("Index of higher_education >>>")
print(higher_education.index)
Keys of salary_filt >>>
RangeIndex(start=0, stop=32561, step=1)
Index of higher_education >>>
Int64Index([    0,     1,     4,     5,     8,     9,    11,    12,    19,
               20,
            ...
            32530, 32531, 32532, 32533, 32536, 32538, 32539, 32544, 32553,
            32554],
           dtype='int64', length=7491)

To fix the problem, I redefined salary_filt based on the filtered higher_education rather than the original df. This way, all the keys of salary_filt would be equal to the index of higher_education.

salary_filt = higher_education["salary"] == ">50K"
print("Keys of salary_filt match index of higher_education?")
all(salary_filt.keys() == higher_education.index)
Keys of salary_filt match index of higher_education?
True

Once the salary filter is fixed, the warning is gone ๐Ÿ˜Ž!

salary_filt = higher_education["salary"] == ">50K"
higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
print(f"Before rounding: {higher_education_rich}")
higher_education_rich = round(higher_education_rich, 1)
print(f"After rounding: {higher_education_rich}")
Before rounding: 46.535843011613935
After rounding: 46.5

Q5: Percentage of people without advanced education who make more than 50k

The logic for this question is pretty much like that for the previous one, except that we'll first filter out people without higher degrees rather than those with them. To negate the Boolean filter higher_education_filt, we just need to add the ~ symbol in front of it. To see a visualization of the filtering process, follow this link.

lower_education = df[~ higher_education_filt]
lower_education

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
6 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica <=50K
7 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States >50K
10 37 Private 280464 Some-college 10 Married-civ-spouse Exec-managerial Husband Black Male 0 0 80 United-States >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32556 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States <=50K
32557 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States >50K
32558 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States <=50K
32559 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States <=50K
32560 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States >50K

25070 rows ร— 15 columns

Now that we've got lower_education, we can use a salary filter as before to get a subset of lower_education, which, when divided by the total set, gives us the percentage we're looking for.

salary_filt = lower_education["salary"] == ">50K"
lower_education_rich = (lower_education[salary_filt].shape[0]/lower_education.shape[0]) * 100
print(f"Before rounding: {lower_education_rich}")
lower_education_rich = round(lower_education_rich, 1)
print(f"After rounding: {lower_education_rich}")
Before rounding: 17.3713601914639
After rounding: 17.4

Q6: Minimum hours per week

This question is the easiest of all. We simply grab the hours-per-week Series, and find out the minimal value in it by calling the methond min on it. The value is saved as min_work_hours for later use.

min_work_hours = df["hours-per-week"].min()
min_work_hours
1

Q7: Percentage of people working minimum hours per week who make more than 50k

Given min_work_hours, we can use a filter, named hour_filt, to filter out people who work minimum hours per week, a subset saved as min_hour_workers. This link shows how the filtered subset is created.

hour_filt = df["hours-per-week"] == min_work_hours
min_hour_workers = df[hour_filt]
min_hour_workers

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
189 58 State-gov 109567 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 1 United-States >50K
1036 66 Self-emp-inc 150726 9th 5 Married-civ-spouse Exec-managerial Husband White Male 1409 0 1 ? <=50K
1262 69 ? 195779 Assoc-voc 11 Widowed ? Not-in-family White Female 0 0 1 United-States <=50K
5590 78 ? 363134 HS-grad 9 Widowed ? Not-in-family White Female 0 0 1 United-States <=50K
5632 45 ? 189564 Masters 14 Married-civ-spouse ? Wife White Female 0 0 1 United-States <=50K
5766 62 ? 97231 Some-college 10 Married-civ-spouse ? Wife White Female 0 0 1 United-States <=50K
5808 76 ? 211574 10th 6 Married-civ-spouse ? Husband White Male 0 0 1 United-States <=50K
8447 67 ? 244122 Assoc-voc 11 Widowed ? Not-in-family White Female 0 0 1 United-States <=50K
9147 75 ? 260543 10th 6 Widowed ? Other-relative Asian-Pac-Islander Female 0 0 1 China <=50K
11451 27 Private 147951 HS-grad 9 Never-married Machine-op-inspct Other-relative White Male 0 0 1 United-States <=50K
19337 72 ? 76860 HS-grad 9 Married-civ-spouse ? Husband Asian-Pac-Islander Male 0 0 1 United-States <=50K
19750 23 Private 72887 HS-grad 9 Never-married Craft-repair Own-child Asian-Pac-Islander Male 0 0 1 Vietnam <=50K
20072 65 ? 76043 HS-grad 9 Married-civ-spouse ? Husband White Male 0 0 1 United-States >50K
20909 77 Self-emp-not-inc 71676 Some-college 10 Widowed Adm-clerical Not-in-family White Female 0 1944 1 United-States <=50K
22960 21 Private 184135 HS-grad 9 Never-married Machine-op-inspct Own-child Black Male 0 0 1 United-States <=50K
23536 69 ? 320280 Some-college 10 Never-married ? Not-in-family White Male 1848 0 1 United-States <=50K
24284 57 Self-emp-not-inc 56480 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 1 United-States <=50K
25078 74 Private 260669 10th 6 Divorced Other-service Not-in-family White Female 0 0 1 United-States <=50K
29752 69 ? 117525 Assoc-acdm 12 Divorced ? Unmarried White Female 0 0 1 United-States <=50K
32525 81 ? 120478 Assoc-voc 11 Divorced ? Unmarried White Female 0 0 1 ? <=50K

Now, let's create a salary filter based on min_hour_workers. The number of records in min_hour_workers with the filter divided by the number without it times 100 produces the answer, which is saved as rich_percentage.

salary_filt = min_hour_workers["salary"] == ">50K"
rich_percentage = (min_hour_workers[salary_filt].shape[0]/min_hour_workers.shape[0]) * 100
print(f"Before rounding: {rich_percentage}")
rich_percentage = round(rich_percentage, 1)
print(f"After rounding: {rich_percentage}")
Before rounding: 10.0
After rounding: 10.0

Q8: Country with the highest percentage of people who make more than 50k

This question is perhaps the most challenging of all. My initial move was to create a GroupBy object first selected by by native-country then by salary, and call the size method on it. This would produce a Series of counts for <=50K and >50K in each country, as follows. Go to this page to see this process in action.

df.groupby(["native-country", "salary"]).size()
native-country  salary
?               <=50K      437
                >50K       146
Cambodia        <=50K       12
                >50K         7
Canada          <=50K       82
                          ... 
United-States   >50K      7171
Vietnam         <=50K       62
                >50K         5
Yugoslavia      <=50K       10
                >50K         6
Length: 82, dtype: int64

But then it's cubersome to get the percentage of the counts for >50K over total counts in each country. The solution I submited for the first time worked, but it was not neat enough. After tackling the problem on a different occasion, I came up with a cleaner solution, to be explained below.

Let's forget about the breakdown of percentages by countries for the time being, and work out a way to get the percentage for all the dataset. Calling the value_counts method on the salary Series creates another Series with unique values in it as keys and their counts as the corresponding values. Let's name it salary_counts.

salary_counts = df["salary"].value_counts()
salary_counts
<=50K    24720
>50K      7841
Name: salary, dtype: int64

Now we can retrieve the number of people who make more than 50K by using the bracket notation.

salary_counts[">50K"]
7841

For the total number of people, we can just call the sum method on the salary_counts Series.

salary_counts.sum()
32561

With the two numbers in place, we're now ready to calculate the percentage, which is named rich_percentage.

rich_percentage = (salary_counts[">50K"]/salary_counts.sum()) * 100
rich_percentage
24.080955744602438

Next, to replicate what we just did, let's create a function named get_rich_percentage, which takes the salary Series as its argument and returns the percentage of people who make more than 50K.

def get_rich_percentage(salary_series):
    salary_counts = salary_series.value_counts()
    res = (salary_counts[">50K"]/salary_counts.sum()) * 100
    return res

A test on the function gives us the expected number.

get_rich_percentage(df["salary"])
24.080955744602438

Then, we apply the get_rich_percentage function to each group of a GroupBy object named by_country, which is selected by the native-country column. The outcome, saved as rich_percentage_by_country, would have been a Series of percentages across countries, were it not for the KeyError I got ๐Ÿ˜ž.

by_country = df.groupby("native-country")
rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"]))
rich_percentage_by_country

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

/usr/local/lib/python3.7/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/usr/local/lib/python3.7/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '>50K'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-44-34911dd4f740> in <module>()
----> 1 df.groupby("native-country").apply(lambda group: get_rich_percentage(group["salary"]))

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
   1273         with option_context("mode.chained_assignment", None):
   1274             try:
-> 1275                 result = self._python_apply_general(f, self._selected_obj)
   1276             except TypeError:
   1277                 # gh-20949

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f, data)
   1307             data after applying f
   1308         """
-> 1309         keys, values, mutated = self.grouper.apply(f, data, self.axis)
   1310 
   1311         return self._wrap_applied_output(

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    813             try:
    814                 sdata = splitter.sorted_data
--> 815                 result_values, mutated = splitter.fast_apply(f, sdata, group_keys)
    816 
    817             except IndexError:

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/ops.py in fast_apply(self, f, sdata, names)
   1358         # must return keys::list, values::list, mutated::bool
   1359         starts, ends = lib.generate_slices(self.slabels, self.ngroups)
-> 1360         return libreduction.apply_frame_axis0(sdata, f, names, starts, ends)
   1361 
   1362     def _chop(self, sdata: DataFrame, slice_obj: slice) -> DataFrame:

/usr/local/lib/python3.7/dist-packages/pandas/_libs/reduction.pyx in pandas._libs.reduction.apply_frame_axis0()

<ipython-input-44-34911dd4f740> in <lambda>(group)
----> 1 df.groupby("native-country").apply(lambda group: get_rich_percentage(group["salary"]))

<ipython-input-42-95f3068a1aa3> in get_rich_percentage(salary_series)
      1 def get_rich_percentage(salary_series):
      2     salary_counts = salary_series.value_counts()
----> 3     res = (salary_counts[">50K"]/salary_counts.sum()) * 100
      4     return res

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __getitem__(self, key)
    940 
    941         elif key_is_scalar:
--> 942             return self._get_value(key)
    943 
    944         if is_hashable(key):

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
   1049 
   1050         # Similar to Index.get_value, but we do not fall back to positional
-> 1051         loc = self.index.get_loc(label)
   1052         return self.index._get_values_for_loc(self, loc, label)
   1053 

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: '>50K'

Apparently, the key >50K is missing from some countries, so let's revise the get_rich_percentage function to handle the key error.

def get_rich_percentage(salary_series):
    salary_counts = salary_series.value_counts()
    if salary_counts.get(">50K"):
        res = (salary_counts.get(">50K")/salary_counts.sum()) * 100
    else:
        res = 0
    return res

Having taken care of the key error, we now got the Series we're after ๐Ÿ™Œ. The only thing modified this time is to additionally call the sort_values method on the Series so that the numbers are in descending order. And I'm surprised to see Taiwan ๐Ÿ‡น among the top five on the list!

by_country = df.groupby("native-country")
rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"])).sort_values(ascending=False)
rich_percentage_by_country
native-country
Iran                          41.860465
France                        41.379310
India                         40.000000
Taiwan                        39.215686
Japan                         38.709677
Yugoslavia                    37.500000
Cambodia                      36.842105
Italy                         34.246575
England                       33.333333
Canada                        32.231405
Germany                       32.116788
Philippines                   30.808081
Hong                          30.000000
Greece                        27.586207
China                         26.666667
Cuba                          26.315789
?                             25.042882
Scotland                      25.000000
United-States                 24.583476
Hungary                       23.076923
Ireland                       20.833333
South                         20.000000
Poland                        20.000000
Thailand                      16.666667
Ecuador                       14.285714
Jamaica                       12.345679
Laos                          11.111111
Portugal                      10.810811
Trinadad&Tobago               10.526316
Puerto-Rico                   10.526316
Haiti                          9.090909
El-Salvador                    8.490566
Honduras                       7.692308
Vietnam                        7.462687
Peru                           6.451613
Nicaragua                      5.882353
Mexico                         5.132193
Guatemala                      4.687500
Columbia                       3.389831
Dominican-Republic             2.857143
Outlying-US(Guam-USVI-etc)     0.000000
Holand-Netherlands             0.000000
dtype: float64

Finally, we call the idxmax method on rich_percentage_by_country to get the country with the highest percentage, which turns out to be Iran.

highest_earning_country = rich_percentage_by_country.idxmax()
highest_earning_country
'Iran'

And with the key Iran, we can easily get the percentage for that country.

highest_earning_country_percentage = rich_percentage_by_country[highest_earning_country]
print(f"Before rounding: {highest_earning_country_percentage}")
highest_earning_country_percentage = round(highest_earning_country_percentage, 1)
print(f"After rounding: {highest_earning_country_percentage}")
Before rounding: 41.86046511627907
After rounding: 41.9

This final question is specifically asking about people who make more than 50K in India, so we need two Boolean filters, one based on salary and the other on native-country. Adding the symbol & between the two filters creates the and condition, which filters out the subset we want, named rich_indians ๐Ÿ’ฐ๐Ÿ‡ฎ .

salary_filt = df["salary"] == ">50K"
indian_filt = df["native-country"] == "India"
rich_indians = df[(indian_filt)&(salary_filt)]
rich_indians

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
11 30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
968 48 Private 164966 Bachelors 13 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 0 0 40 India >50K
1327 52 Private 168381 HS-grad 9 Widowed Other-service Unmarried Asian-Pac-Islander Female 0 0 40 India >50K
7258 42 State-gov 102343 Prof-school 15 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 72 India >50K
7285 54 State-gov 93449 Masters 14 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
8124 36 Private 172104 Prof-school 15 Never-married Prof-specialty Not-in-family Other Male 0 0 40 India >50K
9939 43 Federal-gov 325706 Prof-school 15 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 50 India >50K
10590 35 Private 98283 Prof-school 15 Never-married Prof-specialty Not-in-family Asian-Pac-Islander Male 0 0 40 India >50K
10661 59 Private 122283 Prof-school 15 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 99999 0 40 India >50K
10736 30 Private 243190 Prof-school 15 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 20 India >50K
11260 54 Private 225599 Masters 14 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 7298 0 40 India >50K
11384 34 Private 98283 Prof-school 15 Never-married Tech-support Not-in-family Asian-Pac-Islander Male 0 1564 40 India >50K
13422 53 Private 366957 Bachelors 13 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 99999 0 50 India >50K
13551 40 Private 220977 Doctorate 16 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 3103 0 40 India >50K
13862 45 Private 209912 Bachelors 13 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 0 0 40 India >50K
16017 41 Private 207578 Assoc-acdm 12 Married-civ-spouse Exec-managerial Husband Black Male 0 0 50 India >50K
16778 43 Private 242968 Masters 14 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 0 0 40 India >50K
16923 41 Private 143003 Assoc-voc 11 Married-civ-spouse Other-service Husband Asian-Pac-Islander Male 7298 0 60 India >50K
17259 57 Self-emp-inc 123053 Prof-school 15 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 15024 0 50 India >50K
17834 29 Self-emp-not-inc 341672 HS-grad 9 Married-spouse-absent Transport-moving Other-relative Asian-Pac-Islander Male 0 1564 50 India >50K
20417 42 Self-emp-inc 23510 Masters 14 Divorced Exec-managerial Unmarried Asian-Pac-Islander Male 0 2201 60 India >50K
20465 39 Private 198654 Prof-school 15 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 2415 67 India >50K
21128 30 Private 122889 Masters 14 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 50 India >50K
23474 55 State-gov 120781 Doctorate 16 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
24154 46 Private 229737 Bachelors 13 Married-civ-spouse Sales Husband White Male 0 0 50 India >50K
25739 35 Self-emp-inc 79586 Masters 14 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
26305 27 Private 207352 Bachelors 13 Married-civ-spouse Tech-support Husband Asian-Pac-Islander Male 0 0 40 India >50K
26356 34 Private 99872 Masters 14 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 3103 0 40 India >50K
27670 61 Private 80896 Masters 14 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 45 India >50K
28264 51 Self-emp-not-inc 120781 Prof-school 15 Married-civ-spouse Prof-specialty Husband Other Male 99999 0 70 India >50K
28433 42 Private 198341 Masters 14 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 1902 55 India >50K
28452 53 Private 70387 Masters 14 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 4386 0 40 India >50K
28557 34 Private 165737 Masters 14 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 43 India >50K
28651 45 Self-emp-not-inc 216402 Prof-school 15 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 50 India >50K
28798 30 Self-emp-not-inc 116666 Masters 14 Divorced Prof-specialty Not-in-family Asian-Pac-Islander Male 0 0 50 India >50K
30111 41 Federal-gov 219155 Prof-school 15 Married-civ-spouse Prof-specialty Husband White Male 0 0 50 India >50K
30152 48 Private 119471 Doctorate 16 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
30833 25 Private 110978 Assoc-acdm 12 Married-civ-spouse Adm-clerical Wife Asian-Pac-Islander Female 0 0 37 India >50K
31327 38 State-gov 125499 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 7688 0 60 India >50K
31357 23 Private 143003 Masters 14 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 1887 50 India >50K

Next, since we're trying to find out the most popular job among rich Indians, we need a count for each unique job. One of the neatest way to go about this is to call the groupby method on rich_indians and then the size method on the resulting GroupBy object.

ind_occupations = rich_indians.groupby("occupation").size()
ind_occupations
occupation
Adm-clerical         1
Exec-managerial      8
Other-service        2
Prof-specialty      25
Sales                1
Tech-support         2
Transport-moving     1
dtype: int64

Finally, we call the idxmax method on the ind_occupations Series to get the most popular occupation, which is Prof-specialty.

top_IN_occupation = ind_occupations.idxmax()
top_IN_occupation
'Prof-specialty'

Put it together

Here's the complete code for this project, including the main function calculate_demographic_data and the helper function get_rich_percentage for Q8.

import pandas as pd

fpath = "adult.data.csv"

def get_rich_percentage(salary_series):
    salary_counts = salary_series.value_counts()
    if salary_counts.get(">50K"):
        res = (salary_counts.get(">50K")/salary_counts.sum()) * 100
    else:
        res = 0
    return res

def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv(fpath)

    # Q1 How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df.groupby("race").size().sort_values(ascending=False)

    # Q2 What is the average age of men?
    average_age_men = df.groupby("sex")["age"].mean()["Male"]
    average_age_men = round(average_age_men, 1)

    # Q3 What is the percentage of people who have a Bachelor's degree?
    edu_filt = df["education"] == "Bachelors"
    percentage_bachelors = (df[edu_filt].shape[0]/df.shape[0]) * 100
    percentage_bachelors = round(percentage_bachelors, 1)

    # Q4 What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    higher_degrees = ["Bachelors", "Masters", "Doctorate"]
    higher_education_filt = df["education"].isin(higher_degrees)
    higher_education = df[higher_education_filt]
    salary_filt = higher_education["salary"] == ">50K"
    higher_education_rich = (higher_education[salary_filt].shape[0]/higher_education.shape[0]) * 100
    higher_education_rich = round(higher_education_rich, 1)

    # Q5 What percentage of people without advanced education make more than 50K?
    lower_education = df[~ higher_education_filt]
    salary_filt = lower_education["salary"] == ">50K"
    lower_education_rich = (lower_education[salary_filt].shape[0]/lower_education.shape[0]) * 100
    lower_education_rich = round(lower_education_rich, 1)

    # Q6 What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df["hours-per-week"].min()

    # Q7 What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    hour_filt = df["hours-per-week"] == min_work_hours
    min_hour_workers = df[hour_filt]
    salary_filt = min_hour_workers["salary"] == ">50K"
    rich_percentage = (min_hour_workers[salary_filt].shape[0]/min_hour_workers.shape[0]) * 100
    rich_percentage = round(rich_percentage, 1)

    # Q8 What country has the highest percentage of people that earn >50K and what is that percentage?
    by_country = df.groupby("native-country")
    rich_percentage_by_country = by_country.apply(lambda group: get_rich_percentage(group["salary"])).sort_values(ascending=False)
    highest_earning_country = rich_percentage_by_country.idxmax()
    highest_earning_country_percentage = rich_percentage_by_country[highest_earning_country]
    highest_earning_country_percentage = round(highest_earning_country_percentage, 1)
    
    # Q9 Identify the most popular occupation for those who earn >50K in India.
    salary_filt = df["salary"] == ">50K"
    indian_filt = df["native-country"] == "India"
    rich_indians = df[(indian_filt)&(salary_filt)]
    ind_occupations = rich_indians.groupby("occupation").size()
    top_IN_occupation = ind_occupations.idxmax()

    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count)
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(
            f"Percentage with higher education that earn >50K: {higher_education_rich}%"
        )
        print(
            f"Percentage without higher education that earn >50K: {lower_education_rich}%"
        )
        print(f"Min work time: {min_work_hours} hours/week")
        print(
            f"Percentage of rich among those who work fewest hours: {rich_percentage}%"
        )
        print("Country with highest percentage of rich:",
              highest_earning_country)
        print(
            f"Highest percentage of rich people in country: {highest_earning_country_percentage}%"
        )
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

Now it's time to harvest the fruits of our labor by calling the calculate_demographic_data function! And it worked as expected! ๐ŸŽŠใŠ—

calculate_demographic_data()
Number of each race:
 race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
dtype: int64
Average age of men: 39.4
Percentage with Bachelors degrees: 16.4%
Percentage with higher education that earn >50K: 46.5%
Percentage without higher education that earn >50K: 17.4%
Min work time: 1 hours/week
Percentage of rich among those who work fewest hours: 10.0%
Country with highest percentage of rich: Iran
Highest percentage of rich people in country: 41.9%
Top occupations in India: Prof-specialty
{'average_age_men': 39.4,
 'higher_education_rich': 46.5,
 'highest_earning_country': 'Iran',
 'highest_earning_country_percentage': 41.9,
 'lower_education_rich': 17.4,
 'min_work_hours': 1,
 'percentage_bachelors': 16.4,
 'race_count': race
 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
 dtype: int64,
 'rich_percentage': 10.0,
 'top_IN_occupation': 'Prof-specialty'}

Recap

Once you finish the freeCodeCamp project Demographic Data Analyzer, you'll become comfortable doing data analysis with the pandas library while manipulating pandas objects like Series, DataFrame, and GroupBy. Starting from the next project, we'll enter the realm of visualization, drawing figures with the help of libraries like matplotlib and seaborn. Until then ๐Ÿ‘‹!