Which Gender is associated with this Name? R to the R-escue!

When addressing somebody unknown to you with an uncommon name e.g. in an email you might not know whether this person is male or female. In this post, we make it a little fun project to let R help us with that, so read on!

Of course, R cannot figure out the gender just by looking at the names, we need some data! A very impressive dataset can be found here: Gender by Name Data Set.

In this dataset, we find nearly one hundred fifty thousand instances of first/given names of male and female babies, source datasets are from government authorities:

US: Baby Names from Social Security Card Applications – National Data, 1880 to 2019
UK: Baby names in England and Wales Statistical bulletins, 2011 to 2018
Canada: British Columbia 100 Years of Popular Baby names, 1918 to 2018
Australia: Popular Baby Names, Attorney-General’s Department, 1944 to 2019

NB: Because of the origin of the data the categories here are strictly binary (male/female) and not gender-divers.

We can now write a simple R function which formats the output a little bit and provides us with percentage values in case the name is used for both genders:

name_gender_data <- read.csv("data/name_gender_dataset.csv") # change path accordingly

name_gender <- function(name) {
  data <- name_gender_data[name_gender_data$Name == name, 1:3]
  data <- cbind(data[1:2], round(data[3] / sum(data[3]), 3) * 100)
  colnames(data) <- c("Name", "Gender", "Percent")
  rownames(data) <- NULL
  data
}

I, of course, start by trying it on my own name 😉

name_gender("Holger")
##     Name Gender Percent
## 1 Holger      M     100

Now, how about a name not everybody might know the gender of, “Emre”:

name_gender("Emre")
##   Name Gender Percent
## 1 Emre      M     100

Same with “Elle”:

name_gender("Elle")
##   Name Gender Percent
## 1 Elle      F     100

How about names that are given to both genders, like “Charlie”:

name_gender("Charlie")
##      Name Gender Percent
## 1 Charlie      M    86.9
## 2 Charlie      F    13.1

And, as the last example, what happens when the name is not included in the data:

name_gender("nobody")
## [1] Name    Gender  Percent
## <0 rows> (or 0-length row.names)

I hope that you enjoyed this little project and that it will prove helpful. Do you have other ideas about what to do with this dataset? Leave them in the comments!

6 thoughts on “Which Gender is associated with this Name? R to the R-escue!”

Pingback: Links 12/11/2022: Grml 2022.11 RC and Push Notifications for KDE | Techrights
Hsingti Wu says:

November 14, 2022 at 1:45 am

Edit distance might work for the name unmatched in database if the edit distance proved to be an gender similarity in known gender name.

Reply
1. Learning Machines says:
  
  November 14, 2022 at 6:25 am
  
  Thank you for your comment, Hsingti!
  
  I think this might do more harm than good, just think of those combinations:
  
  Joseph/Josephina
  Robert/Roberta
  Kyle/Kyla
  Antonio/Antonia
  Louis/Louisa
  Stephan/Stephania/Stephanie
  Brian/Brianna
  George/Georgia
  Felix/Felicia
  Claude/Claudia
  Alexander/Alexandra
  Eric/Erica
  Simon/Simone/Simona
  Andrew/Andrea
  Carl/Carla
  Philip/Philippa
  
  Reply
Mike Smith says:

November 14, 2022 at 6:48 pm
What a great dataset to point to! I manage a membership and events database and we regularly get accounts added with no gender. We like to report on gender and I had been thinking about how I could update these records and this is the base data to use! I now pull the data directly from the database, match the name caseless, then iterate through the rows and adjust the percentage match to increase the number of hits. I then use ifelse to match where there is only one result from your function (i.e. male or female), then pull the gender out and push it back into the database. Brilliant!

p.s. For the df dataframe below, make sure to use the correct column index for the relevant name and gender

Here’s the relevant code update:
```
df$forenames<-word(df$forenames, 1) #Use only the first name in a forename field
df 1) #remove single character names
names <- read.csv("C:\\software\\name_gender_dataset.csv")  #read in the gender dataset

#read in name database
names <- read.csv("C:\\software\\name_gender_dataset.csv") 

#gender function
gender <- function(name) {
  data <- names[tolower(names$Name) == tolower(name), 1:3] #compares lowercase names
  data <- cbind(data[1:2], round(data[3] / sum(data[3]), 3) * 100)
  colnames(data) <- c("Name", "Gender", "Percent")
  rownames(data) <- NULL
  data
}

for ( i in 1:nrow(df)) { 
  result<-gender(df[i,2])
  result=95) #use matches at set probability
  ifelse( nrow(result == 1),  df[i,4] <- result[1,2], 1)
}
```
Reply
1. Learning Machines says:
  
  November 14, 2022 at 7:24 pm
  
  Dear Mike, thank you so much for your great feedback! I am really overwhelmed and happy that it is useful to you!
  
  Also, a big Thank You for sharing the code snippet.
  
  Reply
Mike says:

November 15, 2022 at 10:41 am

No problems – hopefully it can help someone else as you helped me! Two errors on the code snippet:

line 2: df 1) #remove single character names

line 3: delete this line

line 19: result=95) #use matches at set probability

Reply

6 thoughts on “Which Gender is associated with this Name? R to the R-escue!”

Leave a Reply Cancel reply