Which Gender is associated with this Name? R to the R-escue!


When addressing somebody unknown to you with an uncommon name e.g. in an email you might not know whether this person is male or female. In this post, we make it a little fun project to let R help us with that, so read on!

Of course, R cannot figure out the gender just by looking at the names, we need some data! A very impressive dataset can be found here: Gender by Name Data Set.

In this dataset, we find nearly one hundred fifty thousand instances of first/given names of male and female babies, source datasets are from government authorities:

  • US: Baby Names from Social Security Card Applications – National Data, 1880 to 2019
  • UK: Baby names in England and Wales Statistical bulletins, 2011 to 2018
  • Canada: British Columbia 100 Years of Popular Baby names, 1918 to 2018
  • Australia: Popular Baby Names, Attorney-General’s Department, 1944 to 2019

NB: Because of the origin of the data the categories here are strictly binary (male/female) and not gender-divers.

We can now write a simple R function which formats the output a little bit and provides us with percentage values in case the name is used for both genders:

name_gender_data <- read.csv("data/name_gender_dataset.csv") # change path accordingly

name_gender <- function(name) {
  data <- name_gender_data[name_gender_data$Name == name, 1:3]
  data <- cbind(data[1:2], round(data[3] / sum(data[3]), 3) * 100)
  colnames(data) <- c("Name", "Gender", "Percent")
  rownames(data) <- NULL
  data
}

I, of course, start by trying it on my own name 😉

name_gender("Holger")
##     Name Gender Percent
## 1 Holger      M     100

Now, how about a name not everybody might know the gender of, “Emre”:

name_gender("Emre")
##   Name Gender Percent
## 1 Emre      M     100

Same with “Elle”:

name_gender("Elle")
##   Name Gender Percent
## 1 Elle      F     100

How about names that are given to both genders, like “Charlie”:

name_gender("Charlie")
##      Name Gender Percent
## 1 Charlie      M    86.9
## 2 Charlie      F    13.1

And, as the last example, what happens when the name is not included in the data:

name_gender("nobody")
## [1] Name    Gender  Percent
## <0 rows> (or 0-length row.names)

I hope that you enjoyed this little project and that it will prove helpful. Do you have other ideas about what to do with this dataset? Leave them in the comments!

6 thoughts on “Which Gender is associated with this Name? R to the R-escue!”

  1. Edit distance might work for the name unmatched in database if the edit distance proved to be an gender similarity in known gender name.

    1. Thank you for your comment, Hsingti!

      I think this might do more harm than good, just think of those combinations:

      Joseph/Josephina
      Robert/Roberta
      Kyle/Kyla
      Antonio/Antonia
      Louis/Louisa
      Stephan/Stephania/Stephanie
      Brian/Brianna
      George/Georgia
      Felix/Felicia
      Claude/Claudia
      Alexander/Alexandra
      Eric/Erica
      Simon/Simone/Simona
      Andrew/Andrea
      Carl/Carla
      Philip/Philippa

  2. What a great dataset to point to! I manage a membership and events database and we regularly get accounts added with no gender. We like to report on gender and I had been thinking about how I could update these records and this is the base data to use! I now pull the data directly from the database, match the name caseless, then iterate through the rows and adjust the percentage match to increase the number of hits. I then use ifelse to match where there is only one result from your function (i.e. male or female), then pull the gender out and push it back into the database. Brilliant!

    p.s. For the df dataframe below, make sure to use the correct column index for the relevant name and gender

    Here’s the relevant code update:

    df$forenames<-word(df$forenames, 1) #Use only the first name in a forename field
    df 1) #remove single character names
    names <- read.csv("C:\\software\\name_gender_dataset.csv")  #read in the gender dataset
    
    #read in name database
    names <- read.csv("C:\\software\\name_gender_dataset.csv") 
    
    #gender function
    gender <- function(name) {
      data <- names[tolower(names$Name) == tolower(name), 1:3] #compares lowercase names
      data <- cbind(data[1:2], round(data[3] / sum(data[3]), 3) * 100)
      colnames(data) <- c("Name", "Gender", "Percent")
      rownames(data) <- NULL
      data
    }
    
    for ( i in 1:nrow(df)) { 
      result<-gender(df[i,2])
      result=95) #use matches at set probability
      ifelse( nrow(result == 1),  df[i,4] <- result[1,2], 1)
    }
    
  3. No problems – hopefully it can help someone else as you helped me! Two errors on the code snippet:

    line 2: df 1) #remove single character names

    line 3: delete this line

    line 19: result=95) #use matches at set probability

Leave a Reply

Your email address will not be published. Required fields are marked *

I accept that my given data and my IP address is sent to a server in the USA only for the purpose of spam prevention through the Akismet program.More information on Akismet and GDPR.

This site uses Akismet to reduce spam. Learn how your comment data is processed.