When addressing somebody unknown to you with an uncommon name e.g. in an email you might not know whether this person is male or female. In this post, we make it a little fun project to let R help us with that, so read on!
Of course, R cannot figure out the gender just by looking at the names, we need some data! A very impressive dataset can be found here: Gender by Name Data Set.
In this dataset, we find nearly one hundred fifty thousand instances of first/given names of male and female babies, source datasets are from government authorities:
- US: Baby Names from Social Security Card Applications – National Data, 1880 to 2019
- UK: Baby names in England and Wales Statistical bulletins, 2011 to 2018
- Canada: British Columbia 100 Years of Popular Baby names, 1918 to 2018
- Australia: Popular Baby Names, Attorney-General’s Department, 1944 to 2019
NB: Because of the origin of the data the categories here are strictly binary (male/female) and not gender-divers.
We can now write a simple R function which formats the output a little bit and provides us with percentage values in case the name is used for both genders:
name_gender_data <- read.csv("data/name_gender_dataset.csv") # change path accordingly name_gender <- function(name) { data <- name_gender_data[name_gender_data$Name == name, 1:3] data <- cbind(data[1:2], round(data[3] / sum(data[3]), 3) * 100) colnames(data) <- c("Name", "Gender", "Percent") rownames(data) <- NULL data }
I, of course, start by trying it on my own name 😉
name_gender("Holger") ## Name Gender Percent ## 1 Holger M 100
Now, how about a name not everybody might know the gender of, “Emre”:
name_gender("Emre") ## Name Gender Percent ## 1 Emre M 100
Same with “Elle”:
name_gender("Elle") ## Name Gender Percent ## 1 Elle F 100
How about names that are given to both genders, like “Charlie”:
name_gender("Charlie") ## Name Gender Percent ## 1 Charlie M 86.9 ## 2 Charlie F 13.1
And, as the last example, what happens when the name is not included in the data:
name_gender("nobody") ## [1] Name Gender Percent ## <0 rows> (or 0-length row.names)
I hope that you enjoyed this little project and that it will prove helpful. Do you have other ideas about what to do with this dataset? Leave them in the comments!
Edit distance might work for the name unmatched in database if the edit distance proved to be an gender similarity in known gender name.
Thank you for your comment, Hsingti!
I think this might do more harm than good, just think of those combinations:
Joseph/Josephina
Robert/Roberta
Kyle/Kyla
Antonio/Antonia
Louis/Louisa
Stephan/Stephania/Stephanie
Brian/Brianna
George/Georgia
Felix/Felicia
Claude/Claudia
Alexander/Alexandra
Eric/Erica
Simon/Simone/Simona
Andrew/Andrea
Carl/Carla
Philip/Philippa
What a great dataset to point to! I manage a membership and events database and we regularly get accounts added with no gender. We like to report on gender and I had been thinking about how I could update these records and this is the base data to use! I now pull the data directly from the database, match the name caseless, then iterate through the rows and adjust the percentage match to increase the number of hits. I then use
ifelse
to match where there is only one result from your function (i.e. male or female), then pull the gender out and push it back into the database. Brilliant!p.s. For the
df
dataframe below, make sure to use the correct column index for the relevant name and genderHere’s the relevant code update:
Dear Mike, thank you so much for your great feedback! I am really overwhelmed and happy that it is useful to you!
Also, a big Thank You for sharing the code snippet.
No problems – hopefully it can help someone else as you helped me! Two errors on the code snippet:
line 2: df 1) #remove single character names
line 3: delete this line
line 19: result=95) #use matches at set probability