OneR in Medical Research: Finding Leading Symptoms, Main Predictors and Cut-Off Points

We already had a lot of examples that make use of the OneR package (on CRAN), which can be found in the respective Category: OneR.

Here we will give you some concrete examples in the area of research on Type 2 Diabetes Mellitus (DM) to show that the package is especially well suited in the field of medical research, so read on!

One of the big advantages of the package is that the resulting models are often not only highly accurate but very easy to interpret:

  • the predictors are ordered from best to worst (based on accuracy), the best one is chosen,
  • the model is given in the form of simple if-then rules,
  • the rules contain exact cut-off points.

An additional advantage, compared to other methods, is that with the included optbin function you find as many cut-off points as there are needed to separate all the classes instead of just one (e.g. with decision trees).

For more advantages, a quick introduction, and a real-world example in the area of histology (the study of the microscopic structure of tissues) for breast cancer detection have a look at the official vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models.

The first example is based on the early-stage diabetes risk prediction dataset from the Queen Mary University of London which contains the sign and symptom data of newly diabetic or would be diabetic patients (diabetes_data_upload.csv). We use this dataset to find the leading symptoms of diabetes:


# leading symptoms
data1 <- read.csv("data/diabetes_data_upload.csv") # adjust path accordingly
OneR(data1, verbose = TRUE)
##     Attribute          Accuracy
## 1 * Polyuria           82.31%  
## 2   Polydipsia         80.19%  
## 3   partial.paresis    69.23%  
## 4   sudden.weight.loss 69.04%  
## 5   Gender             68.08%  
## 6   Alopecia           65.96%  
## 7   Polyphagia         65.58%  
## 8   Age                64.42%  
## 9   weakness           63.65%  
## 10  Genital.thrush     61.54%  
## 10  visual.blurring    61.54%  
## 10  Itching            61.54%  
## 10  Irritability       61.54%  
## 10  delayed.healing    61.54%  
## 10  muscle.stiffness   61.54%  
## 10  Obesity            61.54%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## Call:
## = data1, verbose = TRUE)
## Rules:
## If Polyuria = No  then class = Negative
## If Polyuria = Yes then class = Positive
## Accuracy:
## 428 of 520 instances classified correctly (82.31%)

As we can see in the table the leading symptoms are polyuria (excessive urination volume) and polydipsia (excessive thirst) with an accuracy of over 80 percent each. This result is corroborated by the medical literature.

The next dataset is the quite famous Pima Indians Diabetes Database which is often used as a benchmark for machine learning methods. It can be found in the mlbench package (on CRAN):

# glucose

data2 <- PimaIndiansDiabetes
## Call:
## = optbin(data2))
## Rules:
## If glucose = (-0.199,141] then diabetes = neg
## If glucose = (141,199]    then diabetes = pos
## Accuracy:
## 573 of 768 instances classified correctly (74.61%)

Glucose (blood sugar) with a cut-off value of 141 is identified as the main predictor of DM, the “official” cut-off point is at 140 mg/dl.

The last dataset is from a National Health and Nutrition Examination Survey (NHANES): nhgh.rda (here you can find more info on the attributes of the dataset).

# HbA1c
load("data/nhgh.rda") # adjust path accordingly
data3 <- nhgh[ , !names(nhgh) %in% c("seqn", "tx")]
OneR(optbin(dx ~., data = data3, method = "infogain"))
## Warning in = data, method = method, na.omit = na.omit):
## target is numeric
## Warning in = data, method = method, na.omit = na.omit): 1452
## instance(s) removed due to missing values
## Call:
## = optbin(dx ~ ., data = data3, method = "infogain"))
## Rules:
## If gh = (3.99,6.4] then dx = 0
## If gh = (6.4,15.5] then dx = 1
## Accuracy:
## 4955 of 5343 instances classified correctly (92.74%)

Here HbA1c (glycated hemoglobin, measured primarily to determine the three-month average blood sugar level) with a cut-off value of 6.4 is identified as the main predictor for DM with an accuracy of nearly 93%, the “official” cut-off point lies at 6.5%.

In fact, several researchers around the world use the OneR package already. To give you just one publication: Computational prediction of diagnosis and feature selection on mesothelioma patient health records by D. Chicco and C. Rovelli, PLoS One, 2019.

I myself have a paper on COVID-19 under review which was submitted in cooperation with Dr. med. Anna Laura Herzog and Prof. Dr. med. Patrick Meybohm, both from the renowned University Hospital W├╝rzburg, where we used the OneR package among other machine learning methods.

I hope that you can see that the OneR package is well worth a try (not only) in the field of medical research. If you have a project in mind where you are looking for a cooperation partner please leave a note in the comments or contact me directly: About.

2 thoughts on “OneR in Medical Research: Finding Leading Symptoms, Main Predictors and Cut-Off Points”

  1. Dear Prof. Jouanne-Diedrich,

    Many thanks for the amazing content you are sharing in this blog and for making the OneR package.

    I have only recently come across the topic of OneR myself and it is really hard to find anything about it on the Internet.
    So I am even happier to have finally found an expert.

    I would also like to say that I think it is very admirable that you are giving OneR the opportunity to be used in medicine.

    I don’t want to inconvenience you but I have three questions about the OneR package:

    Do you also give online training on OneR?
    How would you briefly describe the purpose of the OneR algorithm in the OneR package?
    In what specific areas do you see a future for OneR outside of medicine?

    Thank you in advance for your efforts
    Happy new year 2021 stay healthy!

    1. Dear Chloe,

      Thank you for your great feedback, I feel humbled.

      Concerning your questions:

      – I have created a video on using the OneR package, you can find it here: One Rule (OneR) Machine Learning Classification in under One Minute. If you are interested in interactive online trainings, please contact me directly: About.
      – I had several purposes for creating the OneR package, I list them in the accompanying vignette under “design principles” here: OneR – Establishing a New Baseline for Machine Learning Classification Models.
      – Actually, the areas where you can use it are basically unlimited. Many people have approached me since I created the package being active in areas as diverse as climate change and brain research. You can find many more post on it in the category I created here: Category: OneR.

      If you don’t mind let us connect on LinkedIn, you can find me here: Holger K. von Jouanne-Diedrich.

      Thank you again and stay safe too!
      best h

Leave a Reply

Your email address will not be published. Required fields are marked *

I accept that my given data and my IP address is sent to a server in the USA only for the purpose of spam prevention through the Akismet program.More information on Akismet and GDPR.

This site uses Akismet to reduce spam. Learn how your comment data is processed.