Building and evaluating Naive Bayes classifier with WEKA

This is a followup post from previous where we were calculating Naive Bayes prediction on the given data set. This time I want to demonstrate how all this can be implemented using WEKA application.

WEKA_GUI

For those who don’t know what WEKA is I highly recommend visiting their website and getting the latest release. It is a compelling machine learning software written in Java. You can find plenty of tutorials on youtube on how to get started with WEKA. So I won’t get into details. I’m sure you’ll be able to follow anyway.

Preparing data for classification

We are going to use the same data set as in the previous example with weather features temperature and humidity and class yes/no for playing golf.

Data is stored in arff file format specific for WEKA software and looks like this:

@relation 'weather.symbolic-weka.filters.unsupervised.attribute.Remove-R1,4'
@attribute temperature {hot,mild,cool}
@attribute humidity {high,normal}
@attribute play {yes,no}
@data
hot,high,no

hot,high,no

hot,high,yes

mild,high,yes

cool,normal,yes

cool,normal,no

cool,normal,yes

mild,high,no

cool,normal,yes

mild,normal,yes

mild,normal,yes

mild,high,yes

hot,normal,yes

mild,high,no

Here we can see the attribute denominators: temperature, humidity, and play followed by the data table. Using this data set, we are going to train the Naive Bayes model and then apply this model to new data with temperature cool and humidity high to see to which class it will be assigned.

First of all in WEKA explorer Preprocess tab we need to open our ARFF data file:

WEKA_load_data

Here we can see the basic statistics of attributes. If you click Edit button, the new Viewer window with the data table will be loaded.

weather_data_table

In viewer, you can edit data as you like and then you can always save new data set with Save button in explorer. We will do so when we create a test set with cool and high parameter values. For this we delete all lines of data except first one and edit values to looks like this:

weather_test_data_table

Select nothing on play attribute because we don’t know it yet.

Click OK and then Save data as a separate file. The file should look like this:

@relation 'weather.symbolic-weka.filters.unsupervised.attribute.Remove-R1,4'

@attribute temperature {hot,mild,cool}
@attribute humidity {high,normal}
@attribute play {yes,no}

@data
cool,high,?

The question “?” mark is a standard way of representing missing value in WEKA.

Building a Naive Bayes model

Now that we have data prepared we can proceed on building model. Load full weather data set again in explorer and then go to Classify tab.

Here you need to press Choose Classifier button, and from the tree menu select NaiveBayes. Be sure that Play attribute is selected as a class selector and then press the Start button to build a model.

WEKA_Naive_Bayes_model_build

Model outputs some information on how accurate it classifies and other parameters.

Correctly Classified Instances 9 64.2857 %

Incorrectly Classified Instances 5 35.7143 %

You can see that on given data set the accuracy of the classifier is about 64%. So keep in mind that you shouldn’t always take the results as granted. To get better results you might want to try different classifiers or preprocess data even further. We won’t get into this right now. We need to demonstrate the usage of the model on new upcoming data.

Evaluating classifier with the test set

Now when we have a model we need to load the test data we’ve created before. For this select Supplied test set and click button Set.

WEKA_load_test_set

Click More Options wherein new window choose PlainText from Output predictions as follows:

WEKA_test_set_more_options

Then click left mouse button on a recently created model on result list and select Re-evaluate model on the current test set.

weka_reevaluate_model

And you should see the prediction for your given data cool and hot like this:

=== Predictions on user test set ===

    inst#     actual  predicted error prediction
        1        1:?      1:yes       0.531

As you can see it has been predicted as yes with error 53.1%. In the previous analytical example, we’ve got 50% error on prediction.

Leave a Reply