Running remote host Weka experiments

Previously we have tried to run weka server to utilize all cores of processor in classification tasks. But appears that wekaserver works only in explorer for classification routines. For more advanced machine learning there is more flexible tool – experimenter. Weka server doesn’s support this area. So what to do if you want more performance or simply utilize multi-core processor of local machine. There is a way out, but it is more tricky. Weka has ability to perform remote experiments that allow spreading the load across multiple host machines that have Weka set up. You can read the documentation of remote experiment on Weka wikispaces, but in some cases it may be somewhat confusing. It took time for me to figure out some parts by trial and error. The trickiest part is to set everything up and prepare the necessary command to be run before performing remote experiment. So lets…

Continue reading

Utilizing multi-core processor for classification in WEKA

Currently WEKA is one of the most favorites machine learning tools. Without programming skills you can do serious classification, clustering and big data analysis. For some time I’ve been using its standard GUI features without thinking much about performance bottlenecks. But since researches are becoming more complex by using ensemble, voting and other meta-algorithms that normally are based on multiple classifiers running simultaneously, the performance issues start becoming annoying. You need to wait for hours until task is completed. The problem is that when running classification algorithms from the WEKA GUI, the utilize a single core of your processor. Such algorithms as Multi-layer Percepron running 10 fold cross-validation is calculating one cross fold at the time on one core taking long time to accomplish: So I started looking for options to make it use all cores of processor as separate threads for each fold of operation. There are couple options…

Continue reading

Regularized Logistic regression

Previously we have tried logistic regression without regularization and with simple training data set. Bu as we all know, things in real life aren’t as simple as we would want to. There are many types of data available the need to be classified. Number of features can grow up hundreds and thousands while number of instances may be limited. Also in many times we might need to classify in more than two classes. The first problem that might rise due to large number of features is over-fitting. This is when learned hypothesis hΘ (x) fit training data too well (cost J(Θ) ≈ 0), but it fails when classifying new data samples. In other words, model tries to distinct each training example correctly by drawing very complicated decision boundary between training data points. As you can see in image above, over-fitting would be green decision boundary. So how to deal with…

Continue reading

Implementing logistic regression learner with python

Logistic regression is a next step from linear regression. The most real life data have non linear relationship, thus applying linear models might be ineffective. Logistic regression is capable of handling hon linear effects in prediction tasks. You can think of lots of different scenarios where logistic regression could be applied. There can be financial, demographic, health, weather and other data where model could be applied and used to predict next events on upcoming data. For instance you can classify emails in to span and non spam, transactions being fraud or non, tumors being malignant or benign. In order to understand logistic regression, let’s cover some basics, do a simple classification on data set with two features and then test it on real life data with multiple features.

Continue reading

Building and evaluating Naive Bayes classifier with WEKA

This is a followup post from previous where we were calculating Naive Bayes prediction on given data set. This time I want to demonstrate how all this can be implemented using WEKA application. For those who doesn’t know what WEKA is I highly recommend visiting their website and getting latest release. It is really powerful machine learning software written in Java. You can find plenty of tutorials in youtube on how to get started with WEKA. So I wont get in to details. I’m sure you’ll be able to follow anyway.

Continue reading

Simple explanation of Naive Bayes classifier

Probably you’ve heard about Naive Bayes classifier and likely used in some GUI based classifiers like WEKA package. This is a number one algorithm used to see the initial results of classification. Sometimes surprisingly it outperforms the other models with speed, accuracy and simplicity. Lets see how this algorithm looks and what does it do. As you may know algorithm works on Bayes theorem of probability which allows to predict the class of unknown data set. Hoe you are comfortable with probability math – at least some basics.

Continue reading

Linear regression with multiple features

linear regression cost function

Single feature linear regression is really simple. All you need is to find function that fits training data best. It is also easy to plot data and learned curve. But in reality regression analysis is based on multiple features. So in most cases we cannot imagine the multidimensional space where data could be plotted. We need to rely on methods we use. You have to feel comfortable with linear algebra where matrices and vectors are used. If previously we had one feature (temperature) now we need to introduce more of them. So we need to expand hypothesis to accept more features. From now and later on instead of output y we are gonna use h(x) notation: As you can see with more variables (features) we also end up with more parameters θ that has to be learned. Before we move lets find suitable data that we could use for building…

Continue reading