Implementing logistic regression learner with python

Logistic regression is the next step from linear regression. The most real-life data have a non-linear relationship; thus, applying linear models might be ineffective. Logistic regression is capable of handling non-linear effects in prediction tasks. You can think of many different scenarios where logistic regression could be applied. There can be financial, demographic, health, weather, and other data where the model could be implemented and used to predict subsequent events on future data. For instance, you can classify emails into spam and non-spam, transactions being a fraud or not, and tumors being malignant or benign. In order to understand logistic regression, let’s cover some basics, do a simple classification on data set with two features, and then test it on real-life data with multiple features.

Continue reading

Building and evaluating Naive Bayes classifier with WEKA

This is a follow-up post from previous where we were calculating Naive Bayes prediction on the given data set. This time I want to demonstrate how all this can be implemented using the WEKA application. I highly recommend visiting their website and getting the latest release. WEKA is a compelling machine learning software written in Java. It is a widely-used and highly regarded machine learning software that offers a range of powerful data mining and modeling tools. It provides a user-friendly interface, making it accessible to both experienced and novice users. Weka offers a wide range of algorithms and data pre-processing techniques, making it a flexible and robust tool for various machine learning applications, such as classification, clustering, and association rule mining. You can find plenty of tutorials on youtube on how to get started with WEKA. So I won’t get into details. I’m sure you’ll be able to follow anyway.

Continue reading

Simple explanation of Naive Bayes classifier

Probably you’ve heard about Naive Bayes classifier, and likely used in some GUI-based classifiers like WEKA package. This is a number one algorithm used to see the initial results of classification. Sometimes surprisingly, it outperforms the other models with speed, accuracy and simplicity. Lets see how this algorithm looks and what does it do. As you may know algorithm works on Bayes theorem of probability, which allows to prediction the class of unknown data sets. Hoe you are comfortable with probability math – at least some basics.

Continue reading

Linear regression with multiple features

linear regression cost function

Single feature linear regression is simple. All you need is to find a function that fits training data best. It is also easy to plot data and learning curves. But in reality, regression analysis is based on multiple features. So in most cases, we cannot imagine the multidimensional space where data could be plotted. We need to rely on the methods we use. You must feel comfortable with linear algebra, where matrices and vectors are used. If previously we had one feature (temperature), now we need to introduce more of them. So we need to expand hypothesis to accept more features. From now and later on, instead of output y, we are going to use h(x) notation: As you can see, with more variables (features), we also end up with more parameters θ that has to be learned. Before we move let’s find relevant data that we could use for building machine learning algorithm. The data set Again we are going to use data set college cengage. This time we select health data set with several variables. The data (X1, X2, X3, X4, X5) are by city. X1 = death rate per 1000 residents X2 = doctor availability per 100,000…

Continue reading

Linear regression – learning algorithm with Python

In this post, we will demystify the learning algorithm of linear regression. We will analyze the simplest univariate case with single feature X wherein the previous example was temperature and output was cricket chirps/sec. Let’s use the same data with crickets to build learning algorithm and see if it produces a similar hypothesis as in excel. As you may already know from this example, we need to find linear equation parameters θ0 and θ1, to fit line most optimally on the given data set: y = θ0 + θ1 x x here is a feature (temperature), and y – output value (chirps/sec). So how are we going to find parameters θ0 and θ1? The whole point of the learning algorithm is doing this iteratively. We need to find optimal θ0, and θ1 parameter values, so that approximation line error from the plotted training set is minimal. By doing successive corrections to randomly selected parameters we can find an optimal solution. From statistics, you probably know the Least Mean Square (LMS) algorithm. It uses gradient-based method of steepest descent.

Continue reading

Simplest machine learning algorithm – linear regression with excel

Some may say that linear regression is a more statistical problem. And this is true at some level. But when the problem is solved from a machine learning perspective, things gets more accessible, especially when moving towards more complex problems. First of all, let’s understand few essential terms. We can start with regression. When speaking of linear regression we try to find a best-fitting line through given points. In other words, we need to find an optimal linear equation to fit given data points. This is a supervised learning problem when we have set of data pairs that can be plotted on x-y axis. I understand theory is a boring thing, even for me, so let’s move to practical examples and learn by solving some problems. In order to work with some examples we need sample data. There are many data sources available on the internet. For instance, a great source is college Cengage, that have several sets with data pairs meant for linear regression problems. For our example, we are going to use Cricket Chirps Vs. Temperature data where each data point consists of chirps/sec and temperature in degrees Fahrenheit. You can send data in three formats: excel, mtp and…

Continue reading