Feature extraction from retina vascular images for classification

Classifying medical images is a tedious and complex task. Using machine learning algorithms to assist the process could be a huge help. There are many challenges to making machine learning algorithms work reliably on image data. First of all, you need a rather large image database with ground truth information (expert’s labeled data with diagnosis information). The second problem is preprocessing images, including merging modalities, unifying color maps, normalizing, and filtering. This part is essential and may impact the last part – feature extraction. This step is crucial because on how well you can extract informative features depends on how well machine learning algorithms will work. Dataset To demonstrate the classification procedure of medical images, the ophthalmology STARE (Structured Analysis of the Retina) image database was pulled from https://cecas.clemson.edu/~ahoover/stare/. The database consists of 400 images with 13 diagnostic cases along with preprocessed images. For classification problem, we have chosen only vessels images and only two classes: Normal and Choroidal Neovascularization (CNV). So the number of images was reduced to 99 where 25 were used as test data and 74 as training.

Continue reading

Running remote host Weka experiments

Previously, we tried to run a weka server to utilize all cores of the processor in classification tasks. But it appears that the weka server works only in explorer for classification routines. For more advanced machine learning, there is a more flexible tool – experimenter. Weka server doesn’t support this area. So what to do if you want more performance or utilize the multi-core processor of the local machine. There is a way out, but it is quite tricky. Weka has the ability to perform remote experiments that allow spreading the load across multiple host machines that have Weka set up. You can read the documentation of remote experiments here, but it may be somewhat confusing. It took time for me to figure out some parts by trial and error. The trickiest part is to set everything up and prepare the necessary command to be run before performing a remote experiment. So let’s get to it.

Continue reading

Utilizing multi-core processor for classification in WEKA

Currently, WEKA is one of the most favorites machine learning tools. Without programming skills, you can do severe classification, clustering, and extensive data analysis. For some time, I’ve been using its standard GUI features without thinking much about performance bottlenecks. But since research are becoming more complex by using ensemble, voting, and other meta-algorithms that generally are based on multiple classifiers running simultaneously, the performance issues start becoming annoying. You need to wait for hours until the task is completed. The problem is that when running classification algorithms from the WEKA GUI, they utilize a single core of your processor. Such algorithms as Multi-layer Perceptron running 10-fold cross-validation is calculating one cross-fold at the time on one core, taking a long time to accomplish: So I started looking for options to make it use all cores of the processor as separate threads for each operation fold. There are a couple of options available to do so. One is to use WekaServer package, and another is remote host processing. This time we will focus on WekaServer solution. The idea is to start a WEKA server as a distributed execution environment. When starting the server, you can indicate how many cores you…

Continue reading

4 Giant Industries – Where Data Science is Flourishing Well

In a fast-paced world where data is the primary language between processes, people who know how to read these are very much in-demand. These people are called data scientists, and their field is one of the fastest-rising professions in the world today. This is because their specific skillset can be utilized by many fields ranging from retail and business to government organizations like the different commissions or departments. The reason data scientists are in such high demand lies in the very nature of what they do. As mentioned above, data scientists are basically translators between people and computers. With the current state of technology, it is only logical for this field to rise to the top. Data is the product of studies and research, and nowadays, studies are not just conducted by academics but also by business owners and people in other fields. With data-gathering technology continuously growing, more data is now up for the taking. Aside from the sheer number of data, data and their implications also vary, which is why in the following industries, data scientists are really thriving.

Continue reading

The Rise of the Machines: How Will AI Integrate With Our Work Lives?

Artificial intelligence is one of the biggest buzzwords in science right now. It’s hitting the headlines frequently, due to the massive leaps in progress being made. The story which grabbed the most attention was the DeepMind’s AlphaGo AI defeating a three-time European Go champion in emphatic fashion – and then doing the same for the number 1 ranked player in the world. To put this into context: Go is one of the most difficult games in the world, and is significantly harder to master than chess. It has a vast range of decision trees and possible outcomes, making it extremely difficult to predict, meaning players have to think on their feet and strategise as much as possible. This has people excited because if AI utilizing neural networks can master one of the most complicated games on the planet, its application within the world of work could be massive.

Continue reading

Implementing logistic regression learner with python

Logistic regression is the next step from linear regression. The most real-life data have a non-linear relationship; thus, applying linear models might be ineffective. Logistic regression is capable of handling non-linear effects in prediction tasks. You can think of many different scenarios where logistic regression could be applied. There can be financial, demographic, health, weather, and other data where the model could be implemented and used to predict subsequent events on future data. For instance, you can classify emails into spam and non-spam, transactions being a fraud or not, and tumors being malignant or benign. In order to understand logistic regression, let’s cover some basics, do a simple classification on data set with two features, and then test it on real-life data with multiple features.

Continue reading

Building and evaluating Naive Bayes classifier with WEKA

This is a follow-up post from previous where we were calculating Naive Bayes prediction on the given data set. This time I want to demonstrate how all this can be implemented using the WEKA application. I highly recommend visiting their website and getting the latest release. WEKA is a compelling machine learning software written in Java. It is a widely-used and highly regarded machine learning software that offers a range of powerful data mining and modeling tools. It provides a user-friendly interface, making it accessible to both experienced and novice users. Weka offers a wide range of algorithms and data pre-processing techniques, making it a flexible and robust tool for various machine learning applications, such as classification, clustering, and association rule mining. You can find plenty of tutorials on youtube on how to get started with WEKA. So I won’t get into details. I’m sure you’ll be able to follow anyway.

Continue reading

Simple explanation of Naive Bayes classifier

Probably you’ve heard about Naive Bayes classifier, and likely used in some GUI-based classifiers like WEKA package. This is a number one algorithm used to see the initial results of classification. Sometimes surprisingly, it outperforms the other models with speed, accuracy and simplicity. Lets see how this algorithm looks and what does it do. As you may know algorithm works on Bayes theorem of probability, which allows to prediction the class of unknown data sets. Hoe you are comfortable with probability math – at least some basics.

Continue reading

Linear regression with multiple features

linear regression cost function

Single feature linear regression is simple. All you need is to find a function that fits training data best. It is also easy to plot data and learning curves. But in reality, regression analysis is based on multiple features. So in most cases, we cannot imagine the multidimensional space where data could be plotted. We need to rely on the methods we use. You must feel comfortable with linear algebra, where matrices and vectors are used. If previously we had one feature (temperature), now we need to introduce more of them. So we need to expand hypothesis to accept more features. From now and later on, instead of output y, we are going to use h(x) notation: As you can see, with more variables (features), we also end up with more parameters θ that has to be learned. Before we move let’s find relevant data that we could use for building machine learning algorithm. The data set Again we are going to use data set college cengage. This time we select health data set with several variables. The data (X1, X2, X3, X4, X5) are by city. X1 = death rate per 1000 residents X2 = doctor availability per 100,000…

Continue reading

Linear regression – learning algorithm with Python

In this post, we will demystify the learning algorithm of linear regression. We will analyze the simplest univariate case with single feature X wherein the previous example was temperature and output was cricket chirps/sec. Let’s use the same data with crickets to build learning algorithm and see if it produces a similar hypothesis as in excel. As you may already know from this example, we need to find linear equation parameters θ0 and θ1, to fit line most optimally on the given data set: y = θ0 + θ1 x x here is a feature (temperature), and y – output value (chirps/sec). So how are we going to find parameters θ0 and θ1? The whole point of the learning algorithm is doing this iteratively. We need to find optimal θ0, and θ1 parameter values, so that approximation line error from the plotted training set is minimal. By doing successive corrections to randomly selected parameters we can find an optimal solution. From statistics, you probably know the Least Mean Square (LMS) algorithm. It uses gradient-based method of steepest descent.

Continue reading