Machine Learning

Homework Assignment 4600.335/435 Artificial Intelligence
Machine LearningIn the past week, we have covered a variety of machine learning techniques for classification tasks. This assignment will focus on implementing three of these techniques.In order to assess the effectiveness of these methods, you will be training and testingyour code on real datasets. We will provide some code scaffolding that will show youwhat methods our testing scripts will expect you to implement and some methods thatcan read training and testing datasets and store the information for you (see downloadlink on website).
DatasetsThe provided datasets all come from the UCI machine learning repository under classification datasets (link). We have selected three datasets that we believe will give adiversity of results from our selected methods.
Congressional Voting Records Dataset
This dataset (link) classifies congressmen into democrats and republicans based ontheir voting record. This is a somewhat easy problem to solve, and you should achievefairly high accuracy with even simple methods of classification. This is a good datasetto use for debugging purposes.
MONKS Problems Dataset
This dataset (link) has arbitrary attributes and uses 0 and 1 as labels. This datasetwas generated in order to be a difficult problem to solve, and was used for a learningalgorithm competition. There are three problems included in this dataset, please besure to test your algorithms on all three! We don’t expect great performance on thisdataset because by its nature it is a difficult problem to solve. Note that some attributeshere have different ranges of values than others.
Iris DatasetThis dataset (link) classifies flowers into different classes of iris according to the size ofdifferent features. This is a classic classification problem because it is a mix of linearlyseparable and inseparable classes.
MethodsThis section outlines the methods you will need to implement and test against theprovided datasets. Please be sure to not only report your accuracy for each dataset,but also precision and recall.
Decision TreeFor this method you will implement a decision tree that split based on the attribute thatoffers the maximum information gain as outlined in section 18.3 of the textbook. Pleasebe sure to add an option for using pure information gain or an information gain ratioto split. The information gain ratio should be information gain divided by the rangeof values provided by the examined attribute. Please report the accuracy, precision,and recall for each dataset using IG and IG ratio. Also discuss the differences in thesemetrics between using pure information gain and an information gain ratio.
Grad Students Only: In addition to the above, please implement pruning as outlinedin section 18.3.5 of the textbook. Please also report the differences in the above metricsfor your datasets between using pruning and not using pruning.
Naive BayesFor this method you will implement a naive bayes classifier as outlined in section 20.2of the textbook. Please be sure t report the accuracy, precision, and recall of the method on all datasets.
Neural NetworkFor this method you will implement a neural network as outlined in section 18.7 ofthe textbook. We will give you a bit of flexibility in how you implement your neuralnetwork, but please be sure to use at least one hidden layer (otherwise this is approachis trivial to implement). Also implement an alternate weight initialization scheme asshown in the class lecture on neural networks. Report the accuracy, precision, and recallof this method on all datasets with and without this alternate weight initializationand discuss the differences in these metrics between using the default weight initialization  and one of your own.

Grad Students Only: Also implement a momentum term as described in the class lectureon neural networks. Please also report the above metrics with this momentumterm included and discuss how these values change when this term is introduced.Once you have implemented all three methods and collected data for each, discusshow the performances of the algorithms compares on each dataset. Which methodwould you consider the "best" for each dataset? Which metric(s) do you use to determinethis, and how does the nature of the dataset itself affect your decision?
Getting Started
A good first step would be to examine the provided code scaffolding and start workingwith numpy (link). Create a node class and examine what data you would need to storefor each method. If you don’t have much experience programming data structures inpython or just need help understanding how these methods work, 
 README that includes a brief description of each class, the breakdown of workbetween partners, and an explanation of any lingering errors. If there are anyknown bugs at the time of submission, please be sure to include these and abreakdown of debugging steps taken to ensure we can give you as much partialcredit as possible (major bugs that aren’t explained will make you lose credit forthat particular method).
 All code needed to call the functions provided by our scaffolding to train and testyour methods on the provided datasets. You are allowed to modify the code weprovide(as long as the method headers are unchanged), but in that case please besure to include it! A pdf or other text file (pdf preferred) that includes all answers to discussionprompts in the assignment. These are found in the methods section.
Powered by