I am seeing this I’m seeing this distribution this way I am seeing this kind of overlap and redo this I’m seeing this kind of overlap I am when I use another dimension in the other dimension also I am seeing an overlap but when I put them together in that mathematical space the two Gaussian become linearly separable because of that linear separability now possible amongst the classes on the dimensions look at this dimension this one.
I don’t know what which one is this one is not going to help you separate the orange from the blue doesn’t matter as long as I have this dimension with me which separates orange from the Blues, so one dimension will compensate for the other dimension so overall these weak predictors put together to become strong predictors exactly overall these weak predictors put together to become strong predictors and as you will notice.
When it goes down the line look at let’s look at the data types they we don’t have any nonnumerical columns here everything is numerical keep in mind all algorithms need numerical columns only numerical columns could be categorical or real numbers come down so I am separating the X variables independent variables the target is the cultivator I am splitting data into training set tested I was splitting my data into training septa set you’re of all of you with me on this okay you can actually download the data set from UCI if you wish to this data set you can download from UCI and run the code very good.
I instantiate the Gaussian noise base here there are different variants of Nye base this nine base assumes that the dimensions are Gaussian on each one of these the distributions are Gaussian on each one of the dimension and we meet that requirement very well on all the dimensions the classes are distributed in almost Gaussian ways Gaussian means looks like a normal distribution.
So perfect case for Gaussian I base so I am calling Gaussian evasive there are other variants like multinomial naive Bayes and other my base which you can make use of so I am instantiating the model Here I am going to do the model fit here this is where our law this my base likelihood ratios will be calculated and then I look at this on the training set itself I am doing testing it’s giving me 97% accuracy why am I doing this I’ll tell you in a minute while I can also do this model on the test data.
Here I am running the model on the test data okay and in the matrix, there is a function called classification report which gives me all this matrix recall precision everything in one shot if you are interested you can separately print the matrix also confusion matrix now look at this confusion matrix look at this look at the class level recall any problems he knew it holy any problems from SQL and import Gaussian okay all right now look at this there are three classes here one two three and look at the class level recall what is class level recall accuracy at class levels almost 100% one means hundred ninety five percent hundred percent.
Each one of the classes it’s able to accurately classify with a degree of this is what we wanted in Pisa diabetes it’s giving us a corrosion vine dataset okay I’ll talk about this micro average and all these things down the line we’ll talk about this later right now I just look at this recall matrix I have not told you what precision is and look at the confusion matrix an actual number of the class an in your test record is 23 all of them have been classified as a class.
An actual number of Records for Class B in your test data is 19 it has correctly classified 1812 it has correctly classified 12 you can achieve this kind of score in this model even though the dimensions are overlapping the classes are overlapping and all the dimensions because of some dimensions which come together to make strong predictors so you should understand the difference between Lima disabilities and this in both the cases the classes are overlapping, but there’s one very important critical difference and it’s that difference which is making this right.
You can run the same algorithm base on Pisa diabetes and see what is the score you get there your job here is to select the right attributes that this is you will never get data on a plate when you are doing real-life products this is one big classroom but when you do sit down and do real-life projects you customer sees like to suppose a simple example can you use data science to improve my customer cell CSAT ratings from the current three point five to four point five out of five can you do something using data science to help me improve my customer rating in our technical support these are the kind of requirements which will come to your data will never come to you.
Now your job is to see site rating customer rating for technical support what kind of data you need not desired you have to first guess what kind of data I need your domain expertise will help you if you don’t have it you need to have domain experts so for season rating in customer tech support what kind of data elite can somebody give go back to the past decades okay go back to the past ticket what kind of data you need to go back to the past decades from the past decades you collect what kind of ticket p1 p2 p3 tickets.
I need to find ticket classes then whether the ticket belongs to hardware-software or something else I need those things and various other you will decide what kind of data you need the next challenge will be very will you get the data from some data will be available within the organization some data will be available outside the organization some data will be available with the customer so getting this stakeholder to give the data to us will be such a challenge so all your soft skills will come into play right.
So once the data comes in you have to first establish the reliability of the data what if the tech support department has given you data where the customer is very happy they are not shared with you the dirty data your model will go for it to us those are the challenges which you will face as a data scientist once the data comes to you on this way attributes that you have you will do this analysis using pair plot and other techniques is this column good or that column good which column should I use to define customer satisfaction.
I come from the IT world just like you do many of you do we people have done a lot of projects in Java and C in all these things there the project starts with customer requirements which would turn into technical requirements that we turn into design requirements that turn into code specs coding requirements that turn into unit testing integration testing and finally sub Murcia then we go for acceptance testing in acceptance testing usually it bombs right.
The reason why it bombs is we look at data only in the last stage may we come to acceptance testing we ask customers to give us some data in data science projects your project will start with data the project will end with the data is the core you will be revolving around the data sets say 80% of your effort estimated effort in data science project you will see that when you do the capstone project will go in getting the data.