A deep dive into the programming language of choice for statistics and data
With R All-in-One For Dummies, you get five mini-books in one, offering a complete and thorough resource on the R programming language and a road map for making sense of the sea of data we're all swimming in. Maybe you're pursuing a career in data science, maybe you're looking to infuse a little statistics know-how into your existing career, or maybe you're just R-curious. This book has your back. Along with providing an overview of coding in R and how to work with the language, this book delves into the types of projects and applications R programmers tend to tackle the most. You'll find coverage of statistical analysis, machine learning, and data management with R.
Grasp the basics of the R programming language and write your first lines of code Understand how R programmers use code to analyze data and perform statistical analysis Use R to create data visualizations and machine learning programsÌý Work through sample projects to hone your R coding skill This is an excellent all-in-one resource for beginning coders who'd like to move into the data space by knowing more about R.
This book is an excellent resource for learning and excelling in R.
As someone who straddles the line between web dev, desktop programming, cybersecurity, and data analysis, this book is a must-read to accomplish the latter.
I especially enjoyed Joseph's writing and great and easy step-by-step instructions.
R is not that hard, but it is not really intuitive, so this book is a great primer for you.
Definitely recommend it if you want to get into R.
This is actually a great book to review the concepts; it covers almost all of the things I learned at graduate school. The reason why I like to reread the concepts is because some of them sit inside my brain like a plastic placement that's there because it has to. By reading it over and over again, my understanding deepens and gives it life; they begin to feel more organic inside my head. Statistics is not easy, nor is machine learning. Hopefully one day, I'll wake up feeling like a chef with ample fresh ingredients to use for cooking. Until I feel that confidence, I'll be reading and rereading :)
Favorite quotes: 1. Sometimes you're interested in part of a data frame. To isolate those columns into a data frame, use subset(). 2. read.xlsx(), read.csv(), read.table() <- text files. 3. gg in ggplot stands for "grammar of graphics" 4. Instead of using $, you can use with(). 5. When a histogram has fatter tails, it is leptokurtic with a greater kurtosis. Platykurtic has fewer extreme events than a normal distribution with negative kurtosis. 6. One type of error occurs when you believe that the data shows something important and you reject H0 but in reality, the data are due just to chance. This is called a Type 1 error with the probability called alpha. The other type of error occurs when you don't reject H0 and the data is really due to something out of the ordinary. This is called a Type 2 error, with probability called beta. 7. A two-tailed test indicates that you're looking for a difference between the sample mean and the null-hypothesis mean, but yo udon't know in which direction. A one-tailed test shows that you have a pretty good idea of how the difference should come out. 8. Paired samples example: When the same individual provides a score for before and after study. This is different from the assumption that choosing an individual for one sample has no bearing on the choice of an individual for the other. 9. Distributions: normal, t, chi-square, f 10. Multiple pairwise t-tests don't work (known as a "thorny problem") because if each test has an alpha=0.05, the overall probability of a Type 1 error increases with the number of means. 11. When something jumps out at you that you didn't anticipate, you can make comparisons such as posteriori tests, post hoc tests, unplanned comparisons. 12. Epsilon represents "error" in the population. It's a catchall for "things you don't know or things you have no control over". 13. Analysis of variance and linear regression are the same thing. They're both part of what's called the general linear model (GLM). The third and final component of the general linear model is called the analysis of covariance (ANCOVA). 14. Adjusted r-squared takes degrees of freedom into account. Every time you add an independent variable, you change the degrees of freedom, and r-squared is adjusted accordingly. 15. The optimal separation boundary is the one that maximizes the distance between the separation boundary and its nearest points (margin). The lines from the two nearest points to the separation boundary are called support vectors. 16. The ratio (between sum of squares)/(within sum of squares) is a measure of how well the k-means clusters fit the data. A higher number is better. 17. Three activation functions are common. The hyperbolic tangent (known as tanh) takes a number and turns it into a number between -1 and 1. Sigmoid turns its input into a number between 0 and 1. Rectified linear unit (ReLU) replaces negative values with 0. By restricting the range of the output, activation functions set up a nonlinear relationship between the inputs and the outputs. Why is this important? In most real-world situations, you don't find a nice, neat linear relationship between what you try to predict and the data you use to predict it. 18. Bias is a constant that the network adds to each number coming out of the units in a layer. Bias is much like the intercept in a linear regression equation. Without the intercept, a regression line would pass through (0,0) and might miss many of the points it's supposed to fit.