What must someone know in statistics and machine learning? The 2019 Stack Overflow Developer...

Why can't devices on different VLANs, but on the same subnet, communicate?

Why couldn't they take pictures of a closer black hole?

How did passengers keep warm on sail ships?

How much of the clove should I use when using big garlic heads?

If climate change impact can be observed in nature, has that had any effect on rural, i.e. farming community, perception of the scientific consensus?

Is it ok to offer lower paid work as a trial period before negotiating for a full-time job?

What force causes entropy to increase?

Can you cast a spell on someone in the Ethereal Plane, if you are on the Material Plane and have the True Seeing spell active?

Does adding complexity mean a more secure cipher?

Will it cause any balance problems to have PCs level up and gain the benefits of a long rest mid-fight?

Old scifi movie from the 50s or 60s with men in solid red uniforms who interrogate a spy from the past

Why are there uneven bright areas in this photo of black hole?

Dropping list elements from nested list after evaluation

Getting crown tickets for Statue of Liberty

Cooking pasta in a water boiler

Falsification in Math vs Science

Short story: man watches girlfriend's spaceship entering a 'black hole' (?) forever

What does Linus Torvalds mean when he says that Git "never ever" tracks a file?

Worn-tile Scrabble

Why didn't the Event Horizon Telescope team mention Sagittarius A*?

What is the motivation for a law requiring 2 parties to consent for recording a conversation

A female thief is not sold to make restitution -- so what happens instead?

Is it safe to harvest rainwater that fell on solar panels?

Loose spokes after only a few rides



What must someone know in statistics and machine learning?



The 2019 Stack Overflow Developer Survey Results Are InWhat is your favorite “data analysis” cartoon?The Two Cultures: statistics vs. machine learning?How to understand the drawbacks of K-meansWhat data and statistics skills are currently in high demand and where are they in high demand?How important is domain knowledge in our profession?What is the daily job routine of the machine learning scientist?Opportunities in machine learning and computational intelligenceWhat are the classical notations in statistics, linear algebra and machine learning? And what are the connections between these notations?Subjects in machine learningIn general, is doing inference more difficult than making prediction?Preparing for machine learning examMost efficient way to direct my studies for a career in Data Science/ML?Where Are Statisticians Going?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







1












$begingroup$


There seem to be two different worlds in statistics. On one hand, there are the practitioners which run the same tests again and again. On the other hand, there is this overwhelming and seemingly endless world of statistics and machine learning where one gets lost easily in specific questions - just like here on Cross Validated.



So my question is: What do you consider a person must to know about statistics and machine learning? I know there will be comments that it depends on the area where you work. But still, there are things all statisticians (should) know like multicollinearity, power analysis or linear regression. I really would love to have a profound foundation in statistics, but for me it is hard to tell where to go next. So if statistics and machine learning were a craft occupation what knowledge and what tests / methods would you put in your toolbox?





Why I think this question should be open



I can imagine that experienced statisticians are going to hate this question since it seems pretty broad and therefore naive. But at the same time I can imagine that there are many people like me who are wondering what topics are basic and should be elaborated.



I was already afraid this questions was going to be closed and that is why I anticipated the criticism in my question. I do understand the argument why this question should be on hold. On the other hand: where should I post this question if not on the best Q&A website for statistics? I am being serious about "the best" here. The argument that my question requires non-objective answers and thus doesn't belong on cross validated seems valid but why are there posts like: What is your favorite "data analysis" cartoon? That is a pretty highly rated question so you probably didn't simply miss it. But this question is perfectly subjective and I see no statistical insight in the answers at all. On the other hand, the answers to my question can give a feeling for what needs to be known to be a statistician to many people that are at the beginning of their career. The two answers so far are pretty helpful to me and I was looking forward to reading more and thus I hope that this question gets reopened.










share|cite|improve this question











$endgroup$








  • 5




    $begingroup$
    I am voting to reopen this question and convert it to a wiki.
    $endgroup$
    – Ferdi
    22 hours ago






  • 3




    $begingroup$
    @igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
    $endgroup$
    – mdewey
    21 hours ago






  • 3




    $begingroup$
    You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
    $endgroup$
    – Martijn Weterings
    16 hours ago








  • 3




    $begingroup$
    @Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
    $endgroup$
    – Sycorax
    16 hours ago






  • 3




    $begingroup$
    @igoR87 You can review what is and is not on-topic in the help center. Part of what makes stats.SE a good website is that it's oriented as a Q&A, not a freewheeling discussion. If you like open-ended discussions, maybe reddit is more for you. Importantly, just because a question is about statistics does not mean that it is well-suited for stats.SE -- we don't host all statistics questions, just the ones that are suitable according to the help center.
    $endgroup$
    – Sycorax
    16 hours ago




















1












$begingroup$


There seem to be two different worlds in statistics. On one hand, there are the practitioners which run the same tests again and again. On the other hand, there is this overwhelming and seemingly endless world of statistics and machine learning where one gets lost easily in specific questions - just like here on Cross Validated.



So my question is: What do you consider a person must to know about statistics and machine learning? I know there will be comments that it depends on the area where you work. But still, there are things all statisticians (should) know like multicollinearity, power analysis or linear regression. I really would love to have a profound foundation in statistics, but for me it is hard to tell where to go next. So if statistics and machine learning were a craft occupation what knowledge and what tests / methods would you put in your toolbox?





Why I think this question should be open



I can imagine that experienced statisticians are going to hate this question since it seems pretty broad and therefore naive. But at the same time I can imagine that there are many people like me who are wondering what topics are basic and should be elaborated.



I was already afraid this questions was going to be closed and that is why I anticipated the criticism in my question. I do understand the argument why this question should be on hold. On the other hand: where should I post this question if not on the best Q&A website for statistics? I am being serious about "the best" here. The argument that my question requires non-objective answers and thus doesn't belong on cross validated seems valid but why are there posts like: What is your favorite "data analysis" cartoon? That is a pretty highly rated question so you probably didn't simply miss it. But this question is perfectly subjective and I see no statistical insight in the answers at all. On the other hand, the answers to my question can give a feeling for what needs to be known to be a statistician to many people that are at the beginning of their career. The two answers so far are pretty helpful to me and I was looking forward to reading more and thus I hope that this question gets reopened.










share|cite|improve this question











$endgroup$








  • 5




    $begingroup$
    I am voting to reopen this question and convert it to a wiki.
    $endgroup$
    – Ferdi
    22 hours ago






  • 3




    $begingroup$
    @igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
    $endgroup$
    – mdewey
    21 hours ago






  • 3




    $begingroup$
    You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
    $endgroup$
    – Martijn Weterings
    16 hours ago








  • 3




    $begingroup$
    @Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
    $endgroup$
    – Sycorax
    16 hours ago






  • 3




    $begingroup$
    @igoR87 You can review what is and is not on-topic in the help center. Part of what makes stats.SE a good website is that it's oriented as a Q&A, not a freewheeling discussion. If you like open-ended discussions, maybe reddit is more for you. Importantly, just because a question is about statistics does not mean that it is well-suited for stats.SE -- we don't host all statistics questions, just the ones that are suitable according to the help center.
    $endgroup$
    – Sycorax
    16 hours ago
















1












1








1


3



$begingroup$


There seem to be two different worlds in statistics. On one hand, there are the practitioners which run the same tests again and again. On the other hand, there is this overwhelming and seemingly endless world of statistics and machine learning where one gets lost easily in specific questions - just like here on Cross Validated.



So my question is: What do you consider a person must to know about statistics and machine learning? I know there will be comments that it depends on the area where you work. But still, there are things all statisticians (should) know like multicollinearity, power analysis or linear regression. I really would love to have a profound foundation in statistics, but for me it is hard to tell where to go next. So if statistics and machine learning were a craft occupation what knowledge and what tests / methods would you put in your toolbox?





Why I think this question should be open



I can imagine that experienced statisticians are going to hate this question since it seems pretty broad and therefore naive. But at the same time I can imagine that there are many people like me who are wondering what topics are basic and should be elaborated.



I was already afraid this questions was going to be closed and that is why I anticipated the criticism in my question. I do understand the argument why this question should be on hold. On the other hand: where should I post this question if not on the best Q&A website for statistics? I am being serious about "the best" here. The argument that my question requires non-objective answers and thus doesn't belong on cross validated seems valid but why are there posts like: What is your favorite "data analysis" cartoon? That is a pretty highly rated question so you probably didn't simply miss it. But this question is perfectly subjective and I see no statistical insight in the answers at all. On the other hand, the answers to my question can give a feeling for what needs to be known to be a statistician to many people that are at the beginning of their career. The two answers so far are pretty helpful to me and I was looking forward to reading more and thus I hope that this question gets reopened.










share|cite|improve this question











$endgroup$




There seem to be two different worlds in statistics. On one hand, there are the practitioners which run the same tests again and again. On the other hand, there is this overwhelming and seemingly endless world of statistics and machine learning where one gets lost easily in specific questions - just like here on Cross Validated.



So my question is: What do you consider a person must to know about statistics and machine learning? I know there will be comments that it depends on the area where you work. But still, there are things all statisticians (should) know like multicollinearity, power analysis or linear regression. I really would love to have a profound foundation in statistics, but for me it is hard to tell where to go next. So if statistics and machine learning were a craft occupation what knowledge and what tests / methods would you put in your toolbox?





Why I think this question should be open



I can imagine that experienced statisticians are going to hate this question since it seems pretty broad and therefore naive. But at the same time I can imagine that there are many people like me who are wondering what topics are basic and should be elaborated.



I was already afraid this questions was going to be closed and that is why I anticipated the criticism in my question. I do understand the argument why this question should be on hold. On the other hand: where should I post this question if not on the best Q&A website for statistics? I am being serious about "the best" here. The argument that my question requires non-objective answers and thus doesn't belong on cross validated seems valid but why are there posts like: What is your favorite "data analysis" cartoon? That is a pretty highly rated question so you probably didn't simply miss it. But this question is perfectly subjective and I see no statistical insight in the answers at all. On the other hand, the answers to my question can give a feeling for what needs to be known to be a statistician to many people that are at the beginning of their career. The two answers so far are pretty helpful to me and I was looking forward to reading more and thus I hope that this question gets reopened.







self-study careers






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 15 hours ago


























community wiki





10 revs, 3 users 78%
igoR87









  • 5




    $begingroup$
    I am voting to reopen this question and convert it to a wiki.
    $endgroup$
    – Ferdi
    22 hours ago






  • 3




    $begingroup$
    @igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
    $endgroup$
    – mdewey
    21 hours ago






  • 3




    $begingroup$
    You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
    $endgroup$
    – Martijn Weterings
    16 hours ago








  • 3




    $begingroup$
    @Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
    $endgroup$
    – Sycorax
    16 hours ago






  • 3




    $begingroup$
    @igoR87 You can review what is and is not on-topic in the help center. Part of what makes stats.SE a good website is that it's oriented as a Q&A, not a freewheeling discussion. If you like open-ended discussions, maybe reddit is more for you. Importantly, just because a question is about statistics does not mean that it is well-suited for stats.SE -- we don't host all statistics questions, just the ones that are suitable according to the help center.
    $endgroup$
    – Sycorax
    16 hours ago
















  • 5




    $begingroup$
    I am voting to reopen this question and convert it to a wiki.
    $endgroup$
    – Ferdi
    22 hours ago






  • 3




    $begingroup$
    @igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
    $endgroup$
    – mdewey
    21 hours ago






  • 3




    $begingroup$
    You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
    $endgroup$
    – Martijn Weterings
    16 hours ago








  • 3




    $begingroup$
    @Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
    $endgroup$
    – Sycorax
    16 hours ago






  • 3




    $begingroup$
    @igoR87 You can review what is and is not on-topic in the help center. Part of what makes stats.SE a good website is that it's oriented as a Q&A, not a freewheeling discussion. If you like open-ended discussions, maybe reddit is more for you. Importantly, just because a question is about statistics does not mean that it is well-suited for stats.SE -- we don't host all statistics questions, just the ones that are suitable according to the help center.
    $endgroup$
    – Sycorax
    16 hours ago










5




5




$begingroup$
I am voting to reopen this question and convert it to a wiki.
$endgroup$
– Ferdi
22 hours ago




$begingroup$
I am voting to reopen this question and convert it to a wiki.
$endgroup$
– Ferdi
22 hours ago




3




3




$begingroup$
@igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
$endgroup$
– mdewey
21 hours ago




$begingroup$
@igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
$endgroup$
– mdewey
21 hours ago




3




3




$begingroup$
You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
$endgroup$
– Martijn Weterings
16 hours ago






$begingroup$
You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
$endgroup$
– Martijn Weterings
16 hours ago






3




3




$begingroup$
@Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
$endgroup$
– Sycorax
16 hours ago




$begingroup$
@Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
$endgroup$
– Sycorax
16 hours ago




3




3




$begingroup$
@igoR87 You can review what is and is not on-topic in the help center. Part of what makes stats.SE a good website is that it's oriented as a Q&A, not a freewheeling discussion. If you like open-ended discussions, maybe reddit is more for you. Importantly, just because a question is about statistics does not mean that it is well-suited for stats.SE -- we don't host all statistics questions, just the ones that are suitable according to the help center.
$endgroup$
– Sycorax
16 hours ago






$begingroup$
@igoR87 You can review what is and is not on-topic in the help center. Part of what makes stats.SE a good website is that it's oriented as a Q&A, not a freewheeling discussion. If you like open-ended discussions, maybe reddit is more for you. Importantly, just because a question is about statistics does not mean that it is well-suited for stats.SE -- we don't host all statistics questions, just the ones that are suitable according to the help center.
$endgroup$
– Sycorax
16 hours ago












3 Answers
3






active

oldest

votes


















11












$begingroup$

The two worlds that you describe aren't really two different kinds of statistician, but rather:




  • "statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.

  • statistics proper, as understood by mathematicians, statisticians, data scientists, etc.


The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.



The problem is that without a fairly in-depth understanding, they:




  1. are very likely to misuse statistics

  2. can't stray from the garden path


Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.



The following poem comes to mind:




A little learning is a dangerous thing;

Drink deep, or taste not the Pierian spring:

There shallow draughts intoxicate the brain,

And drinking largely sobers us again.



- Alexander Pope, A Little Learning




I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."



What do you consider as must to know in statistics and machine learning?



The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.



This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.



There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!



At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.



But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.



If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.



Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.



At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.



What tests/ methods would you put in your toolbox?



All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:




  • Ridge, Lasso, and ElasticNet Regression

  • Local Regression (LOESS)

  • Kernel Density Estimates

  • PCA

  • Factor Analysis

  • K-means

  • GMM (and other mixture models)

  • Decision Trees, Random Forest, and XGBoost

  • Time Series Analysis: ARIMA, possible exponential smoothing

  • SVM (Support Vector Machines)

  • Hidden Markov Models

  • GAM (General Additive Models)

  • Bayes Networks and Structual Equation Modeling

  • Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.

  • Bayesian inference a la Stan

  • Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)

  • Extreme value theory

  • Vapnik–Chervonenkis theory


  • Causality

  • Pairwise/Perference modling e.g. Bradley-Terry


  • IRT (item response theory, used for surveys and tests)


This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.






share|cite|improve this answer











$endgroup$









  • 2




    $begingroup$
    Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
    $endgroup$
    – Frank Harrell
    17 hours ago



















8












$begingroup$

Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.



Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.



Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.






share|cite|improve this answer











$endgroup$









  • 1




    $begingroup$
    It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
    $endgroup$
    – LSC
    17 hours ago












  • $begingroup$
    @LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
    $endgroup$
    – Skander H.
    15 hours ago








  • 1




    $begingroup$
    I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
    $endgroup$
    – LSC
    15 hours ago



















5












$begingroup$

What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).



What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously




  1. Say that statistics is hard


  2. Admit that they have little training or expertise in it and


  3. Do it on their own anyway.



No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.



What I need to know is




  1. When I am out of my depth. No one knows all this stuff, certainly I don't.


  2. A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.


  3. Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).


  4. How to ask questions. A good data analyst asks a lot of questions.


  5. Enough matrix algebra and calculus to at least read articles. But that's not all that much.



Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).






share|cite|improve this answer











$endgroup$














    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "65"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402333%2fwhat-must-someone-know-in-statistics-and-machine-learning%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    11












    $begingroup$

    The two worlds that you describe aren't really two different kinds of statistician, but rather:




    • "statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.

    • statistics proper, as understood by mathematicians, statisticians, data scientists, etc.


    The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.



    The problem is that without a fairly in-depth understanding, they:




    1. are very likely to misuse statistics

    2. can't stray from the garden path


    Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.



    The following poem comes to mind:




    A little learning is a dangerous thing;

    Drink deep, or taste not the Pierian spring:

    There shallow draughts intoxicate the brain,

    And drinking largely sobers us again.



    - Alexander Pope, A Little Learning




    I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."



    What do you consider as must to know in statistics and machine learning?



    The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.



    This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.



    There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!



    At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.



    But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.



    If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.



    Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.



    At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.



    What tests/ methods would you put in your toolbox?



    All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:




    • Ridge, Lasso, and ElasticNet Regression

    • Local Regression (LOESS)

    • Kernel Density Estimates

    • PCA

    • Factor Analysis

    • K-means

    • GMM (and other mixture models)

    • Decision Trees, Random Forest, and XGBoost

    • Time Series Analysis: ARIMA, possible exponential smoothing

    • SVM (Support Vector Machines)

    • Hidden Markov Models

    • GAM (General Additive Models)

    • Bayes Networks and Structual Equation Modeling

    • Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.

    • Bayesian inference a la Stan

    • Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)

    • Extreme value theory

    • Vapnik–Chervonenkis theory


    • Causality

    • Pairwise/Perference modling e.g. Bradley-Terry


    • IRT (item response theory, used for surveys and tests)


    This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.






    share|cite|improve this answer











    $endgroup$









    • 2




      $begingroup$
      Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
      $endgroup$
      – Frank Harrell
      17 hours ago
















    11












    $begingroup$

    The two worlds that you describe aren't really two different kinds of statistician, but rather:




    • "statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.

    • statistics proper, as understood by mathematicians, statisticians, data scientists, etc.


    The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.



    The problem is that without a fairly in-depth understanding, they:




    1. are very likely to misuse statistics

    2. can't stray from the garden path


    Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.



    The following poem comes to mind:




    A little learning is a dangerous thing;

    Drink deep, or taste not the Pierian spring:

    There shallow draughts intoxicate the brain,

    And drinking largely sobers us again.



    - Alexander Pope, A Little Learning




    I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."



    What do you consider as must to know in statistics and machine learning?



    The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.



    This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.



    There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!



    At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.



    But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.



    If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.



    Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.



    At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.



    What tests/ methods would you put in your toolbox?



    All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:




    • Ridge, Lasso, and ElasticNet Regression

    • Local Regression (LOESS)

    • Kernel Density Estimates

    • PCA

    • Factor Analysis

    • K-means

    • GMM (and other mixture models)

    • Decision Trees, Random Forest, and XGBoost

    • Time Series Analysis: ARIMA, possible exponential smoothing

    • SVM (Support Vector Machines)

    • Hidden Markov Models

    • GAM (General Additive Models)

    • Bayes Networks and Structual Equation Modeling

    • Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.

    • Bayesian inference a la Stan

    • Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)

    • Extreme value theory

    • Vapnik–Chervonenkis theory


    • Causality

    • Pairwise/Perference modling e.g. Bradley-Terry


    • IRT (item response theory, used for surveys and tests)


    This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.






    share|cite|improve this answer











    $endgroup$









    • 2




      $begingroup$
      Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
      $endgroup$
      – Frank Harrell
      17 hours ago














    11












    11








    11





    $begingroup$

    The two worlds that you describe aren't really two different kinds of statistician, but rather:




    • "statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.

    • statistics proper, as understood by mathematicians, statisticians, data scientists, etc.


    The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.



    The problem is that without a fairly in-depth understanding, they:




    1. are very likely to misuse statistics

    2. can't stray from the garden path


    Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.



    The following poem comes to mind:




    A little learning is a dangerous thing;

    Drink deep, or taste not the Pierian spring:

    There shallow draughts intoxicate the brain,

    And drinking largely sobers us again.



    - Alexander Pope, A Little Learning




    I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."



    What do you consider as must to know in statistics and machine learning?



    The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.



    This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.



    There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!



    At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.



    But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.



    If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.



    Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.



    At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.



    What tests/ methods would you put in your toolbox?



    All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:




    • Ridge, Lasso, and ElasticNet Regression

    • Local Regression (LOESS)

    • Kernel Density Estimates

    • PCA

    • Factor Analysis

    • K-means

    • GMM (and other mixture models)

    • Decision Trees, Random Forest, and XGBoost

    • Time Series Analysis: ARIMA, possible exponential smoothing

    • SVM (Support Vector Machines)

    • Hidden Markov Models

    • GAM (General Additive Models)

    • Bayes Networks and Structual Equation Modeling

    • Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.

    • Bayesian inference a la Stan

    • Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)

    • Extreme value theory

    • Vapnik–Chervonenkis theory


    • Causality

    • Pairwise/Perference modling e.g. Bradley-Terry


    • IRT (item response theory, used for surveys and tests)


    This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.






    share|cite|improve this answer











    $endgroup$



    The two worlds that you describe aren't really two different kinds of statistician, but rather:




    • "statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.

    • statistics proper, as understood by mathematicians, statisticians, data scientists, etc.


    The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.



    The problem is that without a fairly in-depth understanding, they:




    1. are very likely to misuse statistics

    2. can't stray from the garden path


    Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.



    The following poem comes to mind:




    A little learning is a dangerous thing;

    Drink deep, or taste not the Pierian spring:

    There shallow draughts intoxicate the brain,

    And drinking largely sobers us again.



    - Alexander Pope, A Little Learning




    I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."



    What do you consider as must to know in statistics and machine learning?



    The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.



    This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.



    There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!



    At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.



    But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.



    If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.



    Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.



    At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.



    What tests/ methods would you put in your toolbox?



    All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:




    • Ridge, Lasso, and ElasticNet Regression

    • Local Regression (LOESS)

    • Kernel Density Estimates

    • PCA

    • Factor Analysis

    • K-means

    • GMM (and other mixture models)

    • Decision Trees, Random Forest, and XGBoost

    • Time Series Analysis: ARIMA, possible exponential smoothing

    • SVM (Support Vector Machines)

    • Hidden Markov Models

    • GAM (General Additive Models)

    • Bayes Networks and Structual Equation Modeling

    • Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.

    • Bayesian inference a la Stan

    • Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)

    • Extreme value theory

    • Vapnik–Chervonenkis theory


    • Causality

    • Pairwise/Perference modling e.g. Bradley-Terry


    • IRT (item response theory, used for surveys and tests)


    This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    answered yesterday


























    community wiki





    olooney









    • 2




      $begingroup$
      Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
      $endgroup$
      – Frank Harrell
      17 hours ago














    • 2




      $begingroup$
      Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
      $endgroup$
      – Frank Harrell
      17 hours ago








    2




    2




    $begingroup$
    Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
    $endgroup$
    – Frank Harrell
    17 hours ago




    $begingroup$
    Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
    $endgroup$
    – Frank Harrell
    17 hours ago













    8












    $begingroup$

    Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.



    Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.



    Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.






    share|cite|improve this answer











    $endgroup$









    • 1




      $begingroup$
      It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
      $endgroup$
      – LSC
      17 hours ago












    • $begingroup$
      @LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
      $endgroup$
      – Skander H.
      15 hours ago








    • 1




      $begingroup$
      I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
      $endgroup$
      – LSC
      15 hours ago
















    8












    $begingroup$

    Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.



    Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.



    Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.






    share|cite|improve this answer











    $endgroup$









    • 1




      $begingroup$
      It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
      $endgroup$
      – LSC
      17 hours ago












    • $begingroup$
      @LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
      $endgroup$
      – Skander H.
      15 hours ago








    • 1




      $begingroup$
      I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
      $endgroup$
      – LSC
      15 hours ago














    8












    8








    8





    $begingroup$

    Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.



    Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.



    Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.






    share|cite|improve this answer











    $endgroup$



    Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.



    Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.



    Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited 20 hours ago


























    community wiki





    Skander H.









    • 1




      $begingroup$
      It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
      $endgroup$
      – LSC
      17 hours ago












    • $begingroup$
      @LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
      $endgroup$
      – Skander H.
      15 hours ago








    • 1




      $begingroup$
      I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
      $endgroup$
      – LSC
      15 hours ago














    • 1




      $begingroup$
      It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
      $endgroup$
      – LSC
      17 hours ago












    • $begingroup$
      @LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
      $endgroup$
      – Skander H.
      15 hours ago








    • 1




      $begingroup$
      I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
      $endgroup$
      – LSC
      15 hours ago








    1




    1




    $begingroup$
    It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
    $endgroup$
    – LSC
    17 hours ago






    $begingroup$
    It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
    $endgroup$
    – LSC
    17 hours ago














    $begingroup$
    @LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
    $endgroup$
    – Skander H.
    15 hours ago






    $begingroup$
    @LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
    $endgroup$
    – Skander H.
    15 hours ago






    1




    1




    $begingroup$
    I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
    $endgroup$
    – LSC
    15 hours ago




    $begingroup$
    I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
    $endgroup$
    – LSC
    15 hours ago











    5












    $begingroup$

    What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).



    What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously




    1. Say that statistics is hard


    2. Admit that they have little training or expertise in it and


    3. Do it on their own anyway.



    No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.



    What I need to know is




    1. When I am out of my depth. No one knows all this stuff, certainly I don't.


    2. A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.


    3. Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).


    4. How to ask questions. A good data analyst asks a lot of questions.


    5. Enough matrix algebra and calculus to at least read articles. But that's not all that much.



    Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).






    share|cite|improve this answer











    $endgroup$


















      5












      $begingroup$

      What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).



      What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously




      1. Say that statistics is hard


      2. Admit that they have little training or expertise in it and


      3. Do it on their own anyway.



      No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.



      What I need to know is




      1. When I am out of my depth. No one knows all this stuff, certainly I don't.


      2. A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.


      3. Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).


      4. How to ask questions. A good data analyst asks a lot of questions.


      5. Enough matrix algebra and calculus to at least read articles. But that's not all that much.



      Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).






      share|cite|improve this answer











      $endgroup$
















        5












        5








        5





        $begingroup$

        What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).



        What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously




        1. Say that statistics is hard


        2. Admit that they have little training or expertise in it and


        3. Do it on their own anyway.



        No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.



        What I need to know is




        1. When I am out of my depth. No one knows all this stuff, certainly I don't.


        2. A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.


        3. Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).


        4. How to ask questions. A good data analyst asks a lot of questions.


        5. Enough matrix algebra and calculus to at least read articles. But that's not all that much.



        Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).






        share|cite|improve this answer











        $endgroup$



        What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).



        What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously




        1. Say that statistics is hard


        2. Admit that they have little training or expertise in it and


        3. Do it on their own anyway.



        No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.



        What I need to know is




        1. When I am out of my depth. No one knows all this stuff, certainly I don't.


        2. A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.


        3. Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).


        4. How to ask questions. A good data analyst asks a lot of questions.


        5. Enough matrix algebra and calculus to at least read articles. But that's not all that much.



        Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        answered 17 hours ago


























        community wiki





        Peter Flom































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402333%2fwhat-must-someone-know-in-statistics-and-machine-learning%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            El tren de la libertad Índice Antecedentes "Porque yo decido" Desarrollo de la...

            Castillo d'Acher Características Menú de navegación

            Connecting two nodes from the same mother node horizontallyTikZ: What EXACTLY does the the |- notation for...