The NSDC Data Science Flashcards series will teach you how the data pipeline is developed for data science projects. This flashcard was created by Emily Rothenberg, National Student Data Corps (NSDC) Program Manager. You can find the full NSDC Data Science Flashcards collection of videos on the NEBDHub Youtube channel.
Data Analytics allows us to take large datasets and find hidden patterns or trends that can answer otherwise-complex questions. So, first, you’ll need to come up with your question or problem statement. Are you interested in recommending methods that would increase company sales for the upcoming quarter? Are you interested in identifying forms of bias in social media data? Your question or motivation is just as important as your analysis throughout this process.
Once you have a research question or hypothesis that you’d like to test, you’ll want to begin your research project by following the Data Science Pipeline.
The first step in the Data Science Pipeline concerns data selection bias and data science ethics. A good data scientist needs to understand the ethical issues surrounding the data they collect and use. It’s important to select data for your project which is free of unintentional systematic errors that generate distorted results and lead to unfair recommendations. Bias can be introduced through data collection, model training, algorithms, human error, and more.
Let’s say, for example, a data scientist is tasked with analyzing the health risks, specifically incidences of asthma, of a particular geographic area. Bias can arise in this study in several ways. For example, the information related to certain geographic areas may be omitted in the analysis to make the overall analysis look better (e.g. excluding lower income areas near high automotive exhaust). Bias might also be introduced if the results are averaged. This might, for instance, prevent the researcher from identifying risks by zip code.
As you can see, it is important that the data scientist does not ignore or discount any of the data. Using all appropriate data will ensure a clearer understanding, and should result in more complete and equitable conclusions and decisions.
Please follow along with the rest of the NSDC Data Science Flashcard series to learn more about the Data Pipeline.