The vast macrocosm of data science and engineering

& how skewed our perception might be!

Anukriti Ranjan
5 min readJan 29, 2021

Data science is typically understood as the knowledge that helps deliver insights from the vast ocean of data, the deluge of which has completely captured the imagination of technologists like nothing else. The internet is flooded with tutorials on how to import and clean data using efficient libraries and run a machine learning model to find ways to inspect data for interesting patterns and results or to predict on new data using learnings from the trained model (again using pre-existing libraries!!) While all this forms an integral part of data science, it is not the only or even the most important aspect of it.

Image by author: The universe of data science and engineering

Data science and engineering offers immense potential to create business and social impact. From elevating user experience to reducing cost, from detecting anomalies to improving quality, from forecasting future trends to correcting unimpressive decisions, data is being used for a multitude of purposes. The discipline is here to stay and thrive, but what it needs is an equitable distribution of talent and resources across its various aspects. Currently, there seems to be a skewed understanding of practitioners from this field as being those who investigate data to build models. In reality, the field is quite diverse and offers immense opportunities to put data to the best use using an eclectic mix of talent and resources.

Let’s look at the lesser discussed aspects of this vast field.

What data to use, when and why?

Source: twitter.com

Data acquisition, processing and storage take time, cost and effort and the allocation of resources towards these must justify the purpose they are serving. The decision regarding what all data to acquire and the target to use it for is of utmost importance. This usually requires domain expertise and the experience of having looked at enough data to understand what is relevant and what is not. This insight is developed over time with more real-life exposure to solutioning through data.

Interesting case studies can be found on the internet and books regarding how a particular targeted metric proved to be game changer or how capturing marginally more data helped understand confounding.

•Airbnb’s team had a hunch that better photos would increase rentals.

• They tested the idea with a Concierge MVP, putting the least effort possible into a test that would give them valid results.

• When the experiment showed good results, they built the necessary components and rolled it out to all customers.

Sometimes, growth comes from an aspect of your business you don’t expect. When you think you’ve found a worthwhile idea, decide how to test it quickly, with minimal investment. Define what success looks like beforehand, and know what you’re going to do if your hunch is right.

Lean, analytical thinking is about asking the right questions, and focusing on the one key metric that will produce the change you’re after.

(From: Lean Analytics: Use Data to Build A Better Startup Faster- Book by Alistair Croll and Benjamin Yoskovitz)

Sometimes, it takes humility and understanding to accept that data alone will not provide all the answers.

Source: twitter.com

Data engineering and hardware optimization

Building data infrastructure is a huge domain in itself. All the data (from sources such as mobile devices and laptops, sensors,etc.) that your organization collects will be stored in a ‘data lake’ in raw format which will be converted into usable formats using ETL pipelines. Your organization could use an in-house server or a cloud platform to do that. Data storage and processing on cloud platforms are billed and must be used optimally to keep costs low while also deriving the expected value out of it.

Machine learning algorithms are computationally intensive and are subject to repeated experimentation for acceptable results. This requires state-of-the-art processing units. More often than not, single data ingestion and processing script will not suffice. You would require distributed parallel processing which again requires decision on choice and refinement of architecture. The models would then be deployed to create real world impact.

Data architecture is extremely integral to determining the latency, throughput, usability, scalability, security and effectiveness of your data solution.

Difference in computational time when processed through distributed parallel processing and serially (hypothetical) on BigQuery, GCP

Explainability and Interpretability of models

Interpretability is the degree to which a human can consistently predict the model’s result . The higher the interpretability of a machine learning model, the easier it is for someone to comprehend why certain decisions or predictions have been made. A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model.

(From: Interpretable Machine Learning - Book by Christoph Molnar)

Increasingly, machine learning models are being seen as black box models producing predictions on being fed with data. Often, it is pertinent for us to know how the predictions were arrived at. Explainability of a model is important to increase trust in the model, to make sure that the biases we are fighting against have not percolated in the model only to be scaled further, to be able to answer questions on ethics and also to improve human capability by leveraging the learning from the machine. It is a developing field and will gradually rise in significance as AI and ML become increasingly adopted.

Monitoring the impact of analysis and models in the real world

The models must be subjected to scrutiny to ensure that they are delivering the expected results and that they have not given rise to unintended effects from their use. This required regular monitoring and accountable processes so that course correction is undertaken as and when required.

Conclusion

This article was meant to highlight the different aspects of data science and engineering. It does not intend to weigh the importance of one aspect against the other. There are multiple facets of this amazing field and each of them requires insights and solutions that will help progress the efficacy and use of technology further. Perhaps, this is why the field attracts people from various disciplines and diverse experiences.

--

--