Getting Started With Data Science — A starter guide for companies

Vinayak Mitty
9 min readOct 10, 2022

Data Science. Machine Learning. Artificial Intelligence. Deep Learning. Data Engineering. Data Analytics. Big Data Analytics. Oh, the endless list of buzzwords!

With so much buzz and hype around Data, the majority of small and mid-sized companies feel alienated. In my consulting, advisory, and leadership roles, I have come across executives and teams at smaller companies feeling Data Science isn’t for them and that using data to grow their business is very complex and out of reach. While it is true that leaders in the data space (FAANG and others) invest millions of dollars in Data Science, deriving actionable insights from data to improve your business need not be unattainable or be a monopoly just for the big players. Although the journey to data literacy and adoption varies for every company, I am attempting here to demystify the data landscape and provide some basic pathways to obtain intelligence from data.

Data Strategy — Start small

Having a good data strategy will provide a blueprint and align all the stakeholders in the right direction. The best data strategies are when companies plan their bandwidth and resource allocation to manage their existing reporting while building a scalable data infrastructure. Good strategies allow for a slow start and gain momentum as foundations are built. Analytics Vidhya, a reputed community for Data Science, captures this journey here:

The spectrum of Data Science from Analytics Vidhya

In this article, I want to dive into the mindset and approach needed to navigate along each of these spectrums. I will publish follow-up articles about the technical details, tools, and other how-tos.

Reporting Automation

Most companies start off reporting on excel or google spreadsheets and have the basic analysis needed to manage the business. This works on a small scale. However, it is important to automate reporting if you want to move to the next phases of the Analytics and Data Science journey. Leveraging automation tools within the spreadsheet ecosystem is a good start to succeed in this phase — these could be VBA, Macros, SharePoint, PowerPivots, Pivots, etc. Excel is such a powerful tool that you can create practically anything you can dream of. The goal for this phase is to partially automate your reporting so you can get back some time from running manual reports. The saved time can be allocated toward building the foundational data infrastructure (discussed in the next section). Here are a few articles on automating reports in excel — Article 1 and Article 2.

Data Architecture

A Data Architecture is foundational to a successful data strategy. It is composed of technology, tools, models, policies, and rules that govern how the data is collected, stored, processed, transformed, presented, and analyzed. Setting up these building blocks is essential for growing to the next phase — Business Intelligence and Data Visualization.

It is important to note that the Data Architecture itself can be endlessly complex based on the business needs. For example, some complex business scenarios might need real-time streaming, whereas a simpler batch-processing mechanism might work just fine for some. Again, this article series focuses on a simple architecture to gain insights and intelligence from the data without getting too worried about complex technological infrastructure.

Data Science Architecture

Database as an Analytical Data Store

When developing a Data Architecture, investing in a database system is essential. The goal here is to bring data from your sources into this database and store and transform it so it can be consumed for reporting. This database will serve as the Analytical Data Store used to store and process data. This is also referred to as a data lake or a data warehouse (or a data lakehouse), depending on how it is used and set up.

With advancements in cloud technologies and the availability of cloud database services, it makes sense to skip over on-premise databases and directly invest in cloud databases. Investing in the cloud will you save money, time, and resources when it comes to the initial setup, maintenance, and security of the database. There are plenty of options available, and one cannot go wrong choosing any of the top databases in the market. However, I would recommend going with Snowflake, as it is one of the most flexible, easiest to set up and manage, and highly scalable databases in the market.

At the risk of repeating myself, at the simplest level, the job of our database is to store and process data and make it available for reporting (Business Intelligence and Visualization). The database can be split into a Data Lake and a Data Warehouse. The first component is a Data Lake, where data from various sources are brought in its raw form. Data is then massaged and transformed into a consumable format for reporting called Warehousing, and this database component is a Data Warehouse.

Data Ingestion and Integration

Once the database is set up, the next step is to hydrate it with data from various sources/platforms — these sources could be ERPs, CRMs, digital marketing platforms like google analytics, or paid media platforms like Google Ads, Facebook, Instagram, and others. Ideally, the data lake holds data from all the touchpoints your customers have with your company — pre-purchase, post-purchase, and even any data you can find post-business (cancellation, post-sales, etc.).

The process of bringing data into a database is called data ingestion or extraction. Storage of data being as cheap as it is, I am an advocate of the ELT (Extract, Load, Transform) approach to load in the data. Basically, it means we bring the data as is from different sources, load them into the database… and transform it later into a reportable structure. The Load happens in the Data Lake part of the database, and the data is stored in the reportable format in the data warehouse part post Transformation.

Again, there are several on-premises and cloud-based tools for data ingestion, typically called ETL tools. I’d continue the theme and recommend going for the cloud-based ones to avoid setup maintenance, admin, and security chores. Stitch, Fivetran, and Hevo Pipeline are some of the good ones I have come across — each of these is a no-code-easy-to-setup tool.

Data Analytics and Business Intelligence

A fully automated reporting stack needs a robust platform and set of tools to visualize data into actionable insights that inform business decisions. A Business Intelligence (BI) layer of the data architecture is the user interface and ideally is the one-stop shop for users to come for their insight for both Analytic reporting and more advanced Data Science driven analyses. BI tools such as Tableau and Power BI access and analyze data sets and present analytical findings in reports, summaries, dashboards, graphs, charts, and maps to provide users with detailed intelligence. Again, I’d suggest going with the cloud versions of these tools to avoid maintenance headaches.

A good data analytics infrastructure easily handles reporting for various aspects of the business, such as sales reporting, revenue calculations, and overall summaries of products, reviews, and other activities. For the majority of businesses, this is a phase with the best return on investment in terms of data. The goal for the company at this stage is to automate the repetitive periodic reports so the data teams can sit back and think about the data. Be data analysts vs. report runners — allow the machines to do the mundane report-crunching and spend your time digging into the hidden stories and insights in the data.

Data Science — Statistical Modeling to Machine Learning and AI

Data Analytics and Business Intelligence is all about looking at past events and deriving insights from them. The next step is to use historical data to predict what is likely to happen in the future. This is where companies leverage statistical modeling, predictive analytics, and other advanced data techniques to project future results of their strategies and efforts.

Journey until BI and Data Analytics can be fairly linear (it can be complex based on business needs but can be kept straightforward). However, when it comes to the next phases in the Data Science journey, it starts getting complex — the learning curve becomes steeper, and it takes longer for leaders to see the results of their investments. There are tools available for simple forecasting and predictive modeling. For example, BI tools like Tableau have some of these models built in — simple forecasting in Tableau and Einstein discovery in Tableau. However, in my experience, these are fairly basic and might not get the expected results.

Tools such as Python and R Studio are popular for statistical and predictive modeling. Both of these require some expertise and on-premises setup on your part. If you are well versed in SQL, database tools like Snowflake, in fact, can be leveraged for a lot of statistical modeling and predictive modeling.

Moving further up the spectrum into Machine Learning will require specific tools such as Amazon Sagemaker or Google BigQuery ML — both of which can be integrated with the above-mentioned data architecture with Snowflake.

Machine Learning is a step ahead of predictive modeling, such as forecasting — this is when we train the machines to mimic the intelligence and analysis that a human would do. Algorithms and mathematical models are used to recognize patterns, classify objects, understand anomalies, and predict future events. There are several steps to building an ML pipeline —

  • Data Collection
  • Data Preparation and Feature Engineering
  • Model Selection and Training
  • Evaluation and Tuning
  • Deployment

Artificial Intelligence is even a step ahead of ML, where a computer system is capable of mimicking human cognitive functions like problem-solving. Understanding the use cases (noted in the next section) for Machine Learning and Artificial Intelligence will come in handy to decide if this is something you need to think about at all.

Building out a scalable infrastructure from the get-go is critical. Starting off with reporting automation and growing into data science has a lot of merits — it allows the teams to carry out the reporting needs for the business while planning ahead. Companies of any size can adopt this as a blueprint but are more suitable for small and mid-sized companies constrained with resources.

Additional Use Cases

So is this all worth it?

Tired of the short, clickbaity blogs that say do-as-I-do or follow-five-steps-for-success, I am trying a philosophical and conceptual approach from which anyone curious about it can adopt. Data Science need not be complex or just for the big players; I am trying to write this as part of a blog series about the thought process behind building simple, cost-effective data science platforms.

Used as an umbrella term, Data Science and its subsets — Statistical Analysis, Predictive Analytics, Big Data, Machine Learning, and Artificial Intelligence, can help companies position themselves as the market leader in their respective industry. The better you get at building models and algorithms, collecting and analyzing multi-dimensional data, the better insights you can get on your customers and turn the data architecture into a profit center instead of a must-have cost burden, as is the case with many companies. As you get bigger in terms of data acquisition, the architecture will need to morph into a Big Data stack with the ability to carry out interval-based (batch processing) as well as streaming (real-time) data ingestion, processing, analyses, reporting, and visualization.

If your business goals justify the investment in resources, technology, and time to build out Data Science capabilities, it definitely pays off when done correctly. There is a reason for all the hype around the field; when it works well, it feels like magic! If you want to go all the way, I suggest you have realistic expectations — start slowly and build up the momentum. Even when you get to the modeling and machine learning phases, it’s not going to immediately give you results.

It takes fine-tuning, configuring the algorithms to suit your needs, and lots of iterations to start getting the results that you hope for. All in all, it will be an interesting mission, should you choose to accept :)

--

--

Vinayak Mitty

Director of Data Science and Engineering at LegalShield. PhD Candidate. Advisor. Open for consultations and part-time engagements— www.vmitty.com