I remember that when most of my friends were trying to do everything in their power to avoid maths, I always believed it was one of the most creative and interesting subjects for me to develop. My love of numbers lead me first to accountancy, and ultimately into data science and machine learning. I like to think that my strong understanding of financial numbers and business gave me a competitive advantage when switching careers. The best thing about being a data scientist is that I work on a variety of projects, using different tools and facing new challenges every day with a certainty that I will be crunching into numbers at some point of the day. Data science is used as a buzzword for analytics these days, but generally, it is associated with machine learning, artificial intelligence and data visualisations.
My working day usually starts at 9:30am, checking my meeting schedule for the day while drinking coffee! Most days we have a team daily planning meeting call at around 10am. One of the things I learned very quickly in my role with OpenSky, is that communication is one of the major success factors in any project. As a data scientist, I not only interact with my ‘data’ team members (who are incredible people by the way 😊), but I also collaborate on projects and tasks across multiple departments. Analytics is more than just coding and data modelling – it combines the human element of knowledge and science.
That’s why I usually spend 20% of my time analysing the project objectives and requirements from a business perspective, creating documentation, conducting a series of interviews with stakeholders, and finally transforming this knowledge into a plan. Armed with business knowledge, a plan and caffeine, I am ready for the next phase: data understanding. To begin with, I start gathering the data. It usually comes from multiple sources and in various shapes. Early data discovery helps me to find key insights and links, but most importantly it tells me whether the data I am looking at is representative of the business problem I am trying to solve.
“Analytics is more than just coding, and data modelling. It combines the human element of knowledge and science.”
Shortly after this, one thing that I am constantly faced with is that the data I am reviewing, is not in the best shape and usually far from perfect. Finding and cleansing dirty, noisy or missing data, takes around 60-80% of my time each day. After following multiple steps preparing data, I can finally start building and evaluating the models. I repeat the steps until I am happy with the results and the model can be deployed. This will keep me busy for the rest of the day.
On a daily basis, I use T-SQL to fetch and analyse the data, RStudio for data cleansing and profiling, PySpark to create a data models and SQL Server Integration Services for data extraction, transformation and loading. I develop SQL Server Reporting Services (SSRS) reports and create cubes from scratch using Analysis Services. I use Azure Cognitive Services to build intelligent algorithms into apps and bots and Power BI for data visualisation which brings the data to life.
At OpenSky Data Systems I am always challenged with interesting projects and encouraged to generate and develop ideas. By finding meaning and value to data and enabling users to interact with it, I can support a wide range of evidence-based decisions which need to be taken. Depending on the needs of our clients my goal is always to make data more understandable, useful and accessible. I think that nowadays we are all suffering from information overload and the same applies to businesses. By visualising data, I can find and concentrate on information that needs attention. I can tell the story and bring everyone’s attention to the important patterns and connections between numbers that could be scattered across multiple reports.
If you consider a career in Data Science, I believe Andrew Ng is among one of the most influential people in the field and his Machine Learning course on Coursera is a fantastic introduction to the topic https://www.coursera.org/learn/machine-learning. If you are interested in data visualisation, I would recommend “Visualize This: The Flowing Data Guide to Design, Visualization, and Statistics” by Nathan Yau which offers insight into a practical guide on real-world examples.