This week, I have completed the first course of a specialisation cycle dedicated to Python for data analytics.
Throughout my career of process engineer, I have been constantly facing issues that had to be investigated and understood, most often using the data which was available. Unfortunately, my experience is that only a very tiny amount is effectively uploaded in organised, well-structured databases ready to be queried, and raw data is in general not user-friendly at all and almost unreadable. Processes (especially the ones linked to metrology operations) generate a great deal of raw data, stored in multiple ways and formats. Hence, raw data treatment and preparation is a necessary step of data analysis and inference, but a tedious and low added-value one, unless one comes up with the right tools.
The first one that comes in my mind is the traditional Excel sheet, where data can be imported, filtered and analysed. It is very popular, widespread and versatile. Excel comes with a scripting language, VBA (Visual Basic for Applications), where macro can be designed so as to automate tasks. It is a very decent tool, which should be part of the data analytics survival kit, when nothing else is available. In fact, for manipulating huge datasets, a much better choice is JMP, a licensed statistical discovery software offered by SAS that offers an intuitive, Excel-like interface. JMP provides unique features to transform and combine multiple datasets with hundreds of thousand of lines in a blink of an eye, namely summarising, concatenating, splitting, stacking and subsetting… A bunch of advanced modules allows complex analysis, my favourite one being the profiler where multi-variable and multi-response trends can be immediately displayed with regression parameters associated to the underlying model, which is invaluable for experimental design. Repetitive tasks can be automated thanks to an integrated scripting language (namely JSL), with powerful macros able to build fragments of code.
While JMP is a really great piece of software for data manipulation and on-the-fly analysis, its scripting language lacks portability, is restricted to JMP environment (which is licensed), and basically its inputs are only datasets.
A programming language like Python can alleviate these shortcomings. It is a powerful high-level language, easy to learn, universal, open-source, free and portable. With Python, there is virtually no other limitation than hardware resources and programming skills. A very active community has been continuously designing advanced modules and libraries, allowing for high-productivity programming, with endless potential applications. For all these reasons, I consider Python as a smart choice for a natural extension of professional software specialised in data analytics, and my plan is to go through that full specialisation cycle.