Python data structures

Yesterday, I completed another course of a specialisation cycle dedicated to Python for data analytics.

Python Data Structures by University of Michigan on Coursera. Certificate earned on October 1, 2016

Python Data Structures by University of Michigan on Coursera. Certificate earned on October 1, 2016

While the first course was dealing with the very basics of variables, conditional loops, iterations and functions, this course further builds on data structures such as strings, files, lists, dictionaries and tuples. In general, there are multiple ways to perform a task on data, but only few of them are simple and smart (“pythonic”). Selecting the right data structure is of utmost importance. The assimilation of Python idioms necessitates a little bit of time, but it is fundamental step to build on, and allows for very short and efficient code that can perform complex tasks.

How to create a mind ?

Kurzweil's book, written in 2012

Kurzweil’s book, written in 2012

Our Universe exists because of its informational content. From pure physics to chemistry and biology, evolution started from simple structures (atoms, carbon molecules…) to create more complex ones (DNA, proteins…), and life eventually ! This evolution yielded nervous systems and finally the human brain, which is able of hierarchical thinking. The neocortex is the central piece, it can work with patterns, associate symbols and link them together to give rise to the knowledge that we know. Technology is nothing else but applied knowledge being made possible by humans ability to manipulate objects and make tools. Reverse-engineering the brain to make thinking machines is probably the greatest project ever, one that can transcend humankind.

Even though this book is certainly not the expression of a real scientific work, it is full of optimistic insights and bewildering intuition on future. It is amazing to see how technological progress transformed our societies these last few decades, and we are possibly the witnesses of a major transition never seen before which is going to change humankind forever. Thinking machines able to compete with humans should appear by the 2030s. A natural consequence of LOAR (Law Of Accelerating Return, a postulate stating that evolution accelerates as it grows in complexity and capability) is that humans and machines will meld together, and the computing limits will probably be reached at the end of the century, giving rise to a deeply transformed society potentially able to colonise space and conquer new solar systems.

Getting started with Python

This week, I have completed the first course of a specialisation cycle dedicated to Python for data analytics.

Programming for Everybody (Getting Started with Python) by University of Michigan on Coursera. Certificate earned on August 25, 2016

Programming for Everybody (Getting Started with Python) by University of Michigan on Coursera. Certificate earned on August 25, 2016

Throughout my career of process engineer, I have been constantly facing issues that had to be investigated and understood, most often using the data which was available. Unfortunately, my experience is that only a very tiny amount is effectively uploaded in organised, well-structured databases ready to be queried, and raw data is in general not user-friendly at all and almost unreadable. Processes (especially the ones linked to metrology operations) generate a great deal of raw data, stored in multiple ways and formats. Hence, raw data treatment and preparation is a necessary step of data analysis and inference, but a tedious and low added-value one, unless one comes up with the right tools.

The first one that comes in my mind is the traditional Excel sheet, where data can be imported, filtered and analysed. It is very popular, widespread and versatile. Excel comes with a scripting language, VBA (Visual Basic for Applications), where macro can be designed so as to automate tasks. It is a very decent tool, which should be part of the data analytics survival kit, when nothing else is available. In fact, for manipulating huge datasets, a much better choice is JMP, a licensed statistical discovery software offered by SAS that offers an intuitive, Excel-like interface. JMP provides unique features to transform and combine multiple datasets with hundreds of thousand of lines in a blink of an eye, namely summarising, concatenating, splitting, stacking and subsetting… A bunch of advanced modules allows complex analysis, my favourite one being the profiler where multi-variable and multi-response trends can be immediately displayed with regression parameters associated to the underlying model, which is invaluable for experimental design. Repetitive tasks can be automated thanks to an integrated scripting language (namely JSL), with powerful macros able to build fragments of code.

While JMP is a really great piece of software for data manipulation and on-the-fly analysis, its scripting language lacks portability, is restricted to JMP environment (which is licensed), and basically its inputs are only datasets.

A programming language like Python can alleviate these shortcomings. It is a powerful high-level language, easy to learn, universal, open-source, free and portable. With Python, there is virtually no other limitation than hardware resources and programming skills. A very active community has been continuously designing advanced modules and libraries, allowing for high-productivity programming, with endless potential applications. For all these reasons, I consider Python as a smart choice for a natural extension of professional software specialised in data analytics, and my plan is to go through that full specialisation cycle.

Course on cryptography

Today, I successfully completed a course on cryptography.

Cryptography I by Stanford University on Coursera. Certificate earned on August 21, 2016

Cryptography I by Stanford University on Coursera. Certificate earned on August 21, 2016

Cryptography is the cornerstone of information security and modern communications, and I used it on a daily basis throughout my life and career, most often without being even aware of it ! Its applications are expanding at a pace never seen before with the advent of Internet technologies. While encrypting data is anything but new (ciphers existed way back in the Ancient times), these last few decades transformed cryptography from an art into a genuine science. Formal definitions and assumptions are now rigorously established, from which ciphers can be constructed with mathematically-proven security derived from algebra and number theory. Cryptography is the field of intense active research focused on bullet-proofing existing protocols and creating new ones for new applications.

The exponential increase of computing performance has driven many protocols to become obsolete. The Data Encryption Standard (DES) became notoriously unsecure in 1999 when its 56-bit key became vulnerable to brute-force attacks, and had to be replaced by the Advanced Encryption Standard (AES). Some other encryption schemes were poorly designed, because cryptography science was not as advanced as today, or because the designers just made mistakes. This was probably the case with Wireless Encryption Protocol (WEP), with its multiple weaknesses that are now given as a good case-study of what not to do for students. Besides design, implementation is equally important and can turn a provenly secure cipher into a totally unsecure protocol. And many examples exist in real life, like the padding oracle attack on authenticated encryption.

In practice, the best advice for a reliable encryption is to always use public, open-source and updated crypto-libraries from reliable and well-established providers. However, it is worth to keep in mind that the security of a cipher erodes over time, as computing performance and attacker skill both increase, which represents a real challenge for cryptographers. In fact, the right question for selecting an encryption scheme is not whether the cipher will be decrypted or not, but when, and if this amount of time is acceptable or not for the application. For a long-lifetime secret is more costly than a short-lifetime one, and is not always needed. And the answer to the afore mentioned question is only an estimation. The rise of a disruptive technology like quantum computing may completely wreak havoc in existing secret documents in a much shorter time than expected…