The tasks for next week will be to finally get around to studying Machine Laerning for Hackers – the book is an introduction to R and the book Beginning CouchDb – and Think Bayes (which is a Python Book).
First I need to setup a new virtual Linux application server to host CouchDb and MongoDb – both are JSON document NoSQL servers. Mongos data format is BSON – which has very similar syntax to JSON.
Having setup the server and tested both NoSQL application are running I need to prepare some ETL on the data I will be using. This data is available on the web – but the format is mainly TXT (I could download excel data but were is challegned um). The data is the historical results of the Seria Italia played from 1970. I also need to find the historical tottocalcio schedine. and ETL them as well.
Why this data – well when I lived in Italy, me and my flat shares in Via dei Vanni accanto il ponte della Vitoria, used to enjoyed Sunday afternoon at the bar at the corner of Via Pisana and Ponte Sospeso or in the piazza en-face le bar. In the bar or the piazza, listening to a portable radio, we would be checking lo schindina di tuttocalcio against the live score programma.
I am not a gambler – I played tottocalcio more for the social sunday event than to actully win (anche solo costa una mille lire di giocare una linea … meh niente). The reason I don’t is because I understand probabilty. Back then with a grasp of Baysian Inference (learnt from Collins Dictionary of Statistics) I’d approach it scientifically. The problem was all I had was calulator, very little data (non mi ricordo il nome del settamanale per il tottocalcio), a pencil and paper. Non mai ho fatto piu di dieci – peccato altrimenti ci starei ancora.
Now I have a 3rd Gen i7, 16 gb of RAM, various data mining tools and lots of data. I am doing this for fun and nothing helps one learning more than having fun. And Data Mining, Machine Learning, and analytics is all the rage at the moment in the market. I’ll get to financial analytics, non-intelligent and intelligent agents later (the finacial zip algorithm and game theory … oh game theory mi piace).
The Goal: Simple put it is to discover the predicability of a tottocalcio schedina – and how many lines should be played and aid in learning and reacquiring skills neglected since 2000.
Having set everything up on the application server … I now need to do the ETL.
Basic Process of all ETL:
- Find the data source(s) (www.rsssf.com)
- Study the loose structured TXT sources to extract pattens
- With the extracted patterns transform TXT to appropriate intermediate XML (why XML … I might need it later)
- Add any additional semantic data to the intermediate format
- Convert intermdiate format for loading into CouchDb and MongoDb
- Load the data
The data environment is ready.
The ability and the thirst to learn … why is it not recognised, on personality tests I rank hugely on intuition 10/10 … experts are intutive … what does that say … I rank just as high on risk taking … are they linked ?