The challenge in becoming focused.
Over the years I’ve formed a habit of exploring music, MOOCs, math, economics, news, basic computing, data science and now blogging.
Many successful entrepreneurs have said, one should become entirely focused on just one talent until you’ve mastered it if your aiming to become highly successful at it.
So my behavior appears to be in direct opposition to this law of success. But I can at least say with some certainty, that not having attained a reputation for being a world expert in anything so far, I can at least validate the corollary to that bit of wisdom.
Because I would like to become more successful than I am at the present and if those words of wisdom contain the truth of life’s success formula then I’ll be required to commit to becoming an expert in a skill that ideally will be in demand once I obtain it.
Based on my interest and recent experience I think the field of data science would be a wise choice for many reasons two of which I’ll briefly mention.
The first, is that I have enjoyed the exploration of computing and data science that I’ve experienced in the related courses I’ve enrolled in and completed so far.
The second, is that according to labor statistics, if these can be trusted, the demand for data science skill now exceeds and is expected to continue to exceed their supply for the near future.
The challenge for me though, is to attain a consistent focus on developing the data science skills eventually enabling me to own the extraordinarily high level of expertise spoken of by these successful entrepreneurs.
I have at least begun the journey and so now the challenge will be to stay focused.
61115 or 15 Nov 2016
It’s been two months since beginning this post, since then I have reviewed a course I completed earlier to achieve a better understanding because it was the most difficult one to understand. Following that, I previewed the lectures for the next course and earlier today enrolled in the course, regression models expecting to complete it within a month.
So far I have maintained my focus on learning the fundamentals of data science bringing me a step closer to completing this specialization I began last November.
I have not lost my focus yet on what I set out to do more than a year ago and should easily be in reach of competing this goal by mid 2017 if all goes according to plan. More updates to follow!
61127 or 26 Nov 2016
Making progress, approaching 50% completion in the Regression Models course, just need to complete the third quiz, begin the course peer graded project and simultaneously complete the project and the fourth quiz by the course closing date on 61219. These are quite powerful statistical tools to have available especially the multivariate regression that was just taught. I can see why these skills could be in demand by a lot of different organizations including my own.
Just a quick update; somehow I managed to complete the course in Regression Models and subsequently enrolled in and also completed the course in Practical Machine Learning.
Now I am progressing through the final course, Developing Data Products, before moving on to the specialization “Capstone Project”.
Today I am working on the course project, due toward the end of the fourth week in the course Developing Data Products.
This journey has at times been extremely frustrating but for the most part it has been highly rewarding as the skills I’ve been learning in this specialization, I think, can be applied to just about any type company or process.
In the time since I last posted, I have been to the US and visited with my family, in Texas, and have been gaining other complimentary new skills along the way.
Began the Capstone project a few days ago and so far have found it to be very interesting and providing lots and lots of new knowledge about computing and linguistics and text data analysis.
2017-04-22 Continuing the focus
Although I’ve accomplished project set up, downloading the data zip files, initial script writing (on the first half of the project), data pre-processing (sampling the data) and data cleaning, the course officially begins a week from today.
I’ll be applying NLP (natural language processing) methods on a corpora (a collection) of text data files from (twitter, blogs (similar to this one), and news) provided by the online Data Science Specialization through Coursera.org and developed by the Johns Hopkins University Bloomberg School of Public Health, BioStatistics Department and the SwiftKey corporation for this Capstone project.
Considerable preliminary research on Natural Language Processing was necessary in order to understand the overall goals of this project.
Additionally, new processes, r packages and techniques along with topics that had been briefly covered earlier in the specialization have been combined hopefully in a logical sequence to effectively manage the significantly large text files (approximately 4 million lines of text or roughly .5 GB) being used for this project.
Currently, I am working on an aspect of NLP called tokenization (a process for splitting a text document into smaller components (i.e., sentences, words, or characters) , and then using several methods for transforming those text documents into what is called a Term Document Matrix (tdm) or Document Term Matrix (dtm) which can be further analyzed by TF-IDF (term frequency – inverse document frequency) which calculates probabilities of the terms in a phrase based on occurrences of consecutive words in the given phrase by calculating the ratio of the number of occurrences of a term of interest in a single document divided by the number of documents in which the term is present within a corpus.
The term frequency calculated using rowSums of the tdm terms can be used to produce visual representations of the resulting term frequencies using a bar plots and or a Wordcloud (i.e., a word frequency visualization tool) plot which highlights the impact of Zipf’s law in relation to word frequencies (up to roughly 1,000 unique words).
Statistical predictive modeling algorithms like: (Katz’s backoff model, Kneser Ney smoothing algorithm, Hidden Markov model, Viterbi algorithm, Forward Backward algorithm, Bag of Words model along with the Naive-Bayes) are all valuable to some degree in conducting text mining or analysis.
The Naive-Bayes algorithm turns out to be a good model for building a simple word prediction application not only according to the course suggestions but further investigation into the whole process of natural language processing or text mining.
The next step in the process that I’ll be working on is breaking the roughly .5 million lines of sample text that I’ve sampled from three corpora (twitter, blogs & news) of roughly 4 million total lines of text into three smaller subsets of each of those three samples into three partitions called (training, validation & test) combined into three new corpora each having elements of the three types of text. These new corpora will be cleaned and then have the machine learning algorithms applied to them to generate the best predictive model attainable with this particular data. Several different predictive models will be tested and compared against each other for a comparison of accuracy and speed.
Beyond this, the project requires the development of an interactive Shiny web application, that has at its beginning, the code I am writing locally on my Mac Book Pro in the R language.
The Shiny application is a web application where the developer, that would be me in this case, writes a two part code in the R language on a local computing device, consisting of code script for a user interface called ui.r and a corresponding interactive code script called server.r . A copy of these code scripts I’ve developed locally reside on R studio’s Shiny application server but are linked to my local R studio IDE original code.
The idea for this application is that a user can provide inputs (a three to four word phrase) to the web application user interface (a webpage) hosted on R studio’s Shiny Application server through the user.r code that interfaces with the corresponding script, server.r, applying the predictive algorithms that calculate and return the next word prediction via the webpage user interface through the user.r script.
I am writing this project code in the R programming language and will be recording it step by step by the version control application called Git.
The Git version control application has two parts (Git and GitHub). Git is the local (developer) portion which operates from the developer’s computer. GitHub the other half of the application is a server hosted on the internet.
The final task of this project is to write and develop a pitch slide deck or a marketing presentation describing the project, what it is used for, how it is used and how it of benefit to the user and what if any future improvements could be made to improve the word prediction application.
In summary, it’s a challenging project requiring considerable time and work.
2017-05-06 Continuing the focus
This is my latest update, to the progress I have been making on this Capstone project. After going back and forth and back and forth, I’ve arrived once again at the point where I’ll next be creating the infamous ngram tokens. These are perhaps one of the most important elements of this kind of project, in that without them, making any kind of next word prediction would seem rather futile or even impossible.
I am very thankful that I have arrived at this point again for about the fourth time. This time around, I think I am beginning to have a clue, as to what it is, I am doing.