Research Program Summary
At August(July) I had the honor to participate in a research based program. In this program, I had an opportunity to learn about data analysis and machine learning. In the following summary will be the detailed report of this program. What I can said is that, I learn many things from this program and this will be an extremely valuable experience in my life.
At the start of the first week, we have a brief introduction to our fellow program participant: a senior (high school) from Los Angeles, a sophomore (high school) from Minnesota, and the mentor of our program-Dr.C, PHD of UC Berkeley. In the first meeting, Dr. C elaborate the basic definition of machine learning and the project we are going to do. The project we about to challenge is a kaggle competition, which we need to create a model that can predict the real estate price’s error. At the first glance, the project seems like mission impossible, but Dr.C has already have a plan to help us to accomplish this goal.
Here is the link on our project and schedule:
The assignment we need to do is to read machine learning basics (chapter 5), and python environment setup. (the computer language we use for our project is python)
After the setup, Dr.C walks us through the procedures of making the prediction model. First we need to import the necessary modules. The modules include: numpy, scikit learn, matplotlib, panda and csv files, which allow us to use functions that are pre-build in these modules.
The first thing we need to do is to import the data. While the csv is imported, we can proceed to clean the data. The significance of the data cleaning is to make sure the accurateness of our data. As we load the data into our model, we want the model to give valid prediction, which required a good training set for the computer to learn the function.
The process of data cleaning was not easy for us. The reasons for this is we do not have enough background knowledge nor experience with python and machine learning. With help of Dr.C, we managed to overcome the obstacles and finish the data cleaning.Here is the part of data provided by the zillow price:
After the data cleaning process, we need to import the necessary model or algorithm to help the computer create a predictive function, the first function we apply is a linear regression module from scikit learn. What the linear regression do is from all the training set it will draw a linear function to fits the x inputs and y outputs.Here is the what the code looks like:
from sklearn.linear_model import LinearRegression
After the computer gave the results, we put the results into the csv format and submit to the competition:
Well, the ranking for my submission is not that great using the linear regression, it place me around 1250th, so that is when the boosted trees comes in.Noted that many of this algorithms and functions are based on mathematical principles, as Dr.C taught us the math behind machine learning.
Boosted trees is a machine technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable cost function.Boosted trees是一种用于回归和分类问题的机器技术,它以一种形式产生一个预测模型。弱预测模型的集合,典型决策。它像其他增强方法一样,以一种阶段性的方式构建模型,并且通过允许任意可微代价函数的优化推广它们。
The boosted tree algorithm that we first adapted is xgboost, short for “Extreme Gradient Boosting”. We believe that if we use xgboost instead of linear regression, we can get a better model.
boosted tree算法首先适应的是xgboost,“Extreme Gradient Boosting”。我们相信,如果我们使用xgboost而不是线性回归,我们可以得到一个更好的模型。
To use xgboost, there is couple hyperparameters that we need to tune: learning rate (steps that the computer take when performing gradient descent), max depth (control/prevent the overfitting), and subsample (also control overfitting or underfitting, fraction of observation that are randomly samples.) In order to tune for the best hyperparameter, another concept need to be apply: grid search and random forest search.
使用xgboost,有几个参数,我们需要调整:学习速率(步骤:电脑在执行梯度下降),最大深度(控制/防止过拟合),和子样本(同时控制过度拟合或underfitting,分数观察是随机样本。)为了调整最佳的超参数,另一个概念需要申请:网格搜索和random forest search。
Grid search and random forest search can be put this simply, the specific type of algorithm that help the us identify the best hyperparameters to use. Here is what a grid search looks like:
网格搜索和random forest search可以简单地说,特定类型的算法,帮助我们确定最好的。参数使用。下面是网格搜索的样子:
By using the grid search, we able to find the best parameter, as I apply the xgboost as my module, my ranking went from 1200 ish to 700 ish. We can see the significance change due the change of model.
Currently the model I’m using is 丨ightBGM, recommend by Dr.C , the model is more efficient and accurate than xgboost. The process is not easy to us beginners, but I believe that there is a lot of things we can learn from this process, and can be handy to us in the futures.
Afterthoughts and Conclusion.
Before joining this program, I basically have zero ideas about research based computer science. Whether is the atmosphere or learning materials, I have never have the opportunity to experience before this program. I truly enjoy these several weeks of brainstorming and coding. I believe that I have become better in problem solving and analysis. Here I have to give special thanks to Dr.C, who guide us through this entire program, and help us steps by steps whenever we’re in trouble. After this program, I become more determined to major in computer science and even computer science engineer. I wish I can continued to develop my skills in programming and become more intellectual as I learn more about machine learning, participating this program definitely expand my perspective on computing and algorithm that modern technology capable of, it also deepen my understanding to STEM research and carrying out advance project.
At the end of the note, I want to thanks to my fellow participants who have help me whenever I ran into confusion and problems, and Dr.C for teach us the basics of machine learning. Pictures that commemorates the program.
1. 提交报名表
2. 科研组择优面试
3. 面试通过后,发送录取确认书
4. 协调机票、接机、住宿
5. 赴美开始科研
6. 获得导师推荐信,科研证书,丰富的CV、PS履历