美籍高中生的伯克利分校科研之路

  MA同学,来自加州旧金山的一所中学,11年级学生,梦想着进入美国名校读计算机专业,为此参加名校科研,增加学术背景,开拓视野,获得真知。在暑假得到去大学伯克利分校计算机专业的科研机会。

  Research Program Summary

  At August(July) I had the honor to participate in a research based program. In this program, I had an opportunity to learn about data analysis and machine learning. In the following summary will be the detailed report of this program. What I can said is that, I learn many things from this program and this will be an extremely valuable experience in my life.

  在8月(7月)我有幸参加了一个关于计算机的研究项目。在这个过程中,我有了一个学习数据的机会,分析与机器学习。在下面的摘要中将是这个程序的详细报告。我能说的是,我学到了很多。从这个项目的东西,这将是一个非常宝贵的经验,在我的生活中。

  STEP ONE

  At the start of the first week, we have a brief introduction to our fellow program participant: a senior (high school) from Los Angeles, a sophomore (high school) from Minnesota, and the mentor of our program-Dr.C, PHD of UC Berkeley. In the first meeting, Dr. C elaborate the basic definition of machine learning and the project we are going to do. The project we about to challenge is a kaggle competition, which we need to create a model that can predict the real estate price’s error. At the first glance, the project seems like mission impossible, but Dr.C has already have a plan to help us to accomplish this goal.

  在第一步的开始,我们简要介绍了我们的同伴项目参与者:洛杉矶的一位高中生和明尼苏达的一位高中生,和我们的科研C导师,来自加州伯克利大学。在第一次会议上,C导师阐述机器学习的基本定义和我们将要做的项目。这个项目我们要挑战的是Kaggle竞争,我们需要建立一个模型,可以预测房地产价格的错误。乍一看,这个项目好像是不可能的任务,但C导师已经有计划帮助我们实现这个目标。

  Here is the link on our project and schedule:

  这里是我们的项目和时间表的链接:

  The assignment we need to do is to read machine learning basics (chapter 5), and python environment setup. (the computer language we use for our project is python)

  我们需要做的任务是阅读机器学习基础知识(第5章)和Python环境设置。(计算机语言,我们的项目使用的是Python)

  After the setup, Dr.C walks us through the procedures of making the prediction model. First we need to import the necessary modules. The modules include: numpy, scikit learn, matplotlib, panda and csv files, which allow us to use functions that are pre-build in these modules.

  在设置之后,C导师带领我们完成了预测模型的制作过程。首先,我们需要导入必要的模块。该模块包括:NumPy,scikit学习,matplotlib,panda和CSV文件,它允许我们使用预先建立的这些功能。

  The first thing we need to do is to import the data. While the csv is imported, we can proceed to clean the data. The significance of the data cleaning is to make sure the accurateness of our data. As we load the data into our model, we want the model to give valid prediction, which required a good training set for the computer to learn the function.

  我们需要做的第一件事是导入数据。在导入CSV时,我们可以继续清理数据。数据的意义清洗是保证我们数据的准确性。当我们将数据加载到模型中时,我们希望模型提供有效的预测,这需要一个很好的计算机学习功能的训练机会。

  The process of data cleaning was not easy for us. The reasons for this is we do not have enough background knowledge nor experience with python and machine learning. With help of Dr.C, we managed to overcome the obstacles and finish the data cleaning.Here is the part of data provided by the zillow price:

  数据清理的过程对我们来说并不容易。究其原因,是我们没有足够的背景知识和经验。使用python和机器学习。在C导师的帮助下,我们克服了障碍,完成了数据清理工作。这里是由Zillow的价格提供数据的一部分:

  STEP TWO

  After the data cleaning process, we need to import the necessary model or algorithm to help the computer create a predictive function, the first function we apply is a linear regression module from scikit learn. What the linear regression do is from all the training set it will draw a linear function to fits the x inputs and y outputs.Here is the what the code looks like:

  在数据清理过程之后,我们需要导入必要的模型或算法,以帮助计算机创建预测函数。第一个函数将从scikit线性回归模块学习。线性回归是从所有的训练集中得出的。适合x输入和y输出的线性函数。下面是代码的样子:

  from sklearn.linear_model import LinearRegression

  After the computer gave the results, we put the results into the csv format and submit to the competition:

  Well, the ranking for my submission is not that great using the linear regression, it place me around 1250th, so that is when the boosted trees comes in.Noted that many of this algorithms and functions are based on mathematical principles, as Dr.C taught us the math behind machine learning.

  嗯,我的排名不是很好,用线性回归,它把我放在第一千二百五十左右,所以就是在提升的时候,困难来了。注意到许多算法和函数基于数学原理,正如C导师教我们的机器背后的数学学习。

  Boosted trees is a machine technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable cost function.

  Boosted trees是一种用于回归和分类问题的机器技术,它以一种形式产生一个预测模型。弱预测模型的集合,典型决策。它像其他增强方法一样,以一种阶段性的方式构建模型,并且通过允许任意可微代价函数的优化推广它们。

  The boosted tree algorithm that we first adapted is xgboost, short for “Extreme Gradient Boosting”. We believe that if we use xgboost instead of linear regression, we can get a better model.

  boosted tree算法首先适应的是xgboost,“Extreme Gradient Boosting”。我们相信,如果我们使用xgboost而不是线性回归,我们可以得到一个更好的模型。

  To use xgboost, there is couple hyperparameters that we need to tune: learning rate (steps that the computer take when performing gradient descent), max depth (control/prevent the overfitting), and subsample (also control overfitting or underfitting, fraction of observation that are randomly samples.) In order to tune for the best hyperparameter, another concept need to be apply: grid search and random forest search.

  使用xgboost,有几个参数,我们需要调整:学习速率(步骤:电脑在执行梯度下降),最大深度(控制/防止过拟合),和子样本(同时控制过度拟合或underfitting,分数观察是随机样本。)为了调整最佳的超参数,另一个概念需要申请:网格搜索和random forest search。

  Grid search and random forest search can be put this simply, the specific type of algorithm that help the us identify the best hyperparameters to use. Here is what a grid search looks like:

  网格搜索和random forest search可以简单地说,特定类型的算法,帮助我们确定最好的。参数使用。下面是网格搜索的样子:

  By using the grid search, we able to find the best parameter, as I apply the xgboost as my module, my ranking went from 1200 ish to 700 ish. We can see the significance change due the change of model.

  参数采用网格搜索,我们能找到最好的参数,我将xgboost作为我的模块,我的排名就从1200左右到700ISH。从模型的变化可以看出意义的变化。

  Currently the model I’m using is 丨ightBGM, recommend by Dr.C , the model is more efficient and accurate than xgboost. The process is not easy to us beginners, but I believe that there is a lot of things we can learn from this process, and can be handy to us in the futures.

  目前我使用的模型是丨ightbgm,推荐由C老师,该模型比xgboost更加高效和准确。这个过程对我们初学者来说并不容易,但我相信我们可以从这个过程中学到很多东西,并且可以在我们手边得心应手。

  Afterthoughts and Conclusion.

  Before joining this program, I basically have zero ideas about research based computer science. Whether is the atmosphere or learning materials, I have never have the opportunity to experience before this program. I truly enjoy these several weeks of brainstorming and coding. I believe that I have become better in problem solving and analysis. Here I have to give special thanks to Dr.C, who guide us through this entire program, and help us steps by steps whenever we’re in trouble. After this program, I become more determined to major in computer science and even computer science engineer. I wish I can continued to develop my skills in programming and become more intellectual as I learn more about machine learning, participating this program definitely expand my perspective on computing and algorithm that modern technology capable of, it also deepen my understanding to STEM research and carrying out advance project.

  At the end of the note, I want to thanks to my fellow participants who have help me whenever I ran into confusion and problems, and Dr.C for teach us the basics of machine learning. Pictures that commemorates the program.

  在加入这个计划之前,我对研究计算机科学基本上没有什么想法。无论是氛围还是学习材料,我从来没有机会体验这个科研之前。我真的很享受这几周的头脑风暴。编码。我相信我在解决问题和分析方面已经变得更好了。在这里,我要特别感谢指导我们的C导师。通过这整个计划,并帮助我们一步一步地在我们遇到麻烦的时候。这个科研结束后,我更加决心主修计算机科学,甚至计算机科学工程师。我希望我能继续发展我的编程技能,并成为更多的知识,我学到更多的机器学习,参与这个计划肯定扩大我的角度计算和现代技术所能达到的算法,也加深了我对STEM研究和实施先进项目的理解。在结束时,我要感谢我的同伴们,每当我遇到困惑和问题时,他们都会帮助我。

  选拔流程

  1. 提交报名表

  2. 科研组择优面试

  3. 面试通过后,发送录取确认书

  4. 协调机票、接机、住宿

  5. 赴美开始科研

  6. 获得导师推荐信,科研证书,丰富的CV、PS履历

热门推荐 相关阅读
更多
4000-970-969
地址:北京市朝阳区东三环北路甲19号嘉盛中心28楼 2810
免费评估
在线咨询
免费通话