Big Mountain Data Spring 2014 Sessions


Big Data Overview / Full LifeCycle of Big Data

Elements of Big Data Architecture
  11  

What does Big Data architecture look like? When thinking of a strategic data management roadmap, where would you start? How would you leverage existing data sources and combine them with so-called “Big Data” technologies? How do you derive value from a Big Data solution? These are some of the fundamental questions that we will be answering as we present a Reference Architecture for Big Data. We will explore the key elements that any big (or small) data management platform should consider when embarking on a Big Data solution. This session is intended to paint a bigger picture that can help you navigate the Big Data landscape and guide your focus in the deep dive sessions during the conference.

Level 200 - (Beginner): Introductory / fast moving
Duration: Hour
Presenter: Ryan Plant


Getting Value from Big Data; A Deep Dive into Analytic & Visualization Tools (Tableau, D3, Microsoft Power View)
  9  

A fast view of analytic and visualization tools available for forming meaning from Big Data sources.

Level 200 - (Beginner): Introductory / fast moving
Duration: Hour
Presenter: Norman Warren


Hadoop, an elephant you can actually eat
  7  

Seba Jean-Baptiste and Craig Brown will co-present this topic. It's great that you can now process all of your data in a "Big Data" way, but then what do you do with it? In this session, we'll explore some of the ways that you can present your results to a consuming audience. The focus will be on what is termed the "Speed Layer" of the Lamba Architecture. The speed layer is responsible for combining incoming, real-time data with older data that has previously been processed. It can also be termed as combining the "now" with the "then" to produce a final result.

Level 300 - (Intermediate): Basic knowledge of subject matter is suggested
Duration: Hour
Presenter: Craig Brown


Integrating a RDBMS with Hadoop and Big Data Technologies
  6  

Our company built a system mixing Big Data technologies (hadoop/ElasticSearch) along with SQL Server/RDBMS to make a system that is both highly scalable and cost effective. In this session I’ll walk you through the ETL process of pulling data through sqoop, transforming data in hive and presenting a denormalized table in hive. If you are looking to understand how to get data from RDBMS(Relational) into hadoop and leveraging parallel architecture this is the session for you.

Level 100 - Introduction
Duration: Hour
Presenter: Pat Wright


Getting Started With R
  5  

R is a statistical programming language with a large and growing developer community. I'll walk through some core features of the language, the powerful IDE (RStudio) and deploying web based visualizations on R.

Level 200 - (Beginner): Introductory / fast moving
Duration: Hour
Presenter: Jowanza Joseph


Introduction to Hadoop
  5  

Hadoop is the most popular Big Data platform, and with good reason. Using a technique call Map Reduce, Hadoop can process huge amounts of data in times that traditional databases can only dream of. In this presentation, we'll start with a high level overview of Hadoop, and the take a look at what Map Reduce is, how it works, and how it can solve a surprisingly wide range of problems in ways very different from traditional databases. We'll finish with an overview of the larger Hadoop ecosystem, covering several popular tools including Hive, HBase, Sqoop and others. If you're trying to understand what all the hype is about, this is the place to be.

Level 200 - (Beginner): Introductory / fast moving
Duration: Hour
Presenter: Ian Robertson


Big Data Logistics with Apache Kafka
  5  

How do you get data, possibly a large amount of it, from a to b? Then b to c? And ultimately to n? Is it fast? Is it reliable and consistent? Do the downstream data consumers ability to consume affect the upstream data providers ability to provide? These are some of the questions that Apache Kafka can help you answer when designing a high-throughput, low-latency data distribution infrastructure. In this session, we will run a crash course through the basics of Apache Kafka and walk-through some scenarios and architectural approaches to enabling a flexible and reliable data distribution bus

Level 200 - (Beginner): Introductory / fast moving
Duration: Hour
Presenter: Ryan Plant


Migrating an Enterprise Data Mart to Hive
  4  

Do you have data in an existing RDBMS and want to port it to Hadoop for analysis, but don't know where to start? Come join this interactive discussion as we share our experiences in porting a dataset from Sybase IQ over to Apache Hive. We'll cover the steps taken, challenges faced along the way, and lessons learned from doing this exercise in the field at Goldman Sachs. We'll cover topics such as Sqoop, the varying Hive data formats, partitioning and bucketing strategies and ANSI SQL to HiveSQL conversion. Dan Hoffman and Rob Mancuso are members of Goldman's Application Platform group within Technology Infrastructure.

Level 100 - Introduction
Duration: Hour
Presenter: Dan Hoffman



Advanced Track 1

Pick your words carefully - interview text analytics
  6  

@bentaylordata, the principal data scientist for HireVue.com, will be introducing the power of text analytics and demonstrating it on words spoken during interviews. Attendees with no background in text analytics will benefit from the introduction, and more advanced attendees with previous exposure should benefit from seeing new ideas on how they can cluster and work with their text data to add value. The technologies that will be demonstrated include sentiment analysis, custom word weighting, word cloud clustering, predictive analytics using text, and Levenshtein distance.

Level 200 - (Beginner): Introductory / fast moving
Duration: Hour
Presenter: Ben Taylor


Predicting baseball outcomes with R and caret
  5  

Baseball is back in season! J. Michael (Mike) Boyle, professor of information systems at the University of Utah, will be demonstrating how one might use various features within the R library, caret, to predict the outcomes of baseball games. This will be an introductory session, however, experienced R-using data scientists may gain benefit from the introduction to the caret library if they have not already experienced the same.

Level 100 - Introduction
Duration: Hour
Presenter: J. Michael (Mike) Boyle


Using Sports to Teach Analytics
  4  

As analytics practitioners we often find ourselves in the role of teacher. As teachers, we sometimes struggle to meet the needs of our varied audiences. In this presentation, Mike will share examples of how he uses sports as the universal context to teach concepts ranging from data warehousing to predictive analytics to information strategy to diverse audiences. Coming out of this presentation, the challenge for the audience is to consider what steps they might take to become better teachers of analytics and acting thereupon.

Level 300 - (Intermediate): Basic knowledge of subject matter is suggested
Duration: Hour
Presenter: J. Michael (Mike) Boyle



Advanced Track 2

Machine Learning on Big Data with Apache Spark MLib
  11  

You are collecting big data now what? This presentation will outline the functionality supported in Spark MLlib and also provide practical examples of invoking MLlib in Python on large datasets. Intended for an advanced audience. Experience with machine learning, cluster computing, and programming is strongly recommended. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, GraphX for graph processing, and Spark Streaming. MLlib is a Spark implementation of some common machine learning (ML) functionality, as well as associated tests and data generators. MLlib currently supports four common types of machine learning problem settings, namely, binary classification, regression, clustering and collaborative filtering, as well as an underlying gradient descent optimization primitive. Alton Alexander is the lead data scientist at One on One Marketing and holds a graduate degree in scientific computing from the University of Utah.

Level 400 - (Advanced): Experience with subject matter is strongly recommended
Duration: Hour
Presenter: Alton Alexander


RavenDB: Poe Wouldn't Care But You Should
  3  

Introductory course on the document database RavenDB. We will review features, setup, management tools, and clients at a high level. Actually has nothing to do with Edgar Allen Poe... Oren Eini aka Ayende Rahien, on the other hand, is heavily involved in the product.

Level 200 - (Beginner): Introductory / fast moving
Duration: Hour
Presenter: William Munn



Suggest a Topic!

 

Track Name

Session Name

Track

Level

Duration

Session Abstract