Statistical Analysis

Our services in data analysis include consulting on a wide variety of subject areas, including mathematical modeling, machine learning, econometrics, and biostatistics. Our team includes experts in statistical sampling techniques, survey instrument development and testing, qualitative and quantitative experimental design, forecasting and predictive modeling, human factors research, and clinical trials management.


With the advent of modern data analysis software and powerful visualization libraries in R and JavaScript, the picture has become as critical to the analytics process as the words. Our goal in visualizing data is not just to provide stunning displays that will impress any client or investor, but to create innovative graphics that will allow your team to see new patterns and trends that will inform your future business and marketing decisions.


SupStat has teamed up with Transwarp Technologies to help enterprises migrate from traditional database systems to high efficiency Hadoop-Spark-based parallel data storage and computing. The Transwarp Data Hub is not only faster than other data warehouse solutions, but it provides convenient query, analytics, and visualization interfaces through integration with SQL, R, and Tableau.


Applying cluster analysis to study patterns of physical activity in pregnant women in the Blossom Project

Helping researchers in animal behavior improve survey design by using random forest analyses and multivariate logistic regression to highlight significant questions

Building a scoring system for predicting incidence of animal disease using the machine learning techniques of group lasso regularization and ROC analysis


Cache Strategy for CDNs: R + Transwarp Hadoop/Spark Solution

Problem: Content delivery networks (CDNs) cache web pages for information retrieval tasks, but storage and speed limitations require that cached content be constrained such that only the most relevant data is retained.

Solution: R-generated heat maps were used to categorize web pages according to hit rate and then store those above a specified hit-rate threshold within the TDH-HDFS database structure. This system was deployed in 2013 and is able to manage up to 9 million records per second.

Cardiovascular Disease Prediction: Beijing Municipal Health Bureau

Problem: Cardiovascular disease is the number one killer among the elderly in Beijing. To improve preventive care, the BMHB keeps health profiles on every elderly citizen in Beijing, but needs to be able to identify which of nearly 200 health characteristics is most linked to cardiovascular disease.

Solution: Group Lasso and partial correlation methods were used to select characteristics most highly linked to the occurrence of cardiovascular disease. The model built on these features exhibited 98% classification accuracy and was used to develop an early warning system now in use by the BMHB

Tax Revenue Prediction: Financial and Economic Committee of China

Problem: The FEC must provide annual estimates of the next year’s tax revenue to inform present policy decisions; the Stamp Duty is notoriously difficult to predict, despite the involvement of several teams of analysts from different disciplines.

Solution: We used a variety of time series models to predict with 93.7% accuracy a drop in Stamp Duty revenue by 10.6% in 2012. The estimates of other government departments were not only farther off in their figures, but they predicted revenue changes in the opposite direction. Our model also revealed that stock turnover, cash in circulation, and GDP are the most relevant factors affecting Stamp Duty revenue.

Movie Rating Prediction: Sundance Film Festival 2013

Problem: Critical reception has a major impact on film grossing, so some film production companies have begun to seek predictions of a new film's rating prior to release.

Solution: We utilized Item Response Theory to first select reliable movie raters, and then used collaborative filtering to handle missing data. We then built a Bayesian model to optimize the final rating prediction. Our forecasting system achieved 3.5% of Mean Absolute Error for movies’ Rotten Tomatoes ratings. The results were used to estimate the rating of two independent films selected to participate in the 2013 Sundance Film Festival.

Bill Query System: R + Transwarp Hadoop/Spark Solution

Problem: Telecommunications billing data has exploded in recent years due to the rapid development of mobile technology and changes in usage patterns. Storage and retrieval at these scales poses challenges to the speed and efficiency of data analysis and management.

Solution: A Transwarp platform on x86 clusters was used to provide a 30-fold increase in query performance over RISC platforms. This new system can handle 30 TB of monthly users’ billing data.

Transportation Pattern Analysis: CitiBike

Problem: Citibike is New York City's bike sharing system. Currently people find it difficult to find available bikes to rent or available docks to return bikes. Predicting bike and dock availability is important because while CitiBike will provide real-time information on present availability, those numbers are highly volatile and could change by the time a rider reaches the station. Projections based on transportation usage patterns can help inform better planning for bike riders in New York.

Solution: We built a program to scrape and automatically populate our database with CitiBike usage data. With this data we utilized models from time series theory and machine learning to predict bike numbers in every stations across Manhattan. Based on the models, we built a website allowing users to get estimates of bike and dock availability prior to starting their trip.

Real-time Video Surveillance: Transwarp Technologies

Problem: Video surveillance systems have become a standard practice to help coordinate emergency response, control traffic and enhance security in urban areas. The challenge in building these systems is to manage the speed and efficiency of data handling while keeping costs low.

Solution: We worked with Transwarp to help build a real-time video surveillance system in an urban center in China. Using Hadoop clusters with dual computation and storage nodes, our system cost less than one fifth the time of the original Oracle database. And with backup systems built into the task scheduler our system was also more robust to primary scheduler failures.