Engineering

  • Open Source
  • What's it like?

Introducing Yelp's Machine Learning Platform

Jason Sleight, ML Platform Group Tech Lead

  • Jason Sleight, ML Platform Group Tech Lead
  • Jul 1, 2020

Understanding data is a vital part of Yelp’s success. To connect our consumers with great local businesses, we make millions of recommendations every day for a variety of tasks like:

  • Finding you immediate quotes for a plumber to fix your leaky sink
  • Helping you discover which restaurants are open for delivery right now
  • Identifying the most popular dishes for you to try at those restaurants
  • Inferring possible service offerings so business owners can confidently and accurately represent their business on Yelp

In the early days of Yelp circa 2004, engineers painstakingly designed heuristic rules to power recommendations like these, but turned to machine learning (ML) techniques as the product matured and our consumer base grew. Today there are hundreds of ML models powering Yelp in various forms, and ML adoption continues to accelerate.

As our ML adoption has grown, our ML infrastructure has grown with it. Today, we’re announcing our ML Platform, a robust, full feature collection of systems for training and serving ML models built upon open source software. In this initial blog post, we will be focusing on the motivations and high level design. We have a series of blog posts lined up to discuss the technical details of each component in greater depth, so check back regularly!

Yelp’s ML Journey

Yelp’s first ML models were concentrated within a few teams, each of whom created custom training and serving infrastructure. These systems were tailored towards the challenges of their own domains, and cross pollination of ideas was infrequent. Owning an ML model was a heavy investment both in terms of modeling, as well as infrastructure maintenance.

Over several years, each system was gradually extended by its team’s engineers to address increasingly complex scope and tighter service level objectives (SLOs). The operational burden of maintaining these systems took a heavy toll, and drew ML engineers’ focus away from modeling iterations or product applications.

A few years ago, Yelp created a Core ML team to consolidate our ML infrastructure under centrally supported tooling and best practices. The benefits being:

  • Centrally managed systems for ML workflows would enable ML developers to focus on the product and ML aspects of their project without getting bogged down by infrastructure.
  • By staffing our Core ML team with infrastructure engineers, we could provide new cutting edge capabilities that ML engineers might lack expertise to create or maintain.
  • By consolidating systems we could increase system efficiency to provide a more robust platform, with tighter SLOs and lower costs.

Consolidating systems for a topic as broad as ML is daunting, so we began by deconstructing ML systems into three main themes and developed solutions within each: interactive computing, data ETL, and model training/serving. The approach has worked well, and allowed teams to migrate portions of their workflows on to Core ML tooling while leaving other specialized aspects of their domain on legacy systems as needed.

In this blogpost, I’ll discuss how we architected our model training and serving systems into a single, unified model platform.

Yelp’s ML Platform Goals

At a high level, we have a few primary goals for our ML Platform:

  • Opinionated APIs with pre-built implementations for the common cases.
  • Correctness and robustness by default.
  • Leverage open source software.

Opinionated APIs

Many of Yelp’s ML challenges fall into a limited set of common cases, and for these we want our ML Platform to enforce Yelp’s collective best practices. Considerations like meta data logging, model versioning, reproducibility, etc. are easy to overlook but invaluable for long term model maintenance. Instead of requiring developers to slog through all of these details, we want our ML Platform to abstract and apply best practices by default.

Beyond canonizing our ML workflows, opinionated APIs also enable us to streamline model deployment systems. By focusing developers into narrower approaches, we can support automated model serving systems that allow developers to productionize their model via a couple clicks on a web UI.

Correctness and robustness by default

One of the most common pain points of Yelp’s historical ML workflows was system verification. Ideally, the same exact code used to train a model should be used to make predictions with the model. Unfortunately, this is often easier said than done – especially in a diverse, large-scale, distributed production environment like Yelp’s. We usually train our models in Python but might deploy the models via Java, Scala, Python, inside databases, etc.

Even the tiniest inconsistencies can make huge differences for production models. E.g., we encountered an issue where 64-bit floats were unintentionally used by a XGBoost booster for predictions (XGBoost only uses 32-bit floats). The slight floating point differences when numerically encoding an important categorical variable resulted in the model giving approximately random predictions for 35% of instances!

Tolerating sparse vector representations, missing values, nulls, and NaNs also requires special consideration. Especially when different libraries and languages have differing expectations for client side pre-processing on these issues. E.g., some libraries treat zero as missing whereas others have a special designation. It is extremely complicated for developers to think through these implementation details let alone even recognize if a mistake has occurred.

When designing our ML Platform, we’ve adopted a test-driven development mindset. All of our code has a full suite of end-to-end integration tests, and we run actual Yelp production models and datasets through our tests to ensure the models give exactly the same results across our entire ecosystem. Beyond ensuring correctness, this also ensures our ML Platform is robust enough to handle messy production data.

Leverage Open Source Solutions

ML is currently experiencing a renaissance of open source technology. Libraries like Scikit-learn, XGBboost, Tensorflow, and Spark have existed for years and continue to provide the foundational ML capabilities. But newer additions like Kubeflow, MLeap, MLflow, TensorFlow Extended, etc. have reinvented what an ML system should entail and provide ML systems with much needed software engineering best practices.

For Yelp’s ML Platform, we recognized that any in-house solution we might construct would be quickly surpassed by the ever-increasing capabilities of these open source projects. Instead we selected the open source libraries best aligned with our needs and constructed thin wrappers around them to allow easier integrations with our legacy code. In cases where open source tools lack capabilities we need, we’re contributing solutions back upstream.

ML Platform Technological Overview

In future blog posts, we’ll be discussing these systems in greater detail, so check back soon. For now, I’ll just give a brief overview of the key tech choices and a model’s life cycle within these systems.

machine learning case study on yelp

MLflow and MLeap

After evaluating a variety of options, we decided on MLflow and MLeap as the skeleton of our platform.

MLflow’s goal is to make managing ML lifecycles simpler, and contains various subcomponents each aimed at different aspects of ML workflows. For our ML Platform, we especially focused on the MLflow Tracking capabilities. We automatically log parameters and metrics to our tracking server, and then developers use MLflow’s web UI to inspect their models’ performance, compare different model versions, etc.

MLeap is a serialization format and execution engine, and provides two advantages for our ML Platform. Firstly, MLeap comes out of the box with support for Yelp’s most commonly used ML libraries: Spark, XGBoost, Scikit-learn, and Tensorflow – and additionally can be extended for custom transformers to support edge cases. Secondly, MLeap is fully portable, and can run inside any JVM-based system including Spark, Flink, ElasticSearch, or microservices. Taken together, MLeap provides a single solution for our model serving needs like robustness/correctness guarantees and push-button deployment.

Typical Code Flow in our ML Platform

Offline Code Flow for Training a Model in our ML Platform

Offline Code Flow for Training a Model in our ML Platform

Developers begin by constructing a training dataset, and then define a pipeline for encoding and modeling their data. Since Yelp models typically utilize large datasets, Spark is our preferred computational engine. Developers specify a Spark ML Pipeline for preprocessing, encoding, modeling, and postprocessing their data. Developers then use our provided APIs to fit and serialize their pipeline. Behind the scenes, these functions automatically interact with the appropriate MLflow and MLeap APIs to log and bundle the pipeline and its metadata.

Online Code Flow for Serving a Model in our ML Platform

Online Code Flow for Serving a Model in our ML Platform

To serve models, we constructed a thin wrapper around MLeap that is responsible for fetching bundles from MLflow, loading the bundle into MLeap, and mapping requests into MLeap’s APIs. We created several deployment options for this wrapper, which allows developers to execute their model as a REST microservice, Flink stream processing application, or hosted directly inside Elasticsearch for ranking applications. In each deployment option, developers simply configure the MLflow id for the models they want to host, and then can start sending requests!

What’s Next?

We’ve been rolling out our ML Platform incrementally, and observing enthusiastic adoption by our ML practitioners. The ML Platform is full featured, but there are some improvements we have on our roadmap.

First up is expanding the set of pre-built models and transformers. Both MLflow and MLeap are general purpose and allow full customization, but doing so is sometimes an involved process. Rather than requiring developers to learn the internals of MLflow and MLeap, we’re planning to extend our pre-built implementations to cover more of Yelp’s specialized use cases.

We’d also like to integrate our model serving systems with Yelp’s A/B experimentation tools. Hosting multiple model versions on a single server is available now, but currently relies on clients to specify which version they want to use in each request. However, we could further abstract this detail and have the serving infrastructure connect directly to the experimentation cohorting logic.

Building on the above, we would like to have the actual observed events feed back into the system via Yelp’s real-time streaming infrastructure. By joining the observed events with the predicted events, we can monitor ML performance (for different experiment cohorts) in real-time. This enables several exciting properties like automated alerts for model degradation, real-time model selection via reinforcement learning techniques, etc.

Back to blog

  • Investor Relations
  • Content Guidelines
  • Terms of Service
  • Privacy Policy
  • Ad Privacy Info
  • The Local Yelp
  • Contact Yelp
  • Yelp Mobile

Yelp for Business Owners

  • Claim your Business Page
  • Advertise on Yelp
  • Yelp SeatMe
  • Business Success Stories
  • Business Support
  • Yelp Blog for Business Owners

Logo.

Technology and Operations Management

Mba student perspectives.

  • Assignments
  • Assignment: RC TOM Challenge 2018

Machine Learning at Yelp

Machine learning has been integral to Yelp's business model over the last several years and can be leveraged to help improve their declining stock price.

Yelp’s website, Yelp.com, is a crowd-sourced local business review site. Their business model relies on relevant reviews (on scale of 1-5 stars) which generates advertising revenue. 1 The content’s search-ability is very important for businesses, an HBS study found that each “star” in a Yelp rating affected the business owner’s sales by 5-9%. 2 Machine learning has been integral to their business model over the last several years and should be leveraged to help improve their declining stock price. 6

Megatrend of Machine Learning and Process Improvement  

Machine learning is very important and useful to Yelp, both on the consumer side – finding relevant businesses through reviews and encouraging useful reviews – and on the advertising side – displaying relevant ads to users – as most of their revenue is generated through advertising.

Yelp’s foray into machine learning was in 2015 with deep learning-power image analysis which identifies color, texture and shape of objects in user submitted photographs with 83% accuracy and uses the identifying traits to sort them into categories. Once the reviewers’ photographs have been categorized (broad categories such as food, drinks, menu, interior), Yelp has developed deep convolutional neural networks to recognize the classes and sort the photographs that are then displayed to users (see example in exhibit 1). Subsequently, Yelp expanded its machine learning to a custom ads platform whereby advertisers can opt to have “two step” AI system recommend photos and review content to use in banner ads targeting users. This machine learning system increased the rate people click on ads by at least 15%. 1

Pathways to Just Digital Future

Yelp’s Strategy in the Short and Medium Term

In 2018, Yelp introduced Yelp Collections, which uses combination of machine learning, algorithmic sorting and manual curation to highlight top businesses in a particular area (see Exhibit 2). 7 They are comparing the effectiveness of the three methods of Collection curation (machine learning, algorithm, human curation) and assessing the potential impact on user interaction and experience.  Additionally, weekly recommendations (entitled “Recommended for You”) are informed entirely by machine learning. Their AI engine bases the specific recommendations for each user on which businesses a user has viewed and reviews Yelp has received in the previous week. These are compared to the Collections formed through human-curated roundups and algorithmically generated lists. Lastly, Yelp’s algorithms automatically publish “top 10 list” collections for cities, determined by composite of star ratings and volume of ratings for each respective business. Yelp should use the success of these lists (measured by metrics such as click rate and new customer acquisition) in its pitches to advertisers and continue to further develop and refine them. 1

In the short term, Yelp should further refine their machine learning to optimize content delivery to users. One of Yelp’s current machine learning project is creating a “Popular Dish” list on each restaurant’s Yelp page based on customer reviews. The “Popular Dish” idea is a step in the right direction. An HBS study demonstrated that Yelp customers do not use all available information in each review and about each business and thus are more responsive to quality changes that are more visible and respond more strongly when rating contains more information. 3 Given the growing amounts of data, commonly referred to as data deluge, it is important to have the framework and infrastructure to present the data to users and filter out the less-helpful data. 4

In the medium term, Yelp should use machine learning to ensure the validity and integrity of their reviews and more prominently displaying higher quality, thorough reviews. Furthermore, they should take steps to identifying and removing fake reviews as these can negatively (or unfairly positively) impact a business. It is difficult and time consuming to confirm a fake review but ensuring this integrity of each review is critical to their business model and is an area where machine learning should be further developed.  Ensuring high-quality content is of the upmost importance to their business model as they rely on advertising revenue. 2

Future Directions

Over the past few weeks, Yelp stock as decreased nearly 30%, with the company blaming internal issues leading to a paucity of advertiser acquisition. A potential initial step would be to leverage machine learning to screen for authentic reviews as review authenticity is often cited as a reason for declining user engagement.   6

Further questions that can be considered with regards to Yelp’s use of machine learning and additional areas of use are:

How can Yelp continue leverage machine learning to improve their advertising revenue and attract new advertisers?

Given data deluge how will they continue to improve existing algorithms and accelerate the development of other algorithms?

Exhibit 1: Sample classification system of pictures using machine learning 5

machine learning case study on yelp

Exhibit 2: Yelp Collections interface  7

machine learning case study on yelp

(Word Count 767)

Footnotes:   

1 Kyle Wiggers. VentureBeat.  https://venturebeat.com/2018/05/24/yelp-collections-uses-machine-learning-to-serve-up-recommendations/ . May 24, 2018. Accessed 11/12/18.

2  Tom Gara (September 24, 2013).  “Fake Reviews Are Everywhere. How Can We Catch Them?” . Wall Street Journal. Accessed 11/12/18.

3 Michael Luca. Reviews, Reputation, and Revenue: the Case of Yelp.com. Working Paper 12-016.

4 Katrine Lake. Stitch Fix’s CEO on Selling Personal Style to the Mass Market. Idea Watch: how I did it. Harvard Business Review May-June 2018.

5 Yelp.com. “How We use Deep Learning to Classify Business Photos at Yelp.” Oct 19, 2015. https://engineeringblog.yelp.com/2015/10/how-we-use-deep-learning-to-classify-business-photos-at-yelp.html  Accessed 11/12/18.

6 Market Watch. Yelp’s stock plunge exposes a fragile business model yet again. Nov 10, 2018. https://www.marketwatch.com/story/yelps-post-earnings-stock-plunge-exposes-a-fragile-business-model-yet-again-2018-11-08  Accessed 11/12/18.

7 Hilary Grigonis. Yelp now uses AI to deliver personalized recommendations with Collections. Digital Trends. https://www.digitaltrends.com/social-media/yelp-collections-announced/  Accessed 11/12/18.

Student comments on Machine Learning at Yelp

Deep diving into a business defined by its user-friendly algorithms, the key to the next steps is to maintain a user-friendly focus, which is why any “sponsored” results should be clearly identified as such. This is not to say that they shouldn’t take money from advertisers to boost revenues and have these results show up higher, but in order to maintain consumer confidence, they should be fully transparent about what they are doing. A great example to follow would be Google, the mega giant has already helped established parameters for how should search results be posted and there’s no need and on the flipside a huge risk from deviating from this model.

Thanks for your essay, very interesting! Two things that struck me were the data deluge and the balance between advertiser/consumer interests. On data deluge, it’s not surprising to me that people tend to focus only on one important quality aspect rather than long lists of information. When I use yelp, I often only look at the popular dish feature or the proximity relative to the highest “star” ratings. Rarely do I deep dive into this and I think Yelp should continue to focus on these types of visually appealing attributes. Yelp also needs to balance generating advertising revenue versus maintaining consumer confidence. Establishing clear parameters could help, but they could also establish short-term partnerships with restaurants and see how consumers rank them (assuming it drives more demand). If the restaurant sees high ratings, then a long-term partnership could form but if not then Yelp could potentially end the partnership to maintain consumer trust. Lastly, I do agree with your point on investing internally to monitor fake reviews. While these reviews may not have the same detrimental affect that we have seen on Facebook and Twitter recently, they still can erode the brand long-term if not addressed.

Leave a comment Cancel reply

You must be logged in to post a comment.

  • DOI: 10.54364/aaiml.2023.1174
  • Corpus ID: 265712185

Sentiment Analysis: A Systematic Case Study with Yelp Scores

  • Wenping Wang , Jin Han , +3 authors Jingxian huang
  • Published in Advances in Artificial… 2023
  • Computer Science

Figures and Tables from this paper

figure 1

2 Citations

Accelerated cloud for artificial intelligence (acai), grayscale image colorization with gan and cyclegan in different image domain, 19 references, sentiment analysis in the era of large language models: a reality check, sentiment classification and aspect-based sentiment analysis on yelp reviews using deep learning and word embeddings, user modeling with neural network for review rating prediction, yelp dataset challenge: review rating prediction, predicting a business star in yelp from its reviews text alone, thumbs up sentiment classification using machine learning techniques, sentiment analysis algorithms and applications: a survey, leveraging pretrained models for automatic summarization of doctor-patient conversations, linguistically-inspired neural coreference resolution, deep learning for sentiment analysis: a survey, related papers.

Showing 1 through 3 of 0 Related Papers

Yelp Review Rating Prediction: Machine Learning and Deep Learning Models

machine learning case study on yelp

We predict restaurant ratings from Yelp reviews based on Yelp Open Dataset. Data distribution is presented, and one balanced training dataset is built. Two vectorizers are experimented for feature engineering. Four machine learning models including Naive Bayes , Logistic Regression , Random Forest , and Linear Support Vector Machine are implemented. Four transformer-based models containing BERT, DistilBERT, RoBERTa, and XLNet are also applied. Accuracy, weighted F1 score, and confusion matrix are used for model evaluation. XLNet achieves 70 Regression with 64

machine learning case study on yelp

Related Research

Spothitpy: a study for ml-based song hit prediction using spotify, a case study on the classification of lost circulation events during drilling using machine learning techniques on an imbalanced large dataset, sasicm a multi-task benchmark for subtext recognition, influence of the event rate on discrimination abilities of bankruptcy prediction models, autodiscern: rating the quality of online health information with hierarchical encoder attention-based neural networks, machine learning to detect cyber-attacks and discriminating the types of power system disturbances, road friction estimation for connected vehicles using supervised machine learning.

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

Python Geeks

Learn Python Programming from Scratch

  • Learn Machine Learning

16 Real World Case Studies of Machine Learning

FREE Online Courses: Click, Learn, Succeed, Start Now!

A decade ago, no one must have thought that the term “Machine Learning” would be hyped so much in the years to come. Right from our entertainment to our basic needs to complex data handling statistics, Machine Learning takes care of all of this. The clutches of Machine Learning aren’t just limited to the basic necessities and entertainment.

The technology plays a pivotal role in domain areas such as data retrieval, database consistency, and spam detection along with many other vast ranges of applications. We do come across various articles that are ready to teach us about the basic concepts of Machine Learning, however, learning becomes more fun when we actually see it working in practicality.

Keeping this in mind, PythonGeeks brings to you, an article that will talk about the real-life case studies of Machine Learning stating its advancement in various fields. We will talk about the merits of Machine Learning in the field of technology as well as in Life Science and Biology. So, without further delay, let us look at these case studies and get to know a bit more about Machine Learning.

Machine Learning Case Studies in Technology

1. machine learning case study on dell.

We all are aware of the multinational leader in technology, Dell. This tech giant empowers people and communities from across the globe by providing superior software and hardware services at very affordable prices. As a matter of fact, data plays a pivotal role in the programming of the hard drive of Dell, the marketing team of Dell requires a data-driven solution that supercharges response rates and exhibits why certain words and phrases outpace others in terms of efficiency and reliability.

Dell made a partnership with Persado, one of the names amongst the world’s leading technology in AI and ML fabricating marketing creative, in order to harness the power of words in their respective email channel and garner data-driven analytics for each of their key audiences for a better user experience.

As an evident outcome of this partnership, Dell experienced a 50% average increase in CTR and a 46% average increase in responses from its customer engagement . Apart from this, it also witnessed a huge 22% average increase in page visits and a 77% average increase in add-to-carts orders .

Overwhelmed by this success rate and learnings with email, Dell adamantly wanted to elevate their entire marketing platform with Persado for more profit and audience engagement. Dell now makes use of machine learning algorithms to enhance the marketing copy of their promotional and lifecycle emails. Apart from these, their management even deploys Machine Learning models for Facebook ads, display banners, direct mail, and even radio content for a farther reach for the target audience.

2. Machine Learning Case Study on Sky

Sky UK is a British telecommunication service that transforms customer experiences with the help of machine learning and artificial intelligence algorithms with the help of Adobe Sensei.

Due to the immense profit that the company gained due to the deployment of the Machine Learning model, the Head of Digital Decisioning and Analytics, Sky UK once stated that they have 22.5 million very diverse customers. Even attempting to divide people by their favorite television genre can result in pretty broad segments for their services.

This will result in the following outcomes:

  • Creating hyper-focused segments to engage customers.
  • Usage of machine learning to deliver actionable intelligence.
  • Improvement in the relationships with customers.
  • Applying AI learnings across channels to understand what matters to customers.

The company was competent in efficiently analyzing large volumes of customer information with the help of machine learning frameworks. With the deployment of Machine Learning models, the services were able to recommend their target audience with products and services that resonated the most with each of them.

McLaughlin once stated that people think of machine learning as a tool for delivering experiences that are strictly defined and very robotic in their approach, but it’s actually the other way round. With Adobe Sensei, the management of the Sky was drawing a line that connects customer intelligence and personalized experiences that are valuable and appropriate for their customers.

3. Machine Learning Case Study on Trendyol

Trendyol is amongst the leading e-commerce companies based in Turkey. It once faced threats from its global competitors like Adidas and ASOS, particularly for its sportswear sales and audience engagement.

In order to assist the company in gaining customer loyalty and to enhance its emailing system, Trendyol partnered with the vendor Liveclicker, which specializes in real-time personalization for a better user experience for its customers.

Trendyol made use of machine learning and artificial intelligence algorithms to create several highly personalized marketing campaigns based on the interests of a particular target audience. It was not only aimed at providing a personalized touch to the campaign, but it also helped to distinguish which messages would be most relevant or draw the attention of which set of customers. It also came up with an offer for a football jersey imposing the recipient’s name on the back of the jersey to ramp up the personalization level and grab the consumer’s attention.

By innovating such one-to-one personalization, not only were the retailer’s open rates, click-through rates, conversions were high, it also significantly made their sales reach all-time highs. It resulted in the generation of a 30% increase in click-through rates for Trendyol, a 62% growth in response rates, and a striking 130% increase in conversion rates for the tech giant.

4. Machine Learning Case Study On Harley Davidson

The world that we live in today is where it becomes difficult to break through traditional marketing. For an emerging business like – Harley Davidson NYC, Albert (an artificial intelligence-powered robot) has a lot of appeal for the growth and popularity of the company. Powered by effective and reliable machine learning and artificial intelligence algorithms, robots are writing news stories, opening new dimensions, working in hotels, managing traffic, and even running McDonald’s customers’ outlets.

We can use Albert in various marketing channels including social media and email campaigns. The software accurately predicts and differentiates among the consumers who are most likely to convert and adjust personal creative copies on their own for the benefits of the campaign.

Harley Davidson is the only brand to date that uses Albert to its advantage. The company analyzed customer data to determine a strong pattern in the behavior of previous customers whose actions were positive in terms of purchasing and spending more than the average amount of time on browsing through the website giving way to the use of Albert. With this analyzed data, Albert bifurcates segments of customers and scales up the test campaigns according to the interests and engagement of customers.

Once the company efficiently deployed Albert, Harley Davidson witnessed an increase in its sales by 40% with the use of Albert. The brand also witnessed a 2,930% increase in leads, with 50% of those from high converting ‘lookalikes’ identified by artificial intelligence and machine learning using Albert.

5. Machine Learning Case Study on Yelp

As far as our technical knowledge is concerned, we are not able to recognize Yelp as a tech company. However, it is effectively taking advantage of machine learning to improve its users’ experience to a great extent.

Yelp’s machine learning algorithms assist the company’s non-robotic staff in tasks like collecting, categorizing, and labeling images more efficiently and precisely. Since images play a pivotal role to Yelp as user reviews themselves, the tech giant is always trying to improve how it handles image processing to analyze customer feedback in a constructive way. Through this assistance, the company is serving millions of its users now with accurate and satisfactory services.

For an entire generation nowadays, capturing photos of their food has become second nature. Owing to this, Yelp has such a huge database of photos for image processing. Its software makes use of techniques for analysis of the image to identify and classify the extracted features on the basis of color, texture, and shape. It implies that it can recognize the presence of, say, pizzas, or whether a restaurant has outdoor seating by merely analyzing the images that we provide as input data.

As a constructive outcome, the company is now capable of predicting attributes like ‘good for kids’ and ‘classy ambiance’ with a striking more than 80% accuracy.

6. Machine Learning Case Study on Tesla

Tesla is now a big name in the electric automobile industry and the chances that it will continue to be the trending topic for years to come are really high. It is popular and extensively known for its advanced and futuristic cars and their advanced models. The company states that their cars have their own AI hardware for their advancement. Tesla is even making use of AI for fabricating self-driving cars.

With the current progress rate of technology, cars are not yet completely autonomous and need human intervention to some extent. The company is working extensively on the thinking algorithm for cars to help them become fully autonomous. It is currently working in an advert partnership with NVIDIA on an unsupervised ML algorithm for its development.

This step by Tesla would be a game-changer in the field of automobiles and Machine Learning models for many reasons. The cars feed the data directly to Tesla’s cloud storage to avoid data leakage. The car sends the driver’s seating position, traffic of the area, and other valuable information on the cloud to precisely predict the next move of the car. The car is equipped with various internal and external sensors that detect the above-mentioned data for processing.

Machine Learning Case Studies in Life Science and Biology

7. development of microbiome therapeutics.

We have studied and identified a vast number of microorganisms, so-called microbiota like bacteria, fungi, viruses, and other single-celled organisms in our body till today with the advancement in technology. All the genes of the microbiota are collectively known as the microbiome. These genes are present in an enormous number of trillions, for example, the bacteria present in the human body have more than 100 times more unique genes than humans could ever have.

These microbiotas that are present in the human body have a massive influence on human health and cause imbalances leading to many disorders like Parkinson’s disease or inflammatory bowel disease. There is also the presumption that such imbalances may even cause several autoimmune diseases if precariously left in the human body. So, microbiome research is a very trendy research area and Machine Learning models can help in handling them effectively.

In order to influence the microbiota and develop microbiome therapeutics to reverse the diseases caused by them, we need to understand the microbiota’s genes and their influence on our body. With all the gene sequencing possibilities that are present today, terabytes of data are available however we cannot use it as it is not yet probed.

8. Predicting Heart Failure in Mobile Health

Heart failure typically leads to emergency or hospital admission and may even be fatal in some situations. And with the increase in the aging population, the percentage of heart failure in the population is expected to increase.

People that suffer from heart failure usually have some pre-existing illnesses that go undiagnosed and lead to fatal ailments. So, it is not uncommon that we make use of telemedicine systems to monitor and consult a patient, and collect valuable data like mobile health data like blood pressure, body weight, or heart rate and transmit it effectively.

Most prediction and prevention systems are now fabricated based on fixed rules, like when specific measurements of the vital readings of the human body are beyond a predefined threshold, the patient is alerted even before the diagnosis of any kind of ailment. It is self-explanatory that such a predictive system may lead to a high number of false alerts, due to fluctuating reading of the vitals due to reasons that are not serious.

Because of the programming that we do on the algorithms, alerts lead mostly to hospital admission. Due to this reason, too many false alerts lead to increased health costs and deteriorate the patient’s confidence in the prediction defying the cause of the algorithms. Eventually, the concerned patient will stop following the recommendation for medical help even if the algorithm alters it for fatal ailments.

So, on the basis of baseline data of the patient like age, gender, smoker or not, a pacemaker or not along with measurements of vital elements of the body like sodium, potassium, or hemoglobin concentrations in the blood, apart from the monitored characteristics like heart rate, body weight, (systolic and diastolic) blood pressure, or questionnaire proves to be helpful in answering about the well-being, or physical activities, a classifier on the basis of Naïve Bayes has been finally developed to reduce the chances of false positives.

9. Mental Health Prediction, Diagnosis, and Treatment

According to an estimated number that at least 10% of the global population has a mental disorder, it is now high time that we need to take preventive measures in this field. Economic losses that are evident due to mental illness sum up to nearly $10 trillion.

Mental disorders include a large variety of ailments ranging from anxiety, depression, substance use disorder, and others. Some other prime examples include opioids, bipolar disorder, schizophrenia, or eating disorders that cause high risk to the human resources.

As a result of which, the detection of mental disorders and intervention as early as possible is critical in order to reduce the loss of precious resources. There are two main approaches to deploy Machine Learning models in detecting mental disorders: apps for consumers that detect mental diseases and tools for psychiatrists to support diagnostics of their patients.

The apps for consumers are typically conversational chatbots enhanced with machine learning algorithms to help the consumers in reducing their anxiety or panic attacks. The app analyzes the behavioral traits of the person like the spoken language of the consumer and recommends help to the customers accordingly. As the recommendations must be strictly on the basis of scientific evidence, the interaction and response of proposals and the individual language pattern of the chatbot, as well as, the consumer must be predicted as precisely as possible.

10. Research Publication and Database Scanning for Bio-Markers for Stroke

As a matter of fact, Stroke is one of the major reasons for disability and death amongst the elder generations. The lifetime risk analysis of an adult person is about 25% of having once a stroke history. However, stroke is a very heterogeneous disorder in nature. Therefore, having individualized pre-stroke and post-stroke care is critical for the success of a cure.

In order to determine this individualized care, the person’s phenotype indicates that the observable characteristics of a person should be chosen wisely. Furthermore, we usually achieve this by biomarkers. A so-called biomarker represents a measurable data point such that we can stratify the patients. Examples of such biomarkers are disease severity scores, lifestyle characteristics, or genomic properties.

There are many recognized biomarkers already published or in databases. Apart from this, there are hundreds of scientific publications that talk daily about the detection of biomarkers for all the different diseases.

11. 3D Bioprinting

Bioprinting is yet another trending topic in the domain of biotechnology. It works on the basis of a digital blueprint where the printer uses cells and natural or synthetic biomaterials — also called bio-inks — to print layer-by-layer living tissues like skin, organs, blood vessels, or bones that have exact replication of the real tissues.

As an alternative for depending on organ donations, we can produce these tissues in printers more ethically and cost-effectively. Apart from this, we can even perform drug tests on the synthetic build tissue than with animal or human testing. The whole technology is still emerging and is in early maturity due to its high complexity. One of the most crucial parts to cope with this complexity of printing is data science.

12. Supply Chain Optimization

As we might have observed, the production of drugs needs time, especially for today’s high-tech cures based on specific substances and production methods only. Apart from this, we have to break down the whole process into many different steps, and several of them are outsourced to specialist delivery agents.

We observe this currently with the COVID-19 vaccine production as well. The vaccine inventors deliver the blueprint for the vaccine. Then the production happens in plants of companies specialized in sterile production. The production unit then delivers the vaccine in tanks to companies. They do the filling in small doses under clinical conditions, and at last, another company makes the supply for the given blueprint.

The complete planning, right from having the right input substances available at the right time, then having the adequate production capacity, and at last, the exact amount of drugs stored for serving the demand is a highly complicated system. As a result of which, this must be managed for hundreds and thousands of therapies, each with its specific conditions.

13. AES On Google Cloud AutoML Vision

As we have known, the AES Corporation is a power generation and distribution company. They generate and sell power that the consumers use for utilities and industrial work. They depend on Google Cloud on their road to make renewable energy more efficient. AES makes use of Google AutoML Vision to review images of wind turbine blades and analyze their maintenance needs beforehand.

Outcomes of this case study:

  • It reduces image review time by approximately 50%
  • It helps in reducing the prices of renewable energy
  • This results in more time to invest in identifying wind turbine damage and mending it

14. Bayes AG on AWS SageMaker

Bayer AG is an emerging name in multinational pharmaceutical and life sciences companies and it is based in Germany. One of their key highlights is in the production of insecticides, fungicides, and herbicides for agricultural purposes.

In order to assist farmers monitor their crops, they fabricate their Digital Yellow Trap: an Internet of Things (IoT) device that alerts farmers of pests using image recognition on the farming land.

  • It helps in reducing Bayer lab’s architecture costs by 94%
  • We can scale it to accommodate for fluctuating demand
  • It is able to handle tens of thousands of requests per second
  • It helps in Community-based, early warning

15. American Cancer Society on Google Cloud ML Engine

The American Cancer Society is a nonprofit organization for eradicating cancer. They operate in more than 250 regional offices all over America.

They make use of the Google Cloud ML Engine to identify novel patterns in digital pathology images. Their aim is to improve breast cancer detection accuracy and reduce the overall diagnosis timeline as well as ensure effective costing.

Outcomes of this use case:

  • It helps in enhancing the speed and accuracy of image analysis by removing human limitations
  • It even aids in improving patients’ quality of life and life expectancy
  • This helps to protect tissue samples by backing up image data to the cloud

16. Road Safety Commission of Western Australia

The Road Safety Commission of Western Australia operates under the Western Australia Police Force. It takes the responsibility for tracking road accidents and making the roads safer by taking adequate precautions.

In an attempt to achieve its safety strategy “Towards Zero 2008-2020” which aims at reducing road fatalities by 40%, the road safety commission is depending on machine learning, artificial intelligence, and advanced analytics for precise and reliable results.

  • It helps in achieving the goal of data engineering and visualization time reduced by 80%
  • It has achieved an estimated 25% reduction in vehicle crashes
  • This is based on straightforward and efficient data sharing
  • It works on flexibility of data with various coding languages

With this, we have seen the various case studies that are done till now in the field of Machine Learning. PythonGeeks specially curated this list of case studies to help readers to understand the deployment of Machine Learning models in the real world. The article can benefit you in various ways since it delivers accurate studies of the various uses of Machine Learning. You can study these cases to get to know Machine Learning a bit better and even try to find improvements in the existing solution.

If you are Happy with PythonGeeks, do not forget to make us happy with your positive feedback on Google | Facebook

Tags: Machine Learning Case Studies

4 Responses

  • Pingbacks 0

Great content and relevant to current digital transformation process.

Very informative

Very insightful

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

NLP_Yelp_Data.ipynb

Latest commit, file metadata and controls.

DataFlair

  • Machine Learning Tutorials

5 Machine Learning Case Studies to explore the Power of Technology

Free Machine Learning courses with 130+ real-time projects Start Now!!

Machine Learning Case Studies – Power that is beyond imagination!

Machine Learning is hyped as the “next big thing” and is being put into practice by most of the businesses. It has also achieved a prominent role in areas of computer science such as information retrieval, database consistency, and spam detection to be a part of businesses.

WAIT! Before proceeding ahead, first, you must complete the Machine Learning Use Cases

Here are a few real-world case studies on machine learning applications to solve real problems.

  • Machine Learning Case Studies

Machine Learning Case Studies

Here are the five best machine learning case studies explained:

1. Machine Learning Case Study on Dell

The multinational leader in technology, Dell, empowers people and communities from across the globe with superior software and hardware. Since data is a core part of Dell’s hard drive, their marketing team needed a data-driven solution that supercharges response rates and displays why certain words and phrases outperform others.

Dell partnered with Persado, the world’s leading technology in AI and ML generated marketing creative, to harness the power of words in their email channel and garner data-driven analytics for each of their key audiences.

As a result of this partnership, Dell noticed a 50% average increase in CTR and a 46% average increase in responses from customers . It also generated a 22% average increase in page visits and a 77% average increase in add-to-carts .

Excited by their success and learnings with email, Dell was eager to elevate their entire marketing platform with Persado. Dell now uses machine learning to improve the marketing copy of their promotional and lifecycle emails, Facebook ads, display banners, direct mail, and even radio content.

Do you know – How Facebook uses Data Science

2. Machine Learning Case Study on Sky

Sky UK transforms customer experiences with the help of machine learning and artificial intelligence through Adobe Sensei.

“We have 22.5 million very diverse customers. Even attempting to divide people by their favorite television genre can result in pretty broad segments.” said the Head of Digital Decisioning and Analytics, Sky UK.

  • Create hyper-focused segments to engage customers.
  • Use machine learning to deliver actionable intelligence.
  • Improve relationships with customers.
  • Apply AI learnings across channels to understand what matters to customers.

The company was able to make sense of its large volumes of customer information with the help of machine learning frameworks to recommend them with products and services that resonated the most with each customer.

“People think of machine learning as a tool for delivering experiences that are strictly defined and very robotic, but it’s actually the opposite. With Adobe Sensei, we’re drawing a line that connects customer intelligence and personalized experiences that are valuable and appropriate” says McLaughlin.

3. Machine Learning Case Study on Trendyol

Trendyol which is a leading e-commerce company based in Turkey faced threat from global competitors like Adidas and ASOS, particularly for sportswear.

To help gain customer loyalty and enhance its emailing system, it partnered with vendor Liveclicker, which specializes in real-time personalization.

Trendyol used machine learning and artificial intelligence to create several highly personalized marketing campaigns. It also helped to distinguish which messages would be most relevant to which customers. It also created an offer for a football jersey imposing the recipient’s name on the back to ramp up personalization.

By creatively using one-to-one personalization, the retailer’s open rates, click-through rates, conversions, and sales reached all-time highs. It generated a 30% increase in click-through rates for Trendyol, a 62% growth in response rates , and an impressive 130% increase in conversion rates .

It has now also employed strong marketing functions like social media utilization, mobile app, SEO blogs, celebrity endorsement, etc to reach its customer base.

Become an AI Expert by completing the Artificial Intelligence Tutorial Series by DataFlair

4. Machine Learning Case Study on Harley Davidson

The place we are in today is where it is difficult to break through traditional marketing. For a business like – Harley Davidson NYC, Albert (an artificial intelligence-powered robot) has a lot of appeal. Powered by machine learning and artificial intelligence , robots are writing news stories, working in hotels, managing traffic, and even running McDonald’s.

Albert can be applied to various marketing channels including social media and email. The software predicts which consumers are most likely to convert and adjusts personal creative copies on its own.

Harley Davidson is the only brand to make use of Albert. The company analyzed customer data to determine the behavior of previous customers whose actions were positive in terms of purchasing and spending more than the average amount of time on browsing through the website. With this information, Albert created segments of customers and scaled up the test campaigns accordingly.

Results show that Harley Davidson increased its sales by 40% with the use of Albert. The brand also had a 2,930% increase in leads , with 50% of those from high converting ‘lookalikes’ identified by artificial intelligence and machine learning.

5. Machine Learning Case Study on Yelp

While Yelp might not seem to be a tech company at first glance, it is taking advantage of machine learning to improve users’ experience.

Yelp’s machine learning algorithms help the company’s human staff to collect, categorize, and label images more efficiently. Since images are almost as vital to Yelp as user reviews themselves, it is always trying to improve how it handles image processing. Through this, the company is serving millions of its users now.

For an entire generation today, taking photos of their food has become second nature and thanks to these people because of whom Yelp has such a huge database of photos. Its software uses techniques for analysis of the image to identify color, texture, and shape. It means that it can recognize the presence of say, pizzas, or whether a restaurant has outdoor seating.

As a result, the company is now able to predict attributes like ‘good for kids’ and ‘classy ambiance’ with more than 80% accuracy. It is also planning to use this information to auto-caption images and improve search recommendations in the future.

These were all the machine learning case study examples.

These case studies of machine learning listed above would have been almost impossible to even think as recently as a decade ago, and yet the pace at which scientists and researchers are advancing is nothing short of amazing. In the coming future, we’ll see that machine learning can recognize, alter, and improve upon their own internal architecture with minimal human intervention.

The next article in the machine learning tutorial series by DataFlair – Machine Learning Infographic for Beginners

Would you like to add any other case study in machine learning? Share your views in the comment section.

Did you know we work 24x7 to provide you best tutorials Please encourage us - write a review on Google

courses

Tags: Dell Harley Davidson Machine Learning Applications Machine Learning Case Studies Sky

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • ML – Introduction
  • ML – Basics
  • ML – Softwares
  • ML – Applications
  • ML – Types of Algorithms
  • ML – Classification
  • ML – Best Way to Learn
  • ML – Future
  • ML – Why Popular
  • ML – Algorithms
  • ML – Use Cases
  • ML – Advantages & Limitations
  • ML – Transfer Learning
  • ML – Java Libraries
  • ML – Clustering
  • ML – Gaussian Mixture Model
  • ML – Convolutional Neural Network
  • ML – Recurrent Neural Network
  • ML – Artificial Neural Network
  • ML – ANN Applications
  • ML – ANN Learning Rules
  • ML – ANN Model
  • ML – ANN Algorithms
  • ML – Education
  • ML – Healthcare
  • ML – Finance
  • ML – Entrepreneurs
  • ML – Deep Learning
  • ML – DL Terminologies
  • ML – DL For Audio Analysis
  • ML – Support Vector Machine(SVM)
  • ML – SVM Applications
  • ML – SVM Kernel Functions
  • ML – Dimensionality Reduction
  • ML – Gradient Boosting Algorithm
  • ML – XGBoost Introduction
  • ML – XGBoosting Algorithm
  • ML – AdaBoost Algorithm
  • Deep Learning vs ML
  • Deep Learning vs ML vs AI vs DS
  • How Google uses Machine Learning
  • ML Infographic for Beginners
  • Machine Learning Project Ideas
  • ML – Projects
  • 70+ Project Ideas & Datasets
  • Machine Learning Project – Credit Card Fraud Detection
  • Machine Learning Project – Sentiment Analysis
  • Machine Learning Project – Movie Recommendation System
  • Machine Learning Project – Customer Segmentation
  • Machine Learning Project – Uber Data Analysis
  • Deep Learning Project Ideas
  • Python ML – Tutorial
  • Python ML – Environment Setup
  • Python ML – Data Preprocessing
  • Python ML – Train & Test Set
  • Python ML – Techniques
  • Python ML – Applications
  • Python ML – Algorithms
  • Python Deep Learning Tutorial
  • Python DL – Applications
  • Python DL – Environment Setup
  • Python DL – Python Libraries
  • Python DL – Neural Network
  • Python DL – Computational Graphs

job-ready courses

The Cloudflare Blog

The Cloudflare Blog

Subscribe to receive notifications of new posts:

Using machine learning to detect bot attacks that leverage residential proxies

Bob AminAzad

Bots using residential proxies are a major source of frustration for security engineers trying to fight online abuse. These engineers often see a similar pattern of abuse when well-funded, modern botnets target their applications. Advanced bots bypass country blocks, ASN blocks, and rate-limiting. Every time, the bot operator moves to a new IP address space until they blend in perfectly with the “good” traffic, mimicking real users’ behavior and request patterns. Our new Bot Management machine learning model (v8) identifies residential proxy abuse without resorting to IP blocking, which can cause false positives for legitimate users.  

One of the main sources of Cloudflare’s bot score is our bot detection machine learning model which analyzes, on average, over 46 million HTTP requests per second in real time. Since our first Bot Management ML model was released in 2019, we have continuously evolved and improved the model. Nowadays, our models leverage features based on request fingerprints, behavioral signals, and global statistics and trends that we see across our network.

Each iteration of the model focuses on certain areas of improvement. This process starts with a rigorous R&D phase to identify the emerging patterns of bot attacks by reviewing feedback from our customers and reports of missed attacks. In v8, we mainly focused on two areas of abuse. First, we analyzed the campaigns that leverage residential IP proxies, which are proxies on residential networks commonly used to launch widely distributed attacks against high profile targets. In addition to that, we improved model accuracy for detecting attacks that originate from cloud providers.

Residential IP proxies

Proxies allow attackers to hide their identity and distribute their attack. Moreover, IP address rotation allows attackers to directly bypass traditional defenses such as IP reputation and IP rate limiting. Knowing this, defenders use a plethora of signals to identify malicious use of proxies. In its simplest forms, IP reputation signals (e.g., data center IP addresses, known open proxies, etc.) can lead to the detection of such distributed attacks.

However, in the past few years, bot operators have started favoring proxies operating in residential network IP address space. By using residential IP proxies, attackers can masquerade as legitimate users by sending their traffic through residential networks. Nowadays, residential IP proxies are offered by companies that facilitate access to large pools of IP addresses for attackers. Residential proxy providers claim to offer 30-100 million IPs belonging to residential and mobile networks across the world. Most commonly, these IPs are sourced by partnering with free VPN providers, as well as including the proxy SDKs into popular browser extensions and mobile applications. This allows residential proxy providers to gain a foothold on victims’ devices and abuse their residential network connections.

machine learning case study on yelp

Figure 1 depicts the architecture of a residential proxy. By subscribing to these services, attackers gain access to an authenticated proxy gateway address commonly using the HTTPS/ SOCKS5 proxy protocol. Some residential proxy providers allow their users to select the country or region for the proxy exit nodes. Alternatively, users can choose to keep the same IP address throughout their session or rotate to a new one for each outgoing request. Residential proxy providers then identify active exit nodes on their network (on devices that they control within residential networks across the world) and route the proxied traffic through them.

The large pool of IP addresses and the diversity of networks poses a challenge to traditional bot defense mechanisms that rely on IP reputation and rate limiting. Moreover, the diversity of IPs enables the attackers to rotate through them indefinitely. This shrinks the window of opportunity for bot detection systems to effectively detect and stop the attacks. Effective defense against residential proxy attacks should be able to detect this type of bot traffic either based on single request features to stop the attack immediately, or identify unique fingerprints from the browsing agent to track and mitigate the bot traffic regardless of the IP source. Overly broad blocking actions, such as IP block-listing, by definition, would result in blocking legitimate traffic from residential networks where at least one device is acting as a residential proxy node.

ML model training

At its heart, our model is built using a chain of modules that work together. Initially, we fetch and prepare training and validation datasets from our Clickhouse data storage. We use datasets with high confidence labels as part of our training. For model validation, we use datasets consisting of missed attacks reported by our customers, known sources of bot traffic (e.g., verified bots ), and high confidence detections from other bot management modules (e.g., heuristics engine). We orchestrate these steps using Apache Airflow, which enables us to customize each stage of the ML model training and define the interdependencies of our training, validation, and reporting modules in the form of directed acyclic graphs (DAGs).

The first step of training a new model is fetching labeled training data from our data store. Under the hood, our dataset definitions are SQL queries that will materialize by fetching data from our Clickhouse cluster where we store feature values and calculate aggregates from the traffic on our network. Figure 2 depicts these steps as train and validation dataset fetch operations. Introducing new datasets can be as straightforward as writing the SQL queries to filter the desired subset of requests.

machine learning case study on yelp

After fetching the datasets, we train our Catboost model and tune its hyper parameters . During evaluation, we compare the performance of the newly trained model against the current default version running for our customers. To capture the intricate patterns in subsets of our data, we split certain validation datasets into smaller slivers called specializations. For instance, we use the detections made by our heuristics engine and managed rulesets as ground truth for bot traffic. To ensure that larger sources of traffic (large ASNs , different HTTP versions, etc.) do not mask our visibility into patterns for the rest of the traffic, we define specializations for these sources of traffic. As a result, improvements in accuracy of the new model can be evaluated for common patterns (e.g., HTTP/1.1 and HTTP/2) as well as less common ones. Our model training DAG will provide a breakdown report for the accuracy, score distribution, feature importance, and SHAP explainers for each validation dataset and its specializations.

Once we are happy with the validation results and model accuracy, we evaluate our model against a checklist of steps to ensure the correctness and validity of our model. We start by ensuring that our results and observations are reproducible over multiple non-overlapping training and validation time ranges. Moreover, we check for the following factors:

  • Check for the distribution of feature values to identify irregularities such as missing or skewed values.
  • Check for overlaps between training and validation datasets and feature values.
  • Verify the diversity of training data and the balance between labels and datasets.
  • Evaluate performance changes in the accuracy of the model on validation datasets based on their order of importance.
  • Check for model overfitting by evaluating the feature importance and SHAP explainers.

After the model passes the readiness checks, we deploy it in shadow mode. We can observe the behavior of the model on live traffic in log-only mode (i.e., without affecting the bot score ). After gaining confidence in the model's performance on live traffic, we start onboarding beta customers, and gradually switch the model to active mode all while closely monitoring the real-world performance of our new model .

ML features for bot detection

Each of our models uses a set of features to make inferences about the incoming requests. We compute our features based on single request properties (single request features) and patterns from multiple requests (i.e., inter-request features). We can categorize these features into the following groups:

  • Global features: inter-request features that are computed based on global aggregates for different types of fingerprints and traffic sources (e.g., for an ASN) seen across our global network. Given the relatively lower cardinality of these features, we can scalably calculate global aggregates for each of them.
  • High cardinality features: inter-request features focused on fine-grained aggregate data from local traffic patterns and behaviors (e.g., for an individual IP address)
  • Single request features: features derived from each individual request (e.g., user agent).

Our Bot Management system (named BLISS ) is responsible for fetching and computing these feature values and making them available on our servers for inference by active versions of our ML models.

Detecting residential proxies using network and behavioral signals

Attacks originating from residential IP addresses are commonly characterized by a spike in the overall traffic towards sensitive endpoints on the target websites from a large number of residential ASNs. Our approach for detecting residential IP proxies is twofold. First, we start by comparing direct vs proxied requests and looking for network level discrepancies. Revisiting Figure 1, we notice that a request routed through residential proxies (red dotted line) has to traverse through multiple hops before reaching the target, which affects the network latency of the request.

Based on this observation alone, we are able to characterize residential proxy traffic with a high true positive rate (i.e., all residential proxy requests have high network latency). While we were able to replicate this in our lab environment, we quickly realized that at the scale of the Internet, we run into numerous exceptions with false positive detections (i.e., non-residential proxy traffic with high latency). For instance, countries and regions that predominantly use satellite Internet would exhibit a high network latency for the majority of their requests due to the use of performance enhancing proxies .

Realizing that relying solely on network characteristics of connections to detect residential proxies is inadequate given the diversity of the connections on the Internet, we switched our focus to the behavior of residential IPs. To that end, we observe that the IP addresses from residential proxies express a distinct behavior during periods of peak activity. While this observation singles out highly active IPs over their peak activity time, given the pool size of residential IPs, it is not uncommon to only observe a small number of requests from the majority of residential proxy IPs.

These periods of inactivity can be attributed to the temporary nature of residential proxy exit nodes. For instance, when the client software (i.e., browser or mobile application) that runs the exit nodes of these proxies is closed, the node leaves the residential proxy network. One way to filter out periods of inactivity is to increase the monitoring time and punish each IP address that exhibits residential proxy behavior for a period of time. This block-listing approach, however, has certain limitations. Most importantly, by relying only on IP-based behavioral signals, we would block traffic from legitimate users that may unknowingly run mobile applications or browser extensions that turn their devices into proxies. This is further detrimental for mobile networks where many users share their IPs behind CGNATs . Figure 3 demonstrates this by comparing the share of direct vs proxied requests that we received from active residential proxy IPs over a 24-hour period. Overall, we see that 4 out of 5 requests from these networks belong to direct and benign connections from residential devices.

Figure 3: Percentage of direct vs proxied requests from residential proxy IPs.

Using this insight, we combined behavioral and latency-based features along with new datasets to train a new machine learning model that detects residential proxy traffic on a per-request basis. This scheme allows us to block residential proxy traffic while allowing benign residential users to visit Cloudflare-protected websites from the same residential network.

Detection results and case studies

We started testing v8 in shadow mode in March 2024. Every hour, v8 is classifying more than 17 million unique IPs that participate in residential proxy attacks. Figure 4 shows the geographic distribution of IPs with residential proxy activity belonging to more than 45 thousand ASNs in 237 countries/regions. Among the most commonly requested endpoints from residential proxies, we observe patterns of account takeover attempts, such as requests to /login, /auth/login, and /api/login.  

Figure 4: Countries and regions with residential network activity. Size of markers are proportionate to the number of IPs with residential proxy activity.

Furthermore, we see significant improvements when evaluating our new machine learning model on previously missed attacks reported by our customers. In one case, v8 was able to correctly classify 95% of requests from distributed residential proxy attacks targeting the voucher redemption endpoint of the customer’s website. In another case, our new model successfully detected a previously missed content scraping attack evident by increased detection during traffic spikes depicted in Figure 5. We are continuing to monitor the behavior of residential proxy attacks in the wild and work with our customers to ensure that we can provide robust detection against these distributed attacks.

machine learning case study on yelp

Improving detection for bots from cloud providers

In addition to residential IP proxies, bot operators commonly use cloud providers to host and run bot scripts that attack our customers. To combat these attacks, we improved our ground truth labels for cloud provider attacks in our latest ML training datasets. Early results show that v8 detects 20% more bots from cloud providers, with up to 70% more bots detected on zones that are marked as under attack . We further plan to expand the list of cloud providers that v8 detects as part of our ongoing updates.

Check out ML v8

For existing Bot Management customers we recommend toggling “Auto-update machine learning model” to instantly gain the benefits of ML v8 and its residential proxy detection, and to stay up to date with our future ML model updates. If you’re not a Cloudflare Bot Management customer, contact our sales team to try out Bot Management .

Follow on X

Related posts

June 27, 2024 5:00 PM

Embedded function calling in Workers AI: easier, smarter, faster

Introducing a new way to do function calling in Workers AI by running function code alongside your inference. Plus, a new @cloudflare/ai-utils package to make getting started as simple as possible ...

  • Harley Turan , 
  • Dhravya Shah , 
  • Michelle Chen

June 20, 2024 1:00 PM

Introducing Stream Generated Captions, powered by Workers AI

With one click, users can now generate video captions effortlessly using Stream’s newest feature: AI-generated captions for on-demand videos and recordings of live streams ...

  • Mickie Betz , 
  • Ben Krebsbach , 
  • Taylor Smith

May 30, 2024 12:12 PM

Cloudflare acquires BastionZero to extend Zero Trust access to IT infrastructure

We’re excited to announce that BastionZero, a Zero Trust infrastructure access platform, has joined Cloudflare. This acquisition extends our Zero Trust Network Access (ZTNA) flows with native access management for infrastructure like servers, Kubernetes clusters, and databases ...

  • Kenny Johnson , 
  • Michael Keane

May 15, 2024 1:00 PM

New Consent and Bot Management features for Cloudflare Zaraz

Zaraz Consent Management now supports Google Consent Mode v2 and is compliant with the IAB Europe Transparency and Consent Framework. Zaraz also added Bot Management support for keeping your analytics data clean ...

  • Yo'av Moshe

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

sustainability-logo

Article Menu

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Predicting the helpfulness of online restaurant reviews using different machine learning algorithms: a case study of yelp.

machine learning case study on yelp

1. Introduction

2. literature review, 2.1. the role of ewom in promoting business sustainability, 2.2. studies on review helpfulness prediction, 2.3. machine learning for ewom, 3. methodology, 3.1. data collection, 3.2. data analysis process, 3.2.1. step 1: data preprocessing, 3.2.2. step 2: restaurant aspect extraction, 3.2.3. step 3: sentiment detection, 3.2.4. step 4: classifier set up, 3.3. measurements, 4.1. descriptive analysis of online reviews, 4.2. lda results, 4.3. sentiment results, 4.4. model comparison, 5. concluding remarks, 5.1. summary of results and discussion, 5.2. implications, limitations, and future studies, author contributions, conflicts of interest.

  • International Economic Development Council. Green Metrics: Common Measures of Sustainable Economic Development. 2017. Available online: https://www.iedconline.org/clientuploads/Downloads/edrp/IEDC_Greenmetrics.pdf (accessed on 29 August 2019).
  • National Restaurant Association. Restaurant Industry Factbook. 2019. Available online: https://www.restaurant.org/Downloads/PDFs/Research/SOI/restaurant_industry_fact_sheet_2019.pdf (accessed on 26 July 2019).
  • Lee, C.; Hallak, R.; Sardeshmukh, S.R. Innovation, entrepreneurship, and restaurant performance: A higher-order structural model. Tour. Manag. 2016 , 53 , 215–228. [ Google Scholar ]
  • Parsa, H.G.; Self, J.T.; Njite, D.; King, T. Why restaurants fail. Cornell Hosp. Q. 2005 , 46 , 304–322. [ Google Scholar ]
  • Forbes. Restaurants Don’t Fail, Lenders Do. 2013. Available online: https://www.forbes.com/sites/marccompeau/2013/12/03/restaurants-dont-fail-lenders-do/#4701cae121c6 (accessed on 22 July 2019).
  • Hua, N.; Lee, S. Benchmarking firm capabilities for sustained financial performance in the U.S. restaurant industry. Int. J. Hosp. Manag. 2014 , 36 , 137–144. [ Google Scholar ]
  • Guta, M. 94% Diners will Choose Restaurant Based on Online Reviews. Small Business Trends, June 2018. Available online: https://smallbiztrends.com/2018/06/how-diners-choose-restaurants.html (accessed on 20 August 2019).
  • Statista. Cumulative Number of Reviews Submitted to Yelp from 2009 to 2017 (in Millions). Available online: https://www.statista.com/statistics/278032/cumulative-number-of-reviews-submitted-to-yelp/ (accessed on 29 July 2019).
  • Malhotra, N.K. Reflections on the information overload paradigm in consumer decision making. J. Consum. Res. 1984 , 10 , 436–440. [ Google Scholar ]
  • Kwon, B.C.; Kim, S.H.; Duket, T.; Catalán, A.; Yi, J.S. Do people really experience information overload while reading online reviews? Int. J. Hum. Comput. Interact. 2015 , 31 , 959–973. [ Google Scholar ]
  • Mudambi, S.M.; Schuff, D. What makes a helpful online review? A study of customer reviews on amazon.com. Society for Information Management and The Management Information Systems Research Center. MIS Q. 2010 , 34 , 185–200. [ Google Scholar ]
  • Li, M.; Huang, L.; Tan, C.; Wei, K. Helpfulness of online product reviews as seen by consumers: Source and content features. Int. J. Electron. Commer. 2013 , 17 , 101–136. [ Google Scholar ]
  • Chua, A.Y.; Banerjee, S. Helpfulness of user-generated reviews as a function of review sentiment, product type and information quality. Comput. Hum. Behav. 2016 , 54 , 547–554. [ Google Scholar ]
  • Huang, A.H.; Chen, K.; Yen, D.C.; Tran, T.P. A study of factors that contribute to online review helpfulness. Comput. Hum. Behav. 2015 , 48 , 17–27. [ Google Scholar ]
  • Wang, X.; Tang, L.R.; Kim, E. More than words: Do emotional content and linguistic style matching matter on restaurant review helpfulness? Int. J. Hosp. Manag. 2019 , 77 , 438–447. [ Google Scholar ]
  • Samuel, A.L. Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 1959 , 3 , 210–229. [ Google Scholar ]
  • Hennig-Thurau, T.; Gwinner, K.P.; Walsh, G.; Gremler, D.D. Electronic word-of-mouth via consumer-opinion platforms: What motivates consumers to articulate themselves on the Internet? J. Interact. Mark. 2004 , 18 , 38–52. [ Google Scholar ]
  • Blal, I.; Sturman, M.C. The differential effects of the quality and quantity of online reviews on hotel room sales. Cornell Hosp. Q. 2014 , 55 , 365–375. [ Google Scholar ]
  • Ladhari, R.; Michaud, M. eWOM effects on hotel booking intentions, attitudes, trust, and website perceptions. Int. J. Hosp. Manag. 2015 , 46 , 36–45. [ Google Scholar ]
  • Nieto-García, M.; Muñoz-Gallego, P.A.; González-Benito, Ó. Tourists’ willingness to pay for an accommodation: The effect of eWOM and internal reference price. Int. J. Hosp. Manag. 2017 , 62 , 67–77. [ Google Scholar ]
  • Sparks, B.A.; Browning, V. The impact of online reviews on hotel booking intentions and perception of trust. Tour. Manag. 2011 , 32 , 1310–1323. [ Google Scholar ] [ Green Version ]
  • Rampton, J. How Online Reviews Can Help Grow Your Small Business. Forbes. 31 May 2017. Available online: https://www.forbes.com/sites/johnrampton/2017/05/31/how-online-reviews-can-help-grow-your-small-business/#7ecbc990737b (accessed on 13 July 2019).
  • Vlachos, G. Online Travel Statistics. Info Graphics Mania. 2012. Available online: http://infographicsmania.com/online-travel-statistics-2012/ (accessed on 6 August 2019).
  • Mauri, A.G.; Minazzi, R. Web reviews influence on expectations and purchasing intentions of hotel potential customers. Int. J. Hosp. Manag. 2013 , 34 , 99–107. [ Google Scholar ]
  • Zhang, Z.; Ye, Q.; Law, R.; Li, Y. The impact of e-word-of-mouth on the online popularity of restaurants: A comparison of consumer reviews and editor reviews. Int. J. Hosp. Manag. 2010 , 29 , 694–700. [ Google Scholar ]
  • Kim, W.G.; Li, J.J.; Brymer, R.A. The impact of social media reviews on restaurant performance: The moderating role of excellence certificate. Int. J. Hosp. Manag. 2016 , 55 , 41–51. [ Google Scholar ]
  • Guo, Y.; Wang, Y.; Wang, C. Exploring the Salient Attributes of Short-Term Rental Experience: An Analysis of Online Reviews from Chinese Guests. Sustainability 2019 , 11 , 4290. [ Google Scholar ] [ Green Version ]
  • Jia, S.S. Leisure Motivation and Satisfaction: A Text Mining of Yoga Centres, Yoga Consumers, and Their Interactions. Sustainability 2018 , 10 , 4458. [ Google Scholar ] [ Green Version ]
  • Nam, S.; Ha, C.; Lee, H. Redesigning In-Flight Service with Service Blueprint Based on Text Analysis. Sustainability 2018 , 10 , 4492. [ Google Scholar ] [ Green Version ]
  • Pantelidis, I.S. Electronic meal experience: A content analysis of online restaurant comments. Cornell Hosp. Q. 2010 , 51 , 483–491. [ Google Scholar ]
  • Yan, X.; Wang, J.; Chau, M. Customer revisit intention to restaurants: Evidence from online reviews. Inf. Syst. Front. 2015 , 17 , 645–657. [ Google Scholar ]
  • Danescu-Niculescu-Mizil, C.; Kossinets, G.; Kleinberg, J.; Lee, L. How opinions are received by online communities: A case study on amazon.com helpfulness votes. In Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, 20–24 April 2009; pp. 141–150. [ Google Scholar ]
  • Otterbacher, J. “Helpfulness” in online communities: A measure of message quality. In Proceedings of the Sigchi Conference on Human Factors in Computing Systems, Boston, MA, USA, 4–9 April 2009. [ Google Scholar ]
  • Kwok, L.; Xie, K.L. Factors contributing to the helpfulness of online hotel reviews: Does manager response play a role? Int. J. Hosp. Manag. 2016 , 28 , 2156–2177. [ Google Scholar ]
  • Pan, Y.; Zhang, J.Q. Born unequal: A study of the helpfulness of user-generated product reviews. J. Retail. 2011 , 87 , 598–612. [ Google Scholar ]
  • Baek, H.; Ahn, J.; Choi, Y. Helpfulness of online consumer reviews: Readers’ objectives and review cues. Int. J. Electron. Commer. 2012 , 17 , 99–126. [ Google Scholar ]
  • Racherla, P.; Friske, W. Perceived ‘usefulness’ of online consumer reviews: An exploratory investigation across three services categories. Electron. Commer. Res. Appl. 2012 , 11 , 548–559. [ Google Scholar ]
  • Liu, Z.; Park, S. What makes a useful online review? Implication for travel product websites. Tour. Manag. 2015 , 47 , 140–151. [ Google Scholar ] [ Green Version ]
  • Hong, H.; Xu, D.; Wang, G.A.; Fan, W. Understanding the determinants of online review helpfulness: A meta-analytic investigation. Decis. Support Syst. 2017 , 102 , 1–11. [ Google Scholar ]
  • Luo, Y.; Tang, R.L. Understanding hidden dimensions in textual reviews on Airbnb: An application of modified latent aspect rating analysis (LARA). Int. J. Hosp. Manag. 2019 , 80 , 144–154. [ Google Scholar ]
  • Ngo-Ye, T.L.; Sinha, A.P. The influence of reviewer engagement characteristics on online review helpfulness: A text regression model. Decis. Support Syst. 2014 , 61 , 47–58. [ Google Scholar ]
  • Zhou, S.; Guo, B. The order effect on online review helpfulness: A social influence perspective. Decis. Support Syst. 2017 , 93 , 77–87. [ Google Scholar ]
  • Li, H.; Wang, C.R.; Meng, F.; Zhang, Z. Making restaurant reviews useful and/or enjoyable? The impacts of temporal, explanatory, and sensory cues. Int. J. Hosp. Manag. 2018 . [ Google Scholar ] [ CrossRef ]
  • Barreda, A.; Bilgihan, A. An analysis of user-generated content for hotel experiences. J. Hosp. Tour. Technol. 2013 , 4 , 263–280. [ Google Scholar ]
  • Lewis, S.C.; Zamith, R.; Hermida, A. Content analysis in an era of big data: A hybrid approach to computational and manual methods. J. Broadcast. Electron. Media 2013 , 57 , 34–52. [ Google Scholar ]
  • Allahyari, M.; Pouriyeh, S.; Assefi, M.; Safaei, S.; Trippe, E.D.; Gutierrez, J.B.; Kochut, K. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. 2017. Available online: https://arxiv.org/pdf/1707.02919.pdf (accessed on 4 August 2019).
  • Ali, F.; Kwak, K.S.; Kim, Y.G. Opinion mining based on fuzzy domain ontology and Support Vector Machine: A proposal to automate online review classification. Appl. Soft Comput. 2016 , 47 , 235–250. [ Google Scholar ]
  • Aliandu, P. Sentiment analysis to determine accommodation, shopping and culinary location on foursquare in Kupang city. Procedia Comput. Sci. 2015 , 72 , 300–305. [ Google Scholar ]
  • Vapnik, V. The Nature of Statistical Learning Theory ; Springer Science & Business Media: New York, NY, USA, 2013. [ Google Scholar ]
  • Zhang, Z.; Ye, Q.; Zhang, Z.; Li, Y. Sentiment classification of internet restaurant reviews written in Cantonese. Expert Syst. Appl. 2011 , 38 , 7674–7682. [ Google Scholar ]
  • Rafi, M.; Hassan, S.; Shaikh, M.S. Content-Based Text Categorization Using Wikitology. 2012. Available online: https://arxiv.org/pdf/1208.3623.pdf (accessed on 6 August 2019).
  • Lau, R.Y.K.; Li, C.; Liao, S.S.Y. Social analytics: Learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis. Support Syst. 2014 , 65 , 80–94. [ Google Scholar ]
  • Qazi, A.; Syed KB, S.; Raj, R.G.; Cambria, E.; Tahir, M.; Alghazzawi, D. A concept-level approach to the analysis of online review helpfulness. Comput. Hum. Behav. 2016 , 58 , 75–81. [ Google Scholar ]
  • Hu, Y.H.; Chen, K. Predicting hotel review helpfulness: The impact of review visibility, and interaction between hotel stars and review ratings. Int. J. Inf. Manag. 2016 , 36 , 929–944. [ Google Scholar ]
  • Fang, B.; Ye, Q.; Kucukusta, D.; Law, R. Analysis of the perceived value of online tourism reviews: Influence of readability and reviewer characteristics. Tour. Manag. 2016 , 52 , 498–506. [ Google Scholar ]
  • Lee, M.; Jeong, M.; Lee, J. Roles of negative emotions in customers’ perceived helpfulness of hotel reviews on a user-generated review website: A text mining approach. Int. J. Contemp. Hosp. Manag. 2017 , 29 , 762–783. [ Google Scholar ]
  • Yang, S.B.; Shin, S.H.; Joun, Y.; Koo, C. Exploring the comparative importance of online hotel reviews’ heuristic attributes in review helpfulness: A conjoint analysis approach. J. Travel Tour. Mark. 2017 , 34 , 963–985. [ Google Scholar ]
  • Hu, Y.H.; Chen, K.; Lee, P.J. The effect of user-controllable filters on the prediction of online hotel reviews. Inf. Manag. 2017 , 54 , 728–744. [ Google Scholar ]
  • Gao, B.; Hu, N.; Bose, I. Follow the herd or be myself? An analysis of consistency in behavior of reviewers and helpfulness of their reviews. Decis. Support Syst. 2017 , 95 , 1–11. [ Google Scholar ]
  • Filieri, R.; Raguseo, E.; Vitari, C. When are extreme ratings more helpful? Empirical evidence on the moderating effects of review characteristics and product type. Comput. Hum. Behav. 2018 , 88 , 134–142. [ Google Scholar ]
  • Ma, Y.; Xiang, Z.; Du, Q.; Fan, W. Effects of user-provided photos on hotel review helpfulness: An analytical approach with deep leaning. Int. J. Hosp. Manag. 2018 , 71 , 120–131. [ Google Scholar ]
  • Lee, P.J.; Hu, Y.H.; Lu, K.T. Assessing the helpfulness of online hotel reviews: A classification-based approach. Telemat. Inform. 2018 , 35 , 436–445. [ Google Scholar ]
  • Liang, S.; Schuckert, M.; Law, R. How to improve the stated helpfulness of hotel reviews? A multilevel approach. Int. J. Contemp. Hosp. Manag. 2019 , 31 , 953–977. [ Google Scholar ]
  • Michaels, M. The 25 Best Places to Travel in the US This Year, According to TripAdvisor Reviews. Business Insider. March 2018. Available online: https://www.businessinsider.com/tripadvisor-best-places-to-travel-america-2018-3 (accessed on 1 July 2019).
  • Guo, Y.; Barnes, S.J.; Jia, Q. Mining meaning from online ratings and reviews: Tourist satisfaction analysis using latent dirichlet allocation. Tour. Manag. 2017 , 59 , 467–483. [ Google Scholar ] [ Green Version ]
  • Hong, Y.; Lu, J.; Yao, J.; Zhu, Q.; Zhou, G. What reviews are satisfactory: Novel features for automatic helpfulness voting. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, Portland, OR, USA, 12–16 August 2012; pp. 495–504. [ Google Scholar ]
  • Kim, S.M.; Pantel, P.; Chklovski, T.; Pennacchiotti, M. Automatically assessing review helpfulness. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, 22–23 July 2006; pp. 423–430. [ Google Scholar ]
  • Metsis, V.; Androutsopoulos, I.; Paliouras, G. Spam filtering with näive bayes-which näive bayes? In Proceedings of the Third Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA, 27–28 July 2006; Volume 17, pp. 28–69. [ Google Scholar ]
  • Ng, A.Y.; Jordan, M.I. On discriminative vs. generative classifiers: A comparison of logistic regression and näive Bayes. In Advances in Neural Information Processing Systems ; Morgan Kaufmann Publishers: San Mateo, CA, USA, 2002; pp. 841–848. [ Google Scholar ]
  • Gladence, L.M.; Karthi, M.; Anu, V.M. A statistical comparison of logistic regression and different Bayes classification methods for machine learning. ARPN J. Eng. Appl. Sci. 2015 , 10 , 5947–5953. [ Google Scholar ]
  • Suykens, J.A.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999 , 9 , 293–300. [ Google Scholar ]
  • Malik, M.; Hussain, A. Helpfulness of product reviews as a function of discrete positive and negative emotions. Comput. Hum. Behav. 2017 , 73 , 290–302. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Cuizon, J.C.; Lopez, J.; Jones, D.R. Text mining customer reviews for aspect-based restaurant rating. Int. J. Comput. Sci. Inf. Technol. 2019 , 10 , 43–51. [ Google Scholar ]
  • Liu, H.; He, J.; Wang, T.; Song, W.; Du, X. Combining user preferences and user opinions for accurate recommendation. Electron. Commer. Res. Appl. 2013 , 12 , 14–23. [ Google Scholar ]
  • Pronoza, E.; Yagunova, E.; Volskaya, S. Aspect-Based Restaurant Information Extraction for the Recommendation System. In Lecture Notes in Computer Science, Proceedings of the Human Language Technology. Challenges for Computer Science and Linguistics (LTC 2013), Poznań, Poland, 7–9 December 2013 ; Vetulani, Z., Uszkoreit, H., Kubis, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9561. [ Google Scholar ]
  • Gao, S.; Tang, O.; Wang, H.; Yin, P. Identifying competitors through comparative relation mining of online reviews in the restaurant industry. Int. J. Hosp. Manag. 2018 , 71 , 19–32. [ Google Scholar ]
  • Alonso, A.D.; O’neill, M.; Liu, Y.; O’shea, M. Factors driving consumer restaurant choice: An exploratory study from the Southeastern United States. J. Hosp. Mark. Manag. 2013 , 22 , 547–567. [ Google Scholar ]
  • Gan, Q.; Ferns, B.H.; Yu, Y.; Jin, L. A text mining and multidimensional sentiment analysis of online restaurant reviews. J. Qual. Assur. Hosp. Tour. 2017 , 18 , 465–492. [ Google Scholar ]
  • Bilgihan, A.; Seo, S.; Choi, J. Identifying restaurant satisfiers and dissatisfiers: Suggestions from online reviews. J. Hosp. Mark. Manag. 2018 , 27 , 601–625. [ Google Scholar ]
  • Park, Y.-J. Predicting the helpfulness of online customer reviews across different product types. Sustainability 2018 , 10 , 1735. [ Google Scholar ]
  • Cao, Q.; Duan, W.; Gan, Q. Exploring determinants of voting for the “helpfulness” of online user reviews: A text mining approach. Decis. Support Syst. 2011 , 50 , 511–521. [ Google Scholar ]
  • Ju, S.; Chang, H. Consumer perceptions on sustainable practices implemented in foodservice organizations in Korea. Nutr. Res. Pract. 2016 , 10 , 108–114. [ Google Scholar ] [ PubMed ] [ Green Version ]
  • Dewald, B.; Bruin, B.J.; Jang, Y.J. US consumer attitudes towards “green” restaurants. Anatolia 2014 , 25 , 171–180. [ Google Scholar ]
  • Namkung, Y.; Jang, S.C. Effects of restaurant green practices on brand equity formation: Do green practices really matter? Int. J. Hosp. Manag. 2013 , 33 , 85–95. [ Google Scholar ]
  • Yelp. What Is Yelp’s Elite Squad? Available online: https://www.yelp-support.com/article/What-is-Yelps-Elite-Squad?l=en_US (accessed on 30 July 2019).
  • CNBC. 10 Best Foodie Cities in America (No.1 May Surprise You). November 2018. Available online: https://www.cnbc.com/2018/11/05/wallethub-best-food-cities-in-america.html (accessed on 8 August 2019).

Click here to enlarge figure

Author (Year)Antecedents of Review Helpfulness/UsefulnessReview PlatformNumber of ReviewsTargeted LocationReview CategoryMethodsMain Conclusion
Liu and Park [ ]Reviewer characteristics (identity disclosure; expertise; reputation); review content features (review star rating; review length; review readability; review sentiment)Yelp5090New York City, LondonRestaurantTobit regression modelA combination of both reviewer and review characteristics positively influence on the review helpfulness
Qazi et al. [ ]Average number of concepts per sentence; number of concepts per review; review typesTripAdvisor1366NAHotelTobit regression modelThe number of concepts contained in a review, the average number of concepts per sentence, and the review type contribute to the perceived helpfulness of online reviews
Hu and Chen [ ]Review content; review sentiment; review author; review visibilityTripAdvisor349,582Las Vegas, OrlandoHotelModel TreeReview visibility and interaction effect of hotel star class and review rating improve the prediction accuracy
Fang et al. [ ]Review readability; review sentiment; reviewer mean rating; reviewer rating habit (skewness of rating distribution)TripAdvisor19,674New OrleansAttractionsNegative binomial regression and Tobit regression modelText readability and reviewer characteristics affect preceived review helpfulness
Kwok and Xie [ ]Number of words; number of sentences; reviewer gender; reviewer age; ratings; reviewer experience (status; membership; city visited)TripAdvisor56,284Austin, Dallas, Fort Worth, Houston, San AntonioHotelLinear RegressionThe helpfulness of online hotel reviews is positively affected by manager response and reviewer status
Lee et al. [ ]Negative emotional expressionsTripAdvisor520,668New York CityHotelNegative binomial regressionNegative reviews are more influential than positive reviews when potential customers read online hotel reviews for their future stay
Yang et al. [ ]Heuristic attributes (reviewer location, reviewer level, reviewer helpful vote, review rating, review length, and review photo)TripAdvisor1158New York CityHotel (a single case)Conjoint analysisReview rating and reviewer helpful vote attributes are the two most important factors in predicting review helpfulness
Hu et al. [ ]Review quality; review sentiment; reviewer characteristicsTripAdvisor1,434,004New York City, Las Vegas, Chicago, Orlando, MiamiHotelLinear regression, reduced error-pruning tree, random forestReview rating and number of words predict review helpfulness across different users’ travel regions, travel seasons, and travel types
Zhou and Guo [ ]Review orderYelp70,610Atlanta, Chicago, Los Angeles, New York, Washington, D.C.RestaurantNegative binomial regressionA review’s position in the sequence of reviews influences review helpfulness
Gao et al. [ ]Reviewer characteristics (e.g., absolute rating bias; number of cities visited; total number of reviews); hotel ratingTripAdvisor8676New York CityHotelOrdinary Least Squares (OLS) and ordered logistic regressionReviews by reviewers with higher absolute bias in rating in the past influences helpfulness of future reviews
Filieri et al. [ ]Extreme ratingTripAdvisor11,358FranceHotelTobit regression analysisExtreme reviews that are long and accomopanied by the reviewers’ photos are perceived to be more helpful
Ma et al. [ ]Textual content; visual contentTripAdvisor; Yelp37,392OrlandoHotelDecision tree, Support Vector Machine with linear kernel (SVM), logistic regression Deep learning models combining both review texts and user-provided photos were more useful in predicting review helpfulness than other models
Lee et al. [ ]Review quality; review sentiment; reviewer characteristicsTripAdvisor1,170,246New York City, Las Vegas, Chicago, Orlando, MiamiHotelClassification-based approachReviewer characteristics are good predictors of review helpfulness, whereas review quality and review sentiment are poor predictors of review helpfulness
Li et al. [ ]Temporal cues (time related words); explanatory cues (causation-related words); sensory cues (see, hear, feel)Yelp186,714Las VegasRestaurantNegative binomial regressionTemporal cues have the strongest impact on review usefulness
Liang et al. [ ]Review content quality (review depth; review extremity; review readability); reviewer characteristics (expertise; reputation; identity disclosure; cultural background); hotel features (ratings; ranking; number of rooms and photos)TripAdvisor246,963Beijing, Shanghai, Guangzhou, Hong Kong (China)HotelMultilevel modelInformative and readable reviews accompanied by extreme ratings are perceived to be more helpful
Wang et al. [ ]Emotional content; linguistic styleYelp262,205San Diego, Philadelphia, Houston, Atlanta, Las Vegas, Miami, Anaheim, Chicago, New York City, and OrlandoRestaurantNegative binomial regressionJoy, sadness, anger, fear, trust, disgust, and linguistic style matching impact review helpfulness
CityNumber of Reviews
Las Vegas85,558
Los Angeles105,513
New York102,963
Grand Total294,034
Star RatingReview Count%
122,8027.76%
217,3565.90%
329,2409.94%
466,97422.78%
5157,66253.62%
Grand Total294,034100%
ModelF1Recall%Precision%
NB (Naïve Bayes)67.6864.3471.39
NB+SVM (Support Vector Machine)71.2072.9669.52
SVM_FDO (Fuzzy Domain Ontology)79.5977.6581.62

Share and Cite

Luo, Y.; Xu, X. Predicting the Helpfulness of Online Restaurant Reviews Using Different Machine Learning Algorithms: A Case Study of Yelp. Sustainability 2019 , 11 , 5254. https://doi.org/10.3390/su11195254

Luo Y, Xu X. Predicting the Helpfulness of Online Restaurant Reviews Using Different Machine Learning Algorithms: A Case Study of Yelp. Sustainability . 2019; 11(19):5254. https://doi.org/10.3390/su11195254

Luo, Yi, and Xiaowei Xu. 2019. "Predicting the Helpfulness of Online Restaurant Reviews Using Different Machine Learning Algorithms: A Case Study of Yelp" Sustainability 11, no. 19: 5254. https://doi.org/10.3390/su11195254

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

This paper is in the following e-collection/theme issue:

Published on 2.7.2024 in Vol 10 (2024)

A Comprehensive Youth Diabetes Epidemiological Data Set and Web Portal: Resource Development and Case Studies

Authors of this article:

Author Orcid Image

Original Paper

  • Catherine McDonough 1 * , MS   ; 
  • Yan Chak Li 1 * , MPhil   ; 
  • Nita Vangeepuram 2, 3 , MPH, MD   ; 
  • Bian Liu 3 , PhD   ; 
  • Gaurav Pandey 1 , PhD  

1 Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States

2 Department of Pediatrics, Icahn School of Medicine at Mount Sinai, New York, NY, United States

3 Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY, United States

*these authors contributed equally

Corresponding Author:

Gaurav Pandey, PhD

Department of Genetics and Genomic Sciences

Icahn School of Medicine at Mount Sinai

1 Gustave L. Levy Pl

New York, NY, 10029

United States

Phone: 1 212 241 6500

Email: [email protected]

Background: The prevalence of type 2 diabetes mellitus (DM) and pre–diabetes mellitus (pre-DM) has been increasing among youth in recent decades in the United States, prompting an urgent need for understanding and identifying their associated risk factors. Such efforts, however, have been hindered by the lack of easily accessible youth pre-DM/DM data.

Objective: We aimed to first build a high-quality, comprehensive epidemiological data set focused on youth pre-DM/DM. Subsequently, we aimed to make these data accessible by creating a user-friendly web portal to share them and the corresponding codes. Through this, we hope to address this significant gap and facilitate youth pre-DM/DM research.

Methods: Building on data from the National Health and Nutrition Examination Survey (NHANES) from 1999 to 2018, we cleaned and harmonized hundreds of variables relevant to pre-DM/DM (fasting plasma glucose level ≥100 mg/dL or glycated hemoglobin  ≥5.7%) for youth aged 12-19 years (N=15,149). We identified individual factors associated with pre-DM/DM risk using bivariate statistical analyses and predicted pre-DM/DM status using our Ensemble Integration (EI) framework for multidomain machine learning. We then developed a user-friendly web portal named Prediabetes/diabetes in youth Online Dashboard (POND) to share the data and codes.

Results: We extracted 95 variables potentially relevant to pre-DM/DM risk organized into 4 domains (sociodemographic, health status, diet, and other lifestyle behaviors). The bivariate analyses identified 27 significant correlates of pre-DM/DM ( P <.001, Bonferroni adjusted), including race or ethnicity, health insurance, BMI, added sugar intake, and screen time. Among these factors, 16 factors were also identified based on the EI methodology (Fisher P of overlap=7.06×10 6 ). In addition to those, the EI approach identified 11 additional predictive variables, including some known (eg, meat and fruit intake and family income) and less recognized factors (eg, number of rooms in homes). The factors identified in both analyses spanned across all 4 of the domains mentioned. These data and results, as well as other exploratory tools, can be accessed on POND.

Conclusions: Using NHANES data, we built one of the largest public epidemiological data sets for studying youth pre-DM/DM and identified potential risk factors using complementary analytical approaches. Our results align with the multifactorial nature of pre-DM/DM with correlates across several domains. Also, our data-sharing platform, POND, facilitates a wide range of applications to inform future youth pre-DM/DM studies.

Introduction

Type 2 diabetes mellitus (DM) is a complex disease influenced by several biological and epidemiological factors [ 1 , 2 ], such as obesity [ 3 ], family history [ 4 ], diet [ 1 , 5 ], physical activity level [ 1 , 6 - 8 ], and socioeconomic status [ 9 - 11 ]. Prediabetes, characterized by elevated blood glucose levels below the diabetes threshold, is a precursor condition to DM [ 12 ]. There has been an alarming increasing trend in the prevalence of youth with pre–diabetes mellitus (pre-DM) and DM both in the United States [ 13 - 19 ] and worldwide [ 20 , 21 ], and the numbers of newly diagnosed youth living with pre-DM/DM are also expected to increase [ 14 , 20 , 22 ]. The latest estimate based on nationally representative data showed that the prevalence of pre-DM among youth increased from 11.6% in 1999-2002 to 28.2% in 2015-2018 in the United States [ 13 ]. This growth is particularly concerning because pre-DM/DM disproportionately affects racial and ethnic minority groups and those with low socioeconomic status [ 9 - 11 , 22 - 24 ], leading to significant health disparities. Having pre-DM/DM at a younger age also confers a higher health and economic burden resulting from living with the condition for more years and a higher risk of developing other cardiometabolic diseases [ 25 - 30 ]. This serious challenge calls for increased translational research into factors associated with pre-DM/DM among youth and how they can collectively affect disease risk and inform prevention strategies.

In particular, the most critically needed research in this direction is exploring the collective impact of various risk factors across multiple health-related domains. While clinical factors, such as obesity, have been mechanistically linked to insulin resistance [ 31 ], it is important to consider the broader perspective. There is an increasing recognition that social determinants of health (SDoH) play a significant role in amplifying the risk of pre-DM/DM and their related disparities. For example, factors such as limited access to health care, food and housing insecurity, and the neighborhood-built environment have been identified as influential contributors [ 9 - 11 , 32 ]. However, to gain a comprehensive understanding, it is essential to delve into other less studied variables, such as screen time, acculturation, or frequency of eating out, and examine how they interact to increase the risk of pre-DM/DM among youth [ 2 ].

One of the major challenges that has limited translational research into youth pre-DM/DM risk factors is that there are not publicly available, easily accessible data comprehensively profiling interrelated epidemiological factors for young individuals [ 2 ]. Specifically, most available public diabetes data portals focus on providing aggregated descriptive trends, such as pre-DM/DM prevalence for the entire population or subgroups stratified by race and ethnicity [ 33 - 36 ], which does not allow in-depth examination of the relationships between multiple risk factors and pre-DM/DM risk using individual-level data. While there do exist a few individual-level public diabetes data sets [ 37 - 41 ], they include mainly clinical measurements, while other important risk factors such as those related to diet, physical activity, and SDoH are limited. In addition, these data sets are not available for youth populations, as they focus exclusively on adult populations and not on youth specifically [ 37 , 39 - 41 ]. Furthermore, these data sets are not accompanied by any user-friendly web-based portals that can help explore or analyze these data to reveal interesting knowledge about youth pre-DM/DM. This shows that there is a lack of a comprehensive data set that includes multiple epidemiological variables to study youth pre-DM/DM and easily usable functionalities to explore and analyze data.

To directly address this data gap, we turned to the National Health and Nutrition Examination Survey (NHANES), which offers a promising path for examining pre-DM/DM among the US youth population by providing a rich source of individual- and household-level epidemiological factors. As a result, NHANES has been a prominent data source for studying youth pre-DM/DM trends and associated factors [ 18 , 42 - 45 ]. However, the use of NHANES data requires extensive data processing that is laborious and time-intensive [ 46 ]. This represents a major challenge for the widespread use of these high-quality and extensive data for studying youth pre-DM/DM.

In this work, we directly addressed the above challenges by processing NHANES data from 1999 to 2018 into a large-scale, youth diabetes–focused data set that covers a variety of relevant variable domains, namely, sociodemographic factors, health status indicators, diet, and other lifestyle behaviors. We also provided public access to this high-quality comprehensive youth pre-DM/DM data set, as well as functionalities to explore and analyze it, through the user-friendly Prediabetes/diabetes in youth Online Dashboard (POND) [ 47 ]. We demonstrated the data set’s use and potential through 2 case studies that used statistical analyses and machine learning (ML) approaches, respectively, to identify important epidemiological factors that are associated with youth pre-DM/DM.

Through this work, we aim to advance youth diabetes research by providing the most comprehensive epidemiological data set available through a public web portal and illustrating the value of these resources through our example case studies based on statistical analyses and ML. Our overarching goal is to enable researchers to investigate the multifactorial variables associated with youth pre-DM/DM, which may drive translational advances in prevention and management strategies.

Figure 1 [ 48 ] shows the overall study design and workflow. In the following subsections, we detail the components of the workflow.

machine learning case study on yelp

Data Source and Study Population

We built the youth pre-DM/DM data set based on publicly available NHANES data [ 49 ] spanning the years from 1999 to 2018. Developed by the Centers for Disease Control and Prevention, NHANES is a serial cross-sectional survey that gathers comprehensive health-related information from nationally representative samples of the noninstitutionalized population in the United States. The survey uses a multistage probability sampling method and collects data through questionnaires, physical examinations, and biomarker analysis. Each year, approximately 5000 individuals are included in the survey, and the data are publicly released in 2-year cycles.

Figure 2 details the process used to define our study population. Briefly, of the total 101,316 participants in 1999-2018 NHANES, we excluded individuals who (1) were not within the 12-19 years age range, (2) did not have either of the biomarkers used to define pre-DM/DM status, and (3) answered “Yes” to “Have you ever been told by a doctor or health professional that you have diabetes?” The youth pre-DM/DM outcome of this work was derived as follows: youth were considered at risk of pre-DM/DM if their fasting plasma glucose (FPG) was at or greater than 100 mg/dL, or their glycated hemoglobin (HbA 1c ) was at or greater than 5.7%, according to the current American Diabetes Association (ADA) pediatric clinical guidelines [ 2 ].

machine learning case study on yelp

Validation of the Study Population

We estimated pre-DM/DM prevalence across the 10 survey cycles (1999-2018) by incorporating the NHANES design elements in the analysis and compared the general trend with those reported in the literature [ 18 , 19 ]. We also specifically applied the analytical methods reported in a recent study [ 13 ] based on NHANES data to our study population to replicate the trends in pre-DM among youth in the United States from 1999 to 2018 reported in that analysis. Specifically, that study selected a youth population from 12-19 years of age with positive sampling weight from the fasting subsample (ie, nonzero and nonmissing Fasting Subsample 2 Year Mobile Examination Centers Weight [“WTSAF2YR”]; personal communication) without a self-reported physician-diagnosed DM. In addition, that study focused only on pre-DM, which was defined as an HbA 1c level between 5.7% and 6.4% or an FPG level between 100 mg/dL and 125 mg/dL [ 13 ].

Development of Youth Pre-DM/DM Data Set

Based on the most recent ADA standard of care recommendations including factors related to pre-DM/DM risk and management [ 2 ], we selected 27 potentially relevant NHANES questionnaires and grouped them into 4 domains: sociodemographic, health status, diet, and other lifestyle behaviors. For example, under the health status domain, BMI was included as a potential risk factor for youth pre-DM/DM [ 2 ]. Similarly, lifestyle and behavioral variables included factors, such as diet and physical activity, that have been shown to be critical for pre-DM/DM prevention in both observational studies and randomized clinical trials [ 50 - 52 ]. Our sociodemographic domain included demographic, socioeconomic, and SDoH variables (eg, age, gender, poverty status, and food security). Except for commonly available clinical measurements, such as blood pressure and total cholesterol, we did not include laboratory data (eg, triglycerides, transferrin, C-reactive protein, interleukin-6, and white blood cells), since these measurements were not collected for all NHANES participants and were not commonly accessible for the general population.

From the selected questionnaires, we identified a list of 95 variables based on the aforementioned methodology. The complete list of variables is provided in Table S1 in Section S1 of Multimedia Appendix 1 [ 13 , 49 , 53 - 62 ] and on our POND web portal [ 47 ]. All the code developed, processed data, and detailed description of variables are also available on the web portal [ 47 ]. The process of extracting these variables involved extensive examination of the questions that were asked, consultation of the literature, and discussions to reach consensus within the study team. The details of this process are provided in Figure S1 and Section S2 of Multimedia Appendix 1 . We used SAS (version 9.4; SAS Institute) and R (version 4.2.2; R Core Team, 2022) in R Studio (version 4.2.2; R Core Team, 2022) for data processing and data set development.

Building the POND

To facilitate other researchers’ use of our youth pre-DM/DM data set and make our methodology transparent and reproducible, we developed POND to share our processed data set and enable users to understand and explore the data on their own. The web portal was developed using R markdown and the flexdashboard package [ 63 ] and was published as a Shiny application [ 64 ]. Table S2 and Section S3 in Multimedia Appendix 1 provide details of all the R packages used to develop POND, and the related code is available on the portal’s download page.

Case Studies in Using the Data Set to Better Understand Youth Pre-DM/DM

To examine the validity and use of our data set for advancing translational research on youth pre-DM/DM, we conducted 2 complementary data analyses. We first conducted bivariate analyses to assess the statistical associations between each of the 95 variables and youth pre-DM/DM status. In the second analysis, we used ML methods to examine the ability to predict pre-DM/DM status of youth based on the 95 variables. The methodological details of these analyses are provided in the following subsections.

Bivariate Analyses to Identify Variables Associated With Pre-DM/DM Status

We examined associations between individual variables and youth pre-DM/DM status using chi-square and Wilcoxon rank sum tests for categorical and continuous variables, respectively. Cell sizes were checked for sufficient size (≥5) prior to chi-square tests. Independence and equal variance were assessed for continuous variables. Distribution normality was ensured through adequate sample size in accordance with Central Limit Theorem [ 65 ]. We applied Bonferroni correction for multiple hypothesis testing (n=95 tests) at an α level of .05 to determine the statistical significance of each association at the adjusted α level of .0005 (ie, approximately 0.05/95). We used Cramer V and Wilcoxon R values [ 66 ] as the effect size measures for categorical and continuous variables, respectively. To better compare with results from the ML approach, the main bivariate analyses did not account for NHANES survey design; thus, the results were applicable only to the study population included in the analytical sample and were not generalizable to the entire US youth population. For completeness, we provide the survey-weighted analyses using NHANES examination weights (“WTMEC2YR”) in Section S4 of Multimedia Appendix 1 .

Prediction of Pre-DM/DM Status Using ML Algorithms

Several ML algorithms have been used to predict adult pre-DM/DM status using NHANES data [ 67 - 69 ], and we have previously used these algorithms to predict pre-DM/DM status specifically among youth in a subsample of our current study population [ 42 ]. We expanded these existing analyses by taking into account the multidomain nature of our data set with the goal of building an effective and interpretable predictive model of youth pre-DM/DM. To that end, we leveraged our recently developed ML framework, Ensemble Integration (EI) [ 53 , 54 ], with all 4 domains and their variables in our data set. EI incorporates both consensus and complementarity in our data set by first inferring local predictive models from the individual domains, that is, sociodemographic, health status, diet, and other lifestyle behaviors, that are expected to capture information and interactions specific to the domains. These local models and information are then integrated into a global pre-DM/DM, comprehensive pre-DM/DM prediction model using heterogeneous ensemble algorithms [ 70 ] (Figure S2, Table S3, and Table S4 under Section S5 in Multimedia Appendix 1 ). These algorithms, such as stacking, allow the integration of an unrestricted number and variety of local models into the global predictive model, thus offering improved performance and robustness. EI also enables the identification of the most predictive variables in the final model, thus offering deeper insights into the outcome being predicted.

We used both the aforementioned capabilities of EI to build and interpret a predictive model of youth pre-DM/DM status based on our data set. We also compared the predictive performance of the model with three alternative approaches: (1) a modified form of the ADA screening guideline [ 55 ], which is based on BMI, total cholesterol level, hypertension, and race or ethnicity, to assess the use of data-driven screening for youth pre-DM/DM (Table S5 in Multimedia Appendix 1 ); (2) EI applied to individual variable domains, namely, sociodemographic, health status, diet and other lifestyle behaviors, to assess the value of multidomain data for youth pre-DM/DM prediction; and (3) extreme gradient boosting (XGBoost) [ 71 ] applied to our combined multidomain data set as a representative alternate ML algorithm. This alternative was chosen as XGBoost is considered the most effective classification algorithm for tabular data [ 72 ], since it can potentially capture feature interactions across different domains [ 73 , 74 ]. The prediction performance of EI and all the alternative approaches were assessed in terms of the commonly used area under the receiver operating characteristic curve (AUROC) [ 75 ] and balanced accuracy (BA; average of specificity and sensitivity) [ 76 ] measures. The performance of the ML-based prediction approaches, namely, multi- and single-domain EI and XGBoost, was evaluated in a 5-fold cross-validation setting repeated 10 times [ 77 ]. These performance scores were statistically compared using the Wilcoxon rank sum test, and the resultant P values were corrected for multiple hypothesis testing using the Benjamini-Hochberg procedure to yield false discovery rates (FDRs) [ 78 ]. More details of ML model building; the alternative approaches; and the evaluation methodology, including cross-validation, model selection, and comparison, are available in section S5 in Multimedia Appendix 1 . Finally, we used EI’s interpretation capabilities [ 53 , 54 ] to identify the variables in our data set that were the most predictive of youth pre-DM/DM status and compare them with the variables identified from the bivariate analyses described in the above subsection.

Ethical Considerations

This study used existing deidentified and anonymized data in the public domain directly downloadable from the NHANES website and thus, according to the Common Rule, was exempt from institutional review board review and the informed consent requirement. NHANS was conducted by the Centers for Disease Control and Prevention National Center for Health Statistics. NHANES survey procedures and protocol were approved by the National Center for Health Statistics ethics review board for each survey cycle [ 79 ].

Study Population Derived From NHANES

Our study population consisted of 15,149 youths aged 12-19 years who participated in the 1999-2018 NHANES cycles and met our selection criteria ( Figure 2 ). Approximately 13.3% (2010/15,149) of US youth were at risk of pre-DM/DM according to the clinically standard criteria for defining pre-DM/DM per ADA guidelines (FPG ≥100 mg/dL and HbA 1c ≥5.7%; Table 1 ).

VariablesOverall (N=15,149)With pre-DM/DM (n=2010; unweighted %=13.3) With no pre-DM/DM (n=13,139)

Age (years), median (IQR)15 (13-17)15 (13-17)16 (14-17)

Female sex, n (%)7430 (49)691 (34.4)6739 (51.3)

Black, non-Hispanic4292 (28.3)676 (33.6)3616 (27.5)

Hispanic5565 (36.7)711 (35.4)4854 (36.9)

White, non-Hispanic4033 (26.6)431 (21.4)3602 (27.4)

Other1259 (8.3)192 (9.6)1067 (8.1)

Private6392 (43)744 (37.7)5648 (43.8)

Medicare, government, or single service2026 (13.6)268 (13.6)1758 (13.6)

Medicaid or CHIP 3637 (24.4)564 (28.6)3073 (23.8)

No insurance2821 (19)395 (20)2426 (18.8)

Authorized for food stamps7833 (69.4)1037 (61.1)6796 (70.8)

BMI percentile, n (%)



Underweight (BMI percentile < 5th), n (%)462 (3.1)40 (2.0)422 (3.2)

Normal weight (5th ≤ BMI percentile < 85th), n (%)8516 (56.8)933 (46.8)7583 (58.4)

Overweight (85th ≤ BMI percentile < 95th), n (%)2788 (18.6)356 (17.9)2432 (18.7)

Obese (95th ≤ BMI percentile), n (%)3214 (21.5)663 (33.3)2551 (19.6)

Hypertensive , n (%)2552 (17.4)502 (26.1)2050 (16.1)

High total cholesterol (≥170 mg/dL), n (%)4951 (33.2)707 (35.6)4244 (32.8)

Fasting plasma glucose (mg/dL), median (IQR)93 (88-98)102 (100-106)91 (86-95)

Hemoglobin A (%), median (IQR)5.2 (5.0-5.4)5.5 (5.2-5.7)5.2 (5.0-5.3)

Meals eaten out per week2 (1-3)2 (1-3)2 (1-3)

Total grain (oz eq ) intake 24 hours prior6.55 (4.24-9.66)6.43 (4.19-9.58)6.57 (4.25-9.67)

Total fruits (cup eq) intake 24 hours prior0.38 (0.00-1.44)0.26 (0.00-1.37)0.40 (0.00-1.45)

Total vegetable (cup eq) intake 24 hours prior0.88 (0.39-1.58)0.84 (0.37-1.54)0.89 (0.39-1.59)

Total protein (oz eq) intake 24 hours prior5.29 (2.71-9.15)4.73 (2.46-8.37)5.38 (2.76-9.34)

Added sugar (tsp eq) intake 24 hours prior20.42 (11.49-32.49)20.09 (11.15-31.89)20.48 (11.57-32.59)

Physical activity minutes per week, median (IQR)209 (45-488)210 (49-476)209 (45-491)

Screen time hours per day, median (IQR)5 (3-8)5 (3-8)5 (2-7)

Exposed to secondhand smoke at home, n (%)3297 (21.9)469 (23.6)2828 (21.7)

a Unweighted statistics of some key variables describing the study population in the youth pre-DM/DM data set overall and by pre-DM/DM status. More detailed statistics for all the variables in our data set can be found in the Data Exploration section of POND.

b Pre-DM/DM: pre–diabetes mellitus and diabetes mellitus.

c CHIP: child health insurance program.

d Hypertensive was defined by blood pressure ≥90th percentile or ≥120/80 mm Hg for children 13 years of age and older [ 2 ].

e eq: equivalent.

We estimated that the survey-weighted prevalence of pre-DM/DM in our study population rose substantially from 4.1% (95% CI 2.8-5.4) in 1999 to 22% (95% CI 18.5-25.6) in 2018 (Figure S3 and section S6 in Multimedia Appendix 1 ). This increasing trend of pre-DM/DM prevalence was consistent with that reported in other NHANES-based studies, which had pre-DM/DM prevalence ranging from 17.7% to 18% [ 18 , 19 ]. We also applied the study population and pre-DM definition criteria reported in a recent study [ 13 ] to NHANES data and derived a similarly sized study population (n=6656 vs n=6598 in the current vs previous analysis [ 13 ]) and youth pre-DM prevalence, which ranged from 11.1% (95% CI 8.9-13.3) to 37.3% (95% CI 31.0-43.6) in our analysis compared with from 11.6% (95% CI 9.5-14.1) to 28.2% (95% CI 23.3-33.6) in the study by Liu et al [ 13 ] (Table S6 in Multimedia Appendix 1 ).

Youth Pre-DM/DM-Focused Data Set

We extracted 95 epidemiological variables from NHANES and organized them into 4 pre-DM/DM-related domains, namely, sociodemographic, health status, diet, and other lifestyle behaviors (Table S1 in Multimedia Appendix 1 ). Table 1 shows the unweighted statistics of some key study population characteristics. Among youth with pre-DM/DM (n=2010), the proportion of youth who were non-Hispanic Black, non-Hispanic White, Hispanic, and other race or ethnicity (including non-Hispanic persons who reported races other than Black or White and non-Hispanic Asian) were 33.6% (n=676), 21.4% (n=431), 35.4% (n=711), and 9.6% (n=192), respectively. Approximately, half (7719/15,149, 51%) of the population were male, and they represented 65.6% (1319/2010) of those with pre-DM/DM. Approximately 32.4% (4528/15,149) of the youth had a family income below poverty level, and 69.4% (7833/15,149) were from households receiving food stamps. The proportion of youth covered by private insurance was higher among those with than with no pre-DM/DM (5648/13,139, 43.8% vs 744/2010, 37.7%). Overall, 21.5% (3214/15,149) of the youth were obese as defined by having a BMI at or above the 95th percentile based on age and gender, and the proportion was 33.3% (663/2010) among youth with pre-DM/DM. Youth with pre-DM/DM tended to have less fruit and vegetable intake and ate lower amounts of protein and total grains than those with no pre-DM/DM. Youth with and with no pre-DM/DM showed similar amounts of physical activity with 209 and 210 minutes per week, respectively ( Table 1 ).

Pre-DM/DM in Youth Online Dashboard

To facilitate other researchers’ use of our youth pre-DM/DM data set and make our methodology transparent and reproducible, we developed POND, which is available on [ 47 ]. Users can navigate POND through its built-in functionalities. For example, users are able to explore the details of the 95 individual variables ( Figure 3 A) and their distributions by pre-DM/DM status ( Figure 3 B), examine the risk factors of youth pre-DM/DM identified from the case studies described below ( Figure 3 C), as well as download the data for customized analysis and the analytical code to replicate our findings ( Figure 3 D). In addition, we make available all the code used to develop the data set, our case studies, and POND itself.

machine learning case study on yelp

Case Studies Using Our Data Set to Better Understand Youth Pre-DM/DM

We examined the validity and use of our processed multidomain data set for translational studies on youth pre-DM/DM by the following 2 complementary types of data analyses.

Identifying Individual Variables Associated With Pre-DM/DM Status

In our bivariate analyses, we found 27 variables to be significantly ( P <.001, Bonferroni adjusted) associated with pre-DM/DM status ( Figure 4 [ 63 ] and Table S7 in Multimedia Appendix 1 ). These variables spanned all 4 domains and included gender, race or ethnicity, use of food stamps, health insurance status, BMI, total protein intake, and screen time. Similar results were found when repeating these bivariate association tests after accounting for NHANES survey design elements (Table S7 in Multimedia Appendix 1 ).

machine learning case study on yelp

Predicting Youth Pre-DM/DM Status With ML

We used an ML framework, EI [ 53 , 54 ], to leverage the multidomain nature of our data set and predict youth pre-DM/DM status. We also compared EI’s performance with alternative prediction approaches, most prominently the widely used XGBoost algorithm [ 71 ].

The best-performing multidomain EI methodology, stacking [ 75 ] using logistic regression, predicted youth pre-DM/DM status (AUROC=0.67; BA=0.62) more accurately than all the alternative approaches ( Figure 5 ), namely, XGBoost (AUROC=0.64; BA=0.60; Wilcoxon rank sum FDR=1.7×10 4 and 1.8×10 4 , respectively), the ADA pediatric screening guidelines (AUROC=0.57, BA=0.57; Wilcoxon rank sum FDR=1.7×10 4 and 1.8×10 4 , respectively), and 4 single-domain EI (AUROC=0.63-0.54; BA=0.60-0.53; FDR <1.7×10 4 and 1.8×10 4 , respectively).

The multidomain EI also identified 27 variables (the same as the number of significant variables from bivariate analyses) that contributed the most to predicting youth pre-DM/DM status. Among these variables, 16 overlapped with those identified from the bivariate statistical analyses ( Figure 6 ; Fisher P of overlap=7.06×10 6 ). These variables identified by both approaches included some established pre-DM/DM risk factors such as BMI and high total cholesterol, as well as some less-recognized ones such as screen time and taking prescription drugs [ 2 ].

machine learning case study on yelp

Principal Findings

Leveraging the rich information in NHANES spanning nearly 20 years, we built the most comprehensive epidemiological data set for studying youth pre-DM/DM. We accomplished this by selecting and harmonizing variables relevant to youth pre-DM/DM from sociodemographic, health status, diet, and other lifestyle behaviors domains. This youth pre-DM/DM data set, as well as several functionalities to explore and analyze it, is publicly available in our user-friendly web portal, POND. We also conducted case studies using the data set with both traditional statistical methods and ML approaches to demonstrate the potential of using this data set to identify factors relevant to youth pre-DM/DM. The combination of the comprehensive public data set and POND provides avenues for more informed investigations of youth pre-DM/DM.

The future translational impact of pre-DM/DM research, facilitated by comprehensive data sets such as the one developed in this study, holds significant promise for advancing our understanding of the disease and its risk factors among youth. By enabling researchers to investigate multifactorial variables associated with pre-DM/DM, this data set contributes to several areas of research and has a broader impact on the scientific community. First, the data set’s comprehensive nature allows researchers to explore the collective impact of various risk factors across multiple health domains. By incorporating sociodemographic factors, health status indicators, diet, and lifestyle behaviors, researchers can gain a holistic understanding of the interplay between these factors and pre-DM/DM risk among youth. This knowledge can be used to generate hypotheses for further studies and inform the development of targeted interventions and prevention strategies that address the specific needs of at-risk populations. Furthermore, the data set provides an opportunity to delve into less-studied variables and their interactions in relation to pre-DM/DM risk. Variables such as screen time, acculturation, or frequency of eating out, which are often overlooked in traditional research, can be examined to uncover their potential influence on pre-DM/DM risk among youth. This expands the scope of translational research and enhances our understanding of the multifaceted nature of the disease.

One of the major contributions of our work was POND, our publicly available web portal, which provided access to all materials related to our data set and analyses, thus enabling transparency and reproducibility. Although several such portals are available in other biomedical areas, such as genomics [ 76 - 78 ], there is a general lack of such tools in epidemiology and public health. We hope that, in addition to facilitating studies into pre-DM/DM, POND illustrates the use of such portals for population and epidemiological studies as well.

The results of the case studies and validation exercises we conducted were also consistent with existing literature. The case studies identified known pre-DM/DM risk factors, such as gender [ 15 , 17 , 19 ], race and ethnicity [ 2 , 9 , 10 , 24 ], health measures (BMI, hypertension, and cholesterol) [ 2 , 55 ], income [ 9 , 11 ], insurance status [ 9 , 10 ], and health care availability [ 9 , 10 ], thus affirming the validity of the data set. In addition, our analyses revealed some less studied variables, such as screen time, home ownership status, self-reported health status, soy and nut consumption, and frequency of school meal intake, which may influence youth pre-DM/DM risk. Further study of these variables may reveal new knowledge about pre-DM/DM among youth. More generally, such novel findings further demonstrate the use of our data set and data-driven methods for further translational discoveries about this complex disorder.

Limitations

Although our work has several strengths and high potential use for youth pre-DM/DM studies, it is not without limitations. First, as our data set was derived from NHANES, we adopt limitations to the survey in our data set. Since NHANES is a cross-sectional survey, the pre-DM/DM status and its related variables provide only consecutive snapshots of youth in the United States over time across the available survey cycles. Thus, the associations identified are better suited for hypothesis generation purposes and require in-depth investigation using prospective longitudinal and randomized trial designs. In addition, we modified the ADA guideline for determining pre-DM/DM status according to variable availability. Due to the high missingness of 45% in family history (DIQ170) and the complete missingness of maternal history (DIQ175S) from 1999 to 2010 in the raw NHANES data, we were unable to include family history of diabetes in the data set. Similarly, NHANES does not provide data regarding every condition associated with insulin resistance. Therefore, we used hypertension and high cholesterol as proxies for insulin resistance. On the other hand, as our main purpose is to use POND as a conduit between this comprehensive youth pre-DM/DM database and interested researchers, our method can be adopted to longitudinal data sets should they become available in the future. Second, for the prediction of pre-DM/DM status, EI’s performance was found to be significantly better than the alternative approaches, including a modified form of the suggested guideline [ 45 ]. However, this performance assessment was based only on cross-validation, which is no substitute for validation on external data sets that is necessary for rigorous assessment. Finally, while our preliminary case study analyses identified a wide range of variables associated with youth prediabetes and diabetes, other known risk factors, such as current asthma status [ 80 - 82 ], added sugar consumption [ 83 - 85 ], sugary fruit and juice intake [ 83 - 86 ], and physical activity per week [ 6 - 8 , 50 ], were not identified. This limitation can be addressed by using other data analysis methods beyond our bivariate testing and ML approaches, highlighting more potential use cases of our data set.

Conclusions

Overall, the future impact of translational pre-DM/DM research facilitated by comprehensive data sets and web servers like ours extends beyond individual studies. It creates opportunities for interdisciplinary collaboration and reproducibility, strengthens evidence-based decision-making, and supports the development of targeted interventions for the prevention and management of pre-DM/DM among youth. By providing rich resources, our work can enable researchers to build upon existing knowledge and push the boundaries of translational pre-DM/DM research, ultimately leading to improved health outcomes for at-risk populations.

Acknowledgments

This study was enabled in part by computational resources provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai. The Ensemble Integration used in this work was implemented by Jamie JR Bennett. This work was funded by National Institutes of Health grants R21DK131555 and R01HG011407.

Data Availability

The data set and code used in this study are available at Zenodo [ 87 ] and our web portal POND [ 47 ].

Authors' Contributions

BL and GP contributed equally as cosenior and cosupervisory authors. NV, BL, and GP conceptualized the project. CM, YCL, NV, BL, and GP designed the methodology. CM and BL implemented the data curation and bivariate analyses. YCL implemented the ML case study and POND. CM and YCL conducted formal analysis and visualization. CM, YCL, NV, BL, and GP wrote the manuscript. NV, BL, and GP supervised the project.

Conflicts of Interest

None declared.

Supplemental materials.

  • Temneanu OR, Trandafir LM, Purcarea MR. Type 2 diabetes mellitus in children and adolescents: a relatively new clinical problem within pediatric practice. J Med Life. 2016;9(3):235-239. [ FREE Full text ] [ Medline ]
  • ElSayed NA, Aleppo G, Aroda VR, Bannuru RR, Brown FM, Bruemmer D, et al. 2. Classification and diagnosis of diabetes: standards of care in diabetes-2023. Diabetes Care. 2023;46(Suppl 1):S19-S40. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Weiss R, Dufour S, Taksali SE, Tamborlane WV, Petersen KF, Bonadonna RC, et al. Prediabetes in obese youth: a syndrome of impaired glucose tolerance, severe insulin resistance, and altered myocellular and abdominal fat partitioning. Lancet. 2003;362(9388):951-957. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Zhang Y, Luk AOY, Chow E, Ko GTC, Chan MHM, Ng M, et al. High risk of conversion to diabetes in first-degree relatives of individuals with young-onset type 2 diabetes: a 12-year follow-up analysis. Diabet Med. 2017;34(12):1701-1709. [ CrossRef ] [ Medline ]
  • Zhuang P, Liu X, Li Y, Wan X, Wu Y, Wu F, et al. Effect of diet quality and genetic predisposition on hemoglobin A and type 2 diabetes risk: gene-diet interaction analysis of 357,419 individuals. Diabetes Care. 2021;44(11):2470-2479. [ CrossRef ] [ Medline ]
  • Pivovarov JA, Taplin CE, Riddell MC. Current perspectives on physical activity and exercise for youth with diabetes. Pediatr Diabetes. 2015;16(4):242-255. [ CrossRef ] [ Medline ]
  • Colberg SR, Sigal RJ, Yardley JE, Riddell MC, Dunstan DW, Dempsey PC, et al. Physical activity/exercise and diabetes: a position statement of the American Diabetes Association. Diabetes Care. 2016;39(11):2065-2079. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Thomson NM, Kraft N, Atkins RC. Cell-mediated immunity in glomerulonephritis. Aust N Z J Med. 1981;11(Suppl 1):104-108. [ Medline ]
  • Hill-Briggs F, Adler NE, Berkowitz SA, Chin MH, Gary-Webb TL, Navas-Acien A, et al. Social determinants of health and diabetes: a scientific review. Diabetes Care. 2020;44(1):258-279. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Butler AM. Social determinants of health and racial/ethnic disparities in type 2 diabetes in youth. Curr Diab Rep. 2017;17(8):60. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Walker RJ, Smalls BL, Campbell JA, Strom Williams JL, Egede LE. Impact of social determinants of health on outcomes for type 2 diabetes: a systematic review. Endocrine. 2014;47(1):29-48. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bansal N. Prediabetes diagnosis and treatment: a review. World J Diabetes. 2015;6(2):296-303. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Liu J, Li Y, Zhang D, Yi SS, Liu J. Trends in prediabetes among youths in the US from 1999 through 2018. JAMA Pediatr. 2022;176(6):608-611. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Tönnies T, Brinks R, Isom S, Dabelea D, Divers J, Mayer-Davis EJ, et al. Projections of type 1 and type 2 diabetes burden in the US population aged 20 years through 2060: the SEARCH for Diabetes in Youth Study. Diabetes Care. Feb 1, 2023;46(2):313-320. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lawrence JM, Divers J, Isom S, Saydah S, Imperatore G, Pihoker C, et al. SEARCH for Diabetes in Youth Study Group. Trends in prevalence of type 1 and type 2 diabetes in children and adolescents in the US, 2001-2017. JAMA. 2021;326(8):717-727. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Jensen ET, Dabelea D. Type 2 diabetes in youth: new lessons from the SEARCH Study. Curr Diab Rep. 2018;18(6):36. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Dabelea D, Mayer-Davis EJ, Saydah S, Imperatore G, Linder B, Divers J, et al. Prevalence of type 1 and type 2 diabetes among children and adolescents from 2001 to 2009. JAMA. 2014;311(17):1778-1786. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Andes LJ, Cheng YJ, Rolka DB, Gregg EW, Imperatore G. Prevalence of prediabetes among adolescents and young adults in the United States, 2005-2016. JAMA Pediatr. 2020;174(2):e194498. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Menke A, Casagrande S, Cowie CC. Prevalence of diabetes in adolescents aged 12 to 19 years in the United States, 2005-2014. JAMA. 2016;316(3):344-345. [ CrossRef ] [ Medline ]
  • Khan MAB, Hashim MJ, King JK, Govender RD, Mustafa H, Al Kaabi J. Epidemiology of type 2 diabetes—global burden of disease and forecasted trends. J Epidemiol Glob Health. 2020;10(1):107-111. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lin X, Xu Y, Pan X, Xu J, Ding Y, Sun X, et al. Global, regional, and national burden and trend of diabetes in 195 countries and territories: an analysis from 1990 to 2025. Sci Rep. 2020;10(1):14790. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Imperatore G, Boyle JP, Thompson TJ, Case D, Dabelea D, Hamman RF, et al. Projections of type 1 and type 2 diabetes burden in the U.S. population aged 20 years through 2050: dynamic modeling of incidence, mortality, and population growth. Diabetes Care. Dec 2012;35(12):2515-2520. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Herman WH, Ma Y, Uwaifo G, Haffner S, Kahn SE, Horton ES, et al. Diabetes Prevention Program Research Group. Differences in A1C by race and ethnicity among patients with impaired glucose tolerance in the Diabetes Prevention Program. Diabetes Care. 2007;30(10):2453-2457. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Kahkoska AR, Shay CM, Crandell J, Dabelea D, Imperatore G, Lawrence JM, et al. Association of race and ethnicity with glycemic control and hemoglobin A levels in youth with type 1 diabetes. JAMA Netw Open. 2018;1(5):e181851. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lascar N, Brown J, Pattison H, Barnett AH, Bailey CJ, Bellary S. Type 2 diabetes in adolescents and young adults. Lancet Diabetes Endocrinol. 2018;6(1):69-80. [ CrossRef ] [ Medline ]
  • Lee AM, Fermin CR, Filipp SL, Gurka MJ, DeBoer MD. Examining trends in prediabetes and its relationship with the metabolic syndrome in US adolescents, 1999-2014. Acta Diabetol. 2017;54(4):373-381. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Weiss R, Taksali SE, Tamborlane WV, Burgert TS, Savoye M, Caprio S. Predictors of changes in glucose tolerance status in obese youth. Diabetes Care. 2005;28(4):902-909. [ CrossRef ] [ Medline ]
  • Nadeau K, Anderson B, Berg E, Chiang J, Chou H, Copeland K, et al. Youth-onset type 2 diabetes consensus report: current status, challenges, and priorities. Diabetes Care. 2016;39(9):1635-1642. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Dart A, Martens P, Rigatto C, Brownell M, Dean H, Sellers E. Earlier onset of complications in youth with type 2 diabetes. Diabetes Care. 2014;37(2):436-443. [ CrossRef ] [ Medline ]
  • American Diabetes Association. Economic costs of diabetes in the U.S. in 2017. Diabetes Care. 2018;41(5):917-928. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Al-Goblan AS, Al-Alfi MA, Khan MZ. Mechanism linking diabetes mellitus and obesity. Diabetes Metab Syndr Obes. 2014;7:587-591. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Chan JCN, Lim L, Wareham NJ, Shaw JE, Orchard TJ, Zhang P, et al. The Lancet Commission on diabetes: using data to transform diabetes care and patient lives. Lancet. 2021;396(10267):2019-2082. [ CrossRef ] [ Medline ]
  • IDF Diabetes Atlas, 10th Edition. International Diabetes Federation. URL: https://diabetesatlas.org/ [accessed 2024-05-16]
  • U.S. Chronic Disease Indicators: Diabetes | Chronic Disease and Health Promotion Data & Indicators. URL: https:/​/chronicdata.​cdc.gov/​Chronic-Disease-Indicators/​U-S-Chronic-Disease-Indicators-Diabetes/​f8ti-h92k [accessed 2023-05-17]
  • Homepage of NCD Risk Factor Collaboration. NCD Risk Factor Collaboration. URL: https://ncdrisc.org/index.html [accessed 2023-05-17]
  • NCD Risk Factor Collaboration (NCD-RisC). Worldwide trends in diabetes since 1980: a pooled analysis of 751 population-based studies with 4.4 million participants. Lancet. 2016;387(10027):1513-1530. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • UCI Machine Learning Repository. Diabetes 130-US hospitals for years 1999-2008 Data Set. URL: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008 [accessed 2023-05-20]
  • Type 2 Diabetes Knowledge Portal. URL: https://t2d.hugeamp.org/ [accessed 2023-05-17]
  • Rashid A. Diabetes Dataset. Mendeley Data. German. Elsevier; Jul 18, 2020. URL: https://data.mendeley.com/datasets/wj9rwkp9c2/1 [accessed 2024-05-16]
  • Diabetes Dataset 2019. URL: https://www.kaggle.com/datasets/tigganeha4/diabetes-dataset-2019 [accessed 2023-05-20]
  • Diabetes Health Indicators Dataset. URL: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset [accessed 2023-05-17]
  • Vangeepuram N, Liu B, Chiu P, Wang L, Pandey G. Predicting youth diabetes risk using NHANES data and machine learning. Sci Rep. 2021;11(1):11212. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Nagarajan S, Khokhar A, Holmes DS, Chandwani S. Family consumer behaviors, adolescent prediabetes and diabetes in the national health and nutrition examination survey (2007-2010). J Am Coll Nutr. 2017;36(7):520-527. [ CrossRef ] [ Medline ]
  • Wallace AS, Wang D, Shin J, Selvin E. Screening and diagnosis of prediabetes and diabetes in US children and adolescents. Pediatrics. 2020;146(3):e20200265. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Chu P, Patel A, Helgeson V, Goldschmidt AB, Ray MK, Vajravelu ME. Perception and awareness of diabetes risk and reported risk-reducing behaviors in adolescents. JAMA Netw Open. 2023;6(5):e2311466. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Patel CJ, Pho N, McDuffie M, Easton-Marks J, Kothari C, Kohane IS, et al. A database of human exposomes and phenomes from the US National Health and Nutrition Examination Survey. Sci Data. 2016;3:160096. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • PreDM/DM in youth ONline Dashboard (POND). URL: https://rstudio-connect.hpc.mssm.edu/POND/ [accessed 2024-02-02]
  • Freepik. URL: https://www.flaticon.com [accessed 2024-05-31]
  • Zipf G, Chiappa M, Porter KS, Ostchega Y, Lewis BG, Dostal J. National health and nutrition examination survey: plan and operations, 1999-2010. Vital Health Stat 1. 2013;(56):1-37. [ FREE Full text ] [ Medline ]
  • Sampath Kumar A, Maiya AG, Shastry BA, Vaishali K, Ravishankar N, Hazari A, et al. Exercise and insulin resistance in type 2 diabetes mellitus: a systematic review and meta-analysis. Ann Phys Rehabil Med. 2019;62(2):98-103. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Karstoft K, Winding K, Knudsen SH, Nielsen JS, Thomsen C, Pedersen BK, et al. The effects of free-living interval-walking training on glycemic control, body composition, and physical fitness in type 2 diabetic patients: a randomized, controlled trial. Diabetes Care. 2013;36(2):228-236. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Karstoft K, Christensen CS, Pedersen BK, Solomon TPJ. The acute effects of interval- vs continuous-walking exercise on glycemic control in subjects with type 2 diabetes: a crossover, controlled study. J Clin Endocrinol Metab. 2014;99(9):3334-3342. [ CrossRef ] [ Medline ]
  • Li Y, Wang L, Law J, Murali T, Pandey G. Integrating multimodal data through interpretable heterogeneous ensembles. Bioinform Adv. 2022;2(1):vbac065. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bennett JJR, Li YC, Pandey G. An open-source Python package for multi-modal data integration using heterogeneous ensembles. arXiv. Preprint posted online on January 17, 2024. 2024. [ FREE Full text ] [ CrossRef ]
  • Arslanian S, Bacha F, Grey M, Marcus M, White N, Zeitler P. Evaluation and management of youth-onset type 2 diabetes: a position statement by the American Diabetes Association. Diabetes Care. 2018;41(12):2648-2668. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Centers for Disease Control and Prevention. The SAS Program for CDC Growth Charts. SAS Program. URL: https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm [accessed 2024-05-20]
  • BernardRosner. Childhood blood pressure macro-batch mode. URL: https:/​/sites.​google.com/​a/​channing.harvard.edu/​bernardrosner/​pediatric-blood-press/​childhood-blood-pressure [accessed 2023-05-19]
  • United States Department of Agriculture (USDA). Food consumption and nutrient intake. URL: https://www.ers.usda.gov/data-products/food-consumption-and-nutrient-intakes/ [accessed 2024-05-20]
  • Caruana R, Niculescu-Mizil A, Crew G, Ksikes A. Ensemble selection from libraries of models. In: Machine Learning. ACM International Conference Proceeding Series; 2004. Presented at: Proceedings of the Twenty-first International Conference (ICML 2004); July 4-8 2004; Banff, Alberta, Canada.
  • Caruana R, Munson A, Niculescu-Mizil A. Getting the most out of ensemble selection. In: Machine Learning. IEEE Computer Society; 2006. Presented at: Proceedings of the 6th {IEEE} International Conference on Data Mining; March 24 2023:828-833; Hong Kong, China. URL: https://www.researchgate.net/publication/220766367_Getting_the_Most_Out_of_Ensemble_Selection
  • Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The balanced accuracy and its posterior distribution. In: Machine Learning. IEEE; 2004. Presented at: 2010 20th International Conference on Pattern Recognition; August 23-26 2010; Istanbul, Turkey. URL: https://ieeexplore.ieee.org/document/5597285/authors#authors
  • Benjamini Y, Hochberg Y. Controlling the false discovery rate. a practical and powerful approach to multiple testing. 1995;57(1):289-300. [ FREE Full text ]
  • R Markdown Format for flexible dashboards. URL: https://pkgs.rstudio.com/flexdashboard/ [accessed 2023-05-18]
  • Shiny. Welcome to shiny. URL: https://shiny.posit.co/r/getstarted/shiny-basics/lesson1/index.html [accessed 2023-05-18]
  • Kwak SG, Kim JH. Central limit theorem: the cornerstone of modern statistics. Korean J Anesthesiol. 2017;70(2):144-156. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Tomczak M, Tomczak E. The need to report effect size estimates revisited. an overview of some recommended measures of effect size. Trends Sport Sci. Feb 15, 2014;1(21):19-25.
  • Herman WH, Smith PJ, Thompson TJ, Engelgau MM, Aubert RE. A new and simple questionnaire to identify people at increased risk for undiagnosed diabetes. Diabetes Care. 1995;18(3):382-387. [ CrossRef ] [ Medline ]
  • Bang H, Edwards AM, Bomback AS, Ballantyne CM, Brillon D, Callahan MA, et al. Development and validation of a patient self-assessment score for diabetes risk. Ann Intern Med. 2009;151(11):775-783. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Poltavskiy E, Kim DJ, Bang H. Comparison of screening scores for diabetes and prediabetes. Diabetes Res Clin Pract. 2016;118:146-153. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Whalen S, Pandey OP, Pandey G. Predicting protein function and other biomedical characteristics with heterogeneous ensembles. Methods. 2016;93:92-102. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proc 22nd ACM SIGKDD Int Conf Knowl Discov Data Min New York. USA. Association for Computing Machinery; 2016. Presented at: Association for Computing Machinery; 2016; NY. URL: https://dl.acm.org/doi/10.1145/2939672.2939785 [ CrossRef ]
  • Shwartz-Ziv R, Armon A. Tabular data: deep learning is not all you need. Information Fusion. May 2022;81:84-90. [ FREE Full text ] [ CrossRef ]
  • Goyal K, Dumancic S, Blockeel H. Feature interactions in XGBoost. arXiv. Preprint posted online on July 11, 2020. 2020. [ FREE Full text ] [ CrossRef ]
  • Feature Interaction Constraints. XGBoost 2.0.3 documentation. URL: https://xgboost.readthedocs.io/en/stable/tutorials/feature_interaction_constraint.html [accessed 2024-02-01]
  • Sesmero MP, Ledezma AI, Sanchis A. Generating ensembles of heterogeneous classifiers using stacked generalization. WIREs Data Min & Knowl. 2015;5(1):21-34. [ CrossRef ]
  • Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, et al. BioPortal: enhanced functionality via new web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Res. 2011;39(Web Server issue):W541-W545. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bhattacharya S, Andorf S, Gomes L, Dunn P, Schaefer H, Pontius J, et al. ImmPort: disseminating data to the public for the future of immunology. Immunol Res. 2014;58(2-3):234-239. [ CrossRef ] [ Medline ]
  • Zhang J, Baran J, Cros A, Guberman JM, Haider S, Hsu J, et al. International Cancer Genome Consortium Data Portal--a one-stop shop for cancer genomics data. Database (Oxford). 2011;2011:bar026. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • NHANES - NCHS Research Ethics Review Board Approval. 2022. URL: https://www.cdc.gov/nchs/nhanes/irba98.htm [accessed 2024-01-19]
  • Rayner L, McGovern A, Creagh-Brown B, Woodmansey C, de Lusignan S. Type 2 diabetes and asthma: systematic review of the bidirectional relationship. Curr Diabetes Rev. 2019;15(2):118-126. [ CrossRef ] [ Medline ]
  • Black MH, Anderson A, Bell RA, Dabelea D, Pihoker C, Saydah S, et al. Prevalence of asthma and its association with glycemic control among youth with diabetes. Pediatrics. 2011;128(4):e839-e847. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Wu TD. Diabetes, insulin resistance, and asthma: a review of potential links. Curr Opin Pulm Med. 2021;27(1):29-36. [ CrossRef ] [ Medline ]
  • Vartanian LR, Schwartz MB, Brownell KD. Effects of soft drink consumption on nutrition and health: a systematic review and meta-analysis. Am J Public Health. 2007;97(4):667-675. [ CrossRef ] [ Medline ]
  • Greenwood DC, Threapleton DE, Evans CEL, Cleghorn CL, Nykjaer C, Woodhead C, et al. Association between sugar-sweetened and artificially sweetened soft drinks and type 2 diabetes: systematic review and dose-response meta-analysis of prospective studies. Br J Nutr. 2014;112(5):725-834. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Malik VS, Popkin BM, Bray GA, Després JP, Willett WC, Hu FB. Sugar-sweetened beverages and risk of metabolic syndrome and type 2 diabetes: a meta-analysis. Diabetes Care. 2010;33(11):2477-2483. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Muraki I, Imamura F, Manson JE, Hu FB, Willett WC, van Dam RM, et al. Fruit consumption and risk of type 2 diabetes: results from three prospective longitudinal cohort studies. BMJ. 2013;347:f5001. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • McDonough C, Li Y. Youth preDM/DM dataset and Case Studies. Switzerland. Zenodo; 2024. URL: https://zenodo.org/records/10531245 [accessed 2024-05-29]

Abbreviations

American Diabetes Association
area under the receiver operating characteristic curve
balanced accuracy
diabetes mellitus
Ensemble Integration
false discovery rate
fasting plasma glucose
glycated hemoglobin
machine learning
National Health and Nutrition Examination Survey
Prediabetes/diabetes in youth Online Dashboard
pre–diabetes
social determinants of health
extreme gradient boosting

Edited by A Mavragani, T Sanchez; submitted 05.10.23; peer-reviewed by S El Khamlichi, C Zhao, Y Su; comments to author 09.01.24; revised version received 06.02.24; accepted 26.04.24; published 02.07.24.

©Catherine McDonough, Yan Chak Li, Nita Vangeepuram, Bian Liu, Gaurav Pandey. Originally published in JMIR Public Health and Surveillance (https://publichealth.jmir.org), 02.07.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.

ACM Digital Library home

  • Advanced Search

Implementation of Zero Defect Manufacturing using quality prediction: : a spot welding case study from Bosch

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, a practical guide for implementing zero defect manufacturing in new or existing manufacturing systems.

The approach to achieving zero defects by using Industry 4.0 technologies is what constitutes Zero Defect Manufacturing (ZDM). However, its implementation is not a simple task since it requires careful design and new methods. The current ...

Zero Defect Manufacturing ontology: A preliminary version based on standardized terms

The global transition from traditional manufacturing systems to Industry 4.0 compatible systems has already begun. Therefore, the digitization of the manufacturing systems across the globe is increasing with exponential growth which ...

  • Develop an initial ontology for the Zero Defect Manufacturing (ZDM) domain.

Optimizing efficiency and zero-defect manufacturing with in-process inspection: challenges, benefits, and aerospace application

In this paper, we present a comprehensive study on the implementation of machine vision-enabled in-process quality inspection systems in machining operations. Our objective is to enable zero-defect manufacturing by maximizing efficiency and ...

Information

Published in.

Elsevier Science Publishers B. V.

Netherlands

Publication History

Author tags.

  • Zero Defect Manufacturing
  • Machine learning
  • Quality prediction
  • Simulation Industry 4.0
  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Help | Advanced Search

Computer Science > Machine Learning

Title: a case study on contextual machine translation in a professional scenario of subtitling.

Abstract: Incorporating extra-textual context such as film metadata into the machine translation (MT) pipeline can enhance translation quality, as indicated by automatic evaluation in recent work. However, the positive impact of such systems in industry remains unproven. We report on an industrial case study carried out to investigate the benefit of MT in a professional scenario of translating TV subtitles with a focus on how leveraging extra-textual context impacts post-editing. We found that post-editors marked significantly fewer context-related errors when correcting the outputs of MTCue, the context-aware model, as opposed to non-contextual models. We also present the results of a survey of the employed post-editors, which highlights contextual inadequacy as a significant gap consistently observed in MT. Our findings strengthen the motivation for further work within fully contextual MT.
Comments: Accepted to EAMT 2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Predicting the Helpfulness of Online Restaurant Reviews Using

    machine learning case study on yelp

  2. Yelp Collections uses machine learning to serve up recommendations

    machine learning case study on yelp

  3. 5 Machine Learning Case Studies to explore the Power of Technology

    machine learning case study on yelp

  4. Yelp- UX Case Study :: Behance

    machine learning case study on yelp

  5. Yelp- UX Case Study :: Behance

    machine learning case study on yelp

  6. Yelp

    machine learning case study on yelp

VIDEO

  1. Use cases of Machine Learning: ML lecture2

  2. Fake Review Watch Trumps Yelp's Algorithm--Again

  3. Machine Learning on Encrypted Data using Homomorphic Encryption

  4. Data Analysis and NLP Model on YELP Dataset Machine Learning Python Sklearn Pandas Seaborn #2

  5. Data Heroes

  6. Machine Learning Course

COMMENTS

  1. Introducing Yelp's Machine Learning Platform

    MLeap is a serialization format and execution engine, and provides two advantages for our ML Platform. Firstly, MLeap comes out of the box with support for Yelp's most commonly used ML libraries: Spark, XGBoost, Scikit-learn, and Tensorflow - and additionally can be extended for custom transformers to support edge cases.

  2. Machine Learning at Yelp

    Yelp's website, Yelp.com, is a crowd-sourced local business review site. Their business model relies on relevant reviews (on scale of 1-5 stars) which generates advertising revenue. 1 The content's search-ability is very important for businesses, an HBS study found that each "star" in a Yelp rating affected the business owner's sales by 5-9%. 2 Machine learning has been integral to ...

  3. Machine Learning and Visualization with Yelp Dataset

    This dataset contains labeled customer reviews from Yelp (Deceptive reviews and True reviews), which is used for training our predictive models on Fake Review machine learning approach. Source: It ...

  4. [2201.07999] Sentiment Analysis: Predicting Yelp Scores

    Sentiment Analysis: Predicting Yelp Scores. Bhanu Prakash Reddy Guda, Mashrin Srivastava, Deep Karkhanis. In this work, we predict the sentiment of restaurant reviews based on a subset of the Yelp Open Dataset. We utilize the meta features and text available in the dataset and evaluate several machine learning and state-of-the-art deep learning ...

  5. Low latency Neural Network Inference for ML Ranking Applications Yelp

    Speakers:Ryan Irwin, Engineering Manager, Yelp Inc.Ryan Irwin is a senior engineering manager at Yelp. He leads the teams responsible for the ML Platform, wh...

  6. Yelp Restaurant Recommendation System

    Word Clouds of Negative (Left) and Positive (Right) Reviews, Image by author Natural Language Processing. Leveraging on the text reviews left by each Yelp User, I was able to create new rating scores for each User by incorporating sentiment analysis of their text reviews.Sentiment analysis is the process of determining the attitude or emotion of the user (whether it is positive or negative or ...

  7. PDF Applications of Machine Learning to Predict Yelp Ratings

    Yelp, founded in 2004, is a multinational corporation that publishes crowd-sourced online reviews on local businesses. As of 2014, Yelp.com had 57 million reviews and 132 million monthly visitors [1]. A portion of their large dataset is avail-able on the Yelp Dataset Challenge homepage, which includes

  8. Sentiment Analysis: Predicting Yelp Scores

    Yelp Open Dataset. We utilize the meta features and text available in the dataset and evaluate several machine learning and state-of-the-art deep learning approaches for the prediction task. Through several qualitative experiments, we show the success of the deep models with attention mechanism in learning a balanced model for

  9. (PDF) Applications of Machine Learning Models on Yelp Data

    Abstract. The paper attempts to document the application of relevant Machine Learning (ML) models on Yelp (a crowd-sourced local business review and social networking site) dataset to analyze ...

  10. Sentiment Analysis: A Systematic Case Study with Yelp Scores

    Sentiment Analysis: A Systematic Case Study with Yelp Scores. This article experiments with various existing machine learning algorithms, from easy logistic regression to BERT embedding-based deep models, and uses ensemble to combine the aforementioned models into a single predictor, seeing if a combination of these models will achieve better ...

  11. PDF Predicting Usefulness of Yelp Reviews

    Introduction. The Yelp Dataset Challenge makes a huge set of user, business, and review data publicly available for machine learning projects. They wish to find interesting trends and patterns in all of the data they have accumulated. Our goal is to predict how useful a review will prove to be to users. We can use review upvotes as a metric.

  12. PDF Predicting the Helpfulness of Online Restaurant Reviews Using Different

    Reviews Using Di erent Machine Learning Algorithms: A Case Study of Yelp Yi Luo 1 and Xiaowei Xu 2,* 1 College of Business Administration, Capital University of Economics and Business, Beijing 100070, China; [email protected] 2 School of Business Administration, Southwestern University of Finance and Economics, Chengdu 611130, China

  13. Comparative study of deep learning models for analyzing online

    This study empirically analyzed online restaurant reviews from Yelp in the era of the COVID-19 pandemic using traditional machine learning methods as well as deep learning methods. Based on the number of restaurant reviews posted on Yelp, an observable decline in March and a sharp decline in April were consistent with the timeline of how the ...

  14. Using Yelp Data to Predict Restaurant Closure

    This value is approximated by the date of the first yelp review. This means that restaurants that joined yelp late or do not receive frequent comments would appear to have a relatively younger age than their real value. Also, the restaurant age is limited by the date Yelp was founded (i.e. 2004). 4. Machine learning models and optimization

  15. Yelp Review Rating Prediction: Machine Learning and Deep Learning

    We predict restaurant ratings from Yelp reviews based on Yelp Open Dataset. Data distribution is presented, and one balanced training dataset is built. Two vectorizers are experimented for feature engineering. Four machine learning models including Naive Bayes, Logistic Regression, Random Forest, and Linear Support Vector Machine are implemented.

  16. 16 Real World Case Studies of Machine Learning

    Machine Learning Case Study on Yelp. As far as our technical knowledge is concerned, we are not able to recognize Yelp as a tech company. However, it is effectively taking advantage of machine learning to improve its users' experience to a great extent. ... Machine Learning Case Studies in Life Science and Biology 7. Development of Microbiome ...

  17. Yelp Review Rating Prediction: Machine Learning and Deep Learning Models

    We predict restaurant ratings from Yelp reviews based on Yelp Open Dataset. Data distribution is presented, and one balanced training dataset is built. Two vectorizers are experimented for feature engineering. Four machine learning models including Naive Bayes, Logistic Regression, Random Forest, and Linear Support Vector Machine are implemented. Four transformer-based models containing BERT ...

  18. Sentiment Analysis of Yelp Reviews: A Comparison of Techniques and Models

    Abstract—We use over 350,000 Yelp reviews on 5,000 restau-rants to perform an ablation study on text preprocessing tech-niques. We also compare the effectiveness of several machine learning and deep learning models on predicting user sentiment (negative, neutral, or positive). For machine learning models,

  19. Machine-Learning-Case-Studies/NLP/NLP_Yelp_Data.ipynb at master

    Various Machine Learning Case Studies. Contribute to kkaushi4/Machine-Learning-Case-Studies development by creating an account on GitHub.

  20. (PDF) Evaluation of Machine Learning Approach for ...

    Sentiment Analysis is a part of NLP application that extracts emotional information from texts. In this study, we investigate the performance of sequence-based model, i.e., LSTM, compared with ...

  21. 5 Machine Learning Case Studies to explore the Power of Technology

    Here are the five best machine learning case studies explained: 1. Machine Learning Case Study on Dell. The multinational leader in technology, Dell, empowers people and communities from across the globe with superior software and hardware. Since data is a core part of Dell's hard drive, their marketing team needed a data-driven solution that ...

  22. Using machine learning to detect bot attacks that leverage residential

    Detection results and case studies. We started testing v8 in shadow mode in March 2024. Every hour, v8 is classifying more than 17 million unique IPs that participate in residential proxy attacks. Figure 4 shows the geographic distribution of IPs with residential proxy activity belonging to more than 45 thousand ASNs in 237 countries/regions.

  23. Sustainability

    Helpful online reviews could be utilized to create sustainable marketing strategies in the restaurant industry, which contributes to national sustainable economic development. This study, the main aspects (including food/taste, experience, location, and value) from 294,034 reviews on Yelp.com were extracted empirically using the Latent Dirichlet Allocation (LDA) and positive and negative ...

  24. JMIR Public Health and Surveillance

    Background: The prevalence of type 2 diabetes mellitus (DM) and pre-diabetes mellitus (pre-DM) has been increasing among youth in recent decades in the United States, prompting an urgent need for understanding and identifying their associated risk factors. Such efforts, however, have been hindered by the lack of easily accessible youth pre-DM/DM data.

  25. Implementation of Zero Defect Manufacturing using quality prediction

    B Zhou, Y Svetashova, S Byeon, T Pychynski, R Mikut, E. Kharlamov, Predicting Quality of Automated Welding with Machine Learning and Semantics: A Bosch Case Study, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Association for Computing Machinery, New York, NY, USA, 2020, pp. 2933-2940,.

  26. [1605.05362] Yelp Dataset Challenge: Review Rating Prediction

    Review websites, such as TripAdvisor and Yelp, allow users to post online reviews for various businesses, products and services, and have been recently shown to have a significant influence on consumer shopping behaviour. An online review typically consists of free-form text and a star rating out of 5. The problem of predicting a user's star rating for a product, given the user's text review ...

  27. [2407.00108] A Case Study on Contextual Machine Translation in a

    Incorporating extra-textual context such as film metadata into the machine translation (MT) pipeline can enhance translation quality, as indicated by automatic evaluation in recent work. However, the positive impact of such systems in industry remains unproven. We report on an industrial case study carried out to investigate the benefit of MT in a professional scenario of translating TV ...