Feature Engineering Techniques used by Facebook and LinkedIn

When training machine learning (ML) models, I often reach a point when the accuracy of model is not improving anymore. I start asking questions, “Am I getting the best out of my data?”, “How can I create more useful features from the data I have?” Because of this curiosity, anytime I see a new feature building technique, I get excited! With all the barrage of Deep Learning these days, finding some clever new feature engineering technique feels like a breath of fresh air.

In this post I will discuss the use of Gradient Boosting Decision Trees (GBDT or GBT) and…


Optimal Notification Volume for Maximum Rewards

In this paper

  • Notification volume estimation is modelled as a constrained optimization problem
  • Organic and notification driven user engagement is differentiated
  • Reward for notification volume is estimated from activity prediction, unsubscribe prediction and unsubscribe long term effect models
  • Incremental reward of notification is used for volume optimization by hill climbing algorithm
  • Optimal number of notifications are estimated at user level, subject to global constraints

Full paper can be found at https://labs.pinterest.com/user/themes/pin_labs/assets/paper/notifications-kdd18.pdf

Outline

  • Introduction
  • Pinterest Notification System
  • Problem Formulation
  • Data
  • Proposed Algorithms
  • Experiment & Results

Introduction

The purpose of notifications is to keep users engaged on a platform by sending them the right content, at the…


Restricted Random Forests for interpretable predictions

Customers reliance and ease of using OTT services like WhatsApp has shrinked the sources of income for telecom operators. In order to grow or at least maintain the current market share, good customer experience on network has become ever more important. This paper presents approach to capture near real-time mobile customer experience and access the conditions which lead user to place a call with telco’s customer care center.

Outline

  • Introduction
  • Data
  • Data Exploration
  • Proposed Technique — Restricted Random Forests
  • Interpretability
  • Performance Metrics
  • Experiment & Results

Introduction

The core problem paper is addressing is, how to proactively estimate the experience of a customer…


Resizing of image leads to loss of information critical for image forgery detection; full-resolution network are better suited for the task

Noiseprint from paper Noiseprint: a CNN-based camera model fingerprint
Noiseprint from paper Noiseprint: a CNN-based camera model fingerprint
Noiseprint (Noiseprint: a CNN-based camera model fingerprint https://arxiv.org/abs/1808.08396)

In this paper

  • Xception is used to extract features from patches of the full image without any resizing.
  • Feature aggregation is performed using various pooling techniques.
  • Fully connected layers are used for forgery detection at image level.
  • Noiseprint is experimented as additional feature along with RGB bands.
  • Gradient checkpointing is used for memory management of network.
  • The network is trainable end-to-end (E2E).

Francesco Marra is the first author of the paper. Multimedia forensics is one of his research areas. Full paper can be found at https://arxiv.org/abs/1909.06751.

Outline

  • Introduction
  • Data
  • Proposed Model
  • Loss Function
  • Performance Metrics
  • Results
  • Forgery Localization
  • Implementation

Introduction

Typical computer vision models rely…


Embeddings are trained on bipartite graph derived from transactions, used in downstream tasks and visualization

Word2Vec — CBOW and Skip-Gram
Word2Vec — CBOW and Skip-Gram
Source: Efficient Estimation of Word Representations in Vector Space (https://arxiv.org/pdf/1301.3781.pdf)

In this paper

  • Bipartite graph is derived from credit card transactions. If two transactions of an account, falling within a specified time window, are represented as {Merchant, Account, Merchant} then the two merchants constitute an edge in bipartite graph.
  • This graph is used to train Skip-Gram embeddings.
  • Embeddings are found to cluster similar merchants together.
  • Brand based embeddings perform better than raw merchant embeddings for downstream tasks.

This research is done at Capital One and full paper can be found at https://arxiv.org/abs/1907.07225.

Outline

  • Introduction
  • Data
  • DeepTrax Methodology for Embeddings
  • Loss Function
  • Performance Metrics
  • Results

Introduction

Transactions between customers and merchants create a bipartite graph with…


Explanation of the paper

Can a simpler formulation of objective function achieve better results for online marketplaces? Online marketplaces like Groupon have multiple stakeholders and require addressing objectives of each stakeholder. Researchers at Groupon found that simplifying their objective function improved Conversion Rate by 1.56% and Operational Value of their business by 1.43%.

In this post, we will first cover what multi-objective recommendations in multi-stakeholder systems is and then delve into the contents of the paper.

Multi-stakeholder Systems and Multi-Objective Recommendations

Online Marketplace: Online marketplaces are platforms where merchants and customers meet and make a transaction online which mutually benefits both. Depending on the business there can also be…


In this code snippet, I will show you, how to write a custom preprocessing function to use with ImageDataGenerator to extract image patches, random selecting from predefined sizes.

Here we go!

Note that, for every image, we get either the complete image or a patch of it. The relative proportion of cases can be controlled by the parameter p. Also note that all patches are of square shape but that can be changed easily. So I leave that to you.

Thats it!


You will need to install nvidia-ml-py3 library in python (pip install nvidia-ml-py3) which provides the bindings to NVIDIA Management library.

Here is the code snippet:

Thats it!


If you want only some GPU’s to be available to TensorFlow, you can configure it using CUDA_VISIBLE_DEVICES environment variable in python.

Here is the code snippet

Note that, for multiple GPU’s you need to specify them in a comma separated string.

Abhay Shukla

Lead Data Scientist@Airtel X Labs https://www.linkedin.com/in/shuklaabhay/ #DataScience #ML #AI #Statistics #Reading #Music #Running

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store