Trending February 2024 # Building Better Clouds: Four Lessons From Thefiasco # Suggested March 2024 # Top 8 Popular

You are reading the article Building Better Clouds: Four Lessons From Thefiasco updated in February 2024 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 Building Better Clouds: Four Lessons From Thefiasco

When your website has become a running joke on late-night talk shows, you know you have a problem. Yet, as CIOs move to cloud architectures, many are worrying that they’ll have a chúng tôi experience themselves.

Of course, website launches fail spectacularly all the time. It’s not uncommon for new apps and sites to crash, hang and frustrate users in a million and one ways. Eventually, the kinks get ironed out, but those ROI projections you presented to the CEO are now a complete fiction.

Protecting your company’s data is critical. Cloud storage with automated backup is scalable, flexible and provides peace of mind. Cobalt Iron’s enterprise-grade backup and recovery solution is known for its hands-free automation and reliability, at a lower cost. Cloud backup that just works.


This is one of the reasons risk-averse CIOs are hesitant about moving from tried-and-true on-premise systems to cloud ones: fear of the unknown. If an on-premise system goes down, it’s not hard to figure out whose neck to choke. When a cloud application breaks, do you even have the visibility into the infrastructure to know what went wrong?

Here are four lessons you can learn from the chúng tôi fiasco to help guide you as you adopt, migrate to, and build cloud applications.

One thing overlooked as we discuss how profoundly chúng tôi failed as a site is that contractors were not able to tap in to the efficiencies of cloud providers like AWS, which until very recently was considered too insecure for federal purposes. Thus, legacy providers such as Oracle, Quality Software Solutions, Booz Allen and CGI Federal were tasked with building a patchwork chúng tôi system that hearkens back to the late 90s.

You know that old IT saying: “No one gets fired for choosing [Big Name IT Company].” True enough, but, unfortunately, chúng tôi missed its chance to create a streamlined, agile, cloud-based infrastructure, believing instead that this approach would be too risky.

Most organizations fear moving too quickly with new technologies. Sometimes the truly risky option is moving too slowly and sticking with the status quo, especially when the status quo is quickly disappearing in the rearview mirror (unless you’re stuck in the slow lane, of course).

It’s been more than a dozen years since a few top-tier software developers met at a Utah ski resort and came up with the Manifesto for Agile Software Development. Yet, the very term “agile development” seems to be one that doesn’t translate into chúng tôi development-speak.

“It’s important to engage service providers with experience in agile delivery methods that can effectively act as a catalyst for transforming delivery capabilities,” said Craig Wright, a principal at consulting firm Pace Harmon. “Ensure that outsourcing agreements include meaningful expectations around agile service delivery performance structures and relevant provisions to hold service providers responsible for quickly responding to changing needs, aggregating their services into an ecosystem-wide, seamless end-to-end service experience for users.”

A lingering question in my mind is: didn’t anyone see this coming? Good developers know about the bugs they ship. They may not have caught them all, but they know they’re in there and that they’ll have to fix them sooner or later. Typically, the product has reached “good enough” status, so shipping even with some bugs isn’t that big of a risk. Let’s just call this the Microsoft model, and that’s not a slam on Microsoft.

However, the Microsoft model dictates that the product is “good enough,” or that it actually works. It may not work perfectly, but if they’re shipping a browser, you can actually browse the web with it.

Judging from all of these chúng tôi “glitches,” you have to conclude that QA engineers must have all been furloughed during the government shutdown.

“It is clear to us, by the types of issues consumers are experiencing with chúng tôi this past month that the site was not fully tested,” said Tom Lounibos, CEO of SOASTA, a provider of test automation and monitoring tools. “One of the most common problems in website development is the focus on the speed of delivery versus the quality of the end-user experience.”

One of the rumors circulating this week is that Verizon will be tapped to fix chúng tôi But there’s a problem buried in that solution. It’s not always easy to jump ship this far into a project. The existing contractors, including GCI Federal and Booz Allen, will have to release their proprietary code to Verizon (or to whomever the fixer ends up being). That fact alone makes the repair more complicated and difficult.

As you evaluate various cloud providers, platforms and tools, it’s worth putting “open source” on your list of selection criteria. Then, if something goes wrong, it’ll be much easier to throw the fools causing the problem overboard and bring someone else in who can take the helm without a lengthy transition period.

Jeff Vance is a technology journalist based in Santa Monica, California. Connect with him on LinkedIn, follow him on Twitter @JWVance, add him to your cloud computing circle on Google Plus, or just shoot him an old-fashioned email at [email protected].

You're reading Building Better Clouds: Four Lessons From Thefiasco

Four Marketing Lessons From Consumer Inbox Behavior

UK DMA’s survey reveals key insights for email marketers

One of the challenges for email marketers is to stop thinking like email marketers.

A lot of assumptions about best practices are based on our collective view of just what’s going on inside consumer inboxes. But this view is biased by what’s going on inside our own inboxes.

If you’re an email marketer, you’re probably an online regular with a heavy duty email account or accounts. The same can’t be said of the proverbial man and woman in the street: the people who typically get the actual emails.

Our email experience is not their email experience.

Or is it?

Truth is we don’t really know.

Surveys of end users can, however, help correct our misconceptions. They provide important insight into how we might adapt our email campaigns to the reality of end-user chúng tôi the benefit of the email bottom line.

The UK DMA, chúng tôi and Alchemy Worx recently released the 2011 edition of the Email Tracker Report, which surveyed 1,800 UK consumers on their inbox activity and habits, responses and attitudes to commercial email, use of mobile email, and their social sharing behavior.

The numbers include a few surprises. For example, it turns out most people do not use email during their working day.

Lesson 1: The wider relationship is important

Survey result: Over 60% of respondents are signed up to 10 or fewer senders.

People enter relatively few email relationships when you consider the total number of brands (i.e. potential senders) they interact with.

Many email marketers conceive campaigns on the assumption that the recipient’s inbox is flooded with great deals from their many competitors. This may not be the case, offering more scope for alternatives to the ever-increasing-discount wars, including content-oriented, loyalty and branding messages.

Senders face a challenge to crack what Merkle have long called the “inner circle” of senders (they say the average email user subscribes to email from just over 11 companies). The key here may be to exploit the wider, existing (hopefully positive) relationship with the recipient to capture the opt-in. For example by placing sign-up CTAs at key points of contact (like the point of sale in stores) or promoting the email list in transactional communications.

Loren McDonald has a good list building overview, where he emphasizes the need to exploit heightened interest caused by the forthcoming holiday season. It’s a similar principle: exploit the existing brand relationship to get the opt-in.

Lesson 2: Email drives many out-of-email responses

Alchemy Worx’s Dela Quist sums up nicely in the report’s introduction:

“Email makes things happen in other channels and at other times too. Many consumers hold on to email to refer to for later use, which is vital for attribution and a valuable growing trend: a consumer who returns a week later to retrieve a commercial email demonstrates very high purchase propensity, for instance.”

The recognition that emails are having a significant impact on attitudes and responses outside of the actual email itself changes everything, as I’ve written before. To summarize:

it tells us we need to reassess how we measure email success, so our investment in email reflects it’s true value

it encourages us to create specific campaigns to exploit the out-of-email response, driving action now through other channels or driving action in the future through branding and awareness impacts.

“Organizations need to do a better job at defining an inactive”

Lesson 3: You might be able to send more emails after all

Survey result: 94% of respondents are signed up to email from trusted brands, but over half are getting less than 3 such emails a day total.

Most inboxes are not overflowing with commercial email from trusted brands. This is confirmed by benchmark reports which show UK email frequencies at long-term lows.

Email marketers have long been wary of increasing email frequency for fear of triggering excessive spam complaints. Sending “too much” email is chúng tôi nor do you want to miss out on responses by sending “too little”.

Just under 1 in 10 respondents cited “too many emails” as the primary reason for marking a message as spam. So it’s still an issue, but many senders may be well under the threshold for what constitutes “too many”.

My conclusions:

1. Consider carefully testing broad-brush increases in frequency

2. Explore ways to deliver more value, which lifts both responses and upper thresholds for acceptable frequency

3. Treat frequency changes as another option for specific segments or individuals. Your list is not an amorphous blob: some subscribers might resent more email, some will welcome it. The challenge is identifying the subscriber preferences, characteristics and/or behavior that lets you know who falls into each category

Lessons 4: Social sharing is not a global panacea

Survey result: 33% of respondents use no social network at all. Only 12% of respondents said they shared commercial email content into their social networks.

There is much interest in exploiting the interaction between social networks and email to the benefit of both.

That interest is justified, but needs to be tempered by realism: content and CTAs involving social networks are not relevant to many (most?) subscribers.

Rather than clutter up emails with “share with your network” links as a matter of course, marketers may benefit from using social CTAs more selectively. One option is to target by social network use, placing stronger focus on social calls to action with subscribers identified as potential sharers and influencers.

A second option is to reserve sharing efforts for specific contexts. Gretchen Scheiman, for example, recommends three reasons to keep (or not) sharing links in emails:

When the email content is newsworthy

When sharing is central to the message

When sharing is the way you increase your audience

We shouldn’t forget, of course, that what consumers say and what they do are not necessarily the same. Yet this kind of research does alert us to the key differences between our biased perceptions and inbox reality.

If you want to find out more about the way the email consumer thinks, as well as the DMA study, you might also want to check out surveys by e-Dialog, ContactLab, ExactTarget, Merkle. Or have you conducted your own study recently? What did you learn?

10 Lessons From Rpa Leader Uipath’s Ipo In 2023

UiPath, one of the largest RPA players, filed its S-1 on March, 25, 2023 to get listed on the New York Stock Exchange under the ticker PATH. UiPath is expected to raise $1 billion. After 2023 first quarter, UiPath raised their initial IPO range from $43 to $50 per share, to $52 to $54 per share.

UiPath’s initial public offering (IPO) filing includes important information about the company and the market trends shedding more light on how it became one of the fastest growing software companies in enterprise software. We summarized the highlights:

UiPath’s main activities

UiPath has created an end-to-end automation platform enabling customers to build, manage, run, engage, measure, and govern automation programs. Integrated Computer Vision capabilities, an AI platform, low code development capabilities and a free community edition are prominent features in the market for UiPath.

UiPath’s revenue comes from licenses for its proprietary software, maintenance and support, and professional services. License fees are based on the number of its software users and the number of automations running on its platform.

IPO timing

UiPath is expected to go public sometime between June-September 2023.

Control will remain with the founder, Daniel Dines, .

Through the fundraising rounds, Daniel has kept Class B shares while investors have picked Class A shares. Due to 35 votes assigned to each Class B and 1 vote assigned to each Class A share, Daniel will be controlling ±%91 of votes.

This is similar to the models followed by Facebook and Google where founders retained control in public companies. So far, this has generated significant value for shareholders in the case of mentioned companies. However, one person rule checked purely by stock market regulations and laws is not a very diversified or democratic model of corporate control.

Finally, even though the company is controlled by a single individual, it has not chosen to get listed as a “controlled company” which would have exempted it from certain requirements about its boards and committee.

Stakeholders Customers

UiPath is focused on large enterprises. As of January/2024,

75% of revenues were from 13% of customers (i.e. 1,002 / 7,968). These customers have an ARR of $100,000+

35% of revenues were from 1% of customers (i.e. 89 / 7,968). ARR of these customers was $1.0+ million

Most large enterprises seem to be trying UiPath with limited budgets or using it limited scenarios. Its customer base has exceeded ±8,000 customers including 80% of the Fortune 10 and 63% of the Fortune Global 500. Its large enterprise clients include:


Applied Materials

Bank of America


Chipotle Mexican Grill


CVS Health

Deutsche Post DHL




SBA Communications

Takeda Pharmaceuticals

Uber Technologies, Inc.

Most of its revenues (61%) are from outside the US which is not surprising given that was started in Romania.

UiPath also released a version of its software as a community edition which helps with growing its visibility in the market.


Though we couldn’t find the numbers in the S-1, we expect most of UiPath’s customer relationships to be initiated by its partners.

UiPath has more than 3,700 business partners including BPOs, System Integrators (SIs) and consultants including Accenture LLP, Capgemini SE, CGI Inc., Cognizant Technology Solutions Corporation, Deloitte & Touche LLP, Ernst and Young LLP, KPMG LLP, and PricewaterhouseCoopers LLP.

UiPath also underlined 3 types of tech partnerships:

Integrations to tech platforms like Adobe, Alteryx built either by these platforms or in partnership with UiPath team

OCR/ NLP and custom ML and AI solutions

Cloud tech providers like AWS, Google, Microsoft Azure

Target market size

UiPath’s market potential estimate relies on the revenues of its best performing customers so it seemed optimistic but not surprising for us. Companies tend to overestimate their addressable market to investors.

UiPath estimates its current market potential to be $60 billion. Underlying assumption is that all companies can spend on RPA in similar levels to UiPath’s highest revenue customers. UiPath’s rationale is that these customers have had the time to implement RPA and gain significant benefit from it and that other companies will also achieve the same.

Their methodology is:

Identification of number of companies with minimum 200 employees in all sectors

Grouping those companies into three segments according their total number of employees

Multiplication of the companies in each segment with their estimated revenue potential. To estimate this, UiPath team segmented their own customer base into three segments as described above. Then, for each segment, they take the 90th percentile of customer Annual Recurring Revenue ARR as of December 31, 2023, among customers with at least $10,000 in ARR. They take this ARR to be representative of the market potential for that segment

Summing the results from each of these segments

UiPath is active in the below markets and based on their total, we estimate their current market to be worth $2-3 billion in 2023:

RPA including AI enabled RPA applications: Current market size is expected to be $1.9bn by Gartner

Process mining: Less than $1 billion since in 2023 this market was still estimated to be ±160 million.

We did not estimate the market potential since it not a measurable figure.

Before new companies adopt RPA, there will be alternatives to RPA which will be preferred by some of these companies.

The companies that have the potential to implement RPA will also be growing along with the overall economy.

Due to the dynamic nature of this, we do not find market potential to be informative. Market size and estimated growth rates are more relevant for estimating growth of industry participants. While the market growth is harder to estimate, UiPath grew its total revenue from $336M in 2023 to $608M in 2023, achieving 81% annual growth*.

Financials Profitability

The company significantly reduced its operating loss from $517M to $110M in 2023*. Keeping losses at these levels, the $1 bn that is planned to be raised would give them a 10 year runway. However, the company will probably want to increase its losses and expand its revenues since it can raise funds relatively easily as a public company.

The reduction in operating loss was mostly driven by

$272M increase in revenues

$103M reduction in sales & marketing spend mainly due to reducing headcount via restructuring and limited conference related costs

$21M reduction in R&D

$18M reduction in G&A

Revenue growth

UiPath is one of the fastest growing companies in the history of enterprise software. It took them ±13 years to reach 1M ARR and just 2 years to reach 100M ARR.


UiPath has been expanding the scope of its offering while other players have been making acquisitions. We expect more activity in the industry since this is still a fast growing industry which attracts the interest of numerous tech giants

UiPath has made acquisitions including these:

Cloud Elements, an API integration platform in 2023.

Process Gold a process mining platform in 2023.

Other acquisitions & partnerships in the RPA space include but are not limited to:

Service Now’s acquisition of Intellibot in 2023

Microsoft’s acquisition of Softomotive in 2023 and its launch of Power Automate as part of its Office 365 offering

Google Cloud & Automation Anywhere partnership

SAP’s acquisition of Contextor in 2023

Impact of Covid-19

As any other automation company, UiPath claims that the pandemic accelerated the adoption of automation and created more opportunity for RPA market. This is hard to confirm quantitatively because the company

was already in a high growth phase at the time of the pandemic. In such cases, impact of events like pandemics are harder to observe when compared to companies more stagnant financial growth.

did not share quarterly revenues before 2023 Q4.


However, based on the available data, it seems that pandemic has not significantly accelerated UiPath’s already impressive growth. As seen above from their S-1, they have quadrupled (i.e. from 700+ to 2,800+) their number of customers in a year (2024-2024) before COVID-19. However, post-pandemic, they have doubled their number of customers in a similar timespan (from 2023 to 2023). However, it is also worth noting that as companies get larger, their growth rates dwindle as it is harder to grow a larger company. Therefore, it is not clear whether the pandemic had a positive or negative impact on UiPath’s growth.

To see all RPA companies, feel free to check out our prioritized, data driven list of RPA companies.

For more info on RPA:

* These are 2023 and 2023 fiscal year figures. Their fiscal year ends in January 2023 and they call that as fiscal year 2023 revenue. We simply called it as 2023 revenue since 11 months of that fiscal year was in 2023.


Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.





The Milky Way’s Shiniest Known Exoplanet Has Glittering Metallic Clouds

Astronomers from the European Space Agency (ESA) have discovered the shiniest known exoplanet in our universe to date. Named LTT9779 b, this ultra hot exoplanet revolves around its host star every 19 hours and is 262 light-years away from Earth.

[Related: Gritty, swirling clouds of silica surround exoplanet VHS 1256 b.]

In our night sky, the moon and Venus are the brightest objects. Venus’ thick cloud layers reflect 75 percent of the sun’s incoming light, compared to Earth’s cloud layers that just reflect about 30 percent. LTT9779 b and it’s reflective metallic clouds can match Venus’ shininess. Detailed measurements taken by ESA’s Cheops (CHaracterising ExOPlanet Satellite) mission shows that the glittering globe reflects 80 percent of the light that is shone on it by its host star. 

LTT9779 b was first spotted in 2023 by NASA’s Transiting Exoplanet Survey Satellite (TESS) mission and ground-based observations conducted at the European Southern Observatory in Chile. ESA then selected this planet for additional observations as part of the Cheops mission.

At around the same size as the planet Neptune, LTT9779 b is the largest known “mirror” in the universe. According to ESA, it is so reflective due to its metallic clouds that are mostly made of silicate mixed in with metals like titanium. Sand and glass that are used to make mirrors are also primarily made up of silicate. The findings are detailed in a study published July 10 in the journal Astronomy & Astrophysics.

“Imagine a burning world, close to its star, with heavy clouds of metals floating aloft, raining down titanium droplets,” study co-author and Diego Portales University in Chile astronomer James Jenkins, said in a statement. 

The amount of light that an object reflects is called its albedo. Most planets have a low albedo, primarily because they either have an atmosphere that absorbs a lot of light or their surface is rough or dark. Frozen ice-worlds or planets like Venus that boast a reflective cloud layer tend to be the exceptions. 

For the team on this study, LTT9779 b’s high albedo came as a surprise, since the side of the planet that faces its host star is estimated to be around 3,632 degrees Fahrenheit. Any temperature above 212 degrees is too hot for clouds of water to form. On paper, the temperature of LTT9779 b’s atmosphere should even be too hot for clouds that are made of glass or metal.

“It was really a puzzle, until we realized we should think about this cloud formation in the same way as condensation forming in a bathroom after a hot shower,” said co-author and Observatory of Côte d’Azur researcher Vivien Parmentier in a statement. “To steam up a bathroom you can either cool the air until water vapor condenses, or you can keep the hot water running until clouds form because the air is so saturated with vapor that it simply can’t hold any more. Similarly, LTT9779 b can form metallic clouds despite being so hot because the atmosphere is oversaturated with silicate and metal vapors.”

[Related: JWST’s double take of an Earth-sized exoplanet shows it has no sky.]

In addition to being a shiny happy exoplanet, LTT9779 b also is remarkable because it is a planet that shouldn’t really exist. Its size and temperature make it an “ultra-hot Neptune,” but there are no known planets of its size in mass that have been found orbiting this close to their host star. This means that LTT9779 b lives in the “hot Neptune desert,” a planet whose atmosphere is heated to more than 1,700 degrees.

“’We believe these metal clouds help the planet to survive in the hot Neptune desert,” co-author and astronomer at Marseille Astrophysics Laboratory Sergio Hoyer said in a statement. “The clouds reflect light and stop the planet from getting too hot and evaporating. Meanwhile, being highly metallic makes the planet and its atmosphere heavy and harder to blow away.”

While its radius is about 4.7 times as big as Earth’s, one year on LTT9779 b takes only 19 hours. All of the previously discovered planets that orbit their star in less than one day are either  gas giants with a radius that is at least 10 times earth (called hot Jupiters) or rocky planets that are smaller than two Earth radii.

“It’s a planet that shouldn’t exist,” said Vivien. “We expect planets like this to have their atmosphere blown away by their star, leaving behind bare rock.”

Cheops is the first of three ESA missions dedicated to studying the exciting world of exoplanets. In 2026, it will be joined by the Plato mission which will focus on Earth-like planets that could be orbiting at a distance from their star that supports life. Ariel is scheduled to join in 2029, specializing in studying the atmospheres of exoplanets. 

Guide For Building An End

This article was published as a part of the Data Science Blogathon

In this blog, we’ll go over everything you need to know about Logistic Regression to get started and build a model in Python. If you’re new to machine learning and have never built a model before, don’t worry; after reading this, I’m confident you’ll be able to do so.

For those who are new to this, let’s start with a basic understanding of machine learning before moving on to Logistic Regression.

In simple terms, the Machine learning model uses algorithms in which the machine learns from the data just like humans learn from their experiences. Machine learning allows computers to find hidden insights without being explicitly programmed.

Types of Machine Learning algorithms

Based on the output type and task done, machine learning models are classified into the following types

Logistic Regression falls under the Supervised Learning type. Let’s learn more about it.

Supervised Learning

It’s a type of Machine Learning that uses labelled data from the past. Models are trained using already labelled samples.

Example: You have past data of the football premier league and based on that data and previous match results you predict which team will win the next game.

Supervised learning is further divided into two types-

Regression – target/output variable is continuous.

Classification – target/output variable is categorical.

Logistic Regression is a Classification model. It helps to make predictions where the output variable is categorical. With this let’s understand  Logistic Regression in detail.

What is Logistic Regression?

As previously stated, Logistic Regression is used to solve classification problems. Models are trained on historical labelled datasets and aim to predict which category new observations will belong to.

Below are few examples of binary classification problems which can be solved using logistic regression-

The probability of a political candidate winning or losing the next election.

Whether a machine in manufacturing will stop running in a few days or not.

Filtering email as spam or not spam.

Logistic regression is well suited when we need to predict a binary answer (only 2 possible values like yes or no).

The term logistic regression comes from “Logistic Function,” which is also known as “Sigmoid Function”. Let us learn more about it.

Logistic/Sigmoid Function

The sigmoid function, commonly known as the logistic function, predicts the likelihood of a binary outcome occurring. The function takes any value and converts it to a number between 0 and 1. The Sigmoid Function is a machine learning activation function that is used to introduce non-linearity to a machine learning model.

The formula of Logistic Function is:

When we plot the above equation, we get S shape curve like below.

The key point from the above graph is that no matter what value of x we use in the logistic or sigmoid function, the output along the vertical axis will always be between 0 and 1.

When the result of the sigmoid function is greater than 0.5, we classify the label as class 1 or positive class; if it’s less than 0.5, we can classify it as a negative class or

Let’s understand the mathematics behind the sigmoid function.

Logistic regression is derived from Linear regression bypassing its output value to the sigmoid function and the equation for the Linear Regression is –

In Linear Regression we try to find the best-fit line by changing m and c values from the above equation and y (output) can take any values from -infinity to +infinity. But, Logistic regression predicts the probability of outcome which can be between 0 to 1. So, to convert those values between 0 to 1 we use the sigmoid function.

after getting our output value we need to see how our model works, for that, we need to calculate the loss function. The loss function tells us how much our predicted output differ from the actual output. A good model should have less loss value. Let’s see how to calculate the loss function.

When y=1, the predicted y value should be close to 1 to reduce the loss. Now Let’s see when our actual output value is 0.

When y=0, the predicted y value should be close to 0 to reduce the loss.

Let’s move on to the implementation of the Logistic Regression model now that we’ve covered the basics.

Step by step implementation of  Logistic Regression Model in Python

Based on parameters in the dataset, we will build a Logistic Regression model in Python to predict whether an employee will be promoted or not.

For everyone, promotion or appraisal cycles are the most exciting times of the year. Final promotions are only disclosed after employees have been evaluated on a variety of criteria, which causes a delay in transitioning to new responsibilities. We will build a Machine Learning model to predict who is qualified for promotion to speed up the process.

You can get more understanding of the problem statement and download the dataset from Supervised learning.

Importing Libraries

We’ll begin by loading the necessary libraries for creating a Logistic Regression model.

import numpy as np import pandas as pd #Libraries for data visualization import matplotlib.pyplot as plt import seaborn as sns #We will use sklearn for building logistic regression model from sklearn.linear_model import LogisticRegression Loading Dataset

Python Code:

Understanding the Data for Logistic Regression

It’s always a good idea to learn more about data after loading it, such as the shape of the data and statistical information about the columns in a dataset. We can achieve all of this with the code below :

#shape of dataset print("shape of dataframe is : ", data.shape) # summary of data #Get Statistical details of data data.describe()

There are a total of 14 variables in this dataset, with a total of 54808 observations. “is_promoted” is our Target Variable, which has two categories encoded as 1 (promoted) and 0 (not promoted) rest all are input features. In addition, we can observe that our dataset contains both numerical and Categorical features.

Data Cleaning

Data cleaning is a crucial stage in the data preprocessing process. We’ll remove columns with only one unique value because their variance will be 0 and they won’t help us anticipate anything.

only have one unique value.

#Checking the unique value counts in columns featureValues={} for d in data.columns.tolist(): count=data[d].nunique() if count==1: featureValues[d]=count # List of columns having same 1 unique value cols_to_drop= list(featureValues.keys()) print("Columns having 1 unique value are :n",cols_to_drop)

This signifies that there isn’t any column having only 1 unique value.

then verify each field in the dataset for null value percentages.

#Drop employee_id column as it is just a unique id data.drop("employee_id",inplace=True,axis=1) #Checking null percentage data.isnull().mean()*100

previous_year_rating and education both features have null values. As a result, we will impute those null values instead of dropping them. Following our examination of those columns, we discovered that –

For rows with a null previous_year_rating, we can see that their length of service is 1, which could be why they don’t have a previous year rating. As a result, we’ll use 0 to impute null values.

For the education column, we will impute null values with mode.

#fill missing value data["previous_year_rating"]= data["previous_year_rating"].fillna(0) #change type to int data["previous_year_rating"]= data["previous_year_rating"].astype("int") #Find out mode value for education data["education"].mode() #fill missing value with mode data["education"]= data["education"].fillna("Bachelor's")

missing values in our data now. So, let’s go on to the next step.

Exploratory Data Analysis before creating a Logistic Regression Model

Getting insights from data and visualizing them is an important stage in machine learning since it provides us with a better view of features and their relationships.

Let’s look at the target variable’s distribution in the dataset.

# cchart for distribution of target variable fig= plt.figure(figsize=(10,3) ) fig.add_subplot(1,2,1) a= data["is_promoted"].value_counts(normalize=True).plot.pie() fig.add_subplot(1,2,2) churnchart=sns.countplot(x=data["is_promoted"]) plt.tight_layout()

We can observe from the above charts that, promoted employee data is less than non-promoted employee data, indicating that there is a class imbalance because class 0 has more data points or observations than class.

Let’s visualize if there is any relationship between the target variable and other variables.

# Visualize relationship between promoted and other features fig= plt.figure(figsize=(10,5) ) fig.add_subplot(1,3,1) ar_6=sns.boxplot(x=data["is_promoted"],y=data["length_of_service"]) fig.add_subplot(1,3,2) ar_6=sns.boxplot(x=data["is_promoted"],y=data["avg_training_score"]) fig.add_subplot(1,3,3) ar_6=sns.boxplot(x=data["is_promoted"],y=data["previous_year_rating"]) plt.tight_layout()

For an employee If the avg_training_score value is higher then the chances of getting promoted are more.

We will plot correlations between different variables using a heatmap.

#correlation between features corr_plot = sns.heatmap(data.corr(),annot = True,linewidths=3 ) plt.title("Correlation plot")

the service.

Feature Engineering

In feature engineering, we apply domain expertise to produce new features from raw data, or we convert or encode features. We’ll encode categorical features or make dummy features out of them in this section.

#Converting Categorical columns into one hot encoding data["gender"]=data["gender"].apply(lambda x: 1 if x=="m" else 0) #list of columns cols = data.select_dtypes(["object"]).columns #Create dummy variables ds=pd.get_dummies(data[cols],drop_first=True) ds #concat newly created columns with original dataframe data=pd.concat([data,ds],axis=1) #Drop original columns data.drop(cols,axis=1,inplace=True) Train-Test Split

We will divide the dataset into two subsets: train and test. To perform the train-test split, we’ll use Scikit-learn machine learning.

Train subset – we will use this subset to fit/train the model

Test subset – we will use this subset to evaluate our model

from sklearn.model_selection import train_test_split #split data into dependent variables(X) and independent variable(y) that we would predict y = data.pop("is_promoted") X = data #Let’s split X and y using Train test split X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,train_size=0.8) #get shape of train and test data print("train size X : ",X_train.shape) print("train size y : ",y_train.shape) print("test size X : ",X_test.shape) print("test size y : ",y_test.shape)

After splitting the dataset, we have 43846 observations in the training subset and 10962 in the test subset.

After diving into the dataset let’s move on to the next phase of feature scaling.

Feature Scaling/Normalization

Why Feature scaling is important?

As previously stated, Logistic Regression uses Gradient Descent as one of the approaches for obtaining the best result, and feature scaling helps to speed up the Gradient Descent convergence process. When we have features that vary greatly in magnitude, the algorithm assumes that features with a large magnitude are more relevant than those with a small magnitude. As a result, when we train the model, those characteristics become more important.

Because of this feature scaling is required to put all features into the same range, regardless of their relevance.

Feature Scaling Techniques

We bring all the features into the same range using feature scaling. There are many ways to do feature scaling like normalization, standardization, robust scaling, min-max scaling, etc. But here we will discuss the Standardization technique that we are going to apply to our features.

In standardization, features will be scaled to have a mean of 0 and a standard deviation of 1. It does not scale to a preset range. The features are scaled using the formula below:

z = (x – u) / s

where u is the mean of the training samples and s is a standard deviation of the training samples.

Let’s see how to do feature scaling in python using Scikit-learn.

#Feature scaling from sklearn.preprocessing import StandardScaler scale=StandardScaler() X_train = scale.fit_transform(X_train) X_test = scale.transform(X_test) Class Imbalance 

What is the class imbalance?

not balanced and skewed. Let’s see whether we have a class imbalance problem.

#check for distribution of labels y_train.value_counts(normalize=True)

We can observe that the majority of the labels are from class 0 and only a few are from class 1.

If we use this distribution to develop our model, it may become biased towards predicting the majority class since there will be insufficient data to learn minority class patterns. The model will start predicting every new observation as 0 or majority class. (In our problem employee is not promoted). We’ll get more model accuracy here, but it won’t be a decent model because it won’t predict class 1 or minority class, which is a crucial class.

As a result, we must consider class imbalance when developing our Logistic Regression model.

How to Handle Class Imbalance?

There are a variety of approaches to dealing with class imbalance, such as increasing minority class samples or decreasing majority class samples to ensure that both classes have the same distribution.

Because we’re using the Scikit-learn machine library to create the model, it has a logistic regression implementation that supports class weighting. We will use the inbuilt parameter “class_weight” while creating an instance of the Logistic Regression model.

Both the majority and minority classes will be given separate weights. During the training phase, the weight differences will influence the classification of the classes.

The purpose of adding class weights is to penalize the minority class for misclassification by setting a higher class weight while decreasing the weight for the majority class.

Build and Train Logistic Regression model in Python

To implement Logistic Regression, we will use the Scikit-learn library. We’ll start by building a base model with default parameters, then look at how to improve it with Hyperparameter Tuning.

As previously stated, we will use the “class_weight” parameter to address the problem of class imbalance. Let’s start by creating our base model with the code below.

#import library from sklearn.linear_model import LogisticRegression #make instance of model with default parameters except class weight #as we will add class weights due to class imbalance problem lr_basemodel =LogisticRegression(class_weight={0:0.1,1:0.9}) # train model to learn relationships between input and output variables,y_train)

After training our model on the training dataset, we used our model to predict values for the test dataset and recorded them in the y_pred_basemodel variable.

Let’s look at which metrics to use and how to evaluate our base model.

Model Evaluation Metrics

To evaluate performance or our model we will be using “f1 score” as this is a class imbalance problem using accuracy as a performance metrics is not good also, we can say that f1 score is the go-to metric when we have a class imbalance problem. The formula for calculating the F1 score is as follows:

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Precision is the ratio of accurately predicted positive observations to the total predicted positive observations.

Precision = TP/TP+FP

Recall is the ratio of accurately predicted positive observations to all observations in actual class – yes.

Recall = TP/TP+FN

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

Let’s evaluate our base model using the f1 score.

from sklearn.metrics import f1_score print("f1 score for base model is : " , f1_score(y_test,y_pred_basemodel))

We got a 0.37 f1 score on our base model created using default parameters.

Up to this point, we saw how to create a logistic regression model using default parameters.

Now let’s increase model performance and evaluate it again after tuning hyperparameters of the model.

Hyperparameter Optimization for the Logistic Regression Model

Model parameters (such as weight, bias, and so on) are learned from data, whereas hyperparameters specify how our model should be organized. The process of finding the optimum fit or ideal model architecture is known as hyperparameter tuning. Hyperparameters control the overfitting or underfitting of the model. Hyperparameter tuning can be done using algorithms like Grid Search or Random Search.

We will use Grid Search which is the most basic method of searching optimal values for hyperparameters. To tune hyperparameters, follow the steps below:

Create a model instance of the Logistic Regression class

Specify hyperparameters with all possible values

Define performance evaluation metrics

Apply cross-validation

Train the model using the training dataset

Determine the best values for the hyperparameters given.

We can use the below code to implement hyperparameter tuning in python using the Grid Search method.

#Hyperparameter tuning # define model/create instance lr=LogisticRegression() #tuning weight for minority class then weight for majority class will be 1-weight of minority class #Setting the range for class weights weights = np.linspace(0.0,0.99,500) #specifying all hyperparameters with possible values param= {'C': [0.1, 0.5, 1,10,15,20], 'penalty': ['l1', 'l2'],"class_weight":[{0:x ,1:1.0 -x} for x in weights]} # create 5 folds folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42) #Gridsearch for hyperparam tuning model= GridSearchCV(estimator= lr,param_grid=param,scoring="f1",cv=folds,return_train_score=True) #train model to learn relationships between x and y,y_train)

After fitting the model, we will extract the best fit values for all specified hyperparameters.

# print best hyperparameters print("Best F1 score: ", model.best_score_) print("Best hyperparameters: ", model.best_params_)

We will now build our Logistic Regression model using the above values we got by tuning Hyperparameters.

Build Model using optimal values of Hyperparameters

Let’s use the below code to build our model again.

#Building Model again with best params lr2=LogisticRegression(class_weight={0:0.27,1:0.73},C=20,penalty="l2"),y_train)

After training our final model it’s time to evaluate our Logistic Regression model using chosen metrics.

Model Evaluation

We will evaluate our model on Test Dataset. First, we will predict values on the Test dataset.

We chose “f1 score” as our performance metric above, but let’s look at the scores for all of the metrics, including confusion metrics, precision, recall, ROC-AUC score, and ultimately f1 score, for learning purposes.

Then, we’ll compare our final model’s f1 score to our base model to see if it’s improved.

for various metrics:

# predict probabilities on Test and take probability for class 1([:1]) y_pred_prob_test = lr2.predict_proba(X_test)[:, 1] #predict labels on test dataset y_pred_test = lr2.predict(X_test) # create onfusion matrix cm = confusion_matrix(y_test, y_pred_test) print("confusion Matrix is :nn",cm) print("n") # ROC- AUC score print("ROC-AUC score test dataset: t", roc_auc_score(y_test,y_pred_prob_test)) #Precision score print("precision score test dataset: t", precision_score(y_test,y_pred_test)) #Recall Score print("Recall score test dataset: t", recall_score(y_test,y_pred_test)) #f1 score print("f1 score test dataset : t", f1_score(y_test,y_pred_test))

We can see that by tuning hyperparameters, we were able to improve the performance of our model since our F1 Score for the final model (0.43) is higher than that of the base model (0.37). After the hyperparameter tuning model got a 0.88 ROC-AUC score.

With this, we were able to construct our logistic regression model and test it on the Test dataset. More feature engineering, hyperparameter optimization, and cross-validation techniques can improve its performance even more.


We began our learning journey by understanding the basics of machine learning and logistic regression. Then we moved on to the implementation of a Logistic Regression model in Python. We learned key steps in Building a Logistic Regression model like Data cleaning, EDA, Feature engineering, feature scaling, handling class imbalance problems, training, prediction, and evaluation of model on the test dataset. Apart from that, we learned how to use Hyperparameter Tuning to improve the performance of our model and avoid overfitting and underfitting.

I hope you find this information useful and will try it out.

Connect with me on LinkedIn.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.


Building A Food Image Classifier Using Fastai

This article was published as a part of the Data Science Blogathon.


Social Media platforms are a common way to share interesting and informative images. Food images, especially related to different cuisines and cultures, are a topic that appears to be frequently trending. Social media platforms like Instagram have a large number of images belonging to different categories. We all might have used the search options on google images or Instagram to browse through yummy-looking cake images for ideas. But to make these images available via search, we need to have some relevant tags for each image.

This makes it possible to search the keyword and match it with the tags. Since it is extremely challenging to label each and every image manually, companies use ML and DL techniques to generate correct labels for images. This can be achieved using an image classifier that identifies and labels the image based on some labelled data.

In this article, let us build an image classifier using fastai and identify a few food images using a library called ‘fastai‘.

Introduction to Fastai

Fastai is an open-source Deep Learning library that offers practitioners high-level components that can produce state-of-the-art results in conventional deep learning domains rapidly and effortlessly. It gives researchers low-level components to mix and combine to create new techniques. It aims to accomplish both without compromising usability, flexibility, or performance.

Because fastai is written in Python and based on PyTorch, knowledge of Python is required to understand this article. We will run this code in Google Colab. In addition to fastai, we will use a graphics processing unit (GPU) to get results as fast as possible.

Building an Image Classifier using Fastai

Let’s start by installing the fastai library with the following command:

!pip install -Uqq fastai

Run the following command if you’re using Anaconda:

conda install -c fastchan fastai anaconda

Let us import the packages we need for the classification task. The library is divided into modules, the most common of which are tabular, text, and vision. Because our task at hand includes vision, let’s import all of the functions we’ll need from the vision library.

from chúng tôi import *

A lot of academic datasets are available through the fastai library. One of them is FOOD, which is listed under URLs. FOOD.

The first step is to obtain and extract the data that we require. We will use the untar_data function, which will automatically download the dataset and untar it.

foodPath = untar_data(URLs.FOOD)

This dataset contains 101,000 images divided into 101 food categories, with 250 test images and 750 training images per class. The images from the training were not cleaned. All images were resized to a maximum of 512 pixels on each side. You can download the dataset from here.

The next command will tell us how many images we have to deal with.


Furthermore, using the following command, we will print the contents of the meta-directory of the Food dataset.


The meta folder contains eight files, four of which are text files: chúng tôi chúng tôi chúng tôi and chúng tôi The chúng tôi and chúng tôi files include a list of images for the training and test sets, respectively. The chúng tôi file, on the other hand, includes a list of all food classes and labels. txt provides a list of all food image labels. The directory also contains a .h5 file with a pre-trained model and an images folder with 101,000 images in JPG format. Finally, the train and test sets are provided in JSON format.

To view all the image categories, we will run the following command:

image_dir_path = foodPath/'images' image_categories = os.listdir(image_dir_path) print(image_categories)

Then, we’ll execute the following command to see a sample image from the collection of 101,000 images.

img = PILImage.create('/root/.fastai/data/food-101/images/frozen_yogurt/1942235.jpg');

The header of the dataframe can then be printed using the head() function as shown below.

df_train=pd.read_json('/root/.fastai/data/food-101/train.json') df_train.head()

Similarly, by using the pandas function, we will read the chúng tôi file and store it in the df_test dataframe.

df_test=pd.read_json('/root/.fastai/data/food-101/test.json') df_test.head()

We are creating three labels with food names of our choice to classify the food images.

labelA = 'cheesecake' labelB = 'donuts' labelC= 'panna_cotta'

Now we will create a for loop which will run through all the images that we have downloaded. With the help of this loop, we are removing the images that don’t have labels A, B, or C. Also, we are renaming the images with their respective labels by using the following function.

for img in get_image_files(foodPath): if labelA in str(img): img.rename(f"{img.parent}/{labelA}-{}") elif labelB in str(img): img.rename(f"{img.parent}/{labelB}-{}") elif labelC in str(img): img.rename(f"{img.parent}/{labelC}-{}") else: os.remove(img)

Let’s check the count of images we get after running the loop by using the following command:


Let’s try out one sample label among the three chosen food dishes and see if the renaming is done correctly or not.

def GetLabel(fileName): return fileName.split('-')[0] GetLabel("cheesecake-1092082.jpg")


The following code generates a DataLoaders object, which represents a mix of training and validation data.

dls = ImageDataLoaders.from_name_func( foodPath, get_image_files(foodPath), valid_pct=0.2, seed=42, label_func=GetLabel, item_tfms=Resize(224)) dls.train.show_batch()

In this case, we will-

Use the path option to specify the location of the downloaded and extracted data.

Use the get_image_ files function to collect all file names from the specified location.

Use an 80–20 split for the dataset.

Extract labels from file names using the GetLabel function.

Resize all images to the same size, i.e., 224 pixels.

Use the show_batch function to generate an output window displaying a grid of training images with assigned labels.

It’s time to put the model through its places. Using the ResNet34 architecture, we will build a convolutional neural network by focusing on a single function call known as vision_learner (). The vision_learner function (also known as cnn_learner) is beneficial for training computer vision models. It includes your original image dataset, the pre-trained model resnet34, and a metric error rate, which determines the proportion of images identified incorrectly on validation data. The 34 in resnet34 refers to the number of layers in this architectural type (other options are 18, 50, 101, and 152). Models that use more layers require longer to train and are more prone to overfitting.

Fastai provides a ‘fine_tune’ function for tuning the pre-trained model to solve our specific problem using the data we’ve chosen. For training the model, we will set the number of epochs to 10.

learn = vision_learner(dls, resnet34, metrics=error_rate, pretrained=True) learn.fine_tune(epochs=10)

The same model can also be checked for accuracy by replacing the metrics with ‘accuracy.’

Now, let us test a few sample images to check how our model performs.

Sample image #1

Sample image #2

Sample image #3

From the above results, we can say that our model was able to correctly identify the sample images.

After training the model, we can deploy it as a web application for others to use. Although fastai is primarily intended for model training, you can quickly export the PyTorch model for use in production using the ‘learn.export’ function. The code for this tutorial is available on my GitHub repository.


In this tutorial, we learned how to build a food image classifier using fastai based on PyTorch. It is possible to deploy this model using a service like Heroku or Netlify to make this model available as a web app.

Here are some key takeaways from this article-

We can set up deep learning models with minimal code using fastai. Hence, fastai makes it easier to use PyTorch for deep learning tasks.

Food Classification is a challenging task for computer vision applications as the same food can look considerably different from place to place depending on the way it is garnished and served. Still, by leveraging the power of transfer learning, we can use a pre-trained model to identify a food item and classify it correctly.

We used a pre-trained model, ResNet34, for this classifier. However, you can use another pre-trained model like VGG, Inception, DenseNet, etc., to build your own model.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.


Update the detailed information about Building Better Clouds: Four Lessons From Thefiasco on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!