Trending December 2023 # Beginner’s Guide To Web Scraping In Python Using Beautifulsoup # Suggested January 2024 # Top 13 Popular

You are reading the article Beginner’s Guide To Web Scraping In Python Using Beautifulsoup updated in December 2023 on the website Daihoichemgio.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Beginner’s Guide To Web Scraping In Python Using Beautifulsoup

Overview

Learn web scraping in Python using the BeautifulSoup library

Web Scraping is a useful technique to convert unstructured data on the web to structured data

BeautifulSoup is an efficient library available in Python to perform web scraping other than urllib

A basic knowledge of HTML and HTML tags is necessary to do web scraping in Python

Introduction

The need and importance of extracting data from the web is becoming increasingly loud and clear. Every few weeks, I find myself in a situation where we need to extract data from the web to build a machine learning model.

For example, last week we were thinking of creating an index of hotness and sentiment about various data science courses available on the internet. This would not only require finding new courses, but also scraping the web for their reviews and then summarizing them in a few metrics!

This is one of the problems / products whose efficacy depends more on web scraping and information extraction (data collection) than the techniques used to summarize the data.

Note: We have also created a free course for this article – Introduction to Web Scraping using Python. This structured format will help you learn better.

Ways to extract information from web

There are several ways to extract information from the web. Use of APIs being probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scraping. This is because if you are getting access to structured data from the provider, why would you want to create an engine to extract the same information.

Sadly, not all websites provide an API. Some do it because they do not want the readers to extract huge information in a structured way, while others don’t provide APIs due to lack of technical knowledge. What do you do in these cases? Well, we need to scrape the website to fetch the information.

There might be a few other ways like RSS feeds, but they are limited in their use and hence I am not including them in the discussion here.

What is Web Scraping?

You can perform web scraping in various ways, including use of Google Docs to almost every programming language. I would resort to Python because of its ease and rich ecosystem. It has a library known as ‘BeautifulSoup’ which assists this task. In this article, I’ll show you the easiest way to learn web scraping using python programming.

For those of you, who need a non-programming way to extract information out of web pages, you can also look at import.io . It provides a GUI driven interface to perform all basic web scraping operations. The hackers can continue to read this article!

Libraries required for web scraping

As we know, Python is an open source programming language. You may find many libraries to perform one function. Hence, it is necessary to find the best to use library. I prefer BeautifulSoup (Python library), since it is easy and intuitive to work on. Precisely, I’ll use two Python modules for scraping data:

Urllib2: It is a Python module which can be used for fetching URLs. It defines functions and classes to help with URL actions (basic and digest authentication, redirections, cookies, etc). For more detail refer to the documentation page. Note: urllib2 is the name of the library included in Python 2. You can use the urllib.request library included with Python 3, instead. The urllib.request library works the same way urllib.request works in Python 2. Because it is already included you don’t need to install it.

BeautifulSoup: It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages. In this article, we will use latest version BeautifulSoup 4. You can look at the installation instruction in its documentation page.

BeautifulSoup does not fetch the web page for us. That’s why, I use urllib2 in combination with the BeautifulSoup library.

Python has several other options for HTML scraping in addition to BeatifulSoup. Here are some others:

Basics – Get familiar with HTML (Tags)

While performing web scarping, we deal with html tags. Thus, we must have good understanding of them. If you already know basics of HTML, you can skip this section. Below is the basic syntax of HTML:This syntax has various tags as elaborated below:

Other useful HTML tags are:

If you are new to this HTML tags, I would also recommend you to refer HTML tutorial from W3schools. This will give you a clear understanding about HTML tags.

Scraping a web page using BeautifulSoup

Here, I am scraping data from a Wikipedia page. Our final goal is to extract list of state, union territory capitals in India. And some basic detail like establishment, former capital and others form this wikipedia page. Let’s learn with doing this project step wise step:

#import the library used to query a website import urllib2 #if you are using python3+ version, import urllib.request #specify the url #Query the website and return the html to the variable 'page' page = urllib2.urlopen(wiki) #For python 3 use urllib.request.urlopen(wiki) #import the Beautiful soup functions to parse the data returned from the website from bs4 import BeautifulSoup #Parse the html in the 'page' variable, and store it in Beautiful Soup format soup = BeautifulSoup(page)
    Above, you can see that structure of the HTML tags. This will help you to know about different available tags and how can you play with these to extract information.

      Work with HTML tags

        In[30]:soup.title

        In [38]:

        soup

        .

        title

        .

        string

        Out[38]:u'List of state and union territory capitals in India - Wikipedia, the free encyclopedia'

        In [40]:

        soup

        .

        a

         

        Above, it is showing all links including titles, links and other information.  Now to show only links, we need to iterate over each a tag and then return the link using attribute “href” with get.

        

          Find the right table: As we are seeking a table to extract information about state capitals, we should identify the right table first. Let’s write the command to extract information within all table tags. all_tables=soup.find_all('table') right_table=soup.find('table', class_='wikitable sortable plainrowheaders') right_table Above, we are able to identify right table.

          #Generate lists A=[] B=[] C=[] D=[] E=[] F=[] G=[] for row in right_table.findAll("tr"): cells = row.findAll('td') states=row.findAll('th') #To store second column data if len(cells)==6: #Only extract table body not heading A.append(cells[0].find(text=True)) B.append(states[0].find(text=True)) C.append(cells[1].find(text=True)) D.append(cells[2].find(text=True)) E.append(cells[3].find(text=True)) F.append(cells[4].find(text=True)) G.append(cells[5].find(text=True)) #import pandas to convert list to data frame import pandas as pd df=pd.DataFrame(A,columns=['Number']) df['State/UT']=B df['Admin_Capital']=C df['Legislative_Capital']=D df['Judiciary_Capital']=E df['Year_Capital']=F df['Former_Capital']=G df

          Similarly, you can perform various other types of web scraping using “BeautifulSoup“. This will reduce your manual efforts to collect data from web pages. You can also look at the other attributes like .parent, .contents, .descendants and .next_sibling, .prev_sibling and various attributes to navigate using tag name. These will help you to scrap the web pages effectively.-

          But, why can’t I just use Regular Expressions?

          Now, if you know regular expressions, you might be thinking that you can write code using regular expression which can do the same thing for you. I definitely had this question. In my experience with BeautifulSoup and Regular expressions to do same thing I found out:

          Code written in BeautifulSoup is usually more robust than the one written using regular expressions. Codes written with regular expressions need to be altered with any changes in pages. Even BeautifulSoup needs that in some cases, it is just that BeautifulSoup is relatively better.

          Regular expressions are much faster than BeautifulSoup, usually by a factor of 100 in giving the same outcome.

          So, it boils down to speed vs. robustness of the code and there is no universal winner here. If the information you are looking for can be extracted with simple regex statements, you should go ahead and use them. For almost any complex work, I usually recommend BeautifulSoup more than regex.

          End Note

          In this article, we looked at web scraping methods using “BeautifulSoup” and “urllib2” in Python. We also looked at the basics of HTML and perform the web scraping step by step while solving a challenge. I’d recommend you to practice this and use it for collecting data from web pages.

          Note: We have also created a free course for this article – Introduction to Web Scraping using Python. This structured format will help you learn better.

          If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

          Related

          You're reading Beginner’s Guide To Web Scraping In Python Using Beautifulsoup

          Top 7 Python Web Scraping Libraries & Tools In 2023

          When it comes to web scraping, there are four common approaches for gathering data: 

          Developers use web scraping libraries to create in-house web crawlers. In-house web crawlers can be highly customized, requiring  significant development and maintenance time. Building a web scraper in a language you are familiar with will allow you to reduce the development time and resources needed  to build the scraper.

          Python is the most commonly used programming language of 2023.

          In this article, we summarized the main features, pros and cons of the most common open-source Python web scraping libraries.

          1. Beautiful Soup

          Beautiful Soup is a Python web scraping library that extracts data from HTML and XML files.

          Beautiful Soup Installation: You can install Beautiful Soup 4 with “the pip install beautifulsoup4″ script.

          Prerequisites:

          Python

          Pip: It is a Python-based package management system.

          Supported features of Beautiful Soup:

          Beautiful Soup works with the built-in HTML parser in Python and other third-party Python parsers, such as HTML5lib and lxml.

          Beautiful Soup uses a sub-library like Unicode and Dammit to detect the encoding of a document automatically.

          BeautifulSoup provides a Pythonic interface and idioms for searching, navigating and modifying a parse tree.

          Beautiful Soup converts incoming HTML and XML entities to Unicode characters automatically.

          Benefits of Beautiful Soup:

          Provides Python parsers like”lxml” package for processing xml data and specific data parsers for HTML.

          Parses documents as HTML. You need to install lxml in order to parse a document as XML.

          Reduces time spent on data extraction and parsing the web scraping output.

          Lxml parser is built on the C libraries libxml2 and libxslt, allowing  fast and efficient XML and HTML parsing and processing.

          The Lxml parser is capable of handling large and complex HTML documents. It is a good option if you intend to scrape large amounts of web data.

          Can deal with broken HTML code.

          Challenges of Beautiful Soup:

          BeautifulSoup html.parser and html5lib are not suitable for time-critical tasks. If response time is crucial, lxml can accelerate the parsing process.

          Most websites employ detection techniques like browser fingerprinting and bot protection technology, such as Amazon’s, to prevent users from grabbing a web page’s HTML. For instance, when you send a get request to the target server, the target website may detect that you are using a Python script and block your IP address in order to control malicious bot traffic.

          Bright Data provides a residential proxy network with 72+ million IPs from 195 countries, allowing developers to circumvent restrictions and IP blocks.

          2. Requests

          Requests is an HTTP library that allows users to make  HTTP calls to collect data from web sources.

          Requests Installation: Requests’s source code is available on GitHub for integration into your Python package. Requests officially supports Python 3.7+.

          Prerequisites:

          Python

          Pip: You can import Requests library with the “pip install requests” command in your Python package.

          Features of Requests:

          Requests automatically decode web content from the target server. There’s also a built-in JSON decoder if you’re working with JSON data.

          It uses a request-response protocol to communicate between clients and servers in a network.

          Requests provides in-built Python request modules, including GET, DELETE, PUT, PATCH and HEAD, for making HTTP requests to the target web server.

          GET: Is used to extract data from the target web server.

          POST: Sends data to a server to create a resource.

          PUT: Deletes the specified resource.

          PATCH: Enables partial modifications to a specified resource.

          HEAD: Used to request data from a particular resource, similar to GET, but does not return a list of users.

          Benefits of Requests:

          Requests supports SOCKS and HTTP(S) proxy protocols.

          Figure 2: Showing how to import proxies into the user’s coding environment

          Source: Requests

          It supports Transport Layer Security (TLS) and Secure Sockets Layer (SSL) verification. TLS and SSL are cryptographic protocols that establish an encrypted connection between two computers on a network.

          Challenges of Requests:

          It is not intended for data parsing.

          It does not render JavaScript web pages.

          3. Scrapy

          Scrapy is an open-source web scraping and web crawling framework written in Python.

          Scrapy installation: You can install Scrapy from PyPI by using the “pip install Scrapy” command. They have a step-by-step guideline for installation for  more information.

          Features of Scrapy:

          Extract data from HTML and XML sources using XPath and CSS selectors.

          Offer a built-in telnet console for monitoring and debugging your crawler. It is important to note that using the telnet console over public networks is not secure because it does not provide transport-layer security.

          Include built-in extensions and middlewares for handling:

          Robots.txt

          User-agent spoofing

          Cookies and sessions

          Support for HTTP proxies.

          Save extracted data in CSV, JSON, or XML file formats.

          Benefits of Scrapy:

          Scrapy shell is an in-built debugging tool. It allows users to debug scraping code without running  the spider to figure out what needs to be fixed.

          Support robust encoding and auto-detection to handle foreign, non-standard, and broken encoding declarations.

          Challenges of Scrapy:

          Python 3.7+ is necessary for Scrapy.

          4. Selenium

          Selenium offers different open-source extensions and libraries to support web browser automation.

          WebDriver APIs: Utilizes browser automation APIs made available by browser vendors for browser automation and web testing.

          IDE (Integrated Development Environment): Is a Chrome and Firefox extension for creating test cases.

          Grid: Make it simple to run tests on multiple machines in parallel.

          Figure 3: Selenium’s toolkit for browser automation

          Source: Selenium

          Prerequisites:

          Eclipse

          Selenium Web Driver for Python

          To learn how to setup Selenium, check Selenium for beginners.

          Features of Selenium:

          Provides testing automation features

          Capture Screenshots

          Provide JavaScript execution

          Supports various programming languages such as Python, Ruby, chúng tôi and Java.

          Benefits of Selenium:

          Offers headless browser testing. A headless web browser lacks user interface elements such as icons, buttons, and drop-down menus. Headless browsers extract data from web pages without rendering the entire page. This speeds up data collection because you don’t have to wait for entire web pages to load visual elements like videos, gifs, and images.

          Can scrape JavaScript-rich web pages.

          Operates in multiple browsers (Chrome, Firefox, Safari, Opera and Microsoft Edge).

          Challenges of Selenium:

          Taking screenshots of PDFs is not possible.

          5. Playwright

          Playwright is an open-source framework designed for web testing and automation. It is maintained by Microsoft team.

          Features of Playwright:

          Three things are required to install Playwright:

          Python

          Pytest plugin

          Required browsers

          Benefits of Playwright:

          Are capable of scraping JavaScript-rendered websites.

          Takes a screenshot of either a single element or the entire scrollable page.

          Challenges of Playwright:

          It does not support data parsing.

          6. Lxml

          Lxml is another Python-based library for processing and parsing XML and HTML content. Lxml is a wrapper over the C libraries libxml2 and libxslt. Lxml combines the speed of the C libraries with the simplicity of the Python API.

          Lxml installation: You can download and install the lxml library from Python Package Index (PyPI).

          Requirements

          Python 2.7 or 3.4+

          Pip package management tool (or virtualenv)

          Features of LXML:

          Lxml provides two different API for handling XML documents:

          lxml.etree: It is a generic API for handling XML and HTML. lxml.etree is a highly efficient library for XML processing.

          lxml.objectify: It is a specialized API for handling XML data in Python object syntax.

          Lxml currently supports DTD (Document Type Definition), Relax NG, and XML Schema schema languages.

          Benefits of LXML:

          The key benefit of lxml is that it parses larger and more complex documents faster than other Python libraries. It performs at C-level libraries, including libxml2 and libxslt, making lxml fast.

          Challenges of LXML:

          lxml does not parse Python unicode strings. You must provide data that can be parsed in a valid encoding.

          The libxml2 HTML parser may fail to parse meta tags in broken HTML.

          Lxml Python binding for libxml2 and libxslt is independent of existing Python bindings. This results in some issues, including manual memory management and inadequate documentation.

          7. Urllib3

          Python Urllib is a popular Python web scraping library used to fetch URLs and extract information from HTML documents or URLs.

          urllib.request: for opening and reading URLs (mostly HTTP).

          urllib.parse: for parsing URLs.

          urllib.error: for the exceptions raised by urllib.request.

          urllib.robotparser: for parsing chúng tôi files. The chúng tôi file instructs a web crawler on which URLs it may access on a website.

          Urllib has two built-in Python modules including urllib2 and urllib3.

          urllib2: Sends HTTP requests and returns the page’s meta information, such as headers. It is included in Python version 2’s standard library.

          Figure 4: urllib2 sends request to retrive the target page’s meta information

          Source: Urllib2

          urllib3: urllib3 is one of the most downloaded PyPI (Python Package Index) packages.

          Urllib3 installation: Urllib3 can be installed using pip (package installer for Python). You can execute the “pip install urllib3” command to install urllib in your Python environment. You can also get the most recent source code from GitHub.

          Figure 5: Installing Urllib3 using pip command

          Source: GitHub

          Features of Urllib3:

          Proxy support for HTTP and SOCKS.

          Provide client-side TLS/SSL verification.

          Benefits of Urllib3:

          Urllib3’s pool manager verifies certificates when making requests and keeps track of required connection pools.

          Urllib allows users to access and parse data from HTTP and FTP protocols.

          Challenges of Urllib3:

          It might be challenging than other libraries such as Requests.

          8. MechanicalSoup

          MechanicalSoup is a Python library that automates website interaction.

          MechanicalSoup installation: Install Python Package Index (Pypi), then write “pip install MechanicalSoup” script to locate MechanicalSoup on PyPI.

          Features of MechanicalSoup:

          Mechanicalsoup uses BeautifulSoup (BS4) library. You can navigate through the tags of a page using BeautifulSoup.

          Automatically stores and sends cookies.

          Utilizes Beautiful Soup’s find() and find all() methods to extract data from an HTML document.

          Allows users to fill out forms using a script.

          Benefits of MechanicalSoup:

          Supports CSS and XPath selectors. XPaths and CSS Selectors enable users to locate elements on a web page.

          Challenges of MechanicalSoup:

          MechanicalSoup is only compatible with HTML pages. It does not support JavaScript. You cannot access and retrieve elements on JavaScript-based web pages.

          Does not support JavaScript rendering and proxy rotation. 

          Further reading

          Feel free to Download our whitepaper for a more in-depth understanding of web scraping:

          If you have more questions, do not hesitate contacting us:

          Gulbahar Karatas

          Gülbahar is an AIMultiple industry analyst focused on web data collections and applications of web data.

          YOUR EMAIL ADDRESS WILL NOT BE PUBLISHED. REQUIRED FIELDS ARE MARKED

          *

          0 Comments

          Comment

          Using Slicers In Excel Pivot Table – A Beginner’s Guide

          A Pivot Table Slicer enables you to filter the data when you select one or more than one options in the Slicer box (as shown below).

          Let’s get started.

          Suppose you have a dataset as shown below:

          This is a dummy data set (US retail sales) and spans across 1000 rows. Using this data, we have created a Pivot Table that shows the total sales for the four regions.

          Read More: How to Create a Pivot Table from Scratch.

          Once you have the Pivot Table in place, you can insert Slicers.

          One may ask – Why do I need Slicers? 

          You may need slicers when you don’t want the entire Pivot Table, but only a part of it. For example, if you don’t want to see the sales for all the regions, but only for South, or South and West, then you can insert the slicer and quickly select the desired region(s) for which you want to get the sales data.

          Slicers are a more visual way that allows you to filter the Pivot Table data based on the selection.

          Here are the steps to insert a Slicer for this Pivot Table:

          Select any cell in the Pivot Table.

          In the Insert Slicers dialog box, select the dimension for which you the ability to filter the data. The Slicer Box would list all the available dimensions and you can select one or more than one dimensions at once. For example, if I only select Region, it will insert the Region Slicer box only, and if I select Region and Retailer Type both, then it’ll insert two Slicers.

          Note that Slicer would automatically identify all the unique items of the selected dimension and list it in the slicer box.

          You can also insert multiple slicers by selecting more than one dimension in the Insert Slicers dialog box.

          To insert multiple slicers:

          Select any cell in the Pivot Table.

          In the Insert Slicers dialog box, select all the dimensions for which you want to get the Slicers.

          This will insert all the selected Slicers in the worksheet.

          Note that these slicers are linked to each other. For example, If I select ‘Mid West’ in the Region filter and ‘Multiline’ in the Retailer Type filter, then it will show the sales for all the Multiline retailers in Mid West region only.

          Also, if I select Mid West, note that the Specialty option in the second filter gets a lighter shade of blue (as shown below). This indicates that there is no data for Specialty retailer in the Mid West region.

          What’s the difference between Slicers and Report Filters?

          Here are some key differences between Slicers and Report Filters:

          Slicers don’t occupy a fixed cell in the worksheet. You can move these like any other object or shape. Report Filters are tied to a cell.

          Report filters are linked to a specific Pivot Table. Slicers, on the other hand, can be linked to multiple Pivot Tables (as we will see later in this tutorial).

          Since a report filter occupies a fixed cell, it’s easier to automate it via VBA. On the other hand, a slicer is an object and would need a more complex code.

          A Slicer comes with a lot of flexibility when it comes to formatting.

          Here are the things that you can customize in a slicer.

          If you don’t like the default colors of a slicer, you can easily modify it.

          Select the slicer.

          If you don’t like the default styles, you can create you own. To do this, select the New Slicer Style option and specify your own formatting.

          By default, a Slicer has one column and all the items of the selected dimension are listed in it. In case you have many items, Slicer shows a scroll bar that you can use to go through all the items.

          You may want to have all the items visible without the hassle of scrolling. You can do that by creating multiple column Slicer.

          To do this:

          Select the Slicer.

          Change the Columns value to 2.

          This will instantly split the items in the Slicer into two column. However, you may get something looking as awful as shown below:

          This looks cluttered and the full names are not displayed. To make it look better, you change the size of the slicer and even the buttons within it.

          To do this:

          Select the Slicer.

          Change Height and Width of the Buttons and the Slicer. (Note that you can also change the size of the slicer by simply selecting it and using the mouse to adjust the edges. However, to change the button size, you need to make the changes in the Options only).

          By default, a Slicer picks the field name from the data. For example, if I create a slicer for Regions, the header would automatically be ‘Region’.

          You may want to change the header or completely remove it.

          Here are the steps:

          In the Slicer Settings dialog box, change the header caption to what you want.

          This would change the header in the slicer.

          If you don’t want to see the header, uncheck the Display Header option in the dialog box.

          By default, the items in a Slicer are sorted in an ascending order in case of text and Older to Newer in the case of numbers/dates.

          You can change the default setting and even use your own custom sort criteria.

          Here is how to do this:

          In the Slicer Settings dialog box, you can change the sorting criteria, or use your own custom sorting criteria.

          Read More: How to create custom lists in Excel (to create your own sorting criteria)

          It may happen that some of the items in the Pivot Table have no data in it. In such cases, you can make the Slicers hide that item.

          In such cases, you can choose not display it at all.

          Here are the steps to do this:

          In the Slicer Settings dialog box, with the ‘Item Sorting and Filtering’ options, check the option ‘Hide items with no data’.

          A slicer can be connected to multiple Pivot Tables. Once connected, you can use a single Slicer to filter all the connected Pivot Tables simultaneously.

          Remember, to connect different Pivot Tables to a Slicer, the Pivot Tables need to share the same Pivot Cache. This means that these are either created using the same data, or one of the Pivot Table has been copied and pasted as a separate Pivot Table.

          Read More: What is Pivot Table Cache and how to use it?

          Below is an example of two different Pivot tables. Note that the Slicer in this case only works for the Pivot Table on the left (and has no effect on the one on the right).

          To connect this Slicer to both the Pivot  Tables:

          In the Report Connections dialog box, you will see all the Pivot Table names that share the same Pivot Cache. Select the ones you want to connect to the Slicer. In this case, I only have two Pivot Tables and I’ve connected both with the Slicer.

          Now your Slicer is connected to both the Pivot Tables. When you make a selection in the Slicer, the filtering would happen in both the Pivot Tables (as shown below).

          Just as you use a Slicer with a Pivot Table, you can also use it with Pivot Charts.

          Something as shown below:

          Here is how you can create this dynamic chart:

          Make the fields selections (or drag and drop fields into the area section) to get the Pivot chart you want. In this example, we have the chart that shows sales by region for four quarters. (Read here on how to group dates as quarters).

          Select the Slicer dimension you want with the Chart. In this case, I want the retailer types so I check that dimension.

          Format the Chart and the Slicer and you’re done.

          Note that you can connect multiple Slicers to the same Pivot Chart and you can also connect multiple charts to the same Slicer (the same way we connected multiple Pivot Tables to the same Slicer).

          You May Also Like the Following Pivot Table Tutorials:

          Machine Learning Using C++: A Beginner’s Guide To Linear And Logistic Regression

          Why C++ for Machine Learning?

          The applications of machine learning transcend boundaries and industries so why should we let tools and languages hold us back? Yes, Python is the language of choice in the industry right now but a lot of us come from a background where Python isn’t taught!

          The computer science faculty in universities are still teaching programming in C++ – so that’s what most of us end up learning first. I understand why you should learn Python – it’s the primary language in the industry and it has all the libraries you need to get started with machine learning.

          But what if your university doesn’t teach it? Well – that’s what inspired me to dig deeper and use C++ for building machine learning algorithms. So if you’re a college student, a fresher in the industry, or someone who’s just curious about picking up a different language for machine learning – this tutorial is for you!

          In this first article of my series on machine learning using C++, we will start with the basics. We’ll understand how to implement linear regression and logistic regression using C++!

          Let’s begin!

          Note: If you’re a beginner in machine learning, I recommend taking the comprehensive Applied Machine Learning course.

          Linear Regression using C++

          Let’s first get a brief idea about what linear regression is and how it works before we implement it using C++.

          Linear regression models are used to predict the value of one factor based on the value of another factor. The value being predicted is called the dependent variable and the value that is used to predict the dependent variable is called an independent variable. The mathematical equation of linear regression is:

                                                           Y=B0+B1 X

          Here,

          X: Independent variable

          Y: Dependent variable

          B0: Represents the value of Y when X=0

          B1: Regression Coefficient (this represents the change in the dependent variable based on the unit change in the independent variable)

          For example, we can use linear regression to understand whether cigarette consumption can be predicted based on smoking duration. Here, your dependent variable would be “cigarette consumption”, measured in terms of the number of cigarettes consumed daily, and your independent variable would be “smoking duration”, measured in days.

          Loss Function

          The loss is the error in our predicted value of B0 and B1. Our goal is to minimize this error to obtain the most accurate value of B0 and B1 so that we can get the best fit line for future predictions.

          For simplicity, we will use the below loss function:

          e(i) = p(i) - y(i)

          Here,

          e(i) : error of ith training example

          p(i) : predicted value of ith training example

          y(i): actual value of ith training example

          Overview of the Gradient Descent Algorithm

          Gradient descent is an iterative optimization algorithm to find the minimum of a function. In our case here, that function is our Loss Function.

          Here, our goal is to find the minimum value of the loss function (that is quite close to zero in our case). Gradient descent is an effective algorithm to achieve this. We start with random initial values of our coefficients B0 and B1 and based on the error on each instance, we’ll update their values.

          Here’s how it works:

          Initially, let B1 = 0 and B0 = 0. Let L be our learning rate. This controls how much the value of B1 changes with each step. L could be a small value like 0.01 for good accuracy

          We calculate the error for the first point: e(1) = p(1) – y(1)

          We’ll update B0 and B1 according to the following equation:

             b0(t+1) = b0(t) - L * error    b1(t+1) = b1(t) - L * error

          We’ll do this for each instance of a training set. This completes one epoch. We’ll repeat this for more epochs to get more accurate predictions.

          You can refer to these comprehensive guides to get a more in-depth intuition of linear regression and gradient descent:

          Implementing Linear Regression in C++ Initialization phase:

          We’ll start by defining our dataset. For the scope of this tutorial, we’ll use this dataset:

          We’ll train our dataset on the first 5 values and test on the last value:

          View the code on Gist.

          Next, we’ll define our variables:

          View the code on Gist.

          Training Phase

          Our next step is the gradient descent algorithm:

          View the code on Gist.

          Since there are 5 values and we are running the whole algorithm for 4 epochs, hence 20 times our iterative function works. The variable p calculates the predicted value of each instance. The variable err is used for calculating the error of each instance. We then update the values of b0 and b1 as explained above in the gradient descent section above. We finally push the error in the error vector.

          As you will notice, B0 does not have any input. This coefficient is often called the bias or the intercept and we can assume it always has an input value of 1.0. This assumption can help when implementing the algorithm using vectors or arrays.

          Finally, we’ll sort the error vector to get the minimum value of error and corresponding values of b0 and b1. At last, we’ll print the values:

          View the code on Gist.

          Testing Phase:

          View the code on Gist.

          We’ll enter the test value which is 6. The answer we get is 4.9753 which is quite close to 5. Congratulations! We just completed building a linear regression model with C++, and that too with good parameters.

          Full Code for final implementation:

          View the code on Gist.

          Logistic Regression with C++

          Logistic Regression is one of the most famous machine learning algorithms for binary classification. This is because it is a simple algorithm that performs very well on a wide range of problems.

          The name of this algorithm is logistic regression because of the logistic function that we use in this algorithm. This logistic function is defined as:

          predicted = 1 / (1 + e^-x)

          Gradient Descent for Logistic Regression

          We can apply stochastic gradient descent to the problem of finding the coefficients for the logistic regression model as follows:

          Let us suppose for the example dataset, the logistic regression has three coefficients just like linear regression:

          output = b0 + b1*x1 + b2*x2

          The job of the learning algorithm will be to discover the best values for the coefficients (b0, b1, and b2) based on the training data.

          Given each training instance:

          Calculate a prediction using the current values of the coefficients.                            prediction = 1 / (1 + e^(-(b0 + b1*x1 + b2*x2)).

          Calculate new coefficient values based on the error in the prediction. The values are updated according to the below equation:               b = b + alpha * (y – prediction) * prediction * (1 – prediction) * x

          Where b is the coefficient we are updating and prediction is the output of making a prediction using the model. Alpha is a parameter that you must specify at the beginning of the training run. This is the learning rate and controls how much the coefficients (and therefore the model) changes or learns each time it is updated.

          Like we saw earlier when talking about linear regression, B0 does not have any input. This coefficient is called the bias or the intercept and we can assume it always has an input value of 1.0. So while updating, we’ll multiply with 1.0.

          The process is repeated until the model is accurate enough (e.g. error drops to some desirable level) or for a fixed number of iterations.

          For a beginner’s guide to logistic regression, check this out – Simple Guide to Logistic Regression.

          Implementing Logistic Regression in C++ Initialization phase

          We’ll start by defining the dataset:

          We’ll train on the first 10 values and test on the last value:

          View the code on Gist.

          Next, we’ll initialize the variables:

          View the code on Gist.

          Training Phase

          View the code on Gist.

          Since there are 10 values, we’ll run one epoch that takes 10 steps. We’ll calculate the predicted value according to the equation as described above in the gradient descent section:

          prediction = 1 / (1 + e^(-(b0 + b1*x1 + b2*x2)))

          Next, we’ll update the variables according to the similar equation described above:

          b = b + alpha * (y – prediction) * prediction * (1 – prediction) * x

          Finally, we’ll sort the error vector to get the minimum value of error and corresponding values of b0, b1, and b2. And finally, we’ll print the values:

          View the code on Gist.

          Testing phase:

          View the code on Gist.

          When we enter x1=7.673756466 and x2= 3.508563011, we get pred = 0.59985. So finally we’ll print the class:

          View the code on Gist.

          So the class printed by the model is 1. Yes! We got the prediction right!

          Final Code for full implementation

          View the code on Gist.

          One of the more important steps, in order to learn machine learning, is to implement algorithms from scratch. The simple truth is that if we are not familiar with the basics of the algorithm, we can’t implement that in C++.

          Related

          Do You Need Proxies For Web Scraping?

          Data lies at the heart of every successful business. You need relevant competitor data to outperform your direct competitors. You need customer data to understand your target market’s needs and desires. Job market data helps you improve recruitment processes, and pricing data enables you to keep your products and services affordable to your audiences while maximizing your profits.

          At first, glance, collecting relevant data seems easy enough – all you have to do is Google the information you need, and you’ll find thousands of results. However, when you need larger volumes of data, such a manual approach will not cut it. You’ll need to automate this process with web scraping bots, and you’ll need to use a proxy service to do it right.

          Learn why proxies are critical to your web scraping efforts and how they can help you make the most of the data you have available.

          About Web Scraping

          First thing’s first, you need to understand what web scraping is. Put plainly, it’s the process of gathering and later analyzing data that’s freely available on one of the millions of websites that are currently online. It’s valuable for lead generation, competitor research, price comparison, marketing, and target market research.

          Even manual data extraction, such as searching for product pricing information yourself and exporting it to your Excel file, counts as a type of web scraping. However, web scraping is more commonly automated since manual data extraction is slow and prone to human error.

          Web scraping automation involves scraper bots that crawl dozens of websites simultaneously, loading their HTML codes, and extracting the relevant information. The bots then present the data in a readable form that’s easy to understand and analyze when needed.

          Depending on your needs, you have access to several different types of web scrapers:

          Browser Extensions

          Like any other type of browser extension, such as an ad block, web scraper browser plug-ins simply need to be installed on your browser of choice. They’re affordable, easy to use, and effective for smaller data volumes.

          Installable Software

          Installable scrapers are much more powerful. Installed directly on your device, they can go through larger quantities of data without a hitch. The only problem is that they tend to be somewhat slower.

          Cloud-Based Solutions

          The best of the bunch is cloud-based scrapers. Built for significant data volumes, they are fast, reliable, and more expensive than the rest. They can extract data into any format type you prefer and completely automate every aspect of scraping.

          You can also build your own scraping bots from scratch if you have the required skills.

          Challenges of Web Scraping

          Although web scraping seems like a cut-and-dried process, it’s rarely so. You’ll come across numerous challenges when you first get into it, some of the greatest ones being:

          Prevented Bot Access

          Few sites will willingly allow bot access as it can cause many problems. Bots create unwanted traffic, which can overwhelm servers and even cause analytics issues to the site in question. Not to mention that there are numerous malicious bots designed to cause Distributed Denial of Service (DDoS) attacks, steal information, and more. Therefore, if a site identifies your web scrapers as bots, your access will immediately be prevented.

          IP Blocks

          Geo-Restrictions

          Proxies as A Solution

          If you want to go around the aforementioned web scraping challenges, you need a dependable proxy service, such as Oxylabs. Proxies are the middle-men between your device and the internet, forwarding all information requests from you to the site you’re trying to scrape and back.

          Depending on the proxy server you choose, you can receive multiple fake IP addresses that help hide your actual location and allow you to scrape data seamlessly.

          How They Can Help

          By hiding your IP address and giving you a new, fake one, proxies can help you overcome the main challenges of web scraping:

          Make as Many Information Requests as Needed

          Your proxy can provide you with changing IP addresses, allowing you to present yourself as a unique site visitor every time you make an information request. The site will have a more challenging time identifying whether you’re using bots or not.

          Go Around IP Blocks

          another IP address, allowing you to continue scraping without issues.

          Bypass Geo-Restrictions

          Conclusion

          A Guide To Perform 5 Important Steps Of Nlp Using Python

          This article was published as a part of the Data Science Blogathon

          Natural Language Processing is a popular machine learning technique used to analyze t content. We see a lot of fancy reports around us and a lot of companies use business intelligence insights to drive their business. Most of these insights and reports are created using structured data. There are still some use cases for unstructured data. These could be in the form of text, tweets, images, etc. NLP focuses on bringing out meaningful insights from these text-based sources.

          Some examples of NLP include sentiment analysis. So if you have a company and have newly launched a product, you can analyze the sentiments of the users via their tweets. Even product reviews on your website can be analyzed in the same way.

          Challenges of NLP

          So what seems to be the challenge here?

          Let us take an example of a review: “The product is extraordinarily bad”

          Extraordinary is usually referred to in a positive way. If we were to use a keyword-based approach and tag it using the word extraordinary, then it would be incorrect. This is where NLP comes in. These situations where oxymorons are used need to be handled carefully.

          Another challenge is in terms of similar words as well as ambiguous meanings.

          Irony and sarcasm is difficult for a machine to understand

          Advantages of NLP

          Can work with unstructured data.

          More insights on the sentiments of a customer.

          Chatbots and other such AI/ML-based devices/technologies are being improved upon.

          Steps involved in NLP

          Let us take a look at the basic steps involved in running a simple NLP algorithm using a news article dataset.

          I have imported the required libraries for this data processing using NLP. Post that I have imported the file from my local system

          import gensim import numpy #numpy.numpy.random.bit_generator = numpy.numpy.random._bit_generator from gensim.utils import simple_preprocess from gensim.parsing.preprocessing import STOPWORDS from chúng tôi import WordNetLemmatizer, SnowballStemmer from nltk.stem.porter import * import numpy as np np.random.seed(2023) import nltk nltk.download('wordnet') import pandas as pd data = pd.read_csv('C:UsersktkDesktopBBC News Test.csv', error_bad_lines=False); data data_text = data[['Text']] data_text['index'] = data.ArticleId documents = data_text Tokenization

          This is the first major step to be done to any data. So what does this step do? Imagine you have a 100-word document. You need to split the document into 100 separate words in order to identify the keywords and the major topics. This process is called tokenization. I have used an example where I have imported the data sets and used a gensim library for all the preprocessing steps.

          This library has a preprocess function that helps tokenize the keywords. I have used a function called preprocess to help pick out the keywords. Different libraries have different functions for this process.

          processed_docs = documents['Text'].map(preprocess) processed_docs[:10]

          You can also remove the punctuation in this same step. There are functions for the same as well. Since this particular dataset does not have any punctuation, I have not used the punctuation removal functions.

          Stop Word Removal

          You have a huge dataset or several articles. In these articles, you will find that a lot of words like, “is”, “was”, “were”, etc are present. These words do not technically add any value to the main topic. These are tagged as stop words. There are a number of stop word removal techniques that can be used to remove these stop words. This will help us to arrive at the topic of focus.

          import nltk from nltk.corpus import stopwords print(stopwords.words('english')) stop_words = stopwords.words('english') output = [w for w in processed_docs if not w in stop_words] print("n"+str(output[0]))

          I have used stop word function present in the NLTK library. The first list contains the list of stop words considered by the system. The second list contains the list of words after the stop words have been removed.

          We will be left with only the keywords once the stop words are removed. This step is important for any NLP processing.

          Stemming

          Stemming means cutting out the other parts of a word and keeping only the stem (i.e. the important part of the word). In English, we add prefixes and suffixes to a word to form different words/tense forms of the same word.

          For example, the root word stem can take the form of stemming or stems. The stemming process will remove the suffix to give out the word – stem. I have performed both the stemming as well as the lemmatization process explained in the next step together. The code snippet for both is attached together in the next step. I have attached an example for stemming in the code below. You can notice that the word “queens” has been stemmed to “queen“.

          from chúng tôi import PorterStemmer from nltk.tokenize import word_tokenize ps = PorterStemmer() a = doc_sample.split(' ') for w in a: print(w, " : ", ps.stem(w))

          Another example is the word ecosystem. The root word for this is “eco” while the derived word is “ecosystem“. You do not need to be a grammar expert to perform stemming. Python has libraries that support the stemming process.

          Lemmatization

          Lemmatization is similar to stemming but is different in a complex way. Stemming simply cuts out the prefix or the suffix without thinking whether the remaining root word makes sense or not. Lemmatization on the other hand looks at the stemmed word to check whether it makes sense or not.

          For example, the word “care” when stemmed will give out “car” but when lemmatized will give out “care”. The root word care is called a lemma.

          So why is lemmatization very important?

          Lemmatization helps in the disambiguation of words. It brings out the actual meaning of the word. So if you have multiple words which share a similar meaning, lemmatization can help sort this out. Hence, this is a very important step for your NLP process.

          def lemmatize_stemming(text): snow_stemmer = SnowballStemmer(language='english') return snow_stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v')) def preprocess(text): result = [] for token in gensim.utils.simple_preprocess(text): result.append(lemmatize_stemming(token)) return result doc_sample = documents[documents['index'] == 1018].values[0][0] print('original document: ') words = [] for word in doc_sample.split(' '): words.append(word) print(words) print('nn tokenized and lemmatized document: ') print(preprocess(doc_sample))

          You can see the steps used to stem and lemmatize the same news article document. Here, I have used a snowball stemmer for this process.

          Modelling 

          Modeling your text is very important if you want to find out the core idea of the text. I the case of supervised machine learning, we use logistic regression or linear regression, etc to model the data. In those cases, we have the output variable which we use to train the model. In this case, since we do not have the output variable, we rely on unsupervised techniques.

          There are a lot of good algorithms to help model the text data. Two of the most commonly used are the SVD (Singular Value decomposition) and LDA (latent Dirichlet allocation). These are widely used across the industry and are pretty simple to understand and implement.

          LDA is a probabilities algorithm that focuses on iteratively assigning the probability of a word belonging to a topic. I have used LDA here to identify the possible topics for an article.

          lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

          Here, you can see the probabilities being listed out for each article. Each keyword has a value that states the likeliness of the word being the keyword.

          Conclusion

          What I have listed out are some of the key steps in NLP. NLP is a dimension onto itself. To fully understand the magnitude of it, we need to first understand how deep any language can be. Since, NLP focuses on text data based on language, irony, sarcasm, comedy, trauma, horror, and many more such things need to be considered.

          On a parting note, I wish to bring to your attention that the possibilities using NLP are limitless. The industry has realized the value of the text data of late and has started exploring more on this. Even the automated chatbots which pass the turning test have some amount of NLP embedded in them.

          About the Author

          Hi there! This is Aishwarya Murali, currently working in the analytics division of Karnataka Bank’s – Digital Centre of Excellence in Bangalore. Some of my interesting projects include ML-based scorecards for loan journey automation, customer segmentation, and improving the market share via selective profiling of customers using some machine learning analytics.

          I have a master’s in computer applications and have done certification in Business Analytics from IIM-K. Currently, I am working on R&D innovations at my workplace.

          You can connect with me at

          You can also mail me at

          [email protected]

          The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

          Related

          Update the detailed information about Beginner’s Guide To Web Scraping In Python Using Beautifulsoup on the Daihoichemgio.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!