ETL for <eBay’s Skin Care Products>

10 min readJan 18, 2021

In this project, we are required to perform a complete ETL for a web data of Malaysian website (website end with .my). We choose Python as our language to do our project because it is the most widely used programming language for data scraping.

What is Extract, Transform and Load (ETL)?

The process of extracting data from multiple source systems, transforming it to suit business needs, and loading it into a destination database is commonly called ETL, which stands for extraction, transformation, and loading.

Phase 1: Extract

Firstly, we need to choose a Malaysian website which is legal to be scrape. The website that used in this project is ebay(https://www.ebay.com.my/) and the category “Skin Care” will be the main product that we used to scrape.

Before we start doing the process of web scraping, we need to identify which data can be scraped. The information that we choose to do the process of scraping include:

Product Name
Price
Availability
Shipping Cost
Return
Item Location
Item Served Area
Delivery Time

**Screenshot of the data which will be scrapped**

Before we do the web scraping, we need to import some necessary libraries.

from bs4 import BeautifulSoup 
import requests 
import pandas as pd 
import numpy as np

Next, we use find() function available for searching specific tags with specific attributes we locate the Tag Object containing title of the product, price, and others attributes.

Attribute 1: Product Name

def get_title(soup):
  try:
    title = soup.find("span",attrs = {'class':'g-hdn'})
    texttitle = title.next_sibling
  except AttributeError:
    texttitle = np.nan
  return texttitle

Attribute 2: Price

def get_price(soup):
  try:
    price = soup.find("span",attrs = {'id':'prcIsum'})
    textprice = price.getText().strip()
  except AttributeError:
    textprice = np.nan
  return textprice

Attribute 3: Availability

def get_available(soup):
 try:
   available = soup.find(“span”,attrs = {‘id’:’qtySubTxt’})
   textavailable = available.getText().strip()
 except AttributeError:
   textavailable = np.nan
 return textavailable

Attribute 4: Shipping Cost

def get_cost(soup):
 try:
   cost = soup.find(“span”,attrs = {‘id’:’fshippingCost’})
   textcost = cost.getText().strip()
 except AttributeError:
   textcost = np.nan
 return textcost

Attribute 5: Return

def get_return(soup):
 try:
   returns = soup.find(“span”,attrs = {‘id’:’vi-ret-accrd-txt’})
   textreturns = returns.getText().strip()
 except AttributeError:
   textreturns = np.nan
 return textreturns

Attribute 6: Item Location

def get_location(soup):
 try:
   location = soup.find(“span”,attrs={‘itemprop’: ‘availableAtOrFrom’})
   textlocation = location.getText().strip()
 except AttributeError:
   textlocation = np.nan
 return textlocation

Attribute 7: Item Served Area

def get_serve(soup):
 try:
   serve = soup.find(“span”,attrs = {‘itemprop’:’areaServed’})
   textserve = serve.getText().strip()
 except AttributeError:
   textserve = np.nan
 return textserve

Attribute 8: Delivery Time

def get_delivery(soup):
 try:
   delivery = soup.find(“div”,attrs = {‘class’:’sh-inline-div’})
   textdelivery = delivery.getText().strip()
 except AttributeError:
   textdelivery = np.nan
 return textdelivery

Get all URLs of Skincare product
This website contains tons of user agents for the reader to choose from. Following is an example of a User-Agent within the header value. A webpage is accessed by its URL (Uniform Resource Locator). With the help of the URL, we will send the request to the webpage for accessing its data.

HEADERS = ({'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept-Language': 'en-US'})URL = "https://www.ebay.com.my/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=skin+care&_sacat=0"webpage = requests.get(URL, headers=HEADERS)

Output:

Here is some example of URLs that successfully managed.

Next, we create list to store data of every attributes and create a loop for extracting the data in each links.

title_data = []
price_data = []
available_data = []
cost_data = []
return_data = []
location_data = []
serve_data = []
delivery_data = []for i in range(0,len(links_list)):
  url = links_list[i]
  new_webpage = requests.get(url, headers=HEADERS)
  new_soup = BeautifulSoup(new_webpage.content, "lxml")  title = [get_title(new_soup)]
  price = [get_price(new_soup)]
  available = [get_available(new_soup)]
  cost = [get_cost(new_soup)]
  returns = [get_return(new_soup)]
  location = [get_location(new_soup)]
  serve =[get_serve(new_soup)]
  delivery =[get_delivery(new_soup)]

Output:

Finally, we combine all the attributes by extracted it into a pandas dataframe. We use head() function to return top n (5 by default) rows of a dataframe.

pd.options.display.max_colwidth = None
df = pd.DataFrame(list(zip(title_data,price_data,available_data,cost_data,return_data,location_data,serve_data,delivery_data)),columns =[“Product Name”, ”Price”, ”Availability”, ”Shipping Cost”, ”Return”, ”Item Location”, ”Item Served Area”, ”Delivery Time”])df.head()

Output:

Phase 2: Tranform

Data transformation is an integral part of visualization which is choosing the variables to show and their level of detail is just as important as choosing appropriate visual encodings. Before doing the data transformation, we analyze the features of data:

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 8 columns):
#   Column            Non-Null Count  Dtype
---  ------            --------------  -----
0   Product Name      50 non-null     object
1   Price             50 non-null     object
2   Availability      30 non-null     object
3   Shipping Cost     10 non-null     object
4   Return            50 non-null     object
5   Item Location     50 non-null     object
6   Item Served Area  50 non-null     object
7   Delivery Time     9 non-null      object
dtypes: object(8)
memory usage: 3.2+ KB

After observing the features, we know that what our features are:

Product Name (Name of the product)
Price (Price of the product)
Availability (Availability of the stock)
Return (whether can the stock be returned or not)
Item Location (Location of the product stored)
Item Served Area (Area where product can be delivered)
Delivery Time (Time estimated to be delivered)

After that, we may start with calculation of our missing values for each features, the results shown will be in percentage(%):

for col in df.columns:
print(col, df[col].unique(), len(df))
pct_missing = np.mean(df[col].isnull())print('Missing values: {} -{}%\n'.format(col,round(pct_missing*100)))

Output (Complete output may refer to this link):

Missing values: Product Name — 0%
Missing values: Price - 0%
Missing values: Availability - 40%
Missing values: Shipping Cost - 80%
Missing values: Return - 0%
Missing values: Item Location - 0%
Missing values: Item Served Area - 0%
Missing values: Delivery Time - 82%

In brief, we may observe that Availability(40%), Shipping Cost(80%) and Delivery Time(82%) contain missing values while the rest of the features have not. Thus, we may start handling with the missing values in those features mentioned just now:

Step 1: Handling Missing Values

Availability
Before:

df['Availability'].head(15)

Output:

0                3 available
1                        NaN
2                        NaN
3                        NaN
4               10 available
5               10 available
6                        NaN
7                        NaN
8                        NaN
9     More than 10 available
10                       NaN
11               5 available
12               2 available
13                       NaN
14                       NaN
Name: Availability, dtype: object

Visualization:

Data Cleaning Process
Remove those unnecessary words:

df["Availability"] = df["Availability"].str.replace('More', '')
df["Availability"] = df["Availability"].str.replace('than', '')
df["Availability"] = df["Availability"].str.replace('available', '')
df['Availability'].head(15)

Output:

0        3
1       NaN
2       NaN
3       NaN
4       10
5       10
6       NaN
7       NaN
8       NaN
9       10
10      NaN
11       5
12       2
13      NaN
14      NaN
Name: Availability, dtype: object

Then, replace integers with “Available” while NaN into “Out of stock”:

df.loc[df[‘Availability’].notnull(),’Availability’] = ‘Available’
df.loc[df[‘Availability’].isnull(),’Availability’] = ‘Out of Stock’
df[‘Availability’].head(15)

After — Output:

0 Available
1 Out of Stock
2 Out of Stock
3 Out of Stock
4 Available
5 Available
6 Out of Stock
7 Out of Stock
8 Out of Stock
9 Available
10 Out of Stock
11 Available
12 Available
13 Out of Stock
14 Out of Stock
Name: Availability, dtype: object

After that, we visualize again and see the differences:

Shipping Cost
Before:

df[‘Shipping Cost’].head(15)

Output:

0 RM 20.00
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 US $12.97\n\n\t\t\t\t\t\t\t(approx. RM 52.35)
13 NaN
14 NaN
Name: Shipping Cost, dtype: object

Data Cleaning Process:
Firstly, we converted the USD into RM then we removed the currency symbol “RM”

# Split strings into List
sc = df[‘Shipping Cost’].str.split()# Take the last element
df[‘Shipping Cost’] = sc.str[-1]
df[‘Shipping Cost’].head(15)# Remove the ‘)’
df[‘Shipping Cost’] = df[‘Shipping Cost’].str.replace(“)”, “”)
df[‘Shipping Cost’].head(15)

After converted into integers, we will replace those “NaN” with 0.00 to show that it does not charge any fee:

df[‘Shipping Cost’] = df[‘Shipping Cost’].replace(np.nan, ‘0.00’)
df[‘Shipping Cost’] = df[‘Shipping Cost’].astype(float)
df[‘Shipping Cost’].head(15)

After — Output:

0 20.00
1 0.00
2 0.00
3 0.00
4 0.00
5 0.00
6 0.00
7 0.00
8 0.00
9 0.00
10 0.00
11 0.00
12 52.35
13 0.00
14 0.00
Name: Shipping Cost, dtype: float64

Then, we set the ranges of the Shipping Cost which is Free (0.00), Low (0.01 to 60.00), Medium (60.01 to 120.00) and lastly High (120.01 to 180.00).

#binning with range
bins = [-1,1,60,120,180]
labels = [‘Free’, ‘Low’, ‘Medium’, ‘High’]
df[‘Range’] = pd.cut(df[‘Shipping Cost’], bins, labels=labels)
r = df[‘Range’]
r.head(15)

After that, the column “Range” will be shown in the table (may refer to the link). Then, visualization will be shown in Bar Chart and Pie Chart:

Delivery Time
Firstly, let’s see the original output:

Before:

df[‘Delivery Time’].unique()

Output:

array([‘Estimated Delivery within 10–22 business days’, nan, ‘Estimated Delivery within 4–9 business days’, ‘Estimated Delivery within 2–4 business days’], dtype=object)

Data Cleaning Process:
We noticed that there are a lot of words to be processed, thus we extract it into integers only.

#extract only number of days in column
df['Delivery Time'] = df['Delivery Time'].str.extract('(\d+\-\d+)')

We will show it in a simple bar chart:

Then, we replace those “NaN” with standard delivery time 7–14 business days:

#replace NAN with 7–14 business days
df[‘Delivery Time’] = df[‘Delivery Time’].replace(np.nan, ‘7–14’)
df.head()

Then, we visualize it into bar chart again and let’s see the differences:

Lastly, we converted the range of delivery time into average delivery days:

#find the average delivery days and convert data type from object to float
df[‘Delivery Time’] = df[‘Delivery Time’].str.split(‘-’, expand=True).astype(float).mean(axis=1)
df = df.rename(columns={“Delivery Time”: “Average Delivery Days”})
print(df[‘Average Delivery Days’].dtypes)df.head()

After — Output:

Step 2: Remove columns with low variability

As we observed, we found that there are two features that have similar contents which is “Return” and “Item Location”. Let’s see how the contents are:

Before:

df[[‘Return’,’Item Location’]].head()

Output:

Then, let’s see the complete data for these features that visualized in bar chart:

After:
After we identify the variability, we decide to delete those features from table:

df.drop(columns=[‘Return’,’Item Location’],axis=1 ,inplace=True)
df.head()

Step 3: Change Letters to Uppercase

For all the product names, we will change all the letters to uppercase to ensure the naming convention of the data is standardized. Let’s see the original product names:

Before:

df[‘Product Name’].head(15)

Output:

After:
Then, we will do the convention of letters cases:

# Convert the Product Name to uppercase
df[‘Product Name’] = df[‘Product Name’].str.upper()
df[‘Product Name’].head(15)

Output:

Step 4: Convert currency and remove symbols

For the feature “Price”, the original data is shown as below:

Before:

df[‘Price’].head(15)

Output:

0 RM 55.70
1 RM 100.00
2 RM 185.00
3 RM 30.00
4 RM 806.00
5 RM 29.90
6 RM 185.00
7 RM 20.00
8 RM 60.99
9 RM 120.00
10 RM 120.00
11 RM 30.97
12 US $29.97
13 RM 185.00
14 RM 2,500.00
Name: Price, dtype: object

As we observed, there are different currencies and the price is with “RM” and “US $”. Thus, we extract the integers out of the data:

# extract the numbers in Price column
df[‘Price’] = df[‘Price’].str.extract(‘(\d+\.\d+)’)
df[‘Price’].unique()

Output:

array([‘55.70’, ‘100.00’, ‘185.00’, ‘30.00’, ‘806.00’, ‘29.90’, ‘20.00’,‘60.99’, ‘120.00’, ‘30.97’, ‘29.97’, ‘500.00’, ‘29.00’, ‘300.00’, ‘169.00’, ‘150.00’, ‘220.00’, ‘160.00’, ‘148.00’, ‘65.00’, ‘69.00’, ‘50.00’, ‘35.00’, ‘400.00’, ‘195.00’, ‘360.00’, ‘158.00’, ‘399.00’, ‘90.00’, ‘499.00’, ‘135.00’, ‘85.00’, ‘49.00’, ‘10.00’, ‘91.50’, ‘109.00’, ‘200.00’, ‘212.05’, ‘17.90’, ‘39.00’, ‘31.90’], dtype=object)

Then, we convert the currency US $ into RM:

# Change US $ to RM
df[‘Price’] = df[‘Price’].str.replace(‘US’, ‘’)
df[‘Price’] = df[‘Price’].str.replace(‘$’, ‘RM ‘)
df[‘Price’] = df[‘Price’].str.replace(‘RM’, ‘’)
df[‘Price’].head(15)

Output:

0 55.70
1 100.002 185.00
3 30.00
4 806.00
5 29.90
6 185.00
7 20.00
8 60.99
9 120.00
10 120.00
11 30.97
12 29.97
13 185.00
14 500.00
Name: Price, dtype: object

Then, we can observe that RM2,500 is with comma, thus we need to remove it and convert the data type of “Price” into float:

# Remove ‘,’ of RM 2,500.00
df[“Price”] = df[“Price”].str.replace(‘,’, ‘’)# convert data type of Price column into float
df[‘Price’] = df[‘Price’].astype(float)
df[‘Price’].head(15)

After — Output:

0 55.70
1 100.00
2 185.00
3 30.00
4 806.00
5 29.90
6 185.00
7 20.00
8 60.99
9 120.00
10 120.00
11 30.97
12 29.97
13 185.00
14 500.00
Name: Price, dtype: float64

Step 5: Conclude the data as a general statement.

For the feature ”Item Served Area”, there are a lot of different area included in the data. Let’s visualize it and observe:

As we observed, there are a lot of different foreign countries. Therefore, we make all those foreign countries into “worldwide”:

#if not equal to malaysia, replace it to worldwide,
#else replace to malaysia
df['Item Served Area'] = np.where(df['Item Served Area']!='Malaysia', 'worldwide', df['Item Served Area'])
df['Item Served Area'] = np.where(df['Item Served Area']=='Malaysia', 'malaysia', df['Item Served Area'])#replace column name
df = df.rename(columns={"Item Served Area": "Delivery Area"})
df.head()

After — Output:

After converting the terms, let’s see the visualization of “Delivery Area” in pie chart:

Updated Table after data transformation

Lastly, after doing all these steps, let’s have a look at our final table:

Phase 3: Load

After we done the process of scraping and data cleaning, we store the data in a CSV file and download it.

#export the DataFrame as csv
df.to_csv(“group2_ecommerce.csv”)

Here is the complete code:

Lastly, we special thanks to our professor Dr Sabrina and Dr Laila, who guided us throughout this subject and thanks for the efforts of all group members to complete this project!

ETL for <eBay’s Skin Care Products>

What is Extract, Transform and Load (ETL)?

Phase 1: Extract

Phase 2: Tranform

Phase 3: Load

Written by Josephine