Table of contents
Text Link
Text Link

SQL and Python: applying programming languages in machine learning, software development, and science careers

SQL and Python are two of the most popular programming languages. They have been consistently ranked among the top 10 essential tools by Stack Overflow, TIOBE, and IEEE Spectrum.

SQL, which stands for Structured Query Language, is the standard language with a simple syntax for database administrators to work with querying open-source relational database management systems: from statistical analysis to iterative data exploration. Python is a general-purpose programming language used in various tasks: from web application development process to machine learning, from mobile applications to refining artificial intelligence.

SQL and Python can handle complex queries and aggregate functions, but Python has a broader range of applications than SQL. In this article, we will examine the similarities and differences between the two languages and showcase how they can complement each other in different operations.

Similarities and Differences in Code

SQL and Python are some of the most popular programming languages for data-related tasks. SQL is listed in most data analyst job postings. Basic knowledge of SQL and visualization tools like Tableau and Power BI can get you started in data analytics. SQL and Python are suitable for data analysis, higher-level data manipulation, and visualization. You may need to integrate SQL with visualization tools, but Python has its visualization libraries.

SQL is significant when performing basic to medium computations to get quick data insights. Python is ideal when you want to transform your data and perform more advanced computations like regression tests and classification. Python is also better for business Intelligence and data mining, where you discover patterns and knowledge from large and complex datasets.

A key difference between SQL and Python is their different programming paradigms. SQL is a declarative programming language focusing on the database programming paradigm, while Python is a high-level language of the Object-oriented programming paradigm. In SQL, you specify what you want to happen, and the relational database management systems (DBMS) figure out how to do it for you. You must specify exactly how you want it to happen in Python step-by-step.

SQL and Python differ in the types of data structures they support. SQL primarily works with relational databases, tables organized into rows and columns, making it best suited for structured datasets.

In contrast, Python is more versatile and can handle a broader range of data structures, including dictionaries, queues, heaps, and trees.

Several implementations or “flavors” of the Python and SQL languages exist. There are the CPython, JPython, and IronPython implementations for Python, and there are the SQLite, MySQL, and PostgreSQL implementations for SQL.

Despite their differences, SQL and Python can be complementary. For example, SQL can extract data from relational databases for analysis and visualization in Python, and Python can preprocess data before storing it in a relational database.

Running SQL Queries from Scratch in Python

In this section, we will see how SQL and Python can complement each other for data-related and analysis tasks. Imagine the following Entity-Relationship Diagram (ERD): 

You can create the relationship between the database table and the CREATE TABLE command in SQL. The data for the single database has been carefully organized in the Python dictionary, database_dict:

# Create database dictionary from ERD

database_dict = {
    # Create the students table
    'students': {
        'fields': [
            ('student_id', 'INTEGER PRIMARY KEY'),
            ('first_name', 'TEXT NOT NULL'),
            ('last_name', 'TEXT NOT NULL')
        ],
        'entries': [
            (1, 'John', 'Doe'),
            (2, 'Jane', 'Doe'),
            (3, 'Bob', 'Smith')
        ]
    },
    # Create the student_contact table
    'student_contact': {
        'fields': [
            ('student_id', 'INTEGER NOT NULL'),
            ('email_address', 'TEXT NOT NULL'),
        ],
        'entries': [
            (1, '[email protected]'),
            (2, '[email protected]'),
            (3, '[email protected]')
        ],
        "foreign_keys": [
            ("student_id", "students(student_id)")
        ]
    },
    # Create the staff table
    'staff': {
        'fields': [
            ('staff_id', 'INTEGER PRIMARY KEY'),
            ('staff_first_name', 'TEXT NOT NULL'),
            ('staff_last_name', 'TEXT NOT NULL')
        ],
        'entries': [
            (1, 'Mark', 'Johnson'),
            (2, 'Amy', 'Lee'),
            (3, 'David', 'Wong')
        ]
    },
    # Create the staff_contact table
    'staff_contact': {
        'fields': [
            ('staff_id', 'INTEGER NOT NULL'),
            ('staff_email_address', 'TEXT NOT NULL'),
        ],
        'entries': [
            (1, '[email protected]'),
            (2, '[email protected]'),
            (3, '[email protected]')
        ],
        "foreign_keys": [
            ("staff_id", "staff(staff_id)")
        ]
    },
    # Create the courses table
    'courses': {
        'fields': [
            ('course_id', 'INTEGER PRIMARY KEY'),
            ('course_name', 'TEXT NOT NULL'),
            ('staff_id', 'INTEGER NOT NULL')
        ],
        'entries': [
            (1, 'Mathematics', 1),
            (2, 'Science', 3),
            (3, 'English', 2)
        ]
    },
    # Create the enrollment table
    'enrollment': {
        'fields': [
            ('enrollment_id', 'INTEGER PRIMARY KEY'),
            ('student_id', 'INTEGER NOT NULL'),
            ('course_id', 'INTEGER NOT NULL'),
            ('grade', 'INTEGER NOT NULL')
        ],
        'entries': [
            (1, 1, 1, 80),
            (2, 1, 2, 85),
            (3, 2, 1, 92),
            (4, 2, 3, 77),
            (5, 3, 1, 65),
            (6, 3, 2, 95),
            (7, 3, 3, 70),
            (8, 1, 3, 81),
            (9, 2, 2, 62)
        ],
        "foreign_keys": [
            ("student_id", "students(student_id)"),
            ("course_id", "courses(course_id)")
        ]
    }
}

Next, we will use the Sqlite3 API to connect to the database. If the database does not exist, it will be. Otherwise, we'll establish a connection for a new session. The next step is to create a cursor object that can execute SQL commands in the session:

%reload_ext sql
import sqlite3

# Connect to/or create the database
conn = sqlite3.connect('school.db')

# Create a cursor object
cur = conn.cursor()

# Iterate through each table in the dictionary and insert the values into the database
for table_name, table_dict in database_dict.items():
    # Create the table by CREATE TABLE SQL command
    fields = ', '.join([f'{field[0]} {field[1]}' for field in table_dict['fields']])
    cur.execute(f'CREATE TABLE IF NOT EXISTS {table_name} ({fields})')

    # Insert the entries into the table by executing the INSERT SQL command
    entries = ', '.join([str(entry) for entry in table_dict['entries']])
    cur.execute(f'INSERT OR IGNORE INTO {table_name} VALUES {entries}')

# Commit the changes and close the connection
conn.commit()
conn.close()

After executing the SQL commands and making changes to the database, the changes are not automatically saved to the database file stored on your drive. To make your changes permanent, commit your changes and close the connection to end the session.

You can also connect your database file with the sqlalchemy. Let’s execute the queries to confirm that our database has been populated with the data with this API:

from sqlalchemy import create_engine

# Create a connection to the school.db file
engine = create_engine('sqlite:///school.db')

# Execute an SQL
query = """
SELECT * FROM students;
"""
result = engine.execute(query)

print(*result.fetchall(), sep='\n')

# Close the session
engine.dispose()

You have successfully retrieved information from the students table in the school.db. Feel free to use the sqlite3 or sqlalchemy API to execute your queries.

Database Data Manipulation

SQL can also modify the data in huge databases. Let’s assume you made an incorrect entry for Jane Doe in the students table. The correct entry is Kate Doe, so let us update the necessary tables with this information:

# Connect to the school database
conn = sqlite3.connect('school.db')

# Create a cursor object
cur = conn.cursor()

# Execute SQL query
query = """
UPDATE students
SET first_name = 'Kate'
WHERE first_name = 'Jane';
"""
cur.execute(query)

query = """
UPDATE student_contact
SET email_address = '[email protected]'
WHERE student_id = 2;
"""
cur.execute(query)

# Commit and close connection
conn.commit()
conn.close()

You can verify that changes have been made by querying the students and student_info tables:

# Connect to the school database
conn = sqlite3.connect('school.db')

# Create a cursor object
cur = conn.cursor()

# Execute SQL query
result = cur.execute("SELECT * FROM students;")

print(*result.fetchall(), sep='\n')
print()

result = cur.execute("SELECT * FROM student_contact;")

print(*result.fetchall(), sep='\n')

# Close connection
conn.close()




(1, 'John', 'Doe')
(2, 'Kate', 'Doe')
(3, 'Bob', 'Smith')

(1, '[email protected]')
(2, '[email protected]')
(3, '[email protected]')
Share this article
Get more articles
like this
Thank you! Your submission has been received!
Oops! Something went wrong.

Great news! Your changes have been implemented successfully. Here's another example: if you need to obtain information about students' performance, you can use an SQL query to generate a Python DataFrame that displays their grades.

# import the pandas library
import pandas as pd

# Execute SQL query for student performance
query = """
SELECT
    student_id,
    ROUND(AVG(grade), 2) AS average_grade,
    CASE
        WHEN AVG(grade) > 95 THEN 'A+'
        WHEN AVG(grade) > 90 THEN 'A'
        WHEN AVG(grade) > 85 THEN 'B+'
        WHEN AVG(grade) > 80 THEN 'B'
        WHEN AVG(grade) > 75 THEN 'B-'
        WHEN AVG(grade) > 70 THEN 'C+'
        ELSE 'F'
    END AS gpa
FROM
    students
    INNER JOIN enrolment USING (student_id)
    INNER JOIN courses USING (course_id)
GROUP BY
    student_id
ORDER BY
    average_grade DESC;
"""

with sqlite3.connect('school.db') as conn:
    df_students_info = pd.read_sql_query(query, conn)

df_students_info

You can store the Python DataFrame as the students_gpa table in the school database as follows:

with sqlite3.connect("school.db") as conn:
    # Load the dataframe into a table in the database
    df_students_info.to_sql("students_gpa", conn, if_exists="replace", index=False)

# Now let’s check that indeed there is a new table called students_gpa in the database:
with sqlite3.connect("school.db") as conn:
    cur = conn.cursor()
    result = cur.execute("SELECT * FROM students_gpa;")
    print(*result, sep='\n')

# Output:
# (1, 82.0, 'B')
# (2, 77.0, 'B-')
# (3, 76.67, 'B-')

Data Analysis with SQL and Python

Data analysis software can be developed using both SQL and Python. SQL is commonly used for simple applications and queries such as calculating sums, averages, and counts. However, Python, with its dynamic typing, is better suited for more advanced scientific computing.

Let’s first see the distribution of grades by joining the students and students_gpa tables:

query = """
SELECT
    first_name,
    last_name,
    gpa
FROM
    students
    INNER JOIN students_gpa USING(student_id);
"""

with sqlite3.connect('school.db') as conn:
    cur = conn.cursor()
    result = cur.execute(query)
    print(*result.fetchall(), sep='\n')

# Output:
# ('John', 'Doe', 'B')
# ('Kate', 'Doe', 'B-')
# ('Bob', 'Smith', 'B-')

Next let’s see the students’ total scores across three subjects:

query = """
SELECT
    first_name || ' ' || last_name AS full_name,
    SUM(grade) AS total_score
FROM
    students
    INNER JOIN enrolment USING (student_id)
    INNER JOIN courses USING (course_id)
GROUP BY
    student_id
ORDER BY
    total_score;
"""

with sqlite3.connect('school.db') as conn:
    df_students_total = pd.read_sql_query(query, conn)

df_students_total

Next let’s see the course average:

query = """
SELECT
    course_name,
    ROUND(AVG(grade), 2) AS course_avg
FROM
    students
    INNER JOIN enrolment USING (student_id)
    INNER JOIN courses USING (course_id)
GROUP BY
    course_id
ORDER BY
    course_avg;
"""

with sqlite3.connect('school.db') as conn:
    df_course_avg = pd.read_sql_query(query, conn)

df_course_avg

In the next section, you will see how to visualize the query outputs with the Python Matplotlib’s library.

Visualization with Python Matplotlib

Data visualization is the process of presenting the results of your analyses in a graphical or visual format to help others better understand and interpret your results. Python has over 130,000 library options, especially for data visualization, but you may need to connect SQL database engines with visualization tools.

In this section, let's visualize the results of your data analyses. Let’s start by visualizing a bar chart of the students’ total scores:

import matplotlib.pyplot as plt

# Visualize the students' total scores from df_students_total dataframe
df_students_total.plot(kind='barh', x='full_name', color=['teal'], figsize=(7, 3))
plt.ylabel("Full name")
plt.xlabel("Total score")
plt.legend("", frameon=False)
plt.tight_layout()
plt.show()

Next, let’s visualize the course average scores using a pie chart:

# Visualize the course average from the df_course_avg dataframe
df_course_avg.plot(kind='pie', y='course_avg',
                   labels=df_course_avg.course_name,
                   explode=[0.02, 0.01, 0.05], figsize=(6, 6),
                   autopct='%1.1f%%')
plt.legend(loc=2, frameon=False)
plt.ylabel('')

Conveniently, you have been able to visualize the results of your analyses in Python with the matplotlib library.

Conclusion

To excel in data-related technical tasks, learning SQL and Python programming languages is crucial. These tools complement each other despite some differences, as highlighted in this article. Whether you aim to become a software engineer, machine learning engineer, or product manager, learning these fundamental languages is essential.

At Hyperskill, we offer a SQL with Python track that teaches you how to use SQL and Python to create programming projects. Additionally, we have a project-free SQL for Backend Developers track that allows you to efficiently master database theory and gain hands-on experience with MySQL. If you're interested in learning a high-level programming language like Python, we have several Python tracks available for beginners, each equipped with different tools.

Related Hyperskill Topics

Create a free account to access the full topic

Wide range of learning tracks for beginners and experienced developers
Study at your own pace with your personal study plan
Focus on practice and real-world experience
Andrei Maftei
It has all the necessary theory, lots of practice, and projects of different levels. I haven't skipped any of the 3000+ coding exercises.