Posted by smarttecs-lucas & smarttecs-marko at February 10 2025 > Tutorial

Code Security with Semgrep

2025/02/10

2754 words

13 mins

Introduction

During our research using Semgrep, we successfully identified an XSS vulnerability. This vulnerability, found in Dolibarr, has been assigned CVE-2024-34051 . This article will guide you in detecting similar vulnerabilities using Semgrep.

Before we dive into the details, let’s first explore what Semgrep is and how it helps identify vulnerabilities.

Semgrep is an open-source static application security testing (SAST) and software composition analysis (SCA) tool designed to help security researchers find code vulnerabilities and assist developers in integrating security checks into the development lifecycle. Released on February 6, 2020, Semgrep’s core logic is written in OCaml, with a Python-based command-line interface (CLI). Some of its key features include:

Comprehensive language support: Semgrep supports over 30 programming languages, making it adaptable to various development environments [1].
Built-in and custom rules: Predefined rules exist for common security issues and best practices. Additionally, custom rules can be created to meet specific requirements.
CI/CD integration: Semgrep integrates seamlessly with Continuous Integration/Continuous Deployment (CI/CD) pipelines, ensuring security is embedded within the software development process.

Getting Started with Semgrep

Installing Semgrep is straightforward. Use Python’s pip on Linux or Windows, or install it via Docker for an easy setup.

If you’re new to Semgrep, the tool offers auto-configuration to help you get started. It automatically applies the best built-in rules based on the detected programming languages or file names.

# Install Semgrep
python3 -m pip install "semgrep"
# Run an initial scan
semgrep scan --config=auto

# Or via Docker
docker pull returntocorp/semgrep
docker run --rm -v "${PWD}:/src" semgrep/semgrep semgrep scan --config=auto /src/

Semgrep offers several parameters to enhance scan insights:

Use the --dataflow-traces flag to generate detailed outputs, showing how data flows between variables, function calls, and other code elements that lead to the identified issue.
The --verbose flag displays detailed information about the scanning process, including applied rules and skipped files.
For an even deeper analysis, the --debug flag provides additional debug information, along with all verbose details.

docker run --rm -v "${PWD}:/src" semgrep/semgrep semgrep --config=auto --dataflow-traces /src/
...
    435| print $form->selectcontacts($selectedCompany, '', 'contactid',
          0, '', $contactofproject, 0, '', false, 0, 0);
      
    Taint comes from:

    429| $selectedCompany = isset($_GET["newcompany"]) ? 
          $_GET["newcompany"] : $projectstatic->socid;
...

Semgrep’s default configuration skips /tests, /test, and /vendors folders. You can customize these exclusions using the .semgrepignore file (see Ignore files, folders, and code .

Data Privacy and Configuration

Semgrep allows you to disable the collection of metrics to ensure user privacy. Use --metrics=off or set SEMGREP_SEND_METRICS to disable it. This prevents any data from being sent online during the scan.

# Disabling metrics collection
export SEMGREP_SEND_METRICS=off

Understanding and Writing Semgrep Rules

The power of Semgrep comes from its rules. Rules define patterns that Semgrep looks for in the code. When it identifies a matching pattern, the code is reported as a finding. This process of scanning and identifying code is known as matching.

Semgrep rules are typically written in YAML syntax, which allows for clear and readable definitions. This article introduces the basics of creating custom rules to improve your code analysis process. However, to deepen your understanding, we encourage you to explore these tutorials from the official docs.

A simple example of a Semgrep rule that detects the use of the eval() function is shown below:

rules:
  - id: no-eval
    message: Avoid using eval()
    severity: WARNING
    patterns:
      - pattern: eval(...)

A great starting point for understanding the rules is to read the predefined rules at the
Semgrep Registry . It contains a comprehensive collection of community-maintained rules for different languages and frameworks.

For example, to run Semgrep using the phpcs-security-audit ruleset from the Semgrep Registry, you can use the following command:

semgrep --config "p/phpcs-security-audit"

Here, the --config option specifies the ruleset to use, and the p/ prefix indicates that the configuration should be pulled from the Semgrep Registry.

The community is strong and the semgrep-rules repository has many contributors improving the tool. The figure below shows that the number of registry rules has grown continuously since 2021.

To run specific rules from the semgrep-rules repository, use the r prefix in the --config parameter followed by the specific path, such as /java/spring/.

semgrep --config "r/java/spring/"

Rule Definition

Another strength of Semgrep is its ability to customize and tune rule definitions. Users can define complex patterns that include data sources, sinks, and sanitizers, simplifying vulnerability detection and management while providing granular control.

For example, the following Semgrep rule unsafe-echo is designed to find potentially insecure sections of code in the programming language PHP where user input (pattern-sources) is written directly to the echo function (pattern-sinks) without being validated or sanitized (pattern-sanitizers). This behavior leads to vulnerabilities such as cross-site scripting (XSS). The detailed explanation about the rule is given below.

rules:
  - id: unsafe-echo
    message: Detected direct echoing of user input. Consider sanitizing the input before outputting it.
    severity: WARNING
    languages:
      - php
    mode: taint
    pattern-sources:
      - patterns:
        - pattern: |
            $REQ[...]
        - metavariable-regex:
            metavariable: $REQ
            regex: \$_(REQUEST|GET|POST)
    pattern-sinks:
      - patterns:
        - pattern: |
            echo(...);
    pattern-sanitizers:
      - patterns:
        - pattern-either:
          - pattern: |
              htmlentities(...);
          - pattern: |
              htmlspecialchars(...);

The following sections give some background to the structure of this rule and an explanation of each attribute.

Rule information

id: unsafe-echo
This is the unique identifier for the rule. It is used to identify the rule.
message: Detected direct echoing of user input. Consider ....
The message displayed when an unsafe code section is found. In this case, it warns about the direct echoing of user input and suggests checking or sanitizing the input.
severity: WARNING
The severity key specifies how critical are the issues that a rule potentially detects. Here, it is classified as a WARNING, meaning it is a potential security issue.
languages: php
This rule is specific to the PHP programming language.
mode: taint
This mode instructs Semgrep to perform a taint analysis, which tracks how data (especially user input) flows through the code to determine if it is used in an unsafe way.

Sources and Sinks

pattern-sources:
Defines where the potentially unsafe data originates from. In this case, it looks for user inputs coming from global PHP arrays like $_REQUEST, $_GET, and $_POST.
- metavariable: $REQ
  Metavariable capturing any of the superglobal arrays $_REQUEST, $_GET, or $_POST from the regular expression regex
- regex: $_(REQUEST|GET|POST)
  A regular expression that identifies the arrays $_REQUEST, $_GET, and $_POST, which contain user inputs.
pattern-sinks:
Defines where the potentially unsafe data is used. In this case, it checks if these data are directly used in the echo function.
- pattern: echo(...);
  Looks for code lines where the echo function is used.

Sanitizers

pattern-sanitizers:
Defines which methods or functions are considered “sanitizers” that can prevent unsafe user input from being directly echoed.
- pattern-either:
  There are two possible ways to sanitize the data:
  - pattern: htmlentities(...);
    Looks for the htmlentities function, which converts characters into HTML entities to prevent potential XSS attacks.
  - pattern: htmlspecialchars(...);
    Looks for the htmlspecialchars function, which converts special characters into HTML entities.

Running this rule

To run the unsafe-echo rule, save the rule definition to a file (e.g., echo.yaml) and run the following command:

semgrep scan --config echo.yaml ./

This command scans the current directory (./) using the specified configuration file.

For example, consider the following PHP code:

<?php
    echo $_GET['user_input']; // This would trigger the rule
?>

The rule identifies this as a finding because the user input ($_GET['user_input']) is echoed directly without sanitization, resulting in a potential XSS vulnerability.

In contrast, sanitized input, as shown below, will not trigger the rule:

<?php
    echo htmlspecialchars($_GET['user_input']); // This would not trigger the rule
?>

Here, the use of htmlspecialchars() ensures that user input is securely processed before being displayed, mitigating the risk of XSS attacks.

Use the Sempgrep Playground to develop and test your rules. The interactive environment allows you to create, test, and refine rules. The figure below shows the results of the echoed-request , which evaluates test code directly in the playground:

echoed-request from the Semgrep Playground

Check Semgreps documentation if you want for more information about rule syntax.

Semgrep in Action: Real-World Example with Dolibarr

Predefined rules often fail to detect vulnerabilities in real-world applications. If you want to find vulnerabilities in real programs, you need to improve these rules and adapt them to the code.

Let’s illustrate this with an example from the open source project Dolibarr (version 19.0.3) which is available on GitHub. The following image shows the results when we apply the default predefined ruleset.

As you can see, we scanned over 3800 PHP files with a single rule and not a single vulnerability was found. Not even one false positive? So what’s wrong? Is our rule wrong, or is the Dolibarrs source code unbreakable and bulletproof?

But if we take a closer look at the source code, we will see that the tool does not use the standard $_GET or $_POST method.

Instead, Dolibarr uses the GETPOST() function, a custom input-handling method that applies security checks based on its parameters, particularly the $check parameter.

When $check is provided, the function performs various validation checks. However, if it is empty, no security or validation measures are applied. Good to know that this call is deprecated, as described in the function documentation (see screenshots below).

By further analysis, we should now check what effects this misuse has.

This means we need to use a custom role, such as this one:

rules:
  - id: dolibarr-echoed-request
    mode: taint
    message: "`Echo`ing user input risks cross-site scripting vulnerability."
    languages:
      - php
    severity: ERROR
    pattern-sources:
      - pattern: GETPOST(..., '', ... )
      - pattern: GETPOST(..., none, ...)
    pattern-sinks:
      - pattern: echo ...;
      - pattern: print(...);
    pattern-sanitizers:
      - pattern: htmlentities(...)
      - pattern: htmlspecialchars(...)
      - pattern: GETPOST(..., 'alpha')

Lets run Semgrep again:

Success.

Now we can view the results in Visual Code using the SARIF (Static Analysis Results Interchange Format) integration. Use the --sarif parameter together with the SARIF viewer extension in Visual Code to efficiently navigate through the identified issues in the code.

Keep in mind to use the --dataflow-traces to access detailed analysis steps, enhancing issue diagnosis and resolution.

semgrep --config PATH/TO/RULES --dataflow-traces --sarif --output=result.sarif PATH/TO/SRC

Analyse the results of the custom rule using SARIF Viewer

This section should have made clear the importance of writing your own rules. Semgrep is not an all-powerful tool. From time to time it will require more precise manual skills in order to use it to its full potential.

Automation with Semgrep

Fixing issues automatically

Semgrep is more than just a scanning tool; it also provides remediation capabilities through its autofix feature, enabling automatic corrections of detected issues. To use this feature, a rule must define either the fix or fix-regex attribute.

In our case, the Cross-Site Scripting (XSS) vulnerability in Dolibarr can be prevented by passing the alpha parameter to the GETPOST function, ensuring that only alphanumeric characters are accepted. We chose to use the fix-regex functionality because the fix attribute still has some limitations when handling ellipsis metavariables (see Tips and tricks for writing fixes ). The code snippet below demonstrates our custom rule with an extended fix-regex implementation:

rules:
  - id: dolibarr-echoed-request
    mode: taint
    message: "`Echo`ing user input risks cross-site scripting vulnerability."
    #    
    # [reduced]
    #
    fix-regex:
      regex: "GETPOST\\((.*?),\\s*(?:''|none)\\s*,(.*?)\\)"
      replacement: "GETPOST(\\1, 'alpha', \\2)"

After running Semgrep with this rule, the tool output will look like this.

CI/CD Integration using AppSec Platform

A major advantage of Semgrep is its seamless integration into existing development pipelines. To use this feature, it is necessary to register on the Semgrep AppSec Platform, which is available at https://semgrep.dev/login/ .

The repository must be available for the AppSec Platform to scan the source code. This is easily done using the Source code manager (SCM) module under Settings -> Source code managers -> Add.

Adding a source code repository to Semgrep’s Source code managers

Once a repository is connected and available, a new project can be added using various providers such as GitHub, Jenkins or Gitlab.

GitHub Actions

The workflow is defined in semgrep.yml files located in the .github/workflows directory of the repository. Within this file, you can specify jobs that are triggered by certain events (e.g. pull requests). This will be automatically applied using the AppSec Platform, as shown below.

In our GitHub action file, the branch needs to be changed from main to develop as the main branch does not exist in the Dolibarr project.

on:
  workflow_dispatch: {}
  pull_request: {}
  push:
    branches:
    - develop
    paths:
    - .github/workflows/semgrep.yml
  schedule:
  # random HH:MM to avoid a load spike on GitHub Actions at 00:00
  - cron: 10 15 * * *
name: Semgrep
jobs:
  semgrep:
    name: semgrep/ci
    runs-on: ubuntu-20.04
    env:
      SEMGREP_APP_TOKEN: ${{ secrets.SEMGREP_APP_TOKEN }}
    container:
      image: returntocorp/semgrep
    steps:
    - uses: actions/checkout@v4
    - run: semgrep ci

After the commit, the Semgrep job starts automatically upon detecting the committed semgrep.yml file.

Semgrep GitHub action job is in progress

The results of the scan will be uploaded to the AppSec Platform after the GitHub Action is completed (scan phase can be more than one hour). Now, the real work of the security engineer can begin by using the triage functionality. Each issue can be classified (e.g. as false positive or fixed) and add a comment.

Issue Triage in Semgrep’s AppSec Platform

Other providers

The platform supports more providers such as GitLab CI/CD or the popular pre-commit framework .

After all, CI integration works pretty much the same way, so this article refers to the official documentation: Sample continuous integration (CI) configurations

Semgrep Pro: Expanding Capabilities

Semgrep Pro extends the capabilities of Semgrep OSS, offering enhanced analysis across multiple files as well as access to Pro rules and SCA features, free for up to 10 contributors. This professional tier further streamlines the integration of Semgrep into development workflows, providing access to a broader set of security rules optimized for specific threat models and compliance requirements. It supports more efficient scanning of large codebases, deeper integration with CI/CD pipelines, and advanced reporting and dashboards to enhance team collaboration.

Semgrep Pro also enables risk prioritization, offers compliance-specific rules, and delivers faster, dedicated support, making it ideal for larger teams and organizations with strict security requirements.

Conclusion

We hope this has given you a good insight into Semgrep’s capabilities. As we have shown, Semgrep can be especially useful for security engineers and development teams. Simple vulnerabilities can already be found with standard rules. Remember, the Semgrep Registry already contains predefined rules.

Like our example in the chapter “Semgrep in Action: Real-World Example with Dolibarr” , most of the default rules will not work when dealing with a specific program.

However, the most important thing is to understand how Semgrep rules work so that you can write your own rules. So now it is up to you to write your own rules to develop its potential. Remember that you can always use the Sempgrep Playground to develop and test your rules.

As we have shown in the chapter “Automation with semgrep” integrating Semgrep into CI/CD pipelines increases the utility of Semgrep. This can be useful for software developers who want to prevent vulnerabilities before they become part of the new release.

The missing piece of the AI (and Outlook)

Semgrep is releasing their AI assistant in March 2024 [9]. It uses GPT4 to prove Semgrep findings and make recommendations all within pull requests. GPT is valuable because it leverages signals beyond the capabilities of Semgrep’s parsing and dataflow engines. It addresses some of the most challenging sources of false positives, such as context that a program analysis engine cannot fully interpret.

In the future, Semgrep’s assistant is expected to help with the following:

Auto-triage findings Using GPT4
Auto-fix code
When Semgrep Assistant detects a true positive, it provides an autofix recommendation for remediation. To minimize hallucinations, secondary prompts are used to review diffs for potential failure modes.
Writing custom rules
The AI assistant should be able to help you, to write custom rules, specific to your codebase. It needs one example of “bad code”, one example of “good code”, and a prompt describing what you want the rule.
Drive awareness of secure coding principles
The goal in this case is to help developers progressively enhance their knowledge and understanding of secure coding practices over time.

The Semgrep assistant can leave comments in pull requests and Slack notifications. You should know that when you enable the Semgrep assistant on your GitHub project, you’re allowing Semgrep to access your code. Especially in sensitive cases with security-relevant code, this should be considered.

References

[1] Semgrep Supported Languages, Semgrep, Available at: https://semgrep.dev/docs/supported-languages/#language-maturity
[2] Semgrep GitHub Repository, GitHub, Available at: https://github.com/returntocorp/semgrep
[3] Semgrep Documentation, Semgrep, Available at: https://semgrep.dev/docs/
[4] Semgrep Tutorials, Semgrep, Available at: https://semgrep.dev/learn/basics
[5] Semgrep Registry, Semgrep, Available at: https://semgrep.dev/explore
[6] Semgrep Playground, Semgrep, Available at: https://semgrep.dev/playground/new
[7] Security scanning with Semgrep in CI, Semgrep, Available at: https://semgrep.dev/blog/2022/integrating-semgrep-with-ci
[8] pre-commit framework, Anthony Sottile, Available at: https://pre-commit.com/
[9] Semgrep AI Assistant Announcement, Semgrep, Available at: https://semgrep.dev/blog/2024/assistant-ga-launch/
[10] Semgrep Assistant, Semgrep, Available at: https://semgrep.dev/products/semgrep-code/assistant/

#Tutorial