During our research using Semgrep, we successfully identified an XSS vulnerability. This vulnerability, found in Dolibarr, has been assigned CVE-2024-34051 . This article will guide you in detecting similar vulnerabilities using Semgrep.
Before we dive into the details, let’s first explore what Semgrep is and how it helps identify vulnerabilities.
Semgrep is an open-source static application security testing (SAST) and software composition analysis (SCA) tool designed to help security researchers find code vulnerabilities and assist developers in integrating security checks into the development lifecycle. Released on February 6, 2020, Semgrep’s core logic is written in OCaml, with a Python-based command-line interface (CLI). Some of its key features include:
Installing Semgrep is straightforward. Use Python’s pip
on Linux or Windows, or install it via Docker for an easy setup.
If you’re new to Semgrep, the tool offers auto-configuration to help you get started. It automatically applies the best built-in rules based on the detected programming languages or file names.
# Install Semgrep
python3 -m pip install "semgrep"
# Run an initial scan
semgrep scan --config=auto
# Or via Docker
docker pull returntocorp/semgrep
docker run --rm -v "${PWD}:/src" semgrep/semgrep semgrep scan --config=auto /src/
Semgrep offers several parameters to enhance scan insights:
--dataflow-traces
flag to generate detailed outputs, showing how data flows between variables, function calls, and other code elements that lead to the identified issue.--verbose
flag displays detailed information about the scanning process, including applied rules and skipped files.--debug
flag provides additional debug information, along with all verbose details.docker run --rm -v "${PWD}:/src" semgrep/semgrep semgrep --config=auto --dataflow-traces /src/
...
435| print $form->selectcontacts($selectedCompany, '', 'contactid',
0, '', $contactofproject, 0, '', false, 0, 0);
Taint comes from:
429| $selectedCompany = isset($_GET["newcompany"]) ?
$_GET["newcompany"] : $projectstatic->socid;
...
Semgrep’s default configuration skips
/tests
,/test
, and/vendors
folders. You can customize these exclusions using the.semgrepignore
file (see Ignore files, folders, and code .
Semgrep allows you to disable the collection of metrics to ensure user privacy. Use --metrics=off
or
set SEMGREP_SEND_METRICS
to disable it. This prevents any data from being sent online during the scan.
# Disabling metrics collection
export SEMGREP_SEND_METRICS=off
The power of Semgrep comes from its rules. Rules define patterns that Semgrep looks for in the code. When it identifies a matching pattern, the code is reported as a finding. This process of scanning and identifying code is known as matching.
Semgrep rules are typically written in YAML
syntax, which allows for clear and readable definitions. This article
introduces the basics of creating custom rules to improve your code analysis process. However, to deepen your
understanding, we encourage you to explore
these tutorials
from the official docs.
A simple example of a Semgrep rule that detects the use of the eval()
function is shown below:
rules:
- id: no-eval
message: Avoid using eval()
severity: WARNING
patterns:
- pattern: eval(...)
A great starting point for understanding the rules is to read the predefined rules at the
Semgrep Registry
. It contains a comprehensive collection of community-maintained rules
for different languages and frameworks.
For example, to run Semgrep using the phpcs-security-audit
ruleset from the Semgrep Registry, you can use the
following command:
semgrep --config "p/phpcs-security-audit"
Here, the --config
option specifies the ruleset to use, and the p/
prefix indicates that the configuration should
be pulled from the Semgrep Registry.
The community is strong and the semgrep-rules repository has many contributors improving the tool. The figure below shows that the number of registry rules has grown continuously since 2021.
To run specific rules from the
semgrep-rules
repository, use ther
prefix in the--config
parameter followed by the specific path, such as/java/spring/
.
semgrep --config "r/java/spring/"
Another strength of Semgrep is its ability to customize and tune rule definitions. Users can define complex patterns that include data sources, sinks, and sanitizers, simplifying vulnerability detection and management while providing granular control.
For example, the following Semgrep rule unsafe-echo
is designed to find potentially insecure sections of code in the
programming language PHP
where user input (pattern-sources
) is written directly to the echo function
(pattern-sinks
) without being validated or sanitized (pattern-sanitizers
). This behavior leads to vulnerabilities
such as cross-site scripting (XSS). The detailed explanation about the rule is given below.
rules:
- id: unsafe-echo
message: Detected direct echoing of user input. Consider sanitizing the input before outputting it.
severity: WARNING
languages:
- php
mode: taint
pattern-sources:
- patterns:
- pattern: |
$REQ[...]
- metavariable-regex:
metavariable: $REQ
regex: \$_(REQUEST|GET|POST)
pattern-sinks:
- patterns:
- pattern: |
echo(...);
pattern-sanitizers:
- patterns:
- pattern-either:
- pattern: |
htmlentities(...);
- pattern: |
htmlspecialchars(...);
The following sections give some background to the structure of this rule and an explanation of each attribute.
id: unsafe-echo
This is the unique identifier for the rule. It is used to identify the rule.
message: Detected direct echoing of user input. Consider ....
The message displayed when an unsafe code section is found. In this case, it warns about the direct echoing of user input and suggests checking or sanitizing the input.
severity: WARNING
The severity key specifies how critical are the issues that a rule potentially detects. Here, it is classified as a WARNING
, meaning it is a potential security issue.
languages: php
This rule is specific to the PHP programming language.
mode: taint
This mode instructs Semgrep to perform a taint analysis, which tracks how data (especially user input) flows through the code to determine if it is used in an unsafe way.
pattern-sources:
Defines where the potentially unsafe data originates from. In this case, it looks for user inputs coming from global PHP arrays like $_REQUEST
, $_GET
, and $_POST
.
metavariable: $REQ
Metavariable capturing any of the superglobal arrays $_REQUEST
, $_GET
, or $_POST
from the regular expression regex
regex: $_(REQUEST|GET|POST)
A regular expression that identifies the arrays $_REQUEST
, $_GET
, and $_POST
, which contain user inputs.
pattern-sinks:
Defines where the potentially unsafe data is used. In this case, it checks if these data are directly used in the echo function.
pattern: echo(...);
pattern-sanitizers:
Defines which methods or functions are considered “sanitizers” that can prevent unsafe user input from being directly echoed.
pattern-either:
There are two possible ways to sanitize the data:
pattern: htmlentities(...);
Looks for the htmlentities
function, which converts characters into HTML entities to prevent potential XSS attacks.
pattern: htmlspecialchars(...);
Looks for the htmlspecialchars
function, which converts special characters into HTML entities.
To run the unsafe-echo
rule, save the rule definition to a file (e.g., echo.yaml
) and run the following command:
semgrep scan --config echo.yaml ./
This command scans the current directory (./
) using the specified configuration file.
For example, consider the following PHP code:
<?php
echo $_GET['user_input']; // This would trigger the rule
?>
The rule identifies this as a finding because the user input ($_GET['user_input']
) is echoed directly without
sanitization, resulting in a potential XSS vulnerability.
In contrast, sanitized input, as shown below, will not trigger the rule:
<?php
echo htmlspecialchars($_GET['user_input']); // This would not trigger the rule
?>
Here, the use of htmlspecialchars()
ensures that user input is securely processed before being displayed, mitigating
the risk of XSS attacks.
Use the Sempgrep Playground to develop and test your rules. The interactive environment allows you to create, test, and refine rules. The figure below shows the results of the echoed-request , which evaluates test code directly in the playground:
Check Semgreps documentation if you want for more information about rule syntax.
Predefined rules often fail to detect vulnerabilities in real-world applications. If you want to find vulnerabilities in real programs, you need to improve these rules and adapt them to the code.
Let’s illustrate this with an example from the open source project Dolibarr (version 19.0.3) which is available on GitHub. The following image shows the results when we apply the default predefined ruleset.
As you can see, we scanned over 3800 PHP files with a single rule and not a single vulnerability was found. Not even one false positive? So what’s wrong? Is our rule wrong, or is the Dolibarrs source code unbreakable and bulletproof?
But if we take a closer look at the source code, we will see that the tool does not use the standard $_GET
or $_POST
method.
Instead, Dolibarr uses the GETPOST()
function, a custom input-handling method that applies security checks based on
its parameters, particularly the $check
parameter.
When $check
is provided, the function performs various validation checks. However, if it is empty, no security or
validation measures are applied. Good to know that this call is deprecated
, as described in the
function documentation (see screenshots below).
By further analysis, we should now check what effects this misuse has.
This means we need to use a custom role, such as this one:
rules:
- id: dolibarr-echoed-request
mode: taint
message: "`Echo`ing user input risks cross-site scripting vulnerability."
languages:
- php
severity: ERROR
pattern-sources:
- pattern: GETPOST(..., '', ... )
- pattern: GETPOST(..., none, ...)
pattern-sinks:
- pattern: echo ...;
- pattern: print(...);
pattern-sanitizers:
- pattern: htmlentities(...)
- pattern: htmlspecialchars(...)
- pattern: GETPOST(..., 'alpha')
Lets run Semgrep again:
Success.
Now we can view the results in Visual Code using the SARIF
(Static Analysis Results Interchange Format) integration.
Use the --sarif
parameter together with the SARIF viewer extension in Visual Code to efficiently navigate through the
identified issues in the code.
Keep in mind to use the --dataflow-traces
to access detailed analysis steps, enhancing issue diagnosis and resolution.
semgrep --config PATH/TO/RULES --dataflow-traces --sarif --output=result.sarif PATH/TO/SRC
This section should have made clear the importance of writing your own rules. Semgrep is not an all-powerful tool. From time to time it will require more precise manual skills in order to use it to its full potential.
Semgrep is more than just a scanning tool; it also provides remediation capabilities through its autofix feature, enabling
automatic corrections of detected issues. To use this feature, a rule must define either the fix
or fix-regex
attribute.
In our case, the Cross-Site Scripting (XSS) vulnerability in Dolibarr can be prevented by passing the alpha
parameter to
the GETPOST
function, ensuring that only alphanumeric characters are accepted. We chose to use the fix-regex
functionality
because the fix
attribute still has some limitations when handling ellipsis metavariables (see
Tips and tricks for writing fixes
). The code snippet below demonstrates our
custom rule with an extended fix-regex
implementation:
rules:
- id: dolibarr-echoed-request
mode: taint
message: "`Echo`ing user input risks cross-site scripting vulnerability."
#
# [reduced]
#
fix-regex:
regex: "GETPOST\\((.*?),\\s*(?:''|none)\\s*,(.*?)\\)"
replacement: "GETPOST(\\1, 'alpha', \\2)"
After running Semgrep with this rule, the tool output will look like this.
A major advantage of Semgrep is its seamless integration into existing development pipelines. To use this feature, it is necessary to register on the Semgrep AppSec Platform, which is available at https://semgrep.dev/login/ .
The repository must be available for the AppSec Platform to scan the source code. This is easily done using the Source code manager (SCM) module under Settings -> Source code managers -> Add.
Once a repository is connected and available, a new project can be added using various providers such as GitHub, Jenkins or Gitlab.
The workflow is defined in semgrep.yml
files located in the .github/workflows
directory of the repository. Within
this file, you can specify jobs that are triggered by certain events (e.g. pull requests). This will be automatically applied
using the AppSec Platform, as shown below.
In our GitHub action file, the branch needs to be changed from main
to develop
as the main
branch does not exist in the
Dolibarr project.
on:
workflow_dispatch: {}
pull_request: {}
push:
branches:
- develop
paths:
- .github/workflows/semgrep.yml
schedule:
# random HH:MM to avoid a load spike on GitHub Actions at 00:00
- cron: 10 15 * * *
name: Semgrep
jobs:
semgrep:
name: semgrep/ci
runs-on: ubuntu-20.04
env:
SEMGREP_APP_TOKEN: ${{ secrets.SEMGREP_APP_TOKEN }}
container:
image: returntocorp/semgrep
steps:
- uses: actions/checkout@v4
- run: semgrep ci
After the commit, the Semgrep job starts automatically upon detecting the committed semgrep.yml
file.
The results of the scan will be uploaded to the AppSec Platform after the GitHub Action is completed (scan phase can be more than one hour). Now, the real work of the security engineer can begin by using the triage functionality. Each issue can be classified (e.g. as false positive or fixed) and add a comment.
The platform supports more providers such as GitLab CI/CD or the popular pre-commit framework .
After all, CI integration works pretty much the same way, so this article refers to the official documentation: Sample continuous integration (CI) configurations
Semgrep Pro extends the capabilities of Semgrep OSS, offering enhanced analysis across multiple files as well as access to Pro rules and SCA features, free for up to 10 contributors. This professional tier further streamlines the integration of Semgrep into development workflows, providing access to a broader set of security rules optimized for specific threat models and compliance requirements. It supports more efficient scanning of large codebases, deeper integration with CI/CD pipelines, and advanced reporting and dashboards to enhance team collaboration.
Semgrep Pro also enables risk prioritization, offers compliance-specific rules, and delivers faster, dedicated support, making it ideal for larger teams and organizations with strict security requirements.
We hope this has given you a good insight into Semgrep’s capabilities. As we have shown, Semgrep can be especially useful for security engineers and development teams. Simple vulnerabilities can already be found with standard rules. Remember, the Semgrep Registry already contains predefined rules.
Like our example in the chapter “Semgrep in Action: Real-World Example with Dolibarr” , most of the default rules will not work when dealing with a specific program.
However, the most important thing is to understand how Semgrep rules work so that you can write your own rules. So now it is up to you to write your own rules to develop its potential. Remember that you can always use the Sempgrep Playground to develop and test your rules.
As we have shown in the chapter “Automation with semgrep” integrating Semgrep into CI/CD pipelines increases the utility of Semgrep. This can be useful for software developers who want to prevent vulnerabilities before they become part of the new release.
Semgrep is releasing their AI assistant in March 2024 [9]. It uses GPT4 to prove Semgrep findings and make recommendations all within pull requests. GPT is valuable because it leverages signals beyond the capabilities of Semgrep’s parsing and dataflow engines. It addresses some of the most challenging sources of false positives, such as context that a program analysis engine cannot fully interpret.
In the future, Semgrep’s assistant is expected to help with the following:
Auto-triage findings
Using GPT4
Auto-fix code
When Semgrep Assistant detects a true positive, it provides an autofix recommendation for remediation. To minimize hallucinations, secondary prompts are used to review diffs for potential failure modes.
Writing custom rules
The AI assistant should be able to help you, to write custom rules, specific to your codebase.
It needs one example of “bad code”, one example of “good code”, and a prompt describing what you want the rule.
Drive awareness of secure coding principles
The goal in this case is to help developers progressively enhance their knowledge and understanding of secure coding practices over time.
The Semgrep assistant can leave comments in pull requests and Slack notifications. You should know that when you enable the Semgrep assistant on your GitHub project, you’re allowing Semgrep to access your code. Especially in sensitive cases with security-relevant code, this should be considered.
[1] Semgrep Supported Languages, Semgrep, Available at:
https://semgrep.dev/docs/supported-languages/#language-maturity
[2] Semgrep GitHub Repository, GitHub, Available at:
https://github.com/returntocorp/semgrep
[3] Semgrep Documentation, Semgrep, Available at:
https://semgrep.dev/docs/
[4] Semgrep Tutorials, Semgrep, Available at:
https://semgrep.dev/learn/basics
[5] Semgrep Registry, Semgrep, Available at:
https://semgrep.dev/explore
[6] Semgrep Playground, Semgrep, Available at:
https://semgrep.dev/playground/new
[7] Security scanning with Semgrep in CI, Semgrep, Available at:
https://semgrep.dev/blog/2022/integrating-semgrep-with-ci
[8] pre-commit framework, Anthony Sottile, Available at:
https://pre-commit.com/
[9] Semgrep AI Assistant Announcement, Semgrep, Available at:
https://semgrep.dev/blog/2024/assistant-ga-launch/
[10] Semgrep Assistant, Semgrep, Available at:
https://semgrep.dev/products/semgrep-code/assistant/