What is regex in linux

Regular Expressions in Linux

Regular Expression is a set of characters that defines a search pattern in a text. Regular Expression is used in UNIX text processing utilities like AWK, sed, grep, and more. It was first formalized by mathematician Stephen Cole Kleene in the 1950s.

In regex, we have two categories of characters.

  1. Regular (Literal) characters: Literally any character as we use it in the English language.
  2. Meta characters: These characters can be either special characters like $ or ^ and so on, or can be literal characters with a backslash in front of them like \d or \w. These characters carry a special meaning, for example, \d matches a single digit in a text or ^ matches the beginning of a string.

Regular Expressions are not specific to Linux or any programming language and what is supported depends on the tool you use. In this tutorial, we will cover Regular Expressions that can be used in a UNIX environment.

First, let’s get introduced to different categories of Regular Expressions and what they can do with simple examples. Then, we will see how to use them together to match complex patterns.

Most applications support regex search and here I am using VSCode search box with regex activated (.*) . Later in this tutorial, you will see a few examples of using regex with grep as well.

The most basic regex pattern

A single literal character or a set of characters without using any meta characters is the basic search pattern in regex. For example, when you search for the word “the” in a text file, the regex matches the letter “t” followed by “h” followed by “e” anywhere in that text.

regex basic search

Single Characters

\d Matches any digit 0-9 in a text
\w Matches all ASCII characters, digits, and underscore.
\W This is the opposite of \w. It matches anything but a word.
\s Matches any whitespace.
\S Matches anything but whitespaces.
\t Matches tabs.
\ Escapes a special character.
. Any character except line break

For example, used \d to match any single digit.

regex match any digit

Quantifiers

Quantifiers are metacharacters that find repeated patterns. For example, to find the 4-digit numbers, instead of using \d\d\d\d we can use \d (any digit times four), and that will return any four-digit number like 2022.

Читайте также:  Linux копирование большого количества файлов
* Matches the preceding character 0 or more times.
+ Matches the preceding character 1 or more times.
Matches the preceding character n times.
Matches the preceding character minimum n and maximum m times
Matches the preceding character n or more times.
? Matches the preceding character 0 or 1 time.

For example, find any 4-digit number in the regex.txt file.

regex quantifiers

Explanation: \d represents a single digit and multiplies it 4 times, thus 4 digit numbers are matched.

Alternation

Alternation is the logical OR operand in regex. Like in many programming languages, It’s represented with the | symbol in regex. It takes two or more regular expressions and matches any of them in a text.

For example, find 1999 or 2000 in regex.txt

regex alternation

Both sides of the OR operand are two separate regexes.

Character Classes

Character class or Character Set is represented by square brackets [ ] in regex and it matches any single character inside the square brackets. For example, find all the lines that start with a vowel.

regex character class

Explanation: As we know, the Caret ^ matches any line that begins with the characters coming after it, and [aieou] means that the character can be any of the characters within the character class.

A dash (-) inside a character class is treated as a special character if it is not the first or last item. A dash specifies a range within a character class. For example, [A-Z] matches any capital letter from A to Z, or 8 matches any single digit. If the dash is the first item inside the character class, it’s treated as a literal character.

regex character class range

In addition, a caret (^) is a special character inside a character class if it is the first item. It negates the character class in this case. For example, [^A-Z] matches anything BUT the capital letters A to Z.

regex negative character class

Moreover, a dot (.) is treated as a literal character inside a character class. Otherwise, a dot matches anything except a line break.

For example, [-.] matches either a dash or a dot.

Grouping, Capturing, and Backreferencing

Using parenthesis, we can group multiple characters into a single unit. The result from the groups will be stored in the memory for reference. We can refer to the first group with $1, the second group $2, and so on.

For example, imagine replacing the last two octets of an IP address like changing 192.168.1.2 to 192.168.0.0

Replacing the last two octets of an IP address with zeros

When using a group result outside of a regex, we use the dollar sign $. However, referencing a group inside the regular expression itself is done with a backslash (\) like \1 to refer to the first group. This is called Backreferencing.

Example: Find duplicate words like “is is”

regex back referencing

Explanation:

The \w+ matches any word character 1 or more times and we capture that in a group. Next, \s matches a whitespace, and finally \1 refers back to the group we captured, that is the first word.

To ignore capturing but still use grouping in regex, use (?:x) format. This is a good practice if capturing and backreferences are not going to be used.

Some Examples

Example 1: Find email addresses

I have the following emails in regex.txt, let’s assume we want to match all email addresses using grep.

emails: john@example.com rick@example.net mark@example.io bob.logan@something.com

The following Regex will match these emails in a simple format.

Explanation:

The \w+ matches 1 or more word characters.

The \.? matches an optional dot between two words

The second \w+ matches the last name if provided

The @ symbol is just a literal character

The (. com|.net|.io) is a group (without capturing) and is using OR operand to match top-level domains.

In this example, I am using this regex with grep:

grep -E '\w+\.?\w+@\w+(?:\.com|\.net|\.io)' regex.txt

grep regex

The -E option enables Extended Regular Expressions for grep as standalone grep only supports Basic Regular Expression which has limited special characters. You can find more about how to use grep in this article.

Example 2: Match phone numbers

I have the following lines in my file regex.txt

phones: 351-344-5467 (271)-871-0925

To match both formats, the following RegEX can be used:

Explanation:

The \(? matches an optional opening parenthesis (a literal character as it’s escaped by the backslash).

\)? matches an optional closing parenthesis.

[-.] the next character can be either a dash (-) or a dot (.).

\d matches the next 3-digit number.

Again, [-.] the next character can be either a dash (-) or a dot (.).

Finally, \d matches the last 4-digit number.

Conclusion

The main purpose of using regex is to find patterns in a text. Regular Expression is a huge topic but you don’t need to know everything about regex to use them. In this tutorial, we covered the important use cases of regex. You can use regex with Linux utilities such as grep, awk, sed, and more.

Источник

Linux Regular Expression Tutorial: Grep Regex Example

Linux Regular Expressions are special characters which help search data and matching complex patterns. Regular expressions are shortened as ‘regexp’ or ‘regex’. They are used in many Linux programs like grep, bash, rename, sed, etc.

Types of Regular expressions

For ease of understanding let us learn the different types of Regex one by one.

Click here if the video is not accessible

Basic Regular expressions

Some of the commonly used commands with Regular expressions are tr, sed, vi and grep. Listed below are some of the basic Regex.

Symbol Descriptions
. replaces any character
^ matches start of string
$ matches end of string
* matches up zero or more times the preceding character
\ Represent special characters
() Groups regular expressions
? Matches up exactly one character

Execute cat sample to see contents of an existing file

Linux - Regular Expressions

Search for content containing letter ‘a’.

Linux - Regular Expressions

^‘ matches the start of a string. Let’s search for content that STARTS with a

Linux - Regular Expressions

Only lines that start with character are filtered. Lines which do not contain the character ‘a’ at the start are ignored.

Let’s look into another example –

Linux - Regular Expressions

Select only those lines that end with t using $

Linux - Regular Expressions

Interval Regular expressions

These expressions tell us about the number of occurrences of a character in a string. They are

Expression Description
Matches the preceding character appearing ‘n’ times exactly
Matches the preceding character appearing ‘n’ times but not more than m
Matches the preceding character only when it appears ‘n’ times or more

Filter out all lines that contain character ‘p’

Linux - Regular Expressions

We want to check that the character ‘p’ appears exactly 2 times in a string one after the other. For this the syntax would be:

Linux - Regular Expressions

Note: You need to add -E with these regular expressions.

Extended regular expressions

These regular expressions contain combinations of more than one expression. Some of them are:

Expression Description
\+ Matches one or more occurrence of the previous character
\? Matches zero or one occurrence of the previous character

Searching for all characters ‘t’

Linux - Regular Expressions

Suppose we want to filter out lines where character ‘a’ precedes character ‘t’

Linux - Regular Expressions

Brace expansion

The syntax for brace expansion is either a sequence or a comma separated list of items inside curly braces “<>”. The starting and ending items in a sequence are separated by two periods “..”.

Linux - Regular Expressions

In the above examples, the echo command creates strings using the brace expansion.

Summary:

  • Regular expressions are a set of characters used to check patterns in strings
  • They are also called ‘regexp’ and ‘regex’
  • It is important to learn regular expressions for writing scripts
  • Some basic regular expressions are:
  • Some extended regular expressions are:
  • Some interval regular expressions are:
  • The brace expansion is used to generate strings. It helps in creating multiple strings out of one.

Источник

Оцените статью
Adblock
detector