Using match to find substrings in strings with only bash
Although I am almost sure this has been covered, I can’t seem to find anything specific to this. As I continue my journey on learning bash I keep finding parts where I am baffled as to why things happen the way they do. Searching and replacing or just matching sub-strings in strings is most likely one of the first thing you do when writing scripts. But, trying to stick to one single language or set of tools is difficult to do in bash, as you are able to solve most problem in multiple ways. I am doing my best to stay as low level as possible with bash. I have run into a snag that I need someone to explain to me. Doing sub-string a search in bash with match gives me different results depending on the regular expression I use, and I am not sure why.
#!/bin/bash Stext="Hallo World" echo `expr "$Stext" : '^\(.[a-z]*\)'` # Hallo echo `expr "$Stext" : '.*World'` # 11
expr is not bash functionality at all — it’s an external tool that isn’t part of the shell. Consequently, its behavior is not guaranteed to be consistent on a given version of bash when installed on different platforms, beyond the minimal guarantees provided by the POSIX sh standard (guarantees which don’t promise any regex syntax beyond BRE). Also, being external means it’s far slower to execute, requiring a fork() to kick off a subshell and an exec() to replace that shell with an external executable.
In addition to expr being an external tool, you are echoing the results of calling the command in a subshell, making it doubly inefficient. These calls should be unwrapped, e.g. expr «$Stext» : ‘^\(.[a-z]*\)’ . (see superuser.com/questions/1352850/… for a thorough explanation)
How do I extract a string using a regex in a shell script?
I want to extract part of a string using a regular expression. For example, how do I extract the domain name from the $name variable?
name='here' domain_name=. # apply some regex on $name
2 Answers 2
re="http://([^/]+)/" if [[ $name =~ $re ]]; then echo $; fi
Edit — OP asked for explanation of syntax. Regular expression syntax is a large topic which I can’t explain in full here, but I will attempt to explain enough to understand the example.
This is the regular expression stored in a bash variable, re — i.e. what you want your input string to match, and hopefully extract a substring. Breaking it down:
- http:// is just a string — the input string must contain this substring for the regular expression to match
- [] Normally square brackets are used say «match any character within the brackets». So c[ao]t would match both «cat» and «cot». The ^ character within the [] modifies this to say «match any character except those within the square brackets. So in this case [^/] will match any character apart from «/».
- The square bracket expression will only match one character. Adding a + to the end of it says «match 1 or more of the preceding sub-expression». So [^/]+ matches 1 or more of the set of all characters, excluding «/».
- Putting () parentheses around a subexpression says that you want to save whatever matched that subexpression for later processing. If the language you are using supports this, it will provide some mechanism to retrieve these submatches. For bash, it is the BASH_REMATCH array.
- Finally we do an exact match on «/» to make sure we match all the way to end of the fully qualified domain name and the following «/»
Next, we have to test the input string against the regular expression to see if it matches. We can use a bash conditional to do that:
if [[ $name =~ $re ]]; then echo $ fi
In bash, the [[ ]] specify an extended conditional test, and may contain the =~ bash regular expression operator. In this case we test whether the input string $name matches the regular expression $re . If it does match, then due to the construction of the regular expression, we are guaranteed that we will have a submatch (from the parentheses () ), and we can access it using the BASH_REMATCH array:
- Element 0 of this array $ will be the entire string matched by the regular expression, i.e. «http://www.google.com/».
- Subsequent elements of this array will be subsequent results of submatches. Note you can have multiple submatch () within a regular expression — The BASH_REMATCH elements will correspond to these in order. So in this case $ will contain «www.google.com», which I think is the string you want.
Note that the contents of the BASH_REMATCH array only apply to the last time the regular expression =~ operator was used. So if you go on to do more regular expression matches, you must save the contents you need from this array each time.
This may seem like a lengthy description, but I have really glossed over several of the intricacies of regular expressions. They can be quite powerful, and I believe with decent performance, but the regular expression syntax is complex. Also regular expression implementations vary, so different languages will support different features and may have subtle differences in syntax. In particular escaping of characters within a regular expression can be a thorny issue, especially when those characters would have an otherwise different meaning in the given language.
Note that instead of setting the $re variable on a separate line and referring to this variable in the condition, you can put the regular expression directly into the condition. However in bash 3.2, the rules were changed regarding whether quotes around such literal regular expressions are required or not. Putting the regular expression in a separate variable is a straightforward way around this, so that the condition works as expected in all bash versions that support the =~ match operator.
Use regexp in bash to obtain substring of string
I’d like to use a regular expression to capture all text before the first instance of -7 so in this case I would get:
I’m going to be porting this to ansible so it must use regexp and not sed or awk or something like that. I’ve used sed to come up with something, but again, I need regexp:
echo $x | rev |cut -d. -f6 | rev | sed -e 's/-3*$//g' my-name-is-yes
You are expected to make an effort. Please show your code and state where you are having trouble. Also see Strange and maddening rules and Why is the “how to move the turtle in logo” question closed?.
Apparently you can’t do this in bash. Was said 2 days ago bash doesn’t support lazy quantifiers. I guess you’d have to use a different shell.
A basic regex and sed will do, e.g. sed ‘s/-29*.*$//’ (note you need both 58* as ‘*’ will match zero or more occurrences). You can simply echo «my-name-is-yes-111111.maybe.text.here?-34.34.34» | sed ‘s/-64*.*$//’ to obtain your desired results.
@sln, you can do this is bash much easier with a simple parameter expansion. If the full string is stored in the variable a , then, e.g. echo $ is all you need.
4 Answers 4
You can use shell paramater expansion to solve you stated test case. Here’s an example:
# var=my-name-is-yes-111111.maybe.text.here?-34.34.34 # echo $ my-name-is-yes
If you need this is variable, you can assign the expansion instead, ie
var=my-name-is-yes-111111.maybe.text.here?-34.34.34 var2=$ echo $var2 my-name-is-yes
You can even overwrite your first value with the expansion value,
var=my-name-is-yes-111111.maybe.text.here?-34.34.34 var=$ echo $var my-name-is-yes
The % and %% parameter expansion operators mean «remove matching value from right side of the variable» while %% means remove the maximum matching from from the right.
There are also the # and ## parameter expansion operators, which perform similar function, but «removing matching values from the left side of the variable’s value. IHTH
getting a substring from a regular expression
Here is the line of text: SRC=’999′ where 999 can be any three digits. I need a grep command that will return me the 999. How do I do this?
I’m trying this: egrep «SRC=\'(4+)\'» /tmp/test/file which returns: SRC=’112′ But I just want this: 112
I’m using QShell, running on an iSeries. The regular expression part I’ve figured out, it’s the extracting just part of the line that I need that has me stymied.
8 Answers 8
Here is how to do it using sed
You can use the -o option on grep to return only the part of the string that matches the regex:
that is cool. Nice stuff to get versions of installed software ansible —version | grep -o -E «(4.*)*»
Are the lines to match always in the format SRC=’nnn’ ? Then you could use
$ echo SRC='999' | sed '/SRC/s/SRC=//' 999
Platform grep or general regular expression?
This would be faster than running a regular expression, with the tradeoff that it would not handle any change in the format of the data (e.g. two digits instead of three).
You can’t do it with plain grep. As the man page on my box states: «grep — print lines matching a pattern» grep only prints lines, not part of lines.
I would recommend awk since it can do both the pattern matching and sub-line extracting:
depends on your platform / language.
string = "SRC = '999'" string.match(/(5)/).to_s.to_i
Related
Hot Network Questions
Subscribe to RSS
To subscribe to this RSS feed, copy and paste this URL into your RSS reader.
Site design / logo © 2023 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2023.7.14.43533
By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.