Matching at specific points

If you want to match at the end of the line, make sure a $ is the last character in the regex. This one pulls out all those names ending in a. Slot it into the example above :


if (/a$/) {

And there is a corresponding character, the caret ^ , which in this context matches at the beginning of the string. Yes, the caret also negates a character class like this [^KCZ]arl but in this case it anchors the match to the beginning of the string.



if (/n/i)  {

if (/^n/i) {

The first one is true if the word contains an 'n' anywhere in it. The second specifies that the 'n' must be at the beginning of the string to be matched. Use this anchor where you can, because it makes the whole regex faster, and safer if you know what the first character must be.

Negating the regex

If you want to negate the entire regex change =~ to !~ (Remember ! means 'not equal to'.)


if ($_ !~/[KC]arl/) {

Of course, as we are testing $_ this works too:

if (!/[KC]arl/) {


Returning the Match

Now things get interesting. What if we want pull something out of a string ? So far all we have done is test for truth, that is say yea or nay if a string matches, but not return what we found. Run this:

$_='My email address is <Robert@NetCat.co.uk>.';



/(<robert\@netcat.co.uk>)/i;



print "Found it ! $1\n";

Firstly, note the single quotes when $_ is assigned. If there were double quotes, we'd need \@ instead of @ . Remember, double quotes "" allow variable interpolation, so Perl looks for an array called @NetCat which does not exist.

Secondly, look at the parens around the entire regex. If you use parens, a side effect is that the first match is put into a variable called $1 . We'll get to the main effect later. The second match goes into $2 and so on. Also note that the \@ has been escaped, so perl doesn't think it is an array. Remember \ either escapes a special character, or gives a special meaning. Think of it as Superman's telephone box. Imagine Clark Kent walking around with his magic partner Back Slash.

Notice how we specify in the regex case-insensitivity with /i and the regex returns the case-sensitive string - that is, exactly what it found.

Try the regex without parens. Then try this one:


/<(robert)\@netcat.co.uk>/i;

You can put the parens anywhere. More or less. Now, run this :

$_='My email address is <Robert@NetCat.co.uk>.';



/<(robert)\@(netcat.co.uk)>/i;



print "Found it ! $1 at $2\n";

See, you can have more than one ! Look at the above regex. Looks easy now, don't you think ? What about five minutes ago ? It would have looked like a typing mistake ! Well, there are some hairier regex to come, but you'll have a good barber.

* + -- regexes become line noise

What if we didn't know what the email address was going to be ?


$_='My email address is <webslave@work.com>.';



print "Found it ! :$1:" if /(<.*>)/i;

When you see an if statement like this, read it right to left. The print statement is only executed if code on the right of the expression is true.

We'll discuss this. Firstly, we have the opening parens ( . So everything from ( to ) will be put into $1 if the match is successful. Then the first character of what we are searching for, < . Then we have a dot, or period . . For this regex, we can assume . matches any character at all.

So we are now matching < followed by any character. The * means 0 or more of the previous character. The regex finishes by requiring > .

This is important. Get the basics right and all regex are easy (I read somewhere once). An example best illustrates the point. Slot this regex in instead:


$_='My email address is <webslave@work.com>.';



print "Found it ! :$1:" if /(<*>)/i;



What's happening here ?

The regex starts, logically, at the start of the string. This doesn't mean it starts a 'M', it starts just before M. There is a 'nothing' between the string start and 'M'.

The regex is searching for <* , which is 0 or more < .

The first thing it finds is not < , but the nothing in between the start of the string and the 'M' from 'My email...". Does this match ?

As the regex is looking for "0 or more" < , we can certainly say that there are 0 < at the start of the string. So the match is, so far, successful. We have dealt with <* .

However, the next item to match is > . Unfortunately, the next item in the string is 'M', from 'My email..". The match fails at this point. Sure, it matched < without any problem, but the complete match has to work.

The only two characters that can match successfully at this point are < or > . The 'point' being that <* has been matched successfully, and we need either > to complete the match or more of < to continue the '0 or more' match denoted by * .

'M' is neither of them, so it fails at this point, when it has matched

Quick clarification - the regex cannot successfully match < , then skip on ahead through the string until it matches > . The characters in the string between < > also need to match the regex, and they don't in this case.

All is not lost. Regexes are hardy little beasts and don't give up easily. An attempt is made to match the regex wherever possible. The regex system keeps trying the match at every possible place in the string, working towards the end.

Let's look at the match when it reaches the 'm' in 'work.com'.

Again, we have here 0 < . So the match works as before. After success on <* the next character is analysed - it is a > , so the match is successful.

But, be warned. The match may be successful but your job is not done. Assuming the objective of was to return the email address within the angle brackets then that regex is a miserable failure. Watch for traps of this nature when regexing.

That's * explained. Just to consolidate, a quick look at:


$_='My email address is <webslave@work.com>.';

print "Match 1 worked :$1:" if /(<*)/i;



$_='<My email address is <webslave@work.com>.';

print "Match 2 worked :$1:" if /(<*)/i;



$_='My email address is <webslave@work.com<<<<>.';

print "Match 3 worked :$1:" if /(<*>)/i;



Match 1 is true. It doesn't return anything, but it is true because there are 0 < at the very start of the string.

Match 2 works. After the 0 < at the start of the string, there is 1 < so the regex can match that too.

Match 3 works. After the failing on the first < , it jumps to the second. After that, there are plenty more to match right up until the required ending.

Glad you followed that. Now, pay even closer attention ! Concentrate fully on the task at hand ! This should be straightforward now:


$_='HTML <I>munging</I> time !.';



/<I>(.*)<\/I>/i;



print "Found it ! $1\n";

Pretty much the same as the above, except the parens are moved so we return what's only inside the tags, not including the tags themselves. Also note how / is escaped like so; \/ otherwise Perl thinks that's the end of the regex.

Now, suppose we change $_ to :


$_='HTML <I>munging</I> time is here <I>again</I> !.';



and run it again. Interesting effect, eh ? This is known as Greedy Matching. What happens is that when Perl finds the initial match, that is <I> it jumps right to the end of the string and works back from there to find a match, so the longest string matches. This is fine unless you want the shortest string. And there is a solution:

/<I>(.*?)<\/I>/i;

Just add a question mark and Perl does stingy matching. No nationalistic jokes. I have Dutch and Scottish friends I don't want to offend.


The Difference Between + and *

You know what * means, namely match 0 or more. If you want to match 1 or more, then use + . The difference is important.

$_='The number is 2200 and the day is Monday';



($star)=/([0-9]*)/;



($plus)=/([0-9]+)/;



print "Star is '$star' and Plus is '$plus'\n";

You'll note that $star has no value. The match was successful though. It managed to match 0 or more characters from 0 to 9 at the very start of the regex.

The second regex with $plus worked a little better, because we are matching one or more characters from 0 to 9. Therefore, unless one 0 to 9 is found the match will fail. Once a 0-9 is found, the match continues as long as the next character is 0-9, then it stops.

Now we know this, there is another way to remove an email address from within angle brackets:


$_='My email address is <robert@netcat.co.uk> !.';



/<([^>]+)/i;



print "Found it ! $1\n";

This regex matches <. Then the capturing parens start. They have no effect on this regex other than to capture the match. After that, there is a character class, containing one character. As ^ is the first character is the class, it negates the class. That's why we are using a character class with only one character in it, because it can be negated.

So far we have matched < and anything that is not >. The + ensures we match as many characters that are not <'s as we can. This has the same effect as .*? but is more efficient. It may also suit your purposes, as .*? relies on you knowing what you want to match up to, whereas [^>]+ simply contines matching until it finds something that fails its criteria. Just make sure you understand the difference because it is a crucial part of regexery.


Re-using the match -- \1, $1...

Suppose we didn't know what HTML tag we had to match ? It could be B, I, EM or whatever, and we want everything that is in between. Well, HTML container tags like B and EM have end tags which are the same as the start tag, except for the / . So what we could do is:

Can this be done ? Of course. This is perl, all things are possible. Now, remember the side effect of parens. I promise I'll explain the primary effect at some point. If whatever is in (parens) matches, the result is stored in a variable called $1 . So we can use <(.*?)> which will find us < then as many anythings (the . and * ) up to the next, not last > (the ? forces stingy matching).

The result is stored in $1 because we used parens. Next, we need everything up to the closing tag. That's easy : (.*?) matches everything up until the next character or set of characters. And how exactly do we define where to stop ?

We can use $1 even in the same regex it was found in. However, it is not referred to within a regex as $1 , but \1 .

So we want to match </$1> which in perl code is <\/\1> . The / must be escaped because it is the end of the regex, and 1 is escaped so it refers to $1 instead of matching the number 1.

Still here ? This is what it looks like:


$_='HTML <I>munging</I> time is here <I>again</I> !.';

/<(.*?)>(.*?)<\/\1>/i;



print "Found it ! $2\n";

If you want to know how to return all the matches above, read on. But before that:

How to Avoid Making Mountains while Escaping Special Characters

You want to match this; http://language.perl.com/faq/ . That's a real (useful) URL by the way. Hint. To match it, you need to do this:


/http:\/\/language\.perl\.com\/faq\//;

which should make the awful metaphor above clearer, if not funnier. The slash, / , is not normally a metacharacter but as it is being used for the regular expression delimiters, it needs to be escaped. We already know that . is special.

Fortunately for our eyes, Perl allows you to pick your delimiter if you prefix it with 'm' as this example shows. We'll use a #:


m#http://language\.perl\.com/faq/#; 

Which is a huge improvement, as we change / to # . We can go further with readability by quoting everything:

m#\Qhttp://language.perl.com/faq/\E#;

The \Q escapes everything up until \E or the regex delimiter (so we don't really need the \E above). In this case # will not be escaped, as it delimits the regex.

Someone once posted a question about this to the Perl-Win32-Users mailing list and I was so intrigued about this apparently undocumented trick I spent the next twenty minutes figuring it out by trial and error, and posted a reply. Next day I found lots of messages telling the poster to read the manual because it was clearly documented. <face colour='red' intensity='high'> My excuse was I didn't have the docs to hand....moral of the story - RTFM and RTF FAQs !

Subsitution and Yet More Regex Power


Basic changes

Suppose you want to replace bits of a string. For example, 'us' with 'them'.

$_='Us ? The bus usually waits for us, unless the driver forgets us.';



print "$_\n";



s/Us/them/;   # operates on $_, otherwise you need $foo=~s/Us/them/;



print "$_\n";

What happens here is that the string 'Us' is searched for, and when a match is found it is replaced with the right side of the expression, in this case 'them'. Simple.

You'll notice that only one substitution was made. To match globally use /g which runs through the entire string, changing wherever it can. Try:


s/Us/them/g;



which fails. This is because regexes are not, by default, case-sensitive. So:

s/us/them/ig;



would be a better bet. Now, everything is changed. A little too much, but one problem at a time. Everything you have learn about regex so far can be used with s/// , like parens, character classes [ ] , greedy and stingy matching and much more. Deleting things is easy too. Just specify nothing as the replacement character, like so s/Us//; .

So we can use some of that knowledge to fix this problem. We need to make sure that a space precedes the 'us'. What about:


s/ us/them/g;



An small improvement. The first 'Us' is now no longer changed, but one problem at a time ! We'll first consider the problem of the regex changing 'usually' and other words with 'us' in them.

What we are looking for is a space, then 'us', then a comma, period or space. We know how to specify one of a number of options - the character class.


s/ us[. ,]/them/g;



Another tiny step. Unfortunately, that step wasn't really in the right direction, more on the slippery slope to Poor Programming Practice. Why ? Because we are limiting ourselves. Suppose someone wrote ' send it to us; when we get it'.

You can't think of all the possible permutations. It is often easier, and safer, to simply state what must not follow the match. In this case, it can be anything except a letter. We can define that as a-z. So we can add that to the regex.


s/ us[^a-z]/ them/g;



the caret ^ negates the character class, and a-z represents every alphabet from a to z inclusive. A space has been added to the substitution part - as the original space was matched, it should be replaced to maintain readability.


\w

What would be more useful is to use a-zA-Z instead. If we weren't using /i we'd need that. As a-zA-Z is such a common construct, Perl provides an easy shorthand:

s/ us[^\w]/ them/g;



The \w construct actually means 'word' - equivalent to a-zA-Z_0-9 . So we'll use that instead.

To negate any construct, simply capitalise it:


s/ us[\W]/ them/g;



and of course we don't need the negating caret now. In fact, we don't even need the character class !

s/ us\W/ them/g;



So far, so good. Matching the first 'us' is going to be difficult though. Fortunately, there is an easy solution. We've seen Perl's definition of a word - \w . Between each word is a boundary. You can match this with \b .

s/\bus\W/ them/g;



(that's \b followed by 'us', not 'bus' :-)
Now, we require a word boundary before 'us'. As there is a 'nothing' at the start of the string, we have a match. There is a space after the first 'Us', so the match is successful. You might notice an extra space has crept in - that's the space we added earlier. The match doesn't include the space any more - it matches on the word boundary, that is just before the word begins. The space doesn't count.

Did you notice the final period and the comma are replaced ? They are part of the match - it is the

Replacing with what was found

\W that matches them. We can't avoid that. We can however put back that part of the match.

s/\bus(\W)/them\1/g;



We start with capturing whatever the \W matches, using parens. Then, we add it to the replacement string. The capture is of course in $1 , but as it is in a regex we refer to it as \1 .

The final problem is of course capitalising the replacement string when appropriate. Which in old versions of the tutorial I left as an exercise to the reader, having run out of motivation. A reader by the name of Paul Trafford duly solved the problem, and I have just inserted his excellent explanation for the elucidation of all concerned:


#         Solution to the us/them problem...

#

#   The program works through the text assigning the 

#   variable $1 to 'U' or 'u' for any words where this 

#   letter is followed by 's' and then by non 'word' 

#   characters.   The latter is assigned to variable $2.

#

#   For each such matching occurrence, $1 is replaced by 

#   the letter that precedes it in the alphabet using 

#   operations 'ord' and 'chr' that return the ASCII value 

#   of a character and the character corresponding to a 

#   given natural number.  After this 'hem' is tacked on 

#   followed by $2, to retain the shape of the original 

#   sentence.  The '/e' switch is used for evaluation.

#

#   NOTES

#   1. This solution will not replace US (short for 

#   United States) with Them or them.

#

#   2. If a 'magical' decrement operator '--' existed for 

#   strings then the solution could be simplified for we 

#   wouldn't need to use the 'chr' and 'ord' operators.


$_='Us ? The bus usually waits for us, unless the driver forgets us.';



print "$_\n";



s/\b([Uu])s(\W)/chr(ord($1)-1).hem.$2/eg;



print "$_\n";

An excellent solution, thanks Paul.

There are several more constructs. We'll take a quick look at \d which means anything that is a digit, that is 0-9 . First we'll use the negated form, \D , which is anything except 0-9 :


print "Enter a number :";

chop ($input=<STDIN>);



if ($input=~/\D/) {

        print "Not a number !!!!\n";

} else {

        print 'Your answer is ',$input x 3,"\n";



}

this checks that there are no non-number characters in $x . It's not perfect because it'll choke on decimal points, but it's just an example. Writing your own number-checker is actually quite difficult, but it is an interesting exercise. Try it, and see how accurate yours is.


x

I hope you trusted me and typed the above in exactly as it is show (or pasted it), because the x is not a mistake, it is a feature. If you were too smart and changed it to a * or something change it back and see what it does.

Of course, there is another way to do it :


unless ($input=~/\d/) {

        print 'Your answer is ',$input x 3,"\n";

} else {

        print "Not a number !!!!\n";

}

which reverses the logic with an unless statement.

More Matching

Assume we have:

$_='HTML <I>munging</I> time is here <I>again</I> !.';

and we want to find all the italic words. We know that /g will match globally, so surely this will work :

$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';



$match=/<i>(.*?)<\/i>/ig;



print "$match\n";

except it returns 1, and there were definitely two matches. The match operator returns true or false, not the number of matches. So you can test it for truth with functions like if, while, unless Incidentally, the s/// operator does return the number of substitutions.

To return what is matched, you need to supply a list.


($match) = /<i>(.*?)<\/i>/i;

which handily puts all the first match into $match . Note that an = is used (for assignment), as opposed to =~ (to point the regex at a variable other than $_.

The parens force a list context in this case. There is just the one element in the list, but it is still a list. The entire match will be assigned to the list, or whatever is in the parens. Try adding some parens:


$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';



($word1, $word2) = /<i>(.*?)<\/i>/ig;



print "Word 1 is $word1 and Word 2 is $word2\n";

In the example above notice /g has been added so a global replacement is done - this means perl carries on matching even after it finds the first match. Of course, you might not know how many matches there will be, so you can just use an array, or any other type of list:

$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';



@words = /<i>(.*?)<\/i>/ig;



foreach $word (@words) {

        print "Found $word\n";

}

and @words will be grown to the appropriate size for the matches. You really can supply what you like to be assigned to:

($word1, @words[2..3], $last) = /<i>(.*?)<\/i>/ig;

you'll need more italics for that last one to work. It was only a demonstration.

There is another trick worth knowing. Because a regex returns true each time it matches, we can test that and do something every time it returns true. The ideal function is while which means 'do something as long the condition I'm testing is true'. In this case, we'll print out the match every time it is true.


$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';



while (/<(.*?)>(.*?)<\/\1>/g) {

        print "Found the HTML tag $1 which has $2 inside\n";

}

So the while operator runs the regex, and if it is true, carries out the statements inside the block.

Try running the program above without the /g . Notice how it loops forever ? That's because the expression always evaluates to true. By using the /g we force the match to move on until it eventually fails.

Now we know this, an easy way to find the number of matches is:


$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';



$found++ while /<i>.*?<\/i>/ig;



print "Found $found matches\n";

You don't need braces in this case as nothing apart from the expression to be evaluated follows the while function.

Parentheses Again: OR

The real use for them. Precedence. Try this, and yes you can try it at home:

$_='One word sentences ? Eliminate. Avoid clichés like the plague.  They are old hat.';



while (/o(rd|ne|ld)/gi) {

        print "Matched $1\n";

}

Firstly, notice the subtle introduction of the or operator, in this case | , the pipe. What I really want to explain however, is that this regex matches o followed by rd, ne or ld. Without the parens it would be /ord|ne|ld/ which is definitely not what we want. That matches just plain ord, or ne or ld.


(?: OR Efficiency)

In the interests of efficiency, consider this:

print "Give me a name :";

chop($_=<STDIN>);



print "Good name\n" if /Pe(tra|ter|nny)/;

The code above functions correctly. If you were wondering what a good name is, Petra, Peter and Penny qualify. The regex is not as efficient as it could be though. Think about what Perl is doing with the regex, that you are just ignoring. Simply throwing away casually. Without consideration as to the effort that has gone into creating it for you. The resources squandered. The little bytes of memory whose sole function in life is to store this information, which will never be used.

What's happening is that because parens are used, perl is creating $1 for your usage and abusage. While this may not seem important, a fair amount of resources go into creating $1, $2 and so on. Not so much the memory used to store them, more the CPU effort involved. So, if you aren't going to use the parens for capturing purposes, why bother capturing the match?


print "Give me a name :";

chop($_=<STDIN>);



print "Good name\n" if /Pe(?:tra|ter|nny)/;



print "The match is :$1:\n";

The second print statement demonstrates that nothing is captured this time. You get the benefits of the paren's precedence-changing capabilities, but without the overhead of the capturing. This benefit is especially worthwhile if you are writing CGI programs which use parens in regex -- with CGI, every little of bit efficiency counts.


Matching specific amounts of...

Finally, take a look at this :


$_='I am sleepy....zzzz....DING ! Wake Up!';



if (/(z{5})/) {

        print "Matched $1\n";

} else {

        print "Match failed\n";

}

The braces { } specify how many of the preceding character to match. So z{2} matches exactly two 'z's and so on. Change z{5} to z{4} and see how it works. And there's more...

/z{3}/ 3 z only
/z{3,}/ At least 3 z
/z{1,3}/ 1 to 3 z
/z{4,8}/ 4 to 8 z

To any of the above you may suffix an question mark, the effect of which is demonstrated in the following program. Run it a couple of times, inputting 2, 3 and 4:


print "How many letters do you want to match ? ";

chomp($num=<STDIN>);



# we assign and print in one smooth move

print $_="The lowest form of wit is indeed sarcasm, I don't think.\n";



print "Matched \\w{$num,} : $1 \n"  if /(\w{$num,})/;



print "Matched \\w{$num,?}: $1 \n"  if /(\w{$num,}?)/;

The first match is 'match any word (that's a-Z0-9_) equal to or longer than $num character, and return it.' So if you enter 4, then 'lowest' is returned. The word 'The' doesn't match.

The second match is exactly the same, but the ? forces a minimal match, so only the part actually matched is returned.

Just to clear this up, amend the program thus:


print "\nMatched \\w{$num,} :";

print "$1 " while /(\w{$num,})/g;



print "\nMatched \\w{$num,?} :";

print "$1 " while /(\w{$num,}?)/g;

Note the addition of /g . Try it without - notice how the match never moves on ?


Pre, Post, and Match

And now on the Regex Programme Today, we have guest stars Prematch, Postmatch and Match. All of whom are going to slow our entire programme down, but are useful anyway :

$_='I am sleepy....snore....DING ! Wake Up!';



/snore/;	# look, no parens !



print "Postmatch: $'\n";

print "Prematch: $`\n";

print "Match: $&\n";



If you are wondering what the difference between match and using parens is you should remember than you can move the parens around, but you can't vary what $& and its ilk return. Also, using any of the above three operators does slow your entire program, whereas using parens will just slow the particular regex you use them for. However, once you've used one of the three matches you might as well use them all over the place as you've paid the speed penalty. Use parens where possible.

RHS Expressions


/e

RHS means Right Hand Side. Suppose we have an HTML file, which contains:

<FONT SIZE=2> <FONT SIZE=4> <FONT SIZE=6>

and we wish to double the size of each font so 2 becomes 4 and 4 becomes 8 etc. What about :

$data="<FONT SIZE=2> <FONT SIZE=4> <FONT SIZE=6>";



print "$data\n";



$data=~s/(size=)(\d)/\1\2 * 2/ig;



print "$data\n";

which doesn't really work out. What this does is match size=x, where x is any digit. The first match, size=, goes into $1 and the second match, whatever the digit is, goes into $2 . The second part of the regex simply prints $1 and $2 (referred to as \1 and \2 ), and attempts to multiply $2 by 2. Remember /i means case insensitive matching.

What we need to do is evaluate the right hand side of the regex as an expression - that is not just print out what it says, but actually evaluate it. That means work it through, not blindly treat it as string. Perl can do this:


$data=~s/(size=)(\d)/$1.($2 * 2)/eig;

A little explanation....the LHS is the same as before. We add /e so Perl evaluates the RHS as an expression. So we need to change \1 into $1 and so on. The parens are there to ensure that $2 * 2 is evaluated, then joined to $1 . And that's it !


/ee

It is even possible to have more than one /e . For example:

$data='The function is <5funcA>';



$funcA='*2+4';



print "$data\n";



$data=~s/<(\d)(\w+)>/($1+2).${$2}/;	# first time

# $data=~s/<(\d)(\w+)>/($1+2).${$2}/e;	# second time

# $data=~s/<(\d)(\w+)>/($1+2).${$2}/ee;	# third time



print "$data\n";



To properly appreciate this you need to run it three times, each time commenting out a different line. Only one regex line should be uncommented when the program is run.

The first time round the regex is a dumb variable interpolation. Perl just searches the string for any variables, finds $1 and $2, and replaces them.

Second time round the expression is evaluated, as opposed to just plain variable-interpolated. This means that $1+2 is evaluated. $1 has a value of 5, pl, plus 2 == 7. The other part of the replacement, ${$2} is evaluated only so far as working out that the variable named $2 should be placed in the string.

Third time round and Perl now makes a second pass through the string, looking for things to do. After the first pass, and just before that second pass the string looks like this; 7*2+4 . Perl evaluates this, and prints the result.

So the more /e 's you add on the end of the regex, the more passes Perl makes through the replacement string trying to evaluate the code.

This is fairly advanced stuff here, and it is probably not something you will use every day. But knowing it is there is handy.


A Worked Example: Date Change

Imagine you have a list of dates which are in the US format of month, day, year as opposed to the rest of the world's logical notion of day, month year. We need a regex to transpose the day and month. The dates are:

@dates=(

'01/22/95',

'05/15/87',

'8-13-96',

'5.27.78',

'6/16/1993'

);

The task can be split into steps such as:
  1. Match the first digit, or two digits. Capture this result.
  2. Match the delimiter, which appears to be one of / - .
  3. Match the second two digits, and capture that result
  4. Rebuild the string, but this time reversing the day and month.
That may not be all the steps, but it is certainly enough for a start. Planning regex is important. So, first pass:

@dates=(

'01/22/95',

'5/15/87',

'8-13-96',

'5.27.78',

'6/16/1993'

);



foreach (@dates) {

	print;

	s#(\d\d)/(\d\d)#$2/$1#;

	print " $_\n";

}

Hmm. This hasn't worked for the dates delimited with - . , and the last date hasn't worked either. The first problem is pretty easy; we are just matching / , nothing else. The second problem arises because we are matching two digits. Therefore, 5/15/87 is matched on the 15 and 87, not the 5 and 15. The date 6/16/1993 is matched on the 16 and the 19 of 1993.

We can fix both of those. First, we'll match either 1 or 2 digits. There are a few ways of doing this, such as \d{1,2} which means either 1 or two of the preceding character, or perhaps more easily \d\d? which means match one \d and the other digit is optional, hence the question mark. If we used \d+ then that would match 19988883 which is not a valid date, at least not as far as we are concerned.

Secondly, we'll use a character class for all the possible date delimiters. Here is just the loop with those amendments:


foreach (@dates) {

	print;

	s#(\d\d?)[/-.](\d\d?)#$2/$1#;

	print " $_\n";

}

which fails. Examine the error statement carefully. The key word is 'range'. What range? Well, the range between / and . because - is the range operator within a character class. That means it is a special character, or a metacharacter. And to negate the special meaning of metacharacters we have to use a backslash.

But wait! I don't hear you cry. Surely . is a metacharacter too? It is, but not within a character class so it doesn't need to be escaped.


foreach (@dates) {

	print;

	s#(\d\d?)[/\-.](\d\d?)#$2/$1#;

	print " $_\n";

}

Nearly there. However, we are always replacing the delimiter with / which is messy. That's an easy fix:

foreach (@dates) {

	print;

	s#(\d\d?)([/\-.])(\d\d?)#$3$2$1#;

	print " $_\n";

}

so that fixes that. In case you were wondering, the . dot does not act as '1 of anything' inside a character class. It would defeat the object of the character class if it did. So it doesn't need escaping. There is a further improvement you can make to this regex:

$m='/.-';



foreach (@dates) {

	print;

	s#(\d\d?)([$m])(\d\d?)#$3$2$1#;

	print " $_\n";

}

which is good practice because you are bound to want to change your delimiters at some point, and putting them inside the regex is hardcording, and we all know that ends in tears. You can also re-use the $m variable elsewhere, which is good pratice.

Did you notice the difference between what we assign to $m and what we had before?


    /\-.

$m='/.-';

The difference is that the - is no longer escaped. Why not? Logic. Perl knows - is the range operator. Therefore, there must be a character to the immediate left and immediate right of it in order for it to work, for example e-f. When we assign a string to $m, the range operator is the last character and therefore has no character to the right of it, so Perl doesn't interpret as a range operator. Try this:

$m='/-.';

and watch it fail.

Something else that causes heartache is matching what you don't mean to. Try this:


@dates=(

'01/22/95',

'5/15/87',

'8-13-96',

'5.27.78',

'/16/1993',

'8/1/993',

);



$m='/.-';



foreach (@dates) {

	print;

	s#(\d\d?)([$m])(\d\d?)#$3$2$1# or print "Invalid date! ";

	print " $_\n";

}

The two invalid dates at the end are let through. If you wanted to check the validity of every possible date since the start of the modern calendar then you might be better off with a database rather than a regex, but we can do some basic checking. The important point is that we know the limitations of what we are doing.

What we can do is make sure of two things; that there are three sets of digits seperated by our chosen delimiters, and that the last set of digits is either two digits, eg 99, 98, 87, or four digits, eg 1999, 1998, 1987.

How can we do this? Extend the match. After the second digit match we need to match the delimter again, then either 2 digits or four digits. How about:


$m='/.-';



foreach (@dates) {

	print;

	s#(\d\d?)([$m])(\d\d?)[$m](\d\d|\d{4})#$3$2$1$2# or print "Invalid date! ";

	print " $_\n";

}

which doesn't really work out. The problem is it lets 993 through. This is because \d\d will match on the front of 993. Furthermore, we aren't fixing the year back on to the end result.

The delimiter match is also faulty. We could match / as the first delimiter, and - as the second. So, three problems to fix:


foreach (@dates) {

	print;

	s#(\d\d?)([$m])(\d\d?)\2(\d\d|\d{4})$#$3$2$1$2$4# or print "Invalid!";

	print " $_\n";

}

This is now looking like a serious regex. Changes:
  1. We are re-using the second match, which is the delimiter, further on in the regex. That's what the \2 is. This ensures the second delimiter is the same as the first one, so 5/7-98 gets rejected.
  2. The $ on the end means end of string. Nothing allowed after that. So the regex now has to find either 2 or 4 digits at the end of the string, or it fails.
  3. Added the match of the year ($4) to the rebuild section of the regex.
Regex can be as complex as you need. The code above can be improved still further. We could reject all years that don't begin with either 19 or 20 if they are four-digit years. The other problem with the code so far is that it would reject a date like 02/24/99 which is valid because there are characters after the year. Both can be fixed:

@dates=(

'01/22/95',

'5/15/87',

'8-13-96',

'5.27.78',

'/16/1993',

'8/1/993',

'3/29/1854',

'! 4/23/1972 !',

);



$m='/.-';



foreach (@dates) {

	print;

	s#(\d\d?)([$m])(\d\d?)\2(\d\d|(?:19|20)\d{2})(?:$|\D)#$3$2$1$2$4# or print "Invalid!";

	print " $_\n";

}

We have now got a nested OR, and the inner OR is non-capturing for reasons of efficiency and readability. At the end we alternate between letting the regex match either an end of line or any non-digit, symbolised with \D.

We could go on. It is often very difficult to write a regex that matches anything of even minor complexity with absolute certainity. Think about IP addresses for example. What is important is to build the regex carefully, and understand what it can and cannot do. Catching anything supposedly invalid is a good idea too. Test your regex with all sorts of invalid data, and you'll understand what it can do.

Split and Join


Splitting

While you are in the regex mood, a quick look at split and join . Destruction is always easier (just ask your car mechanic), so lets start with split .

$_='Piper:PA-28:Archer:OO-ROB:Antwerp';



@details=split /:/, $_;



foreach (@details) {

        print "$_\n";

}

Here we give split is given two arguments. The first one is a regex specifying what to split on. The next is what to split. Actually, I could leave $_ out because as usual it is the default if nothing is specified.

The assignment can either be a scalar variable or a list like an array (or hash, but at this time 'hash' to you means what you think the Dutch do or a silly drinking event spoilt by some running). If it's a scalar variable you get the number of elements the split has splut. Should that be 'the split has splittered' or 'the split has splat'. Hmmm. Probably 'the split has split'. You know what I mean. I think I just generated a Fatal Error in English.dll. Whoops. In any case, splitting to a scalar variable is not always a Good Thing, as we'll see later.

If the assignment is an array, then as you can see in the above example the array is created with the relevant elements in order. You can also assign to scalars, for example :


$_='Piper:PA-28:Archer:OO-ROB:Antwerp';



($maker,$model,$name,$reg,$location) = split /:/, $_;

(@aircraft[0..1],$aname,@regdetails) = split /:/, $_;



$number=split /:/ ;             # not bothering with the $_ at the end, as it is the default



print "Using the first 'split'\n";

print "$reg is a $maker $model $name based in $location\n";

print "There are $number details available on this aircraft\n\n";



print "Using the second 'split'\n";

print "You can find $regdetails[0], an $aircraft[1], $regdetails[1]\n";



This demonstrates that a list can be a list of scalar variables (which is basically what an array is anyway), and that you can easily see how many elements the expression can be split into.

The example below adds a third parameter to split, which is how many elements you want returned. If you don't want the extra stuff at the end pop it.


$_='Piper:PA-28:Archer:OO-ROB:Antwerp';



@details=split /:/, $_, 3;



foreach (@details) {

        print "$_\n";

}

In the example below we split on whitespace. Whitespace, in perl terms, is a space, tab, newline, formfeed or carriage return. Instead of writing \t\n\f\r for each of the above, you can simply use \s , or the negated version \S which means anything except whitespace. Think of whitespace as anything you know is there, but you can't see.

The whitespace split is specially optimised for speed. I've used spaces, double spaces, a tab and a newline in the list below. Also note the + , which means one or more of the preceding character, so it will split on any combination of whitespace. And I think the final split is useful to know. The split function does not return the delimiter, so in this case the whitespace will not be returned.


$_='Piper       PA-28  Archer           OO-ROB

Antwerp';



@details=split /\s+/, $_;



foreach (@details) {

        print "$_\n";

}



@chars=split //, $details[0];



foreach $char (@chars) {

        print "$char !\n";

}


A very FAQ

The following question has come up at least three times in the Perl-Win32-Users mailing list. Can you answer it ?

"My data is delimited by |, for example:

name|age|sex|height|

Why doesn't

@array=split /|/, $line;

work ?"

Why indeed. If you don't already know the answer, some simple troubleshooting steps can be applied. First, create a sample program and run it.

$line='name|age|sex|height';



@array=split /|/,$line;



foreach (@array) { print "$_\n" }

The effect is to split each character. The | is returned. As it is the delimiter, | should be ignored, not returned.

At this point you should be thinking 'metacharacter'. A little research (looking at the documentation) will reveal that | is indeed a metacharacter, which means 'or', when inside a regex. So, in effect, the regex /|/ means 'nothing, or nothing'. The split is therefore performed on 'nothings', and there are 'nothings' in between each character. The solution is easy ; /\|/ .


$line='name|age|sex|height';



@array=split /\|/,$line;



foreach (@array) { print "$_\n" }

So that's the fun stuff, destruction. Now to put it back together again with join .

What Humpty Dumpty needs : Join


$w1="Mission critical ?";

$w2="Internet ready modems !";

$w3="J(insert your cool phrase here)";	# anything prefixed by 'J' is now cool ;-)

$w4="y2k compatible.";

$w5="We know the Web.";

$w6="...the leading product in an emerging market.";



$cool=join ' ', $w1,$w2,$w3,$w4,$w5,$w6;



print $cool;

Join takes a 'glue' operator, which is not a regular expression. It can be a scalar variable however. In this case it is a space. Then it takes a list, which can either be a list of scalar variables, an array or whatever as long as its a list. And you can see what the result is. You could assign it to an array, but you'd end up with everything in the first element of the array.

The example below adds an array into the list, and demonstrates use of a variable as the delimiter.


$w1="Mission critical ?";

$w2="Internet ready modems !";

$w3="J(insert your cool phrase here)"; 	# anything prefixed by 'J' is now cool ;-)

$w4="y2k approved, tested and safe !";

$w5="We know the Web.";

$w6="...the leading product in an emerging market.";

@morecool=("networkable","compatible");



$sep=" ";



$cool=join $sep, $w1,$w2,$w3,@morecool,$w4,$w5,$w6;



print $cool;


A recap, but with some new functions


Randomness

Aren't you wishing you could mix and match randomly so you too could get a job marketing vapourware ? Heh.

@cool=(

"networkable directory services",

"legacy systems compatible",

"Mission critical, Business Ready",

"Internet ready modems !",

"J(insert your cool phrase here)",

"y2k approved, tested and safe !",

"We know the Web. Yeah.",

"...the leading product in an emerging market."

);



srand;



print "How many phrases would you like (max ",scalar(@cool),") ?";

while (1) {

        chop ($input=<STDIN>);

        if ($input <= scalar(@cool) and $input > 0) {

                last;

        }

        print 'Sorry, invalid input, try again :';

}



for (1..$input) {

        $index=int(rand $#cool);

        print "$cool[$index] ";

        splice @cool, $index, 1;

}

A few things to explain. Firstly, while (1) { . We want an everlasting loop, and this one way to do it. 1 is always true, so round it goes. We could test $input directly, but that wouldn't allow last to be demonstrated.

Everlasting loops aren't useful unless you are a politician being interviewed. We need to break out at some point. This is done by the last function. When $input is between 1 and the number of elements in @cool then out we go. (You can also break out to labels, in case you were wondering. And break out in a sweat. Don't start now if you weren't.)

The srand operator initialises the random number generator. Works ok for us, but CGI programmers should think of something different because their programs are so frequently run (they hope :-).

rand generates a random number between 0 and 1, or 0 and a number it is given. In this case, the number of elements of @cool -1, so from 0 to 7. There is no point generating numbers between 1 and 8 because the array elements run from 0 to 7.

The int function makes sure it is an integer, that is no messy bits after the decimal point.

The splice function removes the printed element from the array so it won't appear again. Don't want to stress the point.

Concatenation

There is another joining operator, this time the humble dot, or period: . . This concatanates (joins) variables:

$x="Hello";

$y=" World";

$z="\n";



print "$x\n";           # print $x and a newline



$prt=$x.$y.$z;          # make a new var $prt out of $x, $y and $z



print $prt;



$x.=$y." again ".$z;    # add stuff to $x



print $x;


Files


Opening

Perl is very good at handling files. Create, in your perl scripts directory c:\scripts, a file called stuff.txt. Copy the following into it :

The Main Perl Newsgroup:comp.lang.perl.misc

The Perl FAQ:http://www.perl.com/faq/

Where to download perl:http://www.activestate.com/

Now, to open and do things with this file. First, we must open the file and assign it to a filehandle. All operations will be done on the file via the filehandle. Earlier, we used <STDIN> as a filehandle - we read from it.

$stuff="c:\scripts\stuff.txt";



open STUFF, $stuff;



while (<STUFF>) {

        print "Line number $. is : $_";

}

What this script does is fail. What is should do is open the file defined in $stuff , assign it to the filehandle STUFF and then, while there are still lines left in the file, print the line number $. and the current line.


An unforgivable error

It fails. That's not so bad, everything fails sometimes. What is unforgivable is NOT CHECKING THE ERROR CODE !

This is a better version:


open STUFF, $stuff or die "Cannot open $stuff for read :$!";

If the open operation fails, the or means that the code on the RHS (right hand side) is evaluated. Perl dies. This means it exits the script, performs a post-mortem which it writes up into $! and tells you the line number at which it died. Just because $! contains useful information doesn't mean to say it is automagically printed, in true perl fashion. Usually you will wish to avail yourself of the information inside as it is of great help when working out why something is not going according to plan. The moral of the chapter is:

Always check your return codes !

\\ or / in pathnames -- your choice

The problem should now be apparent. The backslashes, being escape characters, are not displayed. There are two ways to fix this:

The forward slashes are the preferred option, even under Win32, because you can then port the script direct to Unix or other platforms (assuming you don't use drive letters), and it is less typing. If you wish to use Perl to start external processes then you must use the \\ method, but this variable will be used only in a Perl program, not as a parameter to start an external program. Changing the $stuff variable results in a working script. Always check your return codes !

Reading a file


$stuff="c:/scripts/stuff.txt";



open STUFF, $stuff or die "Cannot open $stuff for read :$!";



while (<STUFF>) {

        print "Line $. is : $_";

}

A little more detail on what is happening here. The file is opened for read. You can append and write too. You don't have to use a variable, but I always do because it is then easy to change and easy to insert into the or die section, and it is easy to change later on. Hardcoding things is not the best way to write a maintainable and flexible program. Just ask the Year 2000 people about code that lived a little longer than the authors imagined :-).

open STUFF, "c:/scripts/stuff.txt" or die "Cannot open stuff.txt for read :$!";

is just as good but more work if you want to change anything.

The line input operator (that's the angle brackets <> reads from the beginning of the file up until and including the first newline. The read data goes into $_ , and you can do what you want with it there. On the next iteration of the loop data is read from where the last read left off, up to the next newline. And so on until there is no more data. When that happens the condition is false and the loop terminates. That's the default behaviour, but we can change this.

This means that you can open a 200Mb file in perl and run through it without having to load the entire file into memory. 200Mb of memory is quite a bit. If you really want to load the entire 200Mb file into one variable, Perl lets you. Limits are not the Perl Way.

The special variable $. is the current line number, starting at 1.

As usual, there is a quicker way to do the previous program.


$STUFF="c:/scripts/stuff.txt";



open STUFF or die "Cannot open $STUFF for read :$!";



while (<STUFF>) {

        print "Line $. is : $_";

}

This saves a little bit of typing, but does tie your filehandle to the variable name. In fact, that entire program could be compressed further, but that's for later.

If you are really into shortness, try this:


$STUFF="c:/scripts/stuff.txt";



open STUFF or die "Cannot open $STUFF for read :$!";



print "Line $. is : $_" while (<STUFF>);

        




Writing to a File


A simple write


$out="c:/scripts/out.txt";



open OUT, ">$out" or die "Cannot open $out for write :$!";



for $i (1..10) {

        print OUT "$i : The time is now : ",scalar(localtime),"\n";

}

Note the addition of > to the filename. This opens it for writing. If we want to print to the file we now just specify the filehandle name. You print to the filehandle, which is a gateway to the file.

Filehandles don't have to be capitalised, but it is wise. All Perl functions are lowercase, and Perl is case-sensitive. So if you choose uppercase names they are guaranteed not to conflict with current or future function words.

And a neat way to grab the date sneaked in there too. You should be aware that writing to a file overwrites the file. It does not append data! However, you may append:

Appending


$out="c:/scripts/out.txt";



&printfile;



open OUT, ">>$out" or die "Cannot open $out for append :$!";



print OUT 'The time is now : ',scalar(localtime),"\n";



close OUT;



&printfile;



sub printfile {

        open IN, $out or die "Cannot open $out for read :$!";

        while (<IN>) {

                print;

        }

        close IN;

}

This script demonstrates subroutines again, and how to append to a file, that is write additional data at the end. The close function is introduced here. This, well, closes a filehandle. You don't have to close a filehandle - just leave it open until the script finishes, or the next open command to the same filehandle will close it for you.

@ARGV: Command Line Arguments

Perl has a special array called @ARGV . This is the list of arguments passed along with the script name on the command line. Run the following perl script as:


perl myscript.pl hello world how are you





foreach (@ARGV) {

        print "$_\n";

}

Another useful way to get parameters into a program -- this time without user input. The relevance to filehandles is as follows. Run the following perl script as:

perl myscript.pl stuff.txt out.txt



while (<>) {

        print;

}

Short and sweet ? If you don't specify anything in the angle brackets, whatever is in @ARGV is used instead. And after it finishes with the first file, it will carry on with the next and so on. You'll need to remove non-file elements from @ARGV before you use this.

It can be shorter still:


perl myscript.pl stuff.txt out.txt



print while <>;

Read it right to left. It is possible to shorten it even further !

perl myscript.pl stuff.txt out.txt



print <>;

This takes a little explanation. As you know, many things in Perl, including filehandles, can be evaluated in list or scalar context. The result that is returned depends on the context.

If a filehandle is evaluated in scalar context, it returns the first line of whatever file it is reading from. If it is evaluated in list context, it returns a list, the elements of which are the lines of the files it is reading from.

The print function is a list operator, and therefore evaluates everything it is given in list context. As the filehandle is evaluated in list context, it is given a list !

Who said short is sweet? Not my girlfriend, but that's another story. The shortest scripts are not usually the easiest to understand, and not even always the quickest. Aside from knowing what you want to achieve with the program from a functional point of view, you should also know wheter you are coding for maximum performance, easy maintenance or whatever -- because chances those goals may be to some extent mutually exclusive.

Modifying a File with $^I

One of the most frequent Perl tasks is to open a file, make some changes and write it back to the original filename. You already have enough knowledge to do this. The steps would be:

  1. Make a backup copy of the file
  2. Open the file for read
  3. Open a new temporary file for write
  4. Go through the read file, and write it and any changes to the temp file
  5. When finished, close both files
  6. Delete the original file
  7. Rename the temp file to the original filename
If you have managed to get this far and assiduously work through the examples, the above will be child's play. Play if you want, but there is a Better Way.

Make sure you have data in c:\scripts\out.txt then run this:


@ARGV="c:/scripts/out.txt";



$^I=".bk";              # let the magic begin



while (<>) {

        tr/A-Z/a-z/;    # another new function sneaked in

        print;          # this goes to the temp filehandle, ARGVOUT, 

			# not STDOUT as usual, so don't mess with it !

}

So, what's happening? First, we load up @ARGV with the name of a file. It doesn't matter how @ARGV is loaded. We could have shifted the code from the command line.

The $^I is a special variable. You knew that just by looking at it. It's name is the Inplace Edit variable, and when it has a value the effects are:

  1. The name of the file to be in-placed edited is taken from the first element of @ARGV. In this case, that is c:/scripts/out.txt. The file is renamed to its existing name plus the value of $^I, ie out.txt.bk.
  2. The file is read as usual by the diamond operator <>, placing a line at a time into $_.
  3. A new filehandle is opened, called ARGVOUT, and no prizes for guessing it is opened on a file called out.txt. The original out.txt is renamed.
  4. The print prints automatically to ARGVOUT, not STDOUT as it would usually.
At the end of the operation you have neatly edited the file and made a backup. If you don't want a backup, assign a null string to $^I but don't go crying on any mailing lists if you lose data.

The usual method of in-place editing would involve just printing everything back where it came from until your regex finds whatever needs changing. You could of course slurp the whole file into memory and play with it there, which could be a lot easier but if you are dealing with files of more than a few megabytes this is probably not a feasible approach.

Now take a look at out.txt . Notice how all capital letters have been transliterated into lowercase. This is the tr operator at work, which is more efficient than regex for changing single characters. But that's only a small part of the tr function's value to the world. More later.

You should also have an out.txt.bk file. And finally, notice the way @ARGV has been created. You don't have to create it from the command line arguments -- it can be treated like an ordinary array, for that is what it is.


$/ -- Changing what is read into $_

On a different note, what if your input file is doesn't look like this:

Beer

Wine

Pizza

Catfood

which is nicely delimited with a newline each time, but like this:

shorts

t-shirt

blouse



pizza

beer

wine

catfood



Viz

Private Eye

The Independent

Byte



toothpaste

soap

towel

which is delimited by TWO newlines, not one. You don't have to save the above as shop.txt, but if you don't, the examples will be difficult to follow.

Now, if you want each set of items as elements in an array you'll have to do something like this:


$SHOP="shop.txt";

$x=0;



open SHOP or die "Can't open $SHOP for read: $!\n";



while (<SHOP>) {

        if (/^\n/) {            # does line begin with newline ?

                $x++;           # if so, increment $x.  Rest of if statement not executed.

        } else {

                $list[$x].=$_;  # glue $_ on the end of whatever is in $list[$x], using a .

        }               

}



foreach (@list) {

        print "Items are:\n$_\n\n";

}

which works, but there is a much easier way to do it. You knew I was going to say that.

$SHOP="shop.txt";

$/="\n\n";



open SHOP or die "Can't open $SHOP for read: $!\n";



while (<SHOP>) {

        push (@list, $_);

}



foreach (@list) {

        print "Items are:\n$_\n\n";

}

The $/ variable is a special variable (it even looks special). It is the Default Input Record Separator. Remember the operation of the angle brackets being to read a file in up until the next newline ? Time to come clean. What the angle bracket actually do is read up until whatever $/ is set to. It is set to a newline by default.

So if we set it to two newlines, as above, then it reads up until it finds two consecutive newlines, then puts the data into $_ This makes the program a lot shorter and quicker. You can set $/ to just about anything, not just a newline. If you want to hack this list for example:

Tea:Beer:Wine:Pizza:Catfood:Coffee:Chicken:Salmon:Icecream
you could just leave $/ as a newline and slurp it into memory in one go, but imagine the above items are a list of clothes that your girlfriend wants to buy or a list of clothes your boyfriend should have thrown away by now. Either are going to be really big files, and you might not want to read it all into memory in one go. So set $/=":"; and all will be well. There are also read and seek functions, but they aren't covered here. Those are useful for files where you read in a precise number of bytes.

We'll go back to the last example for a moment. It is useful to know how to read just one line (well, up to $/ ) at a time:


$SHOP="shop.txt";

$/="\n\n";



open SHOP or die "Can't open $SHOP for read: $!\n";



$clothes=<SHOP>;        # everything up until the first occurrence of $/ into $clothes



$food=<SHOP>;   # everything from first occurrence of $/ to the second into $food



print "We need...\n",$clothes,"...and\n",$food;

And now we know that, there is a even quicker way to achieve the aim of the original program :



$SHOP="shop.txt";

$/="\n\n";



open SHOP or die "Can't open $SHOP for read: $!\n";



@list=<SHOP>;   # dumps *all* of $SHOP into @list, not just one line.



foreach (@list) {

        print "Items are:\n$_\n\n";

}

and you don't need to grab it all :

@list[0..2]=<SHOP> 

. We haven't mentioned list context for a while. Whether the line input operator <> returns a single value or a list depends on the context you use it in. When you supply @xxxxx then this must be a list. If you supply $xxxxx then that's a scalar variable. You can force it into list context by using parens.

The two lines below are provided so you can paste them into the above program. They demonstrate how parens force list context. Remember to replace the foreach with something that prints the variables.


($first, $second) = <SHOP>;

$first,  $second  = <SHOP>;


HERE Docs

The problem:

print "This is a long line of text which might be too long to fit on just one line\n";

print "and I was right, it was too long to fit on one line.  In fact, it looks like it\n";

print "might very well take up to FOUR, yes FOUR lines to print.  That's four print\n";

print "statements, which takes up even more room.  But wait! I'm wrong!  It will take\n";

print "FIVE lines to print this statement!  Or is that six lines? I'm not sure....\n";

The solution:

$var='variable interpolated';



print <<PRT;

This is a long line of text which might be too long to fit on just one line

and I was right, it was too long to fit on one line.  In fact, it looks like

it might very well take up to FOUR, yes FOUR lines to print.  



That's four print statements, which takes up even more room.  But wait! I'm 

wrong!  It will take FIVE lines to print this statement!  Or maybe six lines? 

I'm not sure....but anyway, just to prove this can be $var.

PRT

That's called a 'here' document and you don't need to use PRT, you can use whatever you like within reason. You don't need to put in explicit newlines, although if you do they perform as usual. Now you know about here docs you can stop wearing the print function out by calling it every couple of lines. You don't have to use here docs to print to files, just anywhere you'd normally put a more than one print statement.

Reading Directories


Globbing

For this exercise, I suggest creating another directory where you have at least two text files and two or more binary files. Copy a couple of .dll files from your WINDIR directory if you need to, those will do for the binaries, and save a couple of random text files. Size doesn't matter, in this case.

Then run this, giving the directory as the command line argument:


$dir=shift;	# shifts @ARGV, the command line arguments after the script name



chdir $dir or die "Can't chdir to $dir:$!\n" if $dir;



while (<*>) {

	print "Found a file: $_\n" if -T;

}

The chdir function changes perl's working directory. You should, as ever, test to see if it worked or not. In this case we only try and change directory if $dir is true.

The <*> construct reads all files from a given directory, and prints if it passes the file test -T , which returns true if the file is a non-binary, ie text file. You can be more specific:


$dir =shift;

$type='txt';



chdir $dir or die "Can't chdir to $dir:$!\n" if $dir;



while (<*.$type>) {

	print "Found a file: $_\n";

}

like so. But, there is a better way to read from directories. The method above is rather slow and inflexible.

readdir : How to read from directories

Instead, there is readdir . Another version of the previous example:

$dir= shift || '.';



opendir DIR, $dir or die "Can't open directory $dir: $!\n";



while ($file= readdir DIR) {

	print "Found a file: $file\n";

}

The first difference is the first line, which essentially says if shift is false, then $dir = ., which is of course the current directory. Then, the directory is opened and we have the chance to trap the error. It is assigned a filehandle. The readdir function reads each file into $file. There is no while (<WDIR>) { construct.

We can also apply the text file test. Run this, once without entering a directory and the second time with entering a directory path other than the one the script is in:


$dir= shift || '.';



opendir DIR, $dir or die "Can't open directory $dir: $!\n";



while ($file= readdir DIR) {

	print "Found a file: $file\n" if -T $file ;

}

Firstly, because the filename is now not in $_ we have to explicitly apply the -T test to it with -T $file.

Why did this not work the second time? Look at the code carefully. You are testing $file. If perl doesn't get a fully qualified pathname, it assumes you are still in the directory the script was run from, or that of the last successful chdir . Not necessarily where you are readdir'ing from. So, to fix it:




        print "Found a file: $dir/$file\n" if -T "$dir/$file" ;



where we now specify the pathname, both in the printout and in the file test itself. The "" are used because otherwise perl tries to divide $file by $dir.

Try running this on a directory with only a few files in it:


$dir= shift || '.';



opendir DIR, $dir or die "Can't open directory $dir: $!\n";



while ($file= readdir DIR) {

	print "Found a file: '$file'\n";

}

Notice that two files are found which have interesting names, namely . and .. . These two files are the current, and lower directory respectively. Nothing new, they have always been there -- run the DOS command dir if you don't believe me. You don't usually want to know about them, so:

while ($file= readdir DIR) {

	next if $file=~/^\./;

	print "Found a file: '$file'\n";

}

is the usual workaround. You can use scalar context to dump everything to a list of some description:

$dir= shift || '.';



opendir DIR, $dir or die "Can't open directory $dir: $!\n";



@files=readdir(DIR);



print "@files";

but that includes the . files, so it is best to ensure they aren't included:

@files=grep !/^\./, readdir(DIR);

We haven't met -T yet, but for the moment just remember it searches a list and if it returns true, lets the variable pass. In this case, if it doesn't begin with . then that's true so it goes into @files.

There are other commands associated with reading directories, which tell you where in a directory you are, and then where to go to return. You should be aware of their existence, because you never know when you might need them. The one other command of use is closedir , which closes a directory. Optional, but recommended for clarity.

Associative Arrays


The Basics

Very, very useful. First, a quick recap on arrays. Arrays are an ordered list of scalar variables, which you access by their index number starting at 0. The elements in arrays always stay in the same order.

Hashes are a list of scalars, but instead of being accessed by index number, they are accessed by a key. The tables below illustrate the point:

@myarray
Index No. Value
0 The Netherlands
1 Belgium
2 Germany
3 Monaco
4 Spain
%myhash
Key Value
NL The Netherlands
BE Belgium
DE Germany
MC Monaco
ES Spain

So if we want 'Belgium' from @myarray and also from %myhash , it'll be:


print "$myarray[1]";

print "$myhash{'BE'}";

Notice that the $ prefix is used, because it is a scalar variable. Despite the fact it is part of a list, it is still a scalar variable. The hash syntax is simply to use braces { } instead of square brackets.

So why use hashes ? When you want to look something up by a keyword. Suppose we wanted to create a program which returns the name of the country when given a country code. We'd input ES, and the program would come back with Spain.

You could do it with arrays. It would be messy however. One possible approach:

  1. create @country , and give it values such as 'ES,Spain'
  2. Itierate over the entire array and
  3. split each element of the array, and check the first result to see if it matches the input
  4. If so, return the index

@countries=('NL,The Netherlands','BE,Belgium','DE,Germany','MC,Monaco','ES,Spain');



print "Enter the country code:";

chop ($find=<STDIN>);



foreach (@countries) {

        ($code,$name)=split /,/;

        if ($find=~/$code/i) {

                print "$name has the code $code\n";

        }

}

Complex and slow. We could also store a reference to another array in each element of @countries , but that is not efficient. Whatever way we choose, you still need to search the whole thing. And what if @countries is a big array ? See how much easier a hash is:

A Hash in Action


%countries=('NL','The Netherlands','BE','Belgium','DE','Germany','MC','Monaco','ES','Spain');



print "Enter the country code:";

chop ($find=<STDIN>);



$find=~tr/a-z/A-Z/;

print "$countries{$find} has the code $find\n";

Very easy. All we need to do is make sure everything is in uppercase with tr and we are there. Notice the way %countries is defined - exactly the same as a normal array, except that the values are put into the hash in key/value pairs.


When you should use hashes

So why use arrays ? One excellent reason is because when an array is created, its variables stay in the same order you created them in. With a hash, perl reorders elements for quick access. Add print %countries; to the end of that program above and run it. See what I mean ? No recognisable sequence at all. It's like trying to herd cats. If you were writing code that stored a list of variables over time and you wanted it back in the order you found it in, don't use a hash.

Finally, you should know that each key of a hash must be unique. Stands to reason, if you think about it. You are accessing the hash via keys, so how can you have two keys named 'NL' or something ? If you do define a certain key twice, the second value overwrites the first. This is a feature, and useful. The values of a hash can be duplicates, but never the keys.

If you want to assign to a hash, there is of course no concept of push , pop and splice etc. Instead: