Presentation Screenshots Download Support Development Forum    
   

Welcome to the Community Forum.

Here you can discuss with other users or with the author, suggest new features, report bugs, ask for filters creation or correction, etc. Select the forum you wish to read or post below :

Forum
Forum > Filters repository > Capuring multiple elements from an URL

Pages : [1] Add a reply
User info Capuring multiple elements from an URL
Cyan
Avatar
Jan 10 2012, 1:31 pm
I thought I already explained how to capture only a part of the URL but I can't find it anymore in the topics, so I post it here.


IMPORTANT: Before following this tutorial, you need to allow ASF to look at the full file's URL instead of the domain only.
Go to the preferences > option > page 2> and add the option "3" into the text input. For example "1,3".


For example, you have these URL:
http://www.asf.mangaheart.org/filters/test_subdomain/File_test_01.zip
http://www.asf.mangaheart.org/filters/test_subdomain/File_test_02.zip
http://www.asf.mangaheart.org/filters/test_subdomain/File_test_03.zip


You can decompose these URLs in multiple chunks, the easiest way is to use the slash character / as separator:
protocol, domain, folder 1, folder 2, filename.
^.*?://[^/]+/[^/]+/[^/]+/.*$

I decompose it so you can understand what each part is doing:
These parts will be used along the tutorial to represent their values.

^ = start of the string
.*?:// = protocol
[^/]+ = any characters not a slash (the domain and the folder's name, in fact anything between our separator character)
/ = the slash (our separator)
.* = any characters string (the filename).
$ end of the string


Place as many [^/]+/ as you need.


How to capture an element:

To capture a string, you need to put parentheses around it.
You can capture up to 9 element in one string. (9 in domain field of the filter, 9 in filename field of the filter)

To call a captured element in the saved path, use these tags in your path:
$1d, $2d, $3d .... $9d to call the parenthesis number 1, 2, 3, etc. from the domain field of your filter
$1f, $2f, $3f .... $9d to call the parenthesis number 1, 2, 3, etc. from the filename field of your filter

Filters Domain
All    
= Regexp.
File name
All    
= Regexp.
Local folder



Some filters examples:

1. capture Folder1 and Folder2:
Place parentheses around the [^/]+ like that:
^.*?://[^/]+/([^/])+/([^/])+/.*$

From our previous test URL http://www.asf.mangaheart.org/filters/test_subdomain/File_test_01.zip, the first parenthesis $1d will capture the word "filters", and the second parenthesis $2d will capture the string "test_subdomain",

2. Capture the filename:
^.*?://[^/]+/[^/]+/[^/]+/(.*)$

Same as the folders, place the parentheses around the "filename" part to capture it.
$1d will capture the filename.

BUT, remember that you can also use the filename field of your filter to place a capture, and then you use $1f

Filters Domain
All    
= Regexp.
File name
All    
= Regexp.
Local folder


The capture works with normal (non regular expression) filter.
*.zip -> (*).zip will capture the filename of the zip file.

But let's go back to the tutorial on the domain with the full URL:


3. Capture the filename without the extension:
Instead of (.*) to define the filename, we decompose it in three group: filename, dot, extension.
to capture the filename part, use:
(.*)\..*$

(.*) = the part of the filename to capture, which is followed by
\. = the last filename dot, followed by
.* = any character past the final dot (the extension of the file), followed by
$ = end of the filename


4. Capture only a specific number of letters from the filename:

You can define a number instead of the asterisk option.
* means 0 to infinite (as much as possible)
+ means 1 to infinite (as much as possible)
{xx} is used to tell how many time you want to repeat the previous element.
. = a single character

.* = a single character many times
.{10} = a single character, 10 times = 10 first letters


to capture the 10 first letters of the filename, use this:
(.{10}).*$
it will capture "File_test_" and omit the ending numbers and the file extension.



5.1 Capture the domain:

The domain is a littler trickier to capture as it can be presented differently on each website.
http://domain.com
https://domain.com
ftp://domain.com
http://www.domain.com
www.domain.com
domain.com



So, we will separate each possible elements one by one:
Presence of protocol, presence of www, domain name

^.*?:// = the protocol (any letters from the start, up to the first ://)
www. = the www.
[^/]+ = the first element, capturing all letters which are not a slash /


You want to always capture the domain name, so start placing parentheses around it:
^.*?://www.([^/]+)

$1d = domain name.
But it will work only if the URL is formed with the protocol + the www + the domain name.
If your URL doesn't include the www. then the filter will not trigger at all.


You have two choices:
- make two different filters, one with the www. and one without the www.
- make the www. presence in the filter optional.


5.2 Make a part of the filter optional:
To make a part of the filter optional, you can use the interrogation point ? after the element you want to make optional.
jpe?g = the letter e is optional (it will match jpeg AND jpg).

to make a group of letters optional, use the parentheses:
(www.)? = www. is optional, but it's also capturing.

This filter:
^.*?://(www.)?([^/]+).*$

will work with both URL:
http://www.domain.com
http://domain.com

$1d will be either "www." or undefined.
$2d will always be the domain name without the www.


5.3 Make the protocol optional:
^(.*?://)?(www.)?([^/]+).*$

$1d = http://, https://, ftp://, etc.
$2d = www.
$3d = domain.com


5.4 Make an element optional but not capturing:
By default, the parentheses are capturing, and if you use a lot of optional elements, you will be limited by 9 groups of capture/optional elements.

You can make an option non capturing by placing ?: at the start of a parentheses group:

^(?:.*?://)?(?:www.)?([^/]+).*$

$1d = domain.com


5.5 Capture only the domain name without the tld:
^(?:.*?://)?(?:www\.)?([^/]+)\..{3}

It will capture every letters not a slash which are followed by a dot and three letters, and preceded or not by the protocol and the www., but which are obligatory at the start of the string, thanks to the ^

^ = start of the string
([^/]+) = anything not a slash followed by
\. = a dot
.{3} = 3 letters

http://sub.domain.com

$1d = sub.domain


6. Capturing an optional element.
Like you just see previously, if you capture an optional element it will be replaced by "undefined" if it's not present.

I will certainly change that functionality to not return anything instead if "undefined".

Until I fix it, you can check this other tutorial:
http://asf.mangaheart.org/index.php?menu=5&f=3&t=13



Conclusion:
I hope you managed to understand how the regular expression are working.
If you have any question, just ask here.

I hope you will like regular expression like I do, when you understand how to use them it's really useful :D
Post #1
Edit / Delete
Pages : [1] Add a reply

Return to top