@@ -24,7 +24,7 @@ Groups with adult contents haven't been supported yet.
2424
2525## Installation
2626
27- The script requires ` bash-4 ` , ` sort ` , ` wget ` , ` sed ` , ` awk ` .
27+ The script requires ` bash-4 ` , ` sort ` , ` curl ` , ` sed ` , ` awk ` .
2828
2929Make the script executable with ` chmod 755 ` and put them in your path
3030(e.g, ` /usr/local/bin/ ` .)
@@ -39,16 +39,16 @@ https://github.com/icy/google-group-crawler/issues/26.
3939For private group, please
4040[ prepare your cookies file] ( #private-group-or-group-hosted-by-an-organization ) .
4141
42- # export _WGET_OPTIONS ="-v" # use wget options to provide e.g, cookies
42+ # export _CURL_OPTION ="-v" # use curl options to provide e.g, cookies
4343 # export _HOOK_FILE="/some/path" # provide a hook file, see in #the-hook
4444
4545 # export _ORG="your.company" # required, if you are using Gsuite
4646 export _GROUP="mygroup" # specify your group
4747 ./crawler.sh -sh # first run for testing
48- ./crawler.sh -sh > wget .sh # save your script
49- bash wget .sh # downloading mbox files
48+ ./crawler.sh -sh > curl .sh # save your script
49+ bash curl .sh # downloading mbox files
5050
51- You can execute ` wget .sh` script multiple times, as ` wget ` will skip
51+ You can execute ` curl .sh` script multiple times, as ` curl ` will skip
5252quickly any fully downloaded files.
5353
5454### Update your local archive thanks to RSS feed
@@ -66,32 +66,33 @@ It's useful to follow this way frequently to update your local archive.
6666### Private group or Group hosted by an organization
6767
6868To download messages from private group or group hosted by your organization,
69- you need to provide cookies in legacy format.
70-
71- 1 . Export cookies for ` google ` domains from your browser and
72- save them as file. Please use a Netscape format, and you may want to
73- edit the file to meet a few conditions:
74-
75- 1 . The first line should be ` # Netscape HTTP Cookie File `
76- 2 . The file must use tab instead of space.
77- 3 . The first field of every line in the file must be ` groups.google.com ` .
78-
79- A simple script to process this file is as below
80-
81- $ cat original_cookies.txt \
82- | tail -n +3 \
83- | awk -v OFS='\t' \
84- 'BEGIN {printf("# Netscape HTTP Cookie File\n\n")}
85- {$1 = "groups.google.com"; printf("%s\n", $0)}'
86-
87- See the sample files in the ` tests/ ` directory
88-
89- 1 . The original file: [ tests/sample-original-cookies.txt] ( tests/sample-original-cookies.txt )
90- 1 . The fixed file: [ tests/sample-fixed-cookies.txt] ( tests/sample-fixed-cookies.txt )
91-
92- 2 . Specify your cookie file by ` _WGET_OPTIONS ` :
93-
94- export _WGET_OPTIONS="--load-cookies /your/path/fixed_cookies.txt --keep-session-cookies"
69+ you need to provide some cookie information to the script. In the past,
70+ the script uses ` wget ` and the Netscape cookie file format,
71+ now we are using ` curl ` with cookie string and a configuration file.
72+
73+ 0 . Open Firefox, press F12 to enable Debug mode and select Network tab
74+ from the Debug console of Firefox. (You may find a similar way for
75+ your favorite browser.)
76+ 1 . Log in to your testing google account, and access your group.
77+ For example
78+ https://groups.google.com/forum/?_escaped_fragment_=categories/google-group-crawler-public
79+ (replace ` google-group-crawler-public ` with your group name).
80+ Make sure you can read some contents with your own group URI.
81+ 2 . Now from the Network tab in Debug console, select the address
82+ and select ` Copy -> Copy Request Headers ` . You will have a lot of
83+ things in the result, but please paste them in your text editor
84+ and select only ` Cookie ` part.
85+ 3 . Now prepare a file ` curl-options.txt ` as below
86+
87+ user-agent = "Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0"
88+ header = "Cookie: <snip>"
89+
90+ Of course, replace the ` <snip> ` part with your own cookie strings.
91+ See ` man curl ` for more details of the file format.
92+
93+ 2 . Specify your cookie file by ` _CURL_OPTIONS ` :
94+
95+ export _CURL_OPTIONS="-K /path/to/curl-options.txt"
9596
9697 Now every hidden group can be downloaded :)
9798
@@ -100,13 +101,13 @@ you need to provide cookies in legacy format.
100101If you want to execute a ` hook ` command after a ` mbox ` file is downloaded,
101102you can do as below.
102103
103- 1 . Prepare a Bash script file that contains a definition of ` __wget_hook `
104+ 1 . Prepare a Bash script file that contains a definition of ` __curl_hook `
104105 command. The first argument is to specify an output filename, and the
105106 second argument is to specify an URL. For example, here is simple hook
106107
107108 # $1: output file
108109 # $2: url (https://groups.google.com/forum/message/raw?msg=foobar/topicID/msgID)
109- __wget_hook () {
110+ __curl_hook () {
110111 if [[ "$(stat -c %b "$1")" == 0 ]]; then
111112 echo >&2 ":: Warning: empty output '$1'"
112113 fi
@@ -119,7 +120,7 @@ you can do as below.
119120 to your file. For example,
120121
121122 export _GROUP=archlinuxvn
122- export _HOOK_FILE=$HOME/bin/wget .hook.sh
123+ export _HOOK_FILE=$HOME/bin/curl .hook.sh
123124
124125 Now the hook file will be loaded in your future output of commands
125126 ` crawler.sh -sh ` or ` crawler.sh -rss ` .
0 commit comments